Explore the foundations of processing structured and unstructured data with Ray Data and get an overview of the key concepts and patterns for distributed data processing.
This module maps the distributed data and compute industry landscape, covering the data, compute, and orchestration layers and the tools commonly used in each. You’ll learn how these layers fit together and how to categorize major distributed computing frameworks and commercial platforms by their primary use cases.
In **Structured**, you’ll learn how to use Ray Data to load large structured datasets (e.g., NYC taxi Parquet files) in a distributed, lazy-executed way and inspect their schema and blocks. You’ll also practice applying scalable batch transformations with `map_batches` to engineer new features in parallel.
This module introduces Ray Data for working with unstructured datasets at scale, explaining when to use it and how its distributed execution model (blocks and lazy transformations) works. You’ll practice building an end-to-end pipeline to read image data (e.g., MNIST from S3), apply transformations, and materialize/consume results in a parallel, distributed way.