Data Processing Foundations (Ray Data)

1.Landscape

This module maps the distributed data and compute industry landscape, covering the data, compute, and orchestration layers and the tools commonly used in each. You’ll learn how these layers fit together and how to categorize major distributed computing frameworks and commercial platforms by their primary use cases.

Introduction to Ray Data: Industry Landscape
The Compute Layer
The Orchestration Layer
Distributed Computing Frameworks
Data Processing with Ray Data
Ray Serve

+3 more lessons

2.Structured

In **Structured**, you’ll learn how to use Ray Data to load large structured datasets (e.g., NYC taxi Parquet files) in a distributed, lazy-executed way and inspect their schema and blocks. You’ll also practice applying scalable batch transformations with `map_batches` to engineer new features in parallel.

What is Ray Data?
Introduction to Ray Data: Ray Data + Structured Data
Loading Data
Transforming Data
Writing Data
Data Operations: Shuffling, Grouping and Aggregation
When to use Ray Data

+4 more lessons

3.Unstructured

This module introduces Ray Data for working with unstructured datasets at scale, explaining when to use it and how its distributed execution model (blocks and lazy transformations) works. You’ll practice building an end-to-end pipeline to read image data (e.g., MNIST from S3), apply transformations, and materialize/consume results in a parallel, distributed way.

Intro to Ray Data: Ray Data + Unstructured Data
When to Consider Ray Data
How to work with Ray Data
Loading data
Lazy execution mode
Transforming data
Stateful transformations with Ray Actors
Materializing data
Data Operations: grouping, aggregation, and shuffling
Persisting data

+7 more lessons

Data Processing Foundations (Ray Data)

About this course

1.Landscape

2.Structured

3.Unstructured

1.Landscape

2.Structured

3.Unstructured