databricks blk
7817  Reviews star_rate star_rate star_rate star_rate star_half

Scalable Machine Learning with Apache Spark™

This course teaches you how to scale ML pipelines with Spark, including distributed training, hyperparameter tuning, and inference. You will build and tune ML models with SparkML while leveraging...

Read More
$1,500 USD GSA  $1,360.20
Course Code SCALABLEML
Duration 2 days
Available Formats Classroom, Virtual

This course teaches you how to scale ML pipelines with Spark, including distributed training, hyperparameter tuning, and inference. You will build and tune ML models with SparkML while leveraging MLflow to track, version, and manage these models. This course covers the latest ML features in Apache Spark, such as Pandas UDFs, Pandas Functions, and the pandas API on Spark, as well as the latest ML product offerings, such as Feature Store and AutoML.

  • This course will prepare you to take the Databricks Certified Machine Learning Associate exam.

Skills Gained

  • Perform scalable EDA with Spark
  • Build and tune machine learning models with SparkML
  • Track, version, and deploy models with MLflow
  • Perform distributed hyperparameter tuning with HyperOpt
  • Use the Databricks Machine Learning workspace to create a Feature Store and AutoML experiments
  • Leverage the pandas API on Spark to scale your pandas code

Who Can Benefit

  • Data scientist
  • Machine learning engineer

Prerequisites

  • Intermediate experience with Python (or completion of Introduction to Python for Data Science & Data Engineering)
  • Familiarity with PySpark DataFrame API (or completion of Apache Spark Programming)
  • Experience building machine learning models

Course Details

Day 1

  • Spark / ML overview
  • Exploratory data analysis (EDA) and feature engineering with Spark
  • SparkML: transformers, estimators, pipelines, and evaluators
  • MLflow Tracking and Model Registry

Day 2

  • Parallelizable hyperparameter tuning
  • Databricks AutoML and Feature Store
  • Integrating 3rd party packages (distributed XGBoost)
  • Distributed inference of scikit-learn models with pandas UDFs
  • Distributed training with pandas function API
  • Pandas API on Spark for data manipulation
|
View Full Schedule