The future of IBM Maximo: Work Centers and Inspections Can Transform Your Business


Apache Spark for Machine Learning and Data Science

  • Tuition USD $2,000
  • Reviews star_rate star_rate star_rate star_rate star_half 508 Ratings
  • Course Code DB301
  • Duration 3 days
  • Available Formats Classroom, Virtual

This 3-day course is primarily for data scientists but is directly applicable to analysts, architects, software engineers, and technical managers interested in a thorough, hands-on overview of Apache Spark and its applications to Machine Learning.

The course covers the fundamentals of Apache Spark including Spark’s architecture and internals, the core APIs for using Spark, SQL and other high-level data access tools, Spark’s streaming capabilities and a heavy focus on Spark’s machine learning APIs. The class is a mixture of lecture and hands-on labs.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them after the class ends; all examples are guaranteed to run in the environment the class was taught on (Azure Databricks or CE)

Skills Gained

  • Understand when and where to use Spark
  • Articulate the difference between an RDD, DataFrame, and Dataset
  • Explain supervised vs unsupervised machine learning, and typical applications of both
  • Build a Machine Learning Pipeline using a combination of Transformers and Estimators: Save/Restore Models and Apply models to streaming data
  • Perform hyperparameter tuning with cross-validation
  • Analyze Spark query performance using the Spark UI
  • Train models with 3rd party libraries such as XGBoost
  • Perform hyperparameter search in parallel using single node algorithms such as scikit-learn
  • Gain familiarity with Decision Trees, Random Forests, Gradient Boosted Trees, Linear Regression, Collaborative Filtering, and K-Means
  • Explain options for putting models into production

Who Can Benefit

Data scientists, analysts, architects, software engineers, and technical managers with experience in machine learning who want to adapt traditional machine learning tasks to run at scale using Apache Spark.


  • Some familiarity with Apache Spark is helpful but not required.
  • Some familiarity with Machine Learning and Data Science concepts are highly recommended but not required.
  • Basic programming experience in an object-oriented or functional language is required. The class can be taught concurrently in Python and Scala.

Course Details

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
  • 1. * - required
  • 2. * - highly recommended
  • 3. - required
  • 4. - helpful but not required


Spark Overview

In-depth discussion of Spark SQL and DataFrames, including:

  • RDD vs DataFrame vs Dataset API
  • Spark SQL
  • Data Aggregation
  • Column Operations
  • The Functions API: date/time, string manipulation, aggregation
  • Caching and caching storage levels
  • Use of the Spark UI to analyze behavior and performance

Overview of Spark internals

  • Cluster Architecture
  • How Spark schedules and executes jobs and tasks
  • Shuffling, shuffle files, and performance
  • The Catalyst query optimizer

Spark Structured Streaming

  • Sources and sinks
  • Structured Streaming APIs
  • Windowing & Aggregation
  • Checkpointing & Watermarking
  • Reliability and Fault Tolerance

In-depth overview of Spark’s MLlib Pipeline API for Machine Learning

  • Build machine learning pipelines for both supervised and unsupervised learning
  • Transformer/Estimator/Pipeline API
  • Use transformers to perform pre-processing on a dataset prior to training
  • Train analytical models with Spark ML’s DataFrame-based estimators including Decision Trees, Random Forests, Gradient Boosted Trees, Linear Regression, K-Means, and Alternating Least Squares
  • Tune hyperparameters via cross-validation and grid search
  • Evaluate model performance


  • Track and benchmark model performance

3rd Party Library Integrations

  • XGBoost
  • How to distribute single-node algorithms (like scikit-learn) with Spark
  • Spark-Sklearn: Perform scikit-learn hyperparameter search in parallel

Production Discussion

How do I enroll?

A comprehensive listing of ExitCertified courses can be found here. You can register directly for the required course/location when you select "register". If you have any questions or prefer to speak with an ExitCertified education consultant directly, please submit your query here. A representative will contact you shortly.

How do I pay for a class?

You can pay at the time of registration using credit card (Mastercard/Visa/American Express) cheque or PO.

What if I have training credits?

ExitCertified honors all savings programs from the partners we work with. ExitCertified also offers training credits across multiple partners through our FLEX Account.

When does class start/end?

Classes begin promptly at 9:00 am, and typically end at 5:00 pm.


Lunch is normally an hour long and begins at noon. Coffee, tea, hot chocolate and juice are available all day in the kitchen. Fruit, muffins and bagels are served each morning. There are numerous restaurants near each of our centers, and some popular ones are indicated on the Area Map in the Student Welcome Handbooks - these can be picked up in the lobby or requested from one of our ExitCertified staff.

How can someone reach me during class?

If someone should need to contact you while you are in class, please have them call the center telephone number and leave a message with the receptionist.

What languages are used to deliver training?

Most courses are conducted in English, unless otherwise specified. Some courses will have the word "FRENCH" marked in red beside the scheduled date(s) indicating the language of instruction.

Great company to work with. Have taken training with Exit in the past. This experience is consistent with the others.

Great course, the labs were the best part of the course because it helped the material and information really sink in. We did have issues with one of the labs not being correct though

very good... also gave relevant examples to our current business requirements.

Very prompt during an issue that occurred, and accidentally contacted an individual on their vacation, but they quickly pointed me to another individual who got my issue resolved.

The format and presentation using the iMVP software worked as well.

Instructor was engaging, very knowledgeable, and explained things clearly. Labs were helpful in getting used to the tools that were being taught. Course location (Atlanta Microtek facility) was very comfortably set up. I'd definitely recommend this to anyone learning a new skill. (I took the 3-day AWS Big Data training).

3 options available

  • Sep 14, 2020 Sep 16, 2020 (3 days)
    9:00AM 5:00PM PDT
  • Oct 26, 2020 Oct 28, 2020 (3 days)
    9:00AM 5:00PM EDT
  • Dec 14, 2020 Dec 16, 2020 (3 days)
    9:00AM 5:00PM EST
Contact Us 1-800-803-3948
Contact Us Live Chat
FAQ Get immediate answers to our most frequently asked qestions. View FAQs arrow_forward