databricks blk
7787  Reviews star_rate star_rate star_rate star_rate star_half

Optimizing Apache Spark™ on Databricks

In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With...

Read More
$1,500 USD GSA  $1,360.20
Course Code OPTSPARK
Duration 2 days
Available Formats Classroom, Virtual

In this course, you will explore the five key problems that represent the vast majority of performance issues in an Apache Spark application: skew, spill, shuffle, storage, and serialization. With examples based on 100 GB to 1+ TB datasets, you will investigate and diagnose sources of bottlenecks with the Spark UI and learn effective mitigation strategies. You will also discover new features introduced in Spark 3 that can automatically address common performance problems. Lastly, you learn how to design and configure clusters for optimal performance based on specific team needs and concerns.

Skills Gained

  • Articulate how the five most common performance problems in a Spark application can be mitigated to achieve better application performance
  • Summarize the most common performance problems associated with data ingestion and how to mitigate them
  • Articulate how new features in Spark 3.x can be employed to mitigate performance problems in your Spark applications
  • Configure a Spark cluster for maximum performance given specific job requirements

Prerequisites

  • Hands-on experience developing Apache Spark applications (6+ months). We recommend the Apache Spark Programming course to get started working with Spark.
  • Intermediate experience in Python or Scala

Course Details

Course Outline

Day 1

  • Review of Spark architecture and Spark UI
  • Skew
  • Spill
  • Shuffle
  • Storage
  • Serialization

Day 2

  • Ingestion basics
  • Predicate push downs
  • Disk partitioning
  • Z-ordering
  • Bucketing
  • Optimization with Adaptive Query Execution (AQE)
  • Designing and configuring clusters for high performance
|
View Full Schedule