Apache Spark Tuning and Best Practices

Course Details
Code: SPARK-TUNE
Tuition (USD): Contact Us For Pricing

This 3-day course is primarily for software engineers but is directly applicable to analysts, architects and data scientist interested in a deep dive into the processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications.

This course is a lab-intensive workshop in which students implement various best practices while inducing, diagnose and then fixing various performance problems. The course continues with numerous instructor-lead coding challenges to refactor existing code with the effect of an increase in overall performance by applying learned best practices. It then concludes with a full day workshop in which students work individually or in teams to complete a typical, full-scale data migration of a poorly maintained dataset.

Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.

Skills Gained

After taking this class, students will be able to:

  • Shortcut investigations by developing common-sense intuition as to the root cause of various performance issues.
  • Diagnose & fix various storage-related performance issues including: The tiny files problem, Malformed partitions, Unpartitioned data, Overpartitioned data, and Incorrectly typed data
  • Identify when and when not to cache data
  • Articulate the performance ramifications of different caching strategies
  • Diagnose & fix common coding mistakes that lead to de-optimization
  • Optimize joins via broadcasting, pruning, and pre-joining
  • Apply tips and tricks for: Investigating the files system, Diagnosing partition skew, Developing and distributing utility functions, Developing micro-benchmarks & avoiding related pitfalls, Rapid ETL development for extremely large datasets, Working in shared cluster environments
  • Develop different strategies for testing ETL components
  • Rapidly develop insights on otherwise costly datasets

Who Can Benefit

Data engineers, analysts, architects, data scientist and software engineers who want to further their skills by learning how to develop high-performance Spark applications through the use of best practices and by diagnosing and troubleshooting common performance problems.

Prerequisites

  • Proficiency with Apache Spark's DataFrames API is helpful but not required.
  • Intermediate to advanced programming experience in Python or Scala is required.
  • Several weeks of experience in developing Apache Spark applications is preferred.
  • The class can be taught concurrently in Python and Scala.

Course Details

Lab Requirements

  • A computer or laptop
  • Chrome or Firefox Web Browser Internet Explorer and Safari are not supported
  • Internet access with unfettered connections to the following domains:
  • 1. *.databricks.com - required
  • 2. *.slack.com - highly recommended
  • 3. spark.apache.org - required
  • 4. drive.google.com - helpful but not required

Course Outline

Coding Exercises

  • This course serves as an excellent followup to Databricks’ other courses: Apache Spark Programming (DB 105) and Apache Spark for Machine Learning and Data Science (DB 301)
  • Students will implement more than 75% of all exercises which in turn induce the various performance problems to be diagnosed and fixed

Partitioning

  • Explore the effects of different partitioning strategies
  • Diagnose performance problems related to improperly partitioned data
  • Explore different solutions to fixing mal-partitioned data
  • Working with and understanding on-disk partitioning strategies

Caching

  • Develop tips and tricks for caching data on shared clusters
  • Explore the ramifications of different caching strategies
  • Learn why caching is one of the most common performance problems
  • Develop intuitions as to when and when not to cache data
  • How to use caching as an aid to troubleshooting

Joins

  • Explore different options for optimizing joins
  • Working with broadcast joins
  • Explore different options for avoiding joins

Utility Functions

  • Diagnosing performance problems
  • Caching data
  • Benchmarking
  • Common ETL tasks
  • Discuss deployment strategies for utility functions

Testing strategies

  • Strategies for testing transformations
  • Developing test datasets for unit tests

De-optimization

  • Exploring common coding practices that induce de-optimization
  • Solutions for avoiding de-optimization
  • Review of the Catalyst Optimizer and its role in optimizing applications