This 3-day course is primarily for software engineers but is directly applicable to analysts, architects and data scientist interested in a deep dive into the processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications.
This course is a lab-intensive workshop in which students implement various best practices while inducing, diagnose and then fixing various performance problems. The course continues with numerous instructor-lead coding challenges to refactor existing code with the effect of an increase in overall performance by applying learned best practices. It then concludes with a full day workshop in which students work individually or in teams to complete a typical, full-scale data migration of a poorly maintained dataset.
Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.
After taking this class, students will be able to:
- Shortcut investigations by developing common-sense intuition as to the root cause of various performance issues.
- Diagnose & fix various storage-related performance issues including: The tiny files problem, Malformed partitions, Unpartitioned data, Overpartitioned data, and Incorrectly typed data
- Identify when and when not to cache data
- Articulate the performance ramifications of different caching strategies
- Diagnose & fix common coding mistakes that lead to de-optimization
- Optimize joins via broadcasting, pruning, and pre-joining
- Apply tips and tricks for: Investigating the files system, Diagnosing partition skew, Developing and distributing utility functions, Developing micro-benchmarks & avoiding related pitfalls, Rapid ETL development for extremely large datasets, Working in shared cluster environments
- Develop different strategies for testing ETL components
- Rapidly develop insights on otherwise costly datasets
Who Can Benefit
Data engineers, analysts, architects, data scientist and software engineers who want to further their skills by learning how to develop high-performance Spark applications through the use of best practices and by diagnosing and troubleshooting common performance problems.
- Proficiency with Apache Spark's DataFrames API is helpful but not required.
- Intermediate to advanced programming experience in Python or Scala is required.
- Several weeks of experience in developing Apache Spark applications is preferred.
- The class can be taught concurrently in Python and Scala.