This 3-day course is primarily for software engineers but is directly applicable to analysts, architects and data scientist interested in a deep dive into the processes of tuning Spark applications, developing best practices and avoiding many of the common pitfalls associated with developing Spark applications.
This course is a lab-intensive workshop in which students implement various best practices while inducing, diagnose and then fixing various performance problems. The course continues with numerous instructor-lead coding challenges to refactor existing code with the effect of an increase in overall performance by applying learned best practices. It then concludes with a full day workshop in which students work individually or in teams to complete a typical, full-scale data migration of a poorly maintained dataset.
Each topic includes lecture content along with hands-on labs in the Databricks notebook environment. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering after the class ends; all examples are guaranteed to run in that environment.
After taking this class, students will be able to:
Shortcut investigations by developing common-sense intuition as to the root cause of various performance issues.
Diagnose & fix various storage-related performance issues including: The tiny files problem, Malformed partitions, Unpartitioned data, Overpartitioned data, and Incorrectly typed data
Identify when and when not to cache data
Articulate the performance ramifications of different caching strategies
Diagnose & fix common coding mistakes that lead to de-optimization
Optimize joins via broadcasting, pruning, and pre-joining
Apply tips and tricks for: Investigating the files system, Diagnosing partition skew, Developing and distributing utility functions, Developing micro-benchmarks & avoiding related pitfalls, Rapid ETL development for extremely large datasets, Working in shared cluster environments
Develop different strategies for testing ETL components
Rapidly develop insights on otherwise costly datasets
Who Can Benefit
Data engineers, analysts, architects, data scientist and software engineers who want to further their skills by learning how to develop high-performance Spark applications through the use of best practices and by diagnosing and troubleshooting common performance problems.
Proficiency with Apache Spark's DataFrames API is helpful but not required.
Intermediate to advanced programming experience in Python or Scala is required.
Several weeks of experience in developing Apache Spark applications is preferred.
The class can be taught concurrently in Python and Scala.
A computer or laptop
Chrome or Firefox Web Browser Internet Explorer and Safari are not supported
Internet access with unfettered connections to the following domains:
1. *.databricks.com - required
2. *.slack.com - highly recommended
3. spark.apache.org - required
4. drive.google.com - helpful but not required
This course serves as an excellent followup to Databricks’ other courses: Apache Spark Programming (DB 105) and Apache Spark for Machine Learning and Data Science (DB 301)
Students will implement more than 75% of all exercises which in turn induce the various performance problems to be diagnosed and fixed
Explore the effects of different partitioning strategies
Diagnose performance problems related to improperly partitioned data
Explore different solutions to fixing mal-partitioned data
Working with and understanding on-disk partitioning strategies
Develop tips and tricks for caching data on shared clusters
Explore the ramifications of different caching strategies
Learn why caching is one of the most common performance problems
Develop intuitions as to when and when not to cache data
How to use caching as an aid to troubleshooting
Explore different options for optimizing joins
Working with broadcast joins
Explore different options for avoiding joins
Diagnosing performance problems
Common ETL tasks
Discuss deployment strategies for utility functions
Strategies for testing transformations
Developing test datasets for unit tests
Exploring common coding practices that induce de-optimization
Solutions for avoiding de-optimization
Review of the Catalyst Optimizer and its role in optimizing applications
https://www.exitcertified.com/training/databricks/apache-spark-tuning-best-practices-56010-detail.htmlDB110Apache Spark Tuning and Best Practiceshttps://assets.exitcertified.com/assets/CourseImages/188d379ad4/AdobeStock_215949089__FitMaxWzEwMDAsMTAwMF0.jpg2500.00USDInStock/Training/DatabricksThis 3-day course is primarily for software engineers but is directly applicable to analysts, architects and data scientist...2500.00DatabricksClassroom2019-02-05T16:46:21+00:00USD