Developer Training for Apache Spark

Course Details
Code:
DEV-APACHE-SPARK
Tuition (USD):
$1,815.00 • Self Paced

This OnDemand offering provides you with a 180-day subscription that begins on the date of purchase.

Prerequisites

This course is best suited to developers and engineers with prior knowledge and experience with Hadoop. Course examples and exercises are presented in Python and Scala, so knowledge of one of these programming languages is required. Basic knowledge of Linux is assumed.

Course Details

Through videos and hands-on exercises, participants will learn about Apache Spark and how it integrates with the entire Hadoop ecosystem, including:

  • Using the Spark shell for interactive data analysis
  • The features of Spark’s Resilient Distributed Datasets
  • How Spark runs on a cluster
  • How Spark parallelizes task execution
  • Writing Spark applications
  • Processing streaming data with Spark

Subscription Details

This OnDemand offering provides you with a 180-day subscription that begins on the date of purchase. While the subscription is active, you will have unlimited access to the course training materials which includes recorded course lectures and demonstrations, assessment components, and hands-on exercise instructions. You will also receive 15 runtime hours of access to the online hands-on exercise environment accessible though web browser. You can start the exercise environment when you are ready to use it. You can stop or pause it when you are done for the time being, then return anytime to continue where you left off. The exercise environment remains accessible until you have used the runtime hours or the subscription period ends, whichever occurs first.

Introduction to Spark

  • What is Spark?
  • Review: From Hadoop MapReduce to Spark
  • Review: HDFS
  • Review: YARN
  • Spark Overview

Spark Basics

  • Using the Spark Shell
  • RDDs (Resilient Distributed Datasets)
  • Functional Programming in Spark

Working with RDDs in Spark

  • Creating RDDs
  • Other General RDD Operations

Aggregating Data with Pair RDDs

  • Key-Value Pair RDDs
  • Map-Reduce
  • Other Pair RDD Operations

Writing and Deploying Spark Applications

  • Spark Applications vs. Spark Shell
  • Creating the SparkContext
  • Building a Spark Application (Scala and Java)
  • Running a Spark Application
  • The Spark Application Web UI
  • Hands-On Exercise: Write and Run a Spark Application
  • Configuring Spark Properties
  • Logging

Parallel Processing

  • Review: Spark on a Cluster
  • RDD Partitions
  • Partitioning of File-based RDDs
  • HDFS and Data Locality
  • Executing Parallel Operations
  • Stages and Tasks

Spark RDD Persistence

  • RDD Lineage
  • RDD Persistence Overview
  • Distributed Persistence

Basic Spark Streaming

  • Spark Streaming Overview
  • Example: Streaming Request Count
  • DStreams
  • Developing Spark Streaming Applications

Advanced Spark Streaming

  • Multi-Batch Operations
  • State Operations
  • Sliding Window Operations
  • Advanced Data Sources

Common Patterns in Spark Data Processing

  • Common Spark Use Cases
  • Iterative Algorithms in Spark
  • Graph Processing and Analysis
  • Machine Learning
  • Example: k-means

Improving Spark Performance

  • Shared Variables: Broadcast Variables
  • Shared Variables: Accumulators
  • Common Performance Issues
  • Diagnosing Performance Problems

Spark SQL and DataFrames

  • Spark SQL and the SQL Context
  • Creating DataFrames
  • Transforming and Querying DataFrames
  • Saving DataFrames
  • DataFrames and RDDs
  • Comparing Spark SQL, Impala and Hive-on-Spark
Course Details
Code:
DEV-APACHE-SPARK
Tuition (USD):
$1,815.00 • Self Paced