7862  Reviews star_rate star_rate star_rate star_rate star_half

Introduction to Big Data

Read More
Course Code ES-IntroBigData
Duration 2 days
Available Formats Classroom

Course Details

Topics Covered

Processing Big Data

  • Hadoop storage with HDFS
  • Importing data from SQL databases with Sqoop
  • Processing the data with Spark

First Look at Spark

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

Spark Data structures

  • Partitions
  • Distributed execution
  • Operations: transformations and actions
  • Labs: Unstructured data analytics using RDDs

Caching

  • Caching overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance

Dataframes / Datasets

  • Dataframes Intro
  • Loading structured data (JSON, CSV) using Dataframes
  • Using schema
  • Specifying schema for Dataframes
  • Labs: Dataframes, Datasets, Schema

Spark SQL

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats: JSON / Parquet / ORC
  • Labs: querying structured data using SQL; evaluating data formats

Spark and Hadoop

  • Hadoop Primer: HDFS / YARN
  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark & Hive

Hive

  • Hive concepts & architecture
  • SQL support in Hive
  • Data warehousing in Hive
  • Data types
  • Table creation and queries
  • Partitions
  • Joins
  • Modern data formats
  • Text analytics
  • Hive performance