Live Webinar - ITIL 4 Overview - What’s New from ITIL v3 to ITIL 4

closeClose

Cloudera Data Analyst Training

Course Details
Code: DATA-ANALYST
Tuition (USD): $3,195.00 • Classroom (4 days)
$3,195.00 • Virtual (4 days)
Course Details
GSA (USD): $2,736.27 • Classroom (4 days)
$2,736.27 • Virtual (4 days)

Apache Hive makes transformation and analysis of complex, multi-structured data scalable in Hadoop. Apache Impala enables real-time interactive analysis of the data stored in Hadoop using a native SQL environment. Together, they make multi-structured data accessible to analysts, database administrators, and others without Java programming expertise.

Skills Gained

  • How the open source ecosystem of big data tools addresses challenges not met by traditional RDBMSs
  • How Apache Hive and Apache Impala are used to provide SQL access to data
  • How Hive and Impala syntax and data formats, including functions and subqueries, help answer questions about data
  • How to create, modify, and delete tables, views, and databases; load data; and store results of queries
  • How to create and use partitions and different file formats
  • How to combine two or more datasets using JOIN or UNION, as appropriate
  • What analytic and windowing functions are, and how to use them
  • How to store and query complex or nested data structures
  • How to process and analyze semi-structured and unstructured data
  • Different techniques for optimizing Hive and Impala queries
  • How to extend the capabilities of Hive and Impala using parameters, custom file formats and SerDes, and external scripts
  • How to determine whether Hive, Impala, an RDBMS, or a mix of these is best for a given task

Prerequisites

This course is designed for data analysts, business intelligence specialists, developers, system architects, and database administrators. Some knowledge of SQL is assumed, as is basic Linux command-line familiarity. Prior knowledge of Apache Hadoop is not required.

Course Details

Introduction

Apache Hadoop Fundamentals

  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Hive and Impala
  • Database Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenario Explanation

Introduction to Apache Hive and Impala

  • What Is Hive?
  • What Is Impala?
  • Why Use Hive and Impala?
  • Schema and Data Storage
  • Comparing Hive and Impala to Traditional Databases
  • Use Cases

Querying with Apache Hive and Impala

  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Using Hue to Execute Queries
  • Using Beeline (Hive's Shell)
  • Using the Impala Shell

Common Operators and Built-In Functions

  • Operators
  • Scalar Functions
  • Aggregate Functions

Data Management

  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results

Data Storage and Performance

  • Partitioning Tables
  • Loading Data into Partitioned Tables
  • When to Use Partitioning
  • Choosing a File Format
  • Using Avro and Parquet File Formats

Working with Multiple Datasets

  • UNION and Joins
  • Handling NULL Values in Joins
  • Advanced Joins

Analytic Functions and Windowing

  • Using Analytic Functions
  • Other Analytic Functions
  • Sliding Windows

Complex Data

  • Complex Data with Hive
  • Complex Data with Impala

Analyzing Text

  • Using Regular Expressions with Hive and Impala
  • Processing Text Data with SerDes in Hive
  • Sentiment Analysis and n-grams in Hive

Apache Hive Optimization

  • Understanding Query Performance
  • Bucketing
  • Hive on Spark

Apache Impala Optimization

  • How Impala Executes Queries
  • Improving Impala Performance

Extending Apache Hive and Impala

  • Custom SerDes and File Formats in Hive
  • Data Transformation with Custom Scripts in Hive
  • User-Defined Functions
  • Parameterized Queries

Choosing the Best Tool for the Job

  • Comparing Hive, Impala, and Relational Databases
  • Which to Choose?

Conclusion

Apache Kudu

  • What Is Kudu?
  • Kudu Tables
  • Using Impala with Kudu