Accelebrate's Analyzing Big Data with R Programming training teaches attendees how to use In-memory/on-disk, distributed analysis using H20, Hadoop, and Apache Spark, and how to integrate Microsoft Machine Learning Server and R.
Skills Gained
All students will be able to:
- Use Data.table to manipulate large datasets that fit in memory
- Understand batch processing via SQL queries
- Implement online learning style models using R
- Describe the difference between Hadoop and Spark
- Understand the HDFS file format
- Use the SparkR package to leverage Spark through an R API
- Manage data via H20 and the R API
- Implement models using H20 and the R API
- Explain how Microsoft R Server and Microsoft R Client work
- Use R Client with R Server to explore big data held in different data stores
- Visualize data by using graphs and plots
- Transform and clean big data sets
- Build and evaluate regression models generated from big data
- Create, score, and deploy partitioning models generated from big data
- Use R in the SQL Server and Hadoop environments
Prerequisites
In addition to their professional experience, students who attend this course should have:
- Programming experience using R, and familiarity with common R packages
- Knowledge of common statistical methods and data analysis best practices
- Basic knowledge of the Microsoft Windows operating system and its core functionality
Software Requirements
- R 3.0 or later with console
- IDE or text editor of your choice (RStudio recommended)
Big Data with R Training Outline
Introduction
In-memory Big Data: Data.table
- Why do we need data.table?
- Why is it
- The i and the j arguments in data.table
- Renaming Columns
- Adding new columns
- Binning data (continuous to categorical)
- Combining categorical values
- Transforming Variables
- Group-by functions with data.table
- Handling missing data
- Long to Wide and Back
- Merging datasets together
- Stacking datasets together (concatenation)
SQL Connections and Sequential data updates
SQL Connections and Sequential data updates
Data Munging and Machine Learning Via H20
- Intro to H20
- Launching the cluster, checking status
- Data Import, manipulation in H20
- Unstructured data analysis: Word2Vec
- Fitting models in H20
- Generalized Linear Models
- Naïve Bayes
- RandomForest
- Gradient Boosting Machine (GBM)
- Ensemble model building
Overview of Hadoop
- Distributed data versus distributed analytics
- HDFS and map-reduce
Apache Spark
- Overview of Spark
- APIs to use Apache Spark with R
- Sparklyr versus SparkR
- R, Python, Java and Scala APIs to Spark
- Applied Examples using SparkR
- Data import and manipulation in Spark(R)
- The Spark machine learning library mllib:
- General Linear Models
- Random Forest
- Naïve Bayes
Microsoft Machine Learning Server Overview
- What is Microsoft R server
- Using Microsoft R client
- The ScaleR functions
Data Munging
- Understanding XDF files
- Data I/O
- Variable transformations
- Data subsetting, splitting, and merging
Data Summarization
- Creating visualizations
- Numerical summaries
Processing Big Data
- Transforming Big Data
- Managing datasets
Implementing General Linear Models
- Establishing and leveraging partitions/clusters
- Fitting regression models and making predictions
Implementing Other Models
- Decision Trees and Random Forests
- Naïve Bayes
Conclusion