Predictive Data Science: Foundations Boot Camp

The stores of data relevant to our organizations, customers, operations and goals have never accumulated at a faster pace or to a larger volume. Likewise, the need for intelligent data analysis has never been greater. Vast reserves of value hide within huge and sophisticated datasets. It can be a challenge to find that value – but if we can tease out the insights and answers lurking within our information, they can be translated into a host of opportunities and advantages. With the right skills, only your own creativity limits how you can leverage your stores of data for better decisions, analytics, and prediction. Fortunately, today's data science methods are more practical and accessible than ever. The open-source R environment provides a straightforward yet incredibly powerful toolbox for performing useful predictive modeling and deep analysis. This hands-on course advances your data analysis skills into the realm of real-world data science. If you have a working familiarity with R, our three-day class equips you to go back to work with real-world predictive modelling and basic machine learning techniques. Led by expert data scientists, you will work in R to lay your data science foundation and learn techniques that allow you to leverage your data in sophisticated, powerful new ways.

Who Can Benefit

Intermediate level data analysts interested in expanding their data mining processes. We emphasize Data Foundation and Machine Learning concepts. All exercises are performed in R.

Prerequisites

  • Some knowledge of data analysis
  • Basic knowledge of descriptive statistics
  • Some experience with R

Course Details

Section I: Overview of Data Science

  • 1. Data Science as a quantitative discipline
  • 2. Overview of a Data Mining process cycle

Section II: The Data Foundation

  • 3. Data sources
  • 4. Types of data
  • 5. Working with missing values
  • 6. Working with outliers
  • 7. Working with duplicate records

Section III: Sampling and Hypothesis Testing

  • 8. Why sampling may be important for Machine Learning
  • 9. Sampling techniques and sample bias
  • 10. Statistical hypothesis
  • 11. Z-score, t-score and F statistic
  • 12. P-values
  • 13. Implementation of hypothesis testing for model evaluation analysis

Section IV: Machine Learning Fundamentals

  • 14. What is Machine Learning?
  • 15. Supervised vs. unsupervised learning
  • 16. Overview of supervised Machine Learning
  • 17. Overview of unsupervised Machine Learning
  • 18. Overview of major steps in building and testing quantitative models

Section V: Building a Linear Regression Model with R.

  • 19. Univariate regression vs. multiple regression
  • 20. Mathematical foundation of linear regression overview: least square method vs. maximum likelihood method
  • 21. Model assumptions
  • 22. Working with continuous attributes
  • 23. Dealing with collinear variable
  • 24. Model subset selection
  • 25. Automating model selection procedure
  • 26. Model parameter evaluation, R squared vs. adjusted R squared
  • 27. Validating the model
  • 28. Working with categorical variables
  • 29. Considering input variable interactions

Section VI: Example of building a Classification Model with R

  • 30. Dealing with imbalanced training sets
  • 31. Understanding confusion matrix
  • 32. Evaluating binary classifiers using ROC / AUC

Section VII: Example of Cluster Analysis with R

  • 33. Overview of cluster analysis mathematical foundation
  • 34. K-means clustering method

Section VIII: Dimension Reduction techniques with R

  • 35. What is dimension reduction?
  • 36. The practical goals of dimension reduction implementation
  • 37. Principal component analysis vs. singular value decomposition
  • 38. How many components to choose

Section IX: Class Conclusion

  • 39. What was not covered in the class
  • 40. Big Data Analytics – the future of machine learning: main tools and concepts