This meeting is being recorded. Welcome to the webinar everyone. I'm just going to give a few moments for the attendees to trickle in. We should be getting started shortly. Thanks so much for joining us. (silence). Hello everyone. Thanks for joining. We're just going to give a few more minutes for the attendees to join. We should be getting started in less than a minute here. (silence).
All right, well, considering the webinar is just one hour long, I don't want to eat up too much time waiting for attendees to join so I think I'll get started. Welcome everyone. Today's webinar is titled Performing Data Science and Machine Learning At Scale Using Apache Spark. Thank you so much for joining us. This webinar will be presented by Brian Blakely, a data scientist and technical instructor at Cloudera.
During the webinar, everyone's phones will be muted. So if you have questions, please enter them into the Q&A box at the bottom of your screen and Brian or I will answer them during the question and answer portion of the webinar. Today's webinar is being recorded and you will all be sent a copy of this recording. The webinars hosted on Zoom and the features you will experience today are merely a snippet of how we deliver our live online classes which we call IMVP.
Unlike this webinar, our IMVP classes are interactive. They allow students to see and hear both your instructor and classmates. If you enjoy this presentation today, and you're interested in learning more about training anywhere with IMVP, please visit our website or contact us. As someone interesting in exploring data science, we encourage you to stay till the end of this presentation where we will reveal an amazing promotion we're running with our upcoming Cloudera classes. Here's a hint. It's something to help get you motivated, keep healthy and be active while staying connected. More details to come. All right, let's get started with our presentation. Brian, you can take it away.
All right. Thank you Michelle and thank you for inviting me here to talk about Performing Data Science And Machine Learning At Scale Using Apache Spark and about our Cloudera data scientists training in particular. So as Michelle mentioned, my name is Brian Blakely. I'm a technical instructor at Cloudera. I'm primarily responsible for the development and the delivery of our data science and machine learning curriculum. So I've been at Cloudera for about two years. Before I joined Cloudera I spent over 20 years as a practicing data scientists.
So doing a variety of things from writing low level machine learning algorithms in a corporate research lab to doing very applied analytics for a management consulting firm. And then I did the obligatory tenure at a small startup that was gobbled by a much larger company. So we try to inject that sort of experience into our course as well as the collective experience of our Cloudera team. So what I want to do today is give you a brief overview of our data science course and then jump in and give you a representative sample of what you'll experience in the course. Hopefully, you'll learn some machine learning and learn some Spark along the way because that's our goal in the course is to help you learn.
So let's talk a little bit about sort of the overarching objective of the course of data scientist or Cloudera data scientist training. What we want to do here is to give you the technical skills to explore, process, analyze and model much larger and more varied data sets than maybe you're working with today. And what that's going to allow you to do is to solve problems that you could not solve previously with smaller and maybe less varied data sets. Even though it might not be your primary objective, this is going to enable you to positively impact your organization and drive that business value.
This is certainly your boss, your boss's boss's primary objective. We all want to make a positive impact on our business and we're getting more and more data, more and more types of data at a faster rate. We want to be able to work with that data and provide value to our organization. So those are the overarching objectives, we're going to be focused on technical skills. Obviously there's a lot more to making a positive impact on your business than the technical side. There's the organization and the cultural and those other sides, but we're going to focus on the technical skills.
Those technical skills are going to revolve around a technology called Apache Spark. So Apache Spark is what we call the lead actor in this course. If you're not familiar with Apache Spark, I'll give you a quick overview here. It's an open source cluster computing framework that allows you to explore process and analyze and model huge datasets on the orders of terabytes, even up to two petabytes of data.
Underneath it it's a very low level library, but recent incarnations of Spark have made it much more feasible to data science. For example, Spark now provides high level libraries for working with structured data, Spark SQL library for working with structured data, Spark streaming and structured streaming for working with streaming data rather than batch data. The ML lib library for doing machine learning and then the graphics library for processing and analyzing graph or network data.
So these high level libraries give us much of the functionality that we need as data scientists. In addition, recent versions of Spark also provide Python and R interfaces that are really first class citizens in the Spark language API's. So this wasn't always the case. So when I first got involved in Spark, four or five years ago, if you wanted to fully take advantage of Spark, which is written in Scala, the Scala programming languages, you had to write either Scala or Java code. That's no longer the case for the most part. Python and R are first class citizens and Apache Spark is a much more feasible option and a much more useful option for data scientists who are more comfortable in Python and R then maybe the lower level languages like Scala and Java.
And finally, Spark is just one tool. It's not a silver bullet, it's not going to solve every problem that you have. It's going to allow you to solve problems that you could not solve before. But you're going to integrate it with your other tools. So it does naturally integrate with the Hadoop ecosystem, the big data ecosystem, as well as your favorite Python and R data science stacks.
So that's a nice segue into the supporting cast that we discuss in the Cloudera data scientist training. So again, Spark is going to be our primary actor here. We're going to... The first member of the supporting cast, we're going to introduce you to and we'll walk through this or use this in the demonstration that we're going to do here and the second is the Cloudera data science workbench. The Cloudera data science workbench is a data science environment built for the enterprise. So it focuses on... Allows you to develop data science programs but also provides functionality that's important to the enterprise such as security, scalability, and collaboration.
So all the demonstrations and all the exercises that we do in the course are performed via the Cloudera data science workbench. So Spark was designed from the beginning to integrate naturally with the Hadoop ecosystem. So the Hadoop ecosystem is a set of tools that you can use to again, process, analyze, and model big data. So for example, in the course, the data that we use is all stored in the Hadoop Distributed File System, which is a file system that you can use to store immense datasets in a distributed fashion.
So we talk about HDFS and the data that we use is all stored in HDFS. There's different ways to store different file formats. A very popular file format in the Hadoop world is parquet. Parquet is an efficient binary format, efficient for both in terms of storage and computation. We introduce you to parquet and some of the other file formats that are common within the Hadoop ecosystem.
So we're going to read that data into Spark. We can also... We'll also introduce you to a couple of SQL on Hadoop tools, Hive and Impala that you can use to run SQL queries against your data in Hadoop. And also Hue, the Hadoop user experience, which has a nice graphical interface to the Hadoop many components of the Hadoop ecosystem. Allows you to browse data in the Hadoop ecosystem and HDFS allows you to browse Hive and Impala tables and run queries against those tables. You can also manage your Spark applications and other applications that are running on your Hadoop cluster.
So we introduce you to those Hadoop ecosystem components that you're most likely to run into as a data scientist. So that's some of the supporting cast here on the left. But we also want to integrate Spark. We've got... We may have a workflow in Python, we may have a workflow in R that we're comfortable with and we want to integrate Spark into that workflow. So for using things like pandas, or matplotlib, NumPy, Scikit learn for machine learning, we want to be able to move back and forth between Spark and our favorite Python tools.
Analogously, if you're an R user, we want to do the same thing. If you're using Dplyr, GG plot, the carrot package to build your machine learning packages, we want to integrate our workflows. Integrate our small data workflows in Python and R with our big data workflows in Spark. So these are all... We walk through these technologies. Walk through these integration points in the Cloudera, data scientist training class.
What are the prerequisites for the course? So when this course was originally designed, it was designed for experienced data scientist. So data scientists who are knowledgeable and are practicing data science, so they're familiar with data science and machine learning processes, methodologies and algorithms. They're proficient with Python or R, maybe SAS for doing data science and machine learning on relatively small data sets.
And when I say small here, it's not a pejorative, it just means these are data sets that maybe fit on your laptop. You can process them on your laptop, maybe a souped up desktop or server. But essentially they fit on a single machine. But increasingly, we're finding that we're running into scale issues. These data sets are getting larger and larger. And I can only do so much on my laptop. So we find these data scientists are coming to us. They're running into scale issues. They want to be able to scale up their processes or methodologies, their algorithms to much larger data sets. They have little or no experience with Hadoop ecosystem in general or Apache Spark in particular and they really want to bootstrap and get up and running quickly.
So that's the original audience for which the course was designed. Probably about half of the folks that sign up for the course these days come into the course with that background. The other half, are what we call... What I would sort of characterize as aspiring data scientists. And in fact, there are many more aspiring data scientists in this world than there are probably experienced and practicing data scientists.
So these folks maybe come from a data engineering, data analysts, developer or a technical architect background. May not have the the DS and MK experience. But if you're proficient at asking, if that's your background and you're proficient at asking and answering questions using data, maybe using something like SQL, SQL dialect or some other general purpose programming language, then that's going to be a good starting point and you're going to get a lot out of the course.
If you've got some experience with the Hadoop ecosystem, then that's a bonus. It's going to allow you to focus on what's new, the data science and machine learning component. And obviously, if you've got some exposure to data science and machine learning, maybe you work with data science teams in your organization, maybe you've read some books, sat in on some online courses, any sort of background that you have will ease the transition.
So we're not magically going to be able to turn you into a data scientist or machine learning expert in four days. But these folks definitely walk away from this course feeling like it was a valuable experience. Probably the most valuable part of it from any of the interviews that I've done is that you really understand where you need to go next, where you need to focus your continuing education efforts if you really want to get into this exciting new area of data science and machine learning.
All right, so we accommodate both audiences in the course. How is the course organized here? It's organized as a four day workshop. It's really centered around a realistic case study. So we've got... You'll see a sample of that case study when we walk through the sample module. We walk through pretty much an end to end analytic workflow. So we're going to start at the beginning. Once we've introduced you to the case study, we'll begin by reading data in from a variety of different data sources and different file formats.
We'll explore that data, clean it up and we'll walk through the process. So we do that in a sequence of brief modules or short modules that consist of very brief lectures, most of the time is on an interactive demonstration in the Cloudera data science workbench environment, and then we give you time to practice those skills and extend some of the work that we do with some extensive hands on exercises. It is a workshop so there are sort of specific or clear exercises to do but we also find that a lot of folks will go off and do their own exercises as well. Being curious data detectives, they want to explore the data in their own way and we encourage and support that as well.
We try not to bombard you with too many PowerPoint slides. We've got a minimal number of slides. We want to get through interactive demonstrations and get on to exercises as quickly as possible. That's where you feel like you'll learn the most and the quickest. The course is primarily taught in Python, using PySpark API. But we do do some. We do give you an overview of the Sparklyr interface to R. There's a couple of reasons why we do focus on Python here. The first is Python is increasingly becoming the more popular language for data science, although R is quite prevalent and I probably spent more of my career in R, general career in R but more my Spark career in Python.
But really, we don't do a lot of pure Python in the course. We use it as an enabling language. But really our primary objective is to teach you the Spark API. And you'll see that the Spark API in general is quite consistent across the different language API's whether using Scala, Java or Python. Probably the best way to learn it without the underlying language getting in the way is via Python. So even if you're an R user and a Sparklyr user in particular, if you're interested in Sparklyr, Sparklyr abstracts away a lot of the details of Spark. And that's great. That's what it's really designed for. When you see Sparklyr. But when that doesn't work, it's helpful to know what's going on in the underlying Spark API.
So even if you are... You do want to go forward and use the Sparklyr API, knowing the underlying Spark API that Sparklyr is calling is a great advantage, it's going to help you in the long run. So we focus on Python but we will show you sort of the similarities and the differences between the PySpark and the Sparklyr APIs of course. Our primary focus, I mentioned, there's a number of libraries in Spark. The ones that were most interested in this course are the Spark SQL library for working and processing and analyzing structured data, and the ML lib library for building machine learning libraries.
The more detailed sort of background on what we cover in this course, some selected topics. And in fact, there are more topics than we can cover in a four day course. The instructor has some flexibility on swapping out topics depending on the makeup of the audience. So if one of the first things that we do is try to get a feel for your background and your interests. And when possible, try to customize the delivery to the audience by maybe swapping out some material and swapping in some other material.
Regardless of whether we cover the material, the underlying scripts are all part... Are all yours. You're going to take those with you at the end of the course so if there is something that we don't cover in the course, you're going to take it home, you're going to have a nice template, nice examples on which to study on your own. And you'll see an example of one of the scripts when we walk through the sample module in a minute.
So here's some of the... A pretty representative selection of topics that we would cover in a typical delivery. So the first thing we're going to do is get you introduced to the Cloudera data science workbench environment. So that's where we're going to spend most of our time. So we want to get you up to speed quickly on the environment. Get you up to speed on the case study and then get you reading and then we jump into the data science environment, or the data science process. Reading and writing data, inspecting data quality, cleansing and transforming data, doing joins, joining that data, and then summarize grouping and then exploring and visualizing your data.
So that's mostly... That's focused on that Spark SQL library. And that's the first half of the course. We do talk about sort of how doing distributed computing with Apache Spark is still hard, it's not as hard as it used to be. But you're still going to have to know how to monitor and configure and tune your Apache Spark application. So for the most part, we teach you the API and we ignore what's happening under the hood. But at some point, we need to teach you how Spark works underneath, so that you can optimize your distributed Spark applications.
Once we do that, then the rest of the course is really focused on machine learning. So we'll introduce you to the functionality that's available in Spark ML lib library. And then we'll walk through various machine learning workflows. So extracting and transforming and selecting features, which are inputs to our machine learning models. And then we'll build a variety of different types of machine learning models. Regression, classification, clustering, topic models, recommender models. We'll define those and walk through representative examples.
We also, most of these models require the user to experiment with a lot of different parameters. So we show you how to use Spark to automate those experiments. And then we talk about wrapping our machine learning workflows into pipelines, pipelines that we can share our process with other data scientists or share them with data engineers or developers seamlessly so that they can deploy them in their machine learning applications that are written in for example, Scala or Spark. And then all our hard work can actually make that impact on the business.
All right, so that's the topics. What I'd like to do now is jump into a sample of one of these topics here. Give you a little bit more flavor for how the course runs and hopefully again teach you something about machine learning and Spark along the way. All right, so the organization here that we're focused on in the case study is a newly established ride sharing company. Hopefully many of you have domain expertise on ride sharing. So domain expertise as a rider. Maybe some of you have driven in your life. So we want... One important things in data science is understanding your problem. Having domain expertise.
So we've picked a case study that hopefully most people can relate to at some level. So we've got this newly established ride sharing company. Since they are newly established, their business objectives at this stage are to really understand their riders and the rider experience. They've got some data that will allow us to do that. So we've got some text reviews of the rides that are submitted by the actual riders.
And the analytic exercise we're going to walk through here is to extract themes from those ride reviews. Now, we certainly could read those, use our biological neural network. Use our brain to read through those ride reviews and extract themes or look for patterns in those. But we want to do those in an automated fashion. I'm going to walk through a typical module here on processing text and using that process text to fit and evaluate what we call a topic model.
I'm going to jump away from the slides here and I'm going to jump into our data science workbench environment. All right, so here's our Cloudera data science workbench and everyone will have access to this environment. So this environment here and a training project within that environment. So here I've got a data scientist training project, and each student has their own training project. So I'm going to go ahead and select my training project and I'm going to open a workbench.
And this is where we spend most of our time in the course doing the interactive demonstrations and where the students will be doing exercises here. So this environment here on the left, you'll see I've got my project directories and project files and we've got a bunch of Python snippets. We spend most of our time as I mentioned, going through Python code. We work through these Python scripts throughout the course.
We've also got solutions to all the exercises that are available in a solutions directory and then we've got supplemental material. As I mentioned, there's more material than we can cover. If there's certain questions or certain interests of the audience, we can pull in, the instructor has the flexibility to pull in some of this supplemental material and follow up on other topics. Again, you're going to take all this with you at the end of the course regardless of whether we cover it or not.
We've also got some R... A directory of R code as well. So even though we don't go through all this R code, we've got parallel versions of most of the scripts in R, so if you do want to learn about Sparklyr and how to use Sparklyr to do Apache Spark jobs, this will be available, you're going to take this home with you as well. All right, so we're going to walk through one of the Python examples here. And to do that, I'm going to select my text file and it actually shows up in this text editor here.
And just to give myself a little bit more screen real estate and focus your attention on the results here, I'm going to hide that sidebar. So here's my Python script here. I'm going to walk through it in chunks and probably a little bit more quickly than I would in class but it'll give you a flavor for what we do. So in order to run this script, we're going to... We need to marshal up some computing resources in our Cloudera data science workbench environment.
So I'm going to start up a Python two session so I can run this script in Python and I'm going to ask for a certain number of resources. So we get you familiar with this environment but we're just going to use it for now. So I'm creating a... Basically what I'm creating now is I'm matching up a virtual machine where I can run my Python script and now I've got a Python environment. I can just say, I can run Python commands and say hello attendees.
Now type down here, just a Python command and the results are echoed above. But what I want to do is run through this script and do some text processing and machine learning. So we're running through this example here. I'll just give it a name of the script. We can run blocks of code and talk through them. So as I mentioned, the scenario here, we've got some text reviews and we want to extract some themes from those text reviews.
That's called topic modeling. So a canonical example of topic modeling is a news article categorization. So something like Google News, you've got articles and you want to categorize those into different topics. So I might have World News, US News, technology versus economics versus sports and so forth. And these topic model algorithms are useful in extracting these things from news articles, or in general a set of text documents.
Here we've got a set of ride reviews. The particular algorithm that we're going to use here and supported in Spark is called Latent Dursley Allocation Algorithm. And that algorithm is sort of beyond the scope of the course to go in the technical details of the algorithm. But essentially the algorithm assumes that all documents are distributed across various topics with certain frequencies. And then certain words are distributed in a particular topic with certain frequencies as well.
It tries to find those frequencies that best fit the data that we give the algorithm. And the main parameter or the user specified parameter that we're going to have to explore as data scientist is the number of topics. So we're going to choose a number of topics and then let the algorithm go look for that number of topics or extract that number of topics from the data.
That number of topics is what we call a hyper parameter. All right, so let's jump into an example of that. The first thing we're going to do is set up our environment. So we'll import some Python packages and we're going to use to do some plotting. Spark does not have plotting functionality. So if you want to plot your data, you're going to have to use your favorite Python or R package to do that plotting. So we're going to import a package here called Word Cloud that's going to allow us to plot a word cloud and we're going to use some other plotting packages that we've already installed.
The next thing we want to do is to actually start up our Spark, we call our Spark session. Now Spark is actually a set of Java processes that run on our distributed cluster. In order to start up those Spark Java processes, we need to do what's called start a Spark session. We've covered this earlier in the course, and this is generally boilerplate code that you'll have in all your examples. But we're going to go ahead and start our Spark session here. That's going to give us this Spark object.
That basically is a handle into all those Java processes that are running on the various nodes of our cluster that are going to read and process the data once we give Spark specific instructions to do so. All right, so we've got the Spark session up and running. Now what we want to do is to read some data into Spark. Into a Spark data frame.
So the next thing we're going to do here is read in those text reviews. So I'm going to select that block of code and we're going to load that data. So what I've done here is I've taken that Spark session that we just created and we've read in a parquet file. So as I mentioned, parquet is a nice efficient format, big data format that we use regularly in the big data world to store data. So we're reading a file in from the Hadoop Distributed File System which is located at this address or this directory.
It's stored in parquet format and we're creating a Spark data frame called reviews. Spark data frames are inspired by our data frames or pipe pandas data frames. So they have a lot of similarities to our pandas data frames but they are very different beasts. In the course we show you and we discuss the similarities and the important differences between Spark data frames, which are distributed data frames and maybe our Python data frames that you're more familiar with.
All right, so we've read in this reviews data frame and we can take a look at it here. And actually, let me... I'm going to do a different version of this. I'm going to use the show command, maybe to print out a little bit more user friendly version of this data set. Oops. [inaudible 00:35:48] correctly. All right. So here's a sample of records from my ride reviews data frame. I've just taken my data frame and I've printed it using the show method to print out a few rows.
It's a very simple data frame. It has two columns, the right ID, which will allow me to match this back to some ride data that I have and what we're more focused on in this example is the actual ride text. So again, these are short snippets of these almost tweets about a ride experience made by the actual ride sharing rider. So here's a review. Dale is extremely cordial, presumably Dale is the driver. He was very friendly, it was a very junky car, most awful stench of all time, and so forth. You can probably see some themes here, I could not breathe, the car stunk and so forth.
So again, we could use our biological neural network here. Our built in deep learning to look for some themes but we're going to use machine learning and the Latent Dirichlet Allocation and Apache Spark to process this ride reviews data, which could be quite large. But we can't work with, and the Latent Dirichlet Allocation Algorithm will not accept this data as is. We're going to have to do a little bit of work to transform the text data into numeric data that the Spark machine learning algorithm can process.
Fortunately, Spark ML lib provides us a number of what are called feature transformers, extractors and selectors to help us prepare the data for the various machine learning algorithms. We're going to go through a few in this particular module. We walk through others that are more appropriate for other types of data pre processing. So the first thing we want to do in this particular example, is to parse out or what's called tokenize our words.
So I'm just going to run a block of code here and introduce you to a few of these Spark ML lib, feature extractors transformers and selectors. So we're going to extract and transform features and these features are going to represent the inputs to my machine learning algorithm. The first thing we're going to do is we're going to take those text reviews and parse them into individual words. We can use this tokenizer class here, this is an extractor. It's available in the Spark ML lib library to do that.
The pattern is going to be very consistent, the ML lib package is very nice. It's very clean, it's very consistent. Once you see a few examples, you'll get the hang of things and most of these classes behave the same way. So the first thing we'll have to do is import the tokenizer. So this is the typical Python import statement. We're going to import the tokenizer from our feature module then we'll create an instance of that tokenizer. We're going to pass in the input column, which is our text review and then we're going to generate a new column called words which is going to hold an array of words after they've been tokenized.
Now to actually do the tokenizer, to actually do the tokenization, we're going to call this transform method, pass in our review data and we're going to get the tokenized data frame. So this is a new Spark data frame that has the new column with the tokenized words. So we can print out a few records here. So for example, you can see that Dale is extremely cordial. Here's our original text. Here are tokenized words here. So what we have here is a string of Unicode words. Individual words Dale is extremely cordial.
We've done that for each review on a potentially huge data frame. That could be distributed across... Generals distributed across the nodes of the cluster. But that's all hidden from view here. Now, if you've got an eagle eye, you may notice there's some potential issues here. We have not extracted out the punctuation. So for example here, the driver drove so well. We've got well with two exclamation points. We may or may not want to include those exclamation points in our word definition.
So fortunately, our SQL has a more sophisticated tokenizer called the reg X tokenizer that we can use to customize our approach to tokenizing the words. The basic tokenizer is not terribly smart, the reg X tokenizer is smarter. So I won't go into details here but we can define regular expressions that define words or define split criteria. We can re-run the tokenization on our data and strip out just the words and not the punctuation.
If we look at the result of that newly tokenized data, we can see now the driver drove so well, we've got well without exclamation points. These are short text ride reviews. So maybe that well with no exclamation points versus one, versus two, versus three. There may be information in there but that's part of your decision as a data scientist to determine whether that punctuation is informative or not.
All right, so once we've tokenized those words, we can go ahead and produce a word cloud to visualize sort of the frequency of occurrence of the words across all our reviews. So here, what I've done is I've just... I'm integrating my Spark workflow with my Python workflow. So I'm pulling some data out of my Spark data frame, which is sitting on my Hadoop Distributed Hadoop cluster. I'm going to bring it into my local Python session. I'm going to plot it using the word cloud package.
So we've written a Python function here to do that, extract the data and plot the word cloud. So word cloud just plots the words that show up in the overall reviews and the size of the word is dictated by the frequency, the number of times it shows up over all reviews. So again, those of you that are familiar with natural language processing or even those of you that are clever here notice that there are a lot of words, common words, that probably won't be relevant as we develop our topic model.
Obviously, the, of, was. These are common words called stop words in the natural language process and lexicon that we probably want to exclude from our modeling exercise. We want to focus on the words that are relevant to a ride sharing experience. So fortunately, Spark has another transformer here that will allow us to do that. It's called stop words remover. Not surprisingly, that will remove common words from our reviews.
So the process is going to be very similar to what we did above. We're going to import that stop words remover from the feature module. We're going to create an instance of it. We're going to... We will use the default stop words. Here's a sample of our default stop words. But we could set pass in our own list or we could use other different languages. It supports about, I think, a dozen or so different languages and different stop words in different languages.
We're going to go with the default stop words and we'll go ahead and apply that here using this transform method. And we're going to get a new data frame that has a new column called relevant words. This will be a list or an array of only those relevant words. So if we go back to our driver drove so well, we can see that the relevant words we've dropped the, we've dropped so and we're just left with driver drove well. So that's just a particular example. We can reproduce our word cloud here. And now we see that the words that are leftover are those that are most relevant for extracting themes or topics from our reviews.
So ride and driver, freshener, air, wreaked, smelled, breathe and so forth. Horrible service. These are things that we're going to find useful. All right, we're getting closer to building a model here. But we've got a little bit more work to do. We need to convert this into numeric data. So we're going to use a bag of words approach and create a vector of counts or an array of counts. So we're going to count up the number of times, the most frequent words show up in each review.
So we're going to use this count vectorizer to create what we call a bag of words for each review. What that's going to do is, the count vectorizer is going to do two things. It's going to create a vocabulary, a list of vocabulary words, top 100 vocabulary words, which is a setting that we set as data scientists. And then it's going to count up how many times those vocabulary words show up in each review. All right. And so most of those words aren't going to show up at all but when those words show up, we're going to count those.
So when we apply this count vectorizer, what we get is an array of length 100 or a vector of length 100. Each element of that vector represents the count of each vocabulary word. We're in a big data world and most of those counts are actually going to be zero. So we don't want to store zeros, we want to be efficient. So this word count vector that Spark produces is actually what we call a sparse vector. It's a sparse vector of length 100. That means it has 100 elements. In this case, element one has the value of one means word one shows up one time, word 40 shows up one time and word 96 shows up one time.
All the other 100 words do not show up at all so we don't store those zeros. It's this word count vector, this vector of length 100 that we're going to take as the input to our LDA model. All right, and this part, actually building the machine learning algorithm, we've done all the pre processing, which is the hard part.
Actually developing the machine learning algorithm is somewhat anticlimactic in the sense that the Spark ML lib package is very clean and consistent. It's basically a three step process here. We're going to import our LDA algorithm from the clustering class. We're going to create an instance of that called LDA and we're going to pass in that word count vector that we just created. And we're going to pass in the number of topics that we're going to look for.
This is something that we're going to have to experiment with. There are a lot of other parameters we could experiment with. We can see a list of those using the explained params but we're going to go with the defaults. But in general, you're going to have to explore many of those. So once we've set these, we're also going to set the seed, a random number seed so that we can replicate our results. But the other parameters that are listed here, we're going to go ahead and use the defaults.
All right, so once we've defined or specified our LDA instance, we can go ahead and call the fit method and this is where all the magic happens. We're going to take our LDA algorithm instance, we're going to call the fit method and we're going to pass in that vectorized data. The data frame that has the count vector. And that's going to return our actual LDA model. And we can evaluate that model. There's various attributes we can look at but it's probably easiest just to plot some of the results of the model and get a feel for the topics.
At this stage, it really becomes an exploratory data analysis exercise where we want to understand the particular topics that the LDA model has discovered in the data. So here, what I've done is I've taken some information out of the LDA model and I've created a small Python function that plots it in a way that we can digest a little bit easier. So here we've got two topics, topic zero and topic one. Remember, we gave the algorithm two topics to search for. The first topic here, topic zero, what we've done is I've plotted the five most important words that the algorithm found for that topic.
And so, air and freshener are the top two words followed by ride vehicle and do. So if you remember some of the reviews that we looked at, there were several, just taking a small sample that looked like they revolved around air quality or air freshness. So this first topic that the algorithm seems to have found revolves around air and freshener. And now we'd want to do more analysis here to delve into this topic and understand it more. But this is a starting point for that exercise.
Now, the second topic, there's probably more than two topics. The conjecture hypothesis here is that this topic sort of captures everything else. So we've extracted out air quality topic but there are probably a lot of other topics that have fallen into this topic one. So this topic one, the two most important words are driver and ride. So if you go back and look at some of those original text ride reviews, you'll notice that they are... They revolve around the rider or the driver experience, the rider driver experience.
There's probably further subtopics in here and then the exercises you can go ahead and work through those. So we can continue to explore, we can take this model and we can explore, apply it to new data and actually see how each individual distribution or each individual view gets classified and further analyze our model. In the demonstration, in class, we would do that but then what we want to do is get you to the exercises and have you sort of re-run this with more topics.
So look for, try to break out and re-run the LDA model with for example, three topics. See what you find. And then also there are a variety of other text pre-processing that we can do. We did very simple text pre-processing, Spark ML lib provides a number of other text processing algorithms. We'll introduce you to the engram transformer in the exercise. So you can use that and apply that to the example. And again, solutions are given for all these exercises. If you want to walk through those and you get to take all these scripts with you at the end of the course.
Finally, we give you some links to some background information. And then also what's... More importantly, we make sure that you know how to use the Spark documentation. Spark documentation is very handy. You'll need that Spark documentation to work through some of the exercises. So we've got some links to that Spark documentation available here that you can click on and get to the official documentation.
All right, so that was a quick introduction or a quick overview or representative overview of what we would do in class. Then go through all the details that we normally would. There's a lot of assumptions. We've already covered a lot of material to this stage. But that should give you a feel for how a typical module works out. Hopefully you saw a little bit about how Spark SQL works. And Spark ML lib works and also a little bit about machine learning. So what I want to do now is turn it back over to Michelle. Thank you for attending, sharing a precious hour of your day with us. I'll turn it over to Michelle and we'll do some Q&A while we've got a little bit of time left over.
Thank you Brian and again, thank you so much to all of our attendees. We are at the end of our presentation and we'll be moving on to the Q&A portion. So if you do have any questions, do post them in that Q&A box you can find at the bottom of your screen. Now before we jump into that Q&A session, I do want to talk about that exciting promo that I hinted to at the beginning. So anyone who purchases Cloudera training today through December 31st will receive a free Apple Watch. Terms and conditions apply. Visit our website for more details.
On the slide here, you can find the direct link to go to this page. It's www2.exitcertified.com/freeapplewatch. Or if you just make it to actually exitcertified.com you will be able to find all the details about this exciting promo. All right. It is time for our Q&A. If you do have any questions, feel free to post them in that Q&A box. And at the moment doesn't appear we have any questions, but I'll let time continue to go on and maybe some questions will appear. As they do, we will answer them. Another reminder that each and every one of you will receive a recording of this presentation. So you should receive that by the end of the week. Thank you. (silence).
Ryan, it looks like you were so efficient with your presentation and we don't have any questions to be answered. if anything does pop up for any of the attendees and they do want to get in contact with us, you can do so by going to our website. There is a chat feature right on the website or you'll find all of our contact information right there. I hope you all have a great day. Thank you again for joining us.
Thank you, Michelle. Thank you attendees. (silence).
I will post the URL in the chat. That's a great question. Just hang on for that link. (silence). All right, that wraps up our webinar. Thank you again so much. Have a great day.