【陆勤荐文】Top 10 data mining algorithms in plain English(上)
Knowing the top 10 most influential data mining algorithms is awesome.
Knowing how to USE the top 10 data mining algorithms in R is even more awesome.
That’s when you can slap a big ol’ “S” on your chest…
…because you’ll be unstoppable!
Today, I’m going to take you step-by-step through how to use each of the top 10 most influential data mining algorithms as voted on by 3 separate panels in this survey paper.
By the end of this post…
You’ll have 10 insanely actionable data mining superpowers that you’ll be able to use right away.
Contents [hide]
-
Getting Started
-
First, what is R?
-
Okay, what’s knitr?
-
Few pre-requisites
-
1. C5.0
-
2. k-means
-
3. Support Vector Machines
-
4. Apriori
-
5. EM
-
6. PageRank
-
7. AdaBoost
-
8. kNN
-
9. Naive Bayes
-
10. CART
-
You Can Totally Do This!
UPDATE 18-Jun-2015: Thanks to Albert for the suggestion… I’ve updated the image above.
Getting Started
First, what is R?
R is both a language and environment for statistical computing and graphics. It’s a powerful suite of software for data manipulation, calculation and graphical display.
R has 2 key selling points:
-
R has a fantastic community of bloggers, mailing lists, forums, a Stack Overflow tagand that’s just for starters.
-
The real kicker is R’s awesome repository of packages over at CRAN.
It’s a great environment for manipulating data, but if you’re on the fence between R and Python, lots of folks have compared them.
For this post, do 2 things right now:
-
Install R
-
Install RStudio
The next step is to couple R with knitr…
Okay, what’s knitr?
knitr (pronounced nit-ter) weaves together plain text (like you’re reading) with R code into a single document. In the words of the author, it’s “elegant, flexible and fast!”
You’re probably wondering…
What does this have to do with data mining?
Using knitr to learn data mining is an odd pairing, but it’s also incredibly powerful.
Here’s 3 reasons why:
-
It’s a perfect match for learning R. I’m not sure if anyone else is doing this, but knitr lets you experiment and see a reproducible document of what you’ve learned and accomplished. What better way to learn, teach and grow?
-
Yihui (the author of knitr) is super on top of maintaining, enhancing and making knitr awesome.
-
knitr is light-weight and comes with RStudio!
Don’t wait!
Follow these 5 steps to create your first knitr document:
-
In RStudio, create a new R Markdown document by clicking File > New File > R Markdown…
-
Set the Title to a meaningful name.
-
Click OK.
-
Delete the text after the second set of
---
. -
Click
Knit HTML
.
Your R Markdown code should look like this:
1
2
3
4
|
--- title: "Your Title" output: html_document --- |
After “knitting” your document, you should see something like this in the Viewer pane:
Congratulations! You’ve coded your first knitr document!
Few pre-requisites
You’ll be installing these package pre-reqs:
-
adabag
-
arules
-
C50
-
dplyr
-
e1071
-
igraph
-
mclust
One final package pre-req is printr
which is currently experimental (but I think is fantastic!).
In your RStudio console window, copy and paste these 2 commands:
1
2
3
4
5
6
|
install.packages ( c ( "adabag" , "arules" , "C50" , "dplyr" , "e1071" , "igraph" , "mclust" )) install.packages ( 'printr' , type = 'source' , repos = c ( 'http://yihui.name/xran' , 'http://cran.rstudio.com' )) |
Then press Enter
.
Now let’s get started data mining!
1. C5.0
Wait… what happened to C4.5? C5.0 is the successor to C4.5 (one of the original top 10 algorithms). The author of C4.5/C5.0 claims the successor is faster, more accurate and more robust.
Ok, so what are we doing? We’re going to train C5.0 to recognize 3 different species of irises. Once C5.0 is trained, we’ll test it with some data it hasn’t seen before to see how accurately it “learned” the characteristics of each species.
How do we start? Create a new knitr document, and title it C50
.
Add this code to the bottom of your knitr document:
1
2
3
4
5
6
|
This code loads the required packages: ```{r} library (C50) library (printr) ``` |
Can you see how plain text is weaved in with R code?
In knitr, the R code is surrounded at the beginning by a triple backticks. This tells knitr that the text between the triple backticks is R code and should be executed.
Hit the Knit HTML
button, and you’ll have a newly generated document with the code you just added.
Sweet! Packages are loaded, what’s next? Now we need to divide our data into training data and test data. C5.0 is a classifier, so you’ll be teaching it how to classify the different species of irises using the training data.
And the test data? That’s what you use to test whether C5.0 is classifying correctly.
Add this to the bottom of your knitr document:
1
2
3
4
5
6
7
|
This code takes a sample of 100 rows from the iris dataset: ```{r} train.indeces <- sample (1: nrow (iris), 100) iris.train <- iris[train.indeces, ] iris.test <- iris[-train.indeces, ] ``` |
Hang on, what’s iris? The iris dataset comes with R by default. It contains 150 rows of iris observations. Each iris observation consists of 5 columns: Sepal.Length
, Sepal.Width
,Petal.Length
, Petal.Width
andSpecies
.
Although we know the species for every iris, you’re going to divide this dataset into training data and test data.
Here’s the deal:
Each row in the dataset is numbered 1 through 150.
Line 4 takes a random 100 row sample from 1 through 150. That’s what sample()
does. This sample is stored in train.indeces
.
Line 5 selects some rows (specifically the 100 you sampled) and all columns (leaving the part after the comma empty means you want all columns). This partial dataset is stored iniris.train
.
Remember, iris consists of rows and columns. Using the square brackets, you can select all rows, some rows, all columns or some columns.
Line 6 selects some rows (specifically the rows not in the 100 you sampled) and all columns. This is stored in iris.test
.
Hit the Knit HTML
button, and now you’ve divided your dataset!
How can you train C5.0? This is the most algorithmically complex part, but it will only take you one line of R code.
Check this out:
1
2
3
4
5
|
This code trains a model based on the training data: ```{r} model <- C5.0 (Species ~ ., data = iris.train) ``` |
Add the above code to your knitr document.
In this single line of R code you’re doing 3 things:
-
You’re using the C5.0 function from the C50 package to create a model. Remember, a model is something that describes how observed data is generated.
-
You’re telling C5.0 to use
iris.train
. -
Finally, you’re telling C5.0 that the Species column depends on the other columns (Sepal.Width, Petal.Height, etc.). The tilde means “depends” and the period means all the other columns. So, you’d say something like “Species depends on all the other column data.”
Hit the Knit HTML
button, and now you’ve trained C5.0 with just one line of code!
How can you test the C5.0 model? Evaluating a predictive model can get really complicated. Lots of techniques are available for very sophisticated validation: part 1, part 2a/b, part 3and part 4.
One of the simplest approaches is: cross-validation.
What’s cross-validation? Cross-validation is usually done in multiple rounds. You’re just going to do one round of training on part of the dataset followed by testing on the remaining dataset.
How can you cross-validate? Add this to the bottom of your knitr document:
1
2
3
4
5
|
This code tests the model using the test data: ```{r} results <- predict (object = model, newdata = iris.test, type = "class" ) ``` |
The predict()
function takes your model, the test data and one parameter that tells it to guess the class (in this case the species).
Then it attempts to predict the species based on the other data columns and stores the results inresults
.
How to check the results? A quick way to check the results is to use a confusion matrix.
So… what’s a confusion matrix? Also known as a contingency table, a confusion matrix allows us to visually compare the predicted species vs. the actual species.
Here’s an example:
The rows represent the predicted species, and the columns represent the actual species from the iris dataset.
Starting from the setosa row, you would read this as:
-
21 iris observations were predicted to be
setosa
when they were actuallysetosa
. -
14 iris observations were predicted to be
versicolor
when they were actuallyversicolor
. -
1 iris observation was predicted to be
versicolor
when it was actuallyvirginica
. -
14 iris observations were predicted to be
virginica
when it was actuallyvirginica
.
How can I create a confusion matrix? Again, this is a one-liner:
1
2
3
4
5
|
This code generates a confusion matrix for the results: ```{r} table (results, iris.test$Species) ``` |
Hit the Knit HTML
button, and now you see the 4 things weaved together:
-
You’ve divided this iris dataset into training and testing data.
-
You’ve created a model after training C5.0 to predict the species using the training data.
-
You’ve tested your model with the testing data.
-
Finally, you’ve evaluated your model using a confusion matrix.
Don’t sit back just yet — you’ve nailed classification, now checkout clustering…
2. k-means
What are we doing? As you probably recall from my previous post, k-means is a cluster analysis technique. Using k-means, we’re looking to form groups (a.k.a. clusters) around data that “look similar.”
The problem k-means solves is:
We don’t know which data belongs to which group — we don’t even know the number of groups, but k-means can help.
How do we start? Create a new knitr document, and title it kmeans
.
Add this code to the bottom of your knitr document:
1
2
3
4
5
6
|
This code loads the required packages: ```{r} library (stats) library (printr) ``` |
Hit the Knit HTML
button, and you’ll have import the required libraries.
Okay, what’s next? Now we use k-means! With a single line of R code, we can apply the k-means algorithm.
Add this to the bottom of your knitr document:
1
2
3
4
5
6
|
This code removes the Species column from the iris dataset. Then it uses k-means to create 3 clusters: ```{r} model <- kmeans (x = subset (iris, select = -Species), centers = 3) ``` |
2 things are happening on line 5:
-
The
subset()
function is used to remove theSpecies
column from the iris dataset. It’s no fun if we know the Species before clustering, right? -
Then
kmeans()
is applied to the iris dataset (w/ Species removed), and we tell it to create 3 clusters.
Hit the Knit HTML
button, and you’ll have a newly generated document for kmeans.
How can you test the k-means clusters? Since we started with the known species from the iris dataset, it’s straight-forward to test how accurate k-means clustering is.
Add this code to the bottom of your knitr document:
1
2
3
4
5
|
This code generates a confusion matrix for the results: ```{r} table (model$cluster, iris$Species) ``` |
Hit the Knit HTML
button to generate your own confusion matrix.
What do the results tell us? The k-means results aren’t great, and your results will probably be slightly different.
Here’s what mine looks like:
What are the numbers along the side? The numbers along the side are the cluster numbers. Since we removed the Species column, k-means has no idea what to name the clusters, so it numbers them.
What does the matrix tell us? Here’s a potential interpretation of the matrix:
-
k-means picked up really well on the characteristics for
setosa
in cluster 2. Out of 50setosa
irises, k-means grouped together all 50. -
k-means had a tough time with
versicolor
andvirginica
, since they are being grouped into both clusters 1 and 2. Cluster 1 favorsversicolor
and cluster 3 strongly favorsvirginica
. -
An interesting investigation would be to try clustering the data into 2 clusters rather than 3. You could easily experiment with the
centers
parameter inkmeans()
to see if that would work better.
Does this data mining stuff work? k-means didn’t do great in this instance. Unfortunately, no algorithm will be able to cluster or classify in every case.
Using this iris dataset, k-means could be used to cluster setosa
and possibly virginica
. With data mining, model testing/validation is super important, but we’re not going to be able to cover it in this post. Perhaps a future one…
With C5.0 and k-means under your belt, let’s tackle a tougher one…
【微信公众号推荐】
数据科学自媒体,分享数据科学内容。
【互动交流】
无论是投资人,创业者,还是从业者、数据爱好者。愿意和小编深入交流,请添加小编微信:luqin360。谢谢您的关注,祝好。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!