项目|机器学习项目模板(R语言)
测试开头
机器学习项目模板
测试结尾
编者按:基于R语言实现端到端的机器学习项目所包括的工作流。按着这个流程有序开展工作,并不断迭代和完善。我创建了R语言微信群,定位是R语言学习和实践。您若是想入群,请添加我的微信:luqin360,备注:R语言入群。
介绍一个机器学习项目模板,您可以用R语言来实现端到端的机器学习项目。
机器学习项目模板
# 1. Prepare Problem 问题准备
# a) Load libraries 加载所需R包
# b) Load dataset 加载所需数据集
# c) Split-out validation dataset 数据集划分
# 2. Summarize Data 数据概要
# a) Descriptive statistics 描述性统计分析
# b) Data visualizations 数据可视化
# 3. Prepare Data 数据准备
# a) Data Cleaning 数据清洗
# b) Feature Selection 特征选择
# c) Data Transforms 数据变换
# 4. Evaluate Algorithms 算法评测
# a) Test options and evaluation metric 测试集和评价指标
# b) Spot Check Algorithms 测试算法
# c) Compare Algorithms 算法对比分析
# 5. Improve Accuracy 性能优化
# a) Algorithm Tuning 调参
# b) Ensembles 集成
# 6. Finalize Model 模型应用
# a) Predictions on validation dataset 模型预测
# b) Create standalone model on entire training dataset 全数据集构建模型
# c) Save model for later use 模型保存和实施
怎么使用这个模板?
-
创建R工程项目
-
按着项目模板创建一系列有序的R脚本
-
编写R脚本
举例说明
基于机器学习项目模板开展的端到端机器学习项目,解决乳腺癌识别的问题。
代码如下:
# 乳腺癌识别问题
# 二分类问题
# 问题描述: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original)
# World-Class Results: http://www.is.umk.pl/projects/datasets.html#Wisconsin
# 加载R包
library(mlbench)
library(caret)
library(doMC)
registerDoMC(cores=8)
# 加载数据集
data(BreastCancer)
# 数据集划分
set.seed(7)
validation_index <- createDataPartition(BreastCancer$Class, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- BreastCancer[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- BreastCancer[validation_index,]
# 数据概要
# 数据集样本数和变量数
dim(dataset)
# 数据集检视
head(dataset, n=20)
# 数据集变量类型
sapply(dataset, class)
# 移除ID变量
dataset <- dataset[,-1]
# 变量类型转换
for(i in 1:9) {
dataset[,i] <- as.numeric(as.character(dataset[,i]))
}
# 数据摘要
summary(dataset)
# 类别变量分布
cbind(freq=table(dataset$Class), percentage=prop.table(table(dataset$Class))*100)
# 变量集之间的相关性
complete_cases <- complete.cases(dataset)
cor(dataset[complete_cases,1:9])
# 变量集直方图
par(mfrow=c(3,3))
for(i in 1:9) {
hist(dataset[,i], main=names(dataset)[i])
}
# 变量集核密度图
par(mfrow=c(3,3))
complete_cases <- complete.cases(dataset)
for(i in 1:9) {
plot(density(dataset[complete_cases,i]), main=names(dataset)[i])
}
# 变量集盒箱图
par(mfrow=c(3,3))
for(i in 1:9) {
boxplot(dataset[,i], main=names(dataset)[i])
}
# 散点图矩阵
jittered_x <- sapply(dataset[,1:9], jitter)
pairs(jittered_x, names(dataset[,1:9]), col=dataset$Class)
# 基于类别的变量集盒箱图
par(mfrow=c(3,3))
for(i in 1:9) {
barplot(table(dataset$Class,dataset[,i]), main=names(dataset)[i], legend.text=unique(dataset$Class))
}
# 算法评测
# 重复3次的10折交叉验证
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(Class~., data=dataset, method="glm", metric=metric, trControl=control)
# LDA
set.seed(7)
fit.lda <- train(Class~., data=dataset, method="lda", metric=metric, trControl=control)
# GLMNET
set.seed(7)
fit.glmnet <- train(Class~., data=dataset, method="glmnet", metric=metric, trControl=control)
# KNN
set.seed(7)
fit.knn <- train(Class~., data=dataset, method="knn", metric=metric, trControl=control)
# CART
set.seed(7)
fit.cart <- train(Class~., data=dataset, method="rpart", metric=metric, trControl=control)
# Naive Bayes
set.seed(7)
fit.nb <- train(Class~., data=dataset, method="nb", metric=metric, trControl=control)
# SVM
set.seed(7)
fit.svm <- train(Class~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn, CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(results)
dotplot(results)
# Evaluate Algorithms Transform
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(Class~., data=dataset, method="glm", metric=metric, preProc=c("BoxCox"), trControl=control)
# LDA
set.seed(7)
fit.lda <- train(Class~., data=dataset, method="lda", metric=metric, preProc=c("BoxCox"), trControl=control)
# GLMNET
set.seed(7)
fit.glmnet <- train(Class~., data=dataset, method="glmnet", metric=metric, preProc=c("BoxCox"), trControl=control)
# KNN
set.seed(7)
fit.knn <- train(Class~., data=dataset, method="knn", metric=metric, preProc=c("BoxCox"), trControl=control)
# CART
set.seed(7)
fit.cart <- train(Class~., data=dataset, method="rpart", metric=metric, preProc=c("BoxCox"), trControl=control)
# Naive Bayes
set.seed(7)
fit.nb <- train(Class~., data=dataset, method="nb", metric=metric, preProc=c("BoxCox"), trControl=control)
# SVM
set.seed(7)
fit.svm <- train(Class~., data=dataset, method="svmRadial", metric=metric, preProc=c("BoxCox"), trControl=control)
# Compare algorithms
transform_results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn, CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(transform_results)
dotplot(transform_results)
# 性能优化
# 调参
# Tune SVM
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 10, by=1))
fit.svm <- train(Class~., data=dataset, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("BoxCox"), trControl=control)
print(fit.svm)
plot(fit.svm)
# Tune kNN
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.k=seq(1,20,by=1))
fit.knn <- train(Class~., data=dataset, method="knn", metric=metric, tuneGrid=grid, preProc=c("BoxCox"), trControl=control)
print(fit.knn)
plot(fit.knn)
# 集成
# Ensembles: Boosting and Bagging
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# Bagged CART
set.seed(7)
fit.treebag <- train(Class~., data=dataset, method="treebag", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(Class~., data=dataset, method="rf", metric=metric, preProc=c("BoxCox"), trControl=control)
# Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(Class~., data=dataset, method="gbm", metric=metric, preProc=c("BoxCox"), trControl=control, verbose=FALSE)
# C5.0
set.seed(7)
fit.c50 <- train(Class~., data=dataset, method="C5.0", metric=metric, preProc=c("BoxCox"), trControl=control)
# Compare results
ensemble_results <- resamples(list(BAG=fit.treebag, RF=fit.rf, GBM=fit.gbm, C50=fit.c50))
summary(ensemble_results)
dotplot(ensemble_results)
# 模型应用
# prepare parameters for data transform
set.seed(7)
dataset_nomissing <- dataset[complete.cases(dataset),]
x <- dataset_nomissing[,1:9]
preprocessParams <- preProcess(x, method=c("BoxCox"))
x <- predict(preprocessParams, x)
# prepare the validation dataset
set.seed(7)
# remove id column
validation <- validation[,-1]
# remove missing values (not allowed in this implementation of knn)
validation <- validation[complete.cases(validation),]
# convert to numeric
for(i in 1:9) {
validation[,i] <- as.numeric(as.character(validation[,i]))
}
# transform the validation dataset
validation_x <- predict(preprocessParams, validation[,1:9])
# make predictions
set.seed(7)
predictions <- knn3Train(x, validation_x, dataset_nomissing$Class, k=9, prob=FALSE)
confusionMatrix(predictions, validation$Class)
参考资料:
https://machinelearningmastery.com/machine-learning-project-template-in-r/
你若是觉得有用,清点赞并分享给其它朋友。更多数据知识,请点击阅读原文。您有任何问题,请留言。
公众号推荐
数据人才(ID:datarencai)
(一个帮助数据人才找工作,
帮助数据公司招人才的公众号,
也分享数据人才学习和生活的有趣事情。)
欢迎关注和随喜分享。
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!