用mice包处理缺失值|R包
1、数据准备
> Z1=read.table('clipboard',header=T) #读入数据
> head(Z1) #查看前六行
Age Gender Cholesterol SystolicBP BMI Smoking Education
1 67.9 Female 236.4 129.8 26.4 Yes High
2 54.8 Female 126.0 133.4 28.4 No Medium
3 68.4 Male 198.7 158.5 24.1 Yes High
4 67.9 Male 205.0 136.0 19.9 No Low
5 60.9 Male 207.7 145.4 26.7 No Medium
6 68.4 Female 222.5 130.6 30.6 No Low
> original <- Z1 #复制源数据,为了和之后做预测精确度对比
从以上程序可以看出,源数据无缺失值,则为了之后进行数据处理为数据添加一些缺失值。
> set.seed(10)
> Z1[sample(1:nrow(Z1), 20), "Cholesterol"] <- NA #随机为变量添加缺失值
> Z1[sample(1:nrow(Z1), 20), "Smoking"] <- NA
> Z1[sample(1:nrow(Z1), 20), "Education"] <- NA
> Z1[sample(1:nrow(Z1), 5), "Age"] <- NA
> Z1[sample(1:nrow(Z1), 5), "BMI"] <- NA
> sapply(Z1, function(x) sum(is.na(x))) #对Z1按照sum函数统计
Age Gender Cholesterol SystolicBP BMI
5 0 20 0 5
Smoking Education
20 20
> library(dplyr)
> Z1 <- Z1 %>%
+ mutate(Smoking = as.factor(Smoking)) %>% #对变量进行字符转换
+ mutate(Education = as.factor(Education)) %>%
+ mutate(Cholesterol = as.numeric(Cholesterol))
> str(Z1)
'data.frame': 24 obs. of 7 variables:
$ Age : num NA 54.8 NA 67.9 60.9 68.4 67.9 60.9 NA 62.9 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 1 1 1 2 2 ...
$ Cholesterol: num NA NA NA NA NA ...
$ SystolicBP : num 130 133 158 136 145 ...
$ BMI : num 26.4 NA 24.1 19.9 NA 30.6 26.4 28.4 24.1 NA ...
$ Smoking : Factor w/ 2 levels "No","Yes": 2 NA NA 1 NA NA NA NA NA NA ...
$ Education : Factor w/ 3 levels "High","Low","Medium": NA NA NA NA NA NA NA NA NA 2 ...
2、mice包运算
我们传统中处理缺失值大多使用中位数或平均数的方法就行插补。 mice包处理缺失值的方法是利用所给出数据中的其他变量,通过迭代和预设矩阵构造模型拟合出一个缺,利用多重插补对缺失值进行处理这种方法所给出的缺失值精确度较高。
> library(mice)
> library(Rcpp)
> init = mice(Z1, maxit=0)
> init
Multiply imputed data set
Call:
mice(data = Z1, maxit = 0) #输出mice包处理缺失值各变量的方法,预设矩阵等
Number of multiple imputations: 5 #默认迭代5次
Missing cells per column:
Age Gender Cholesterol SystolicBP BMI
5 0 20 0 5
Smoking Education
20 20
Imputation methods:
Age Gender Cholesterol SystolicBP BMI
"pmm" "" "pmm" "" "pmm"
Smoking Education
"logreg" "polyreg"
VisitSequence:
Age Cholesterol BMI Education
1 3 5 7
PredictorMatrix:#输出的变量矩阵,用于预测缺失值
Age Gender Cholesterol SystolicBP BMI Smoking Education
Age 0 1 1 1 1 0 1
Gender 0 0 0 0 0 0 0
Cholesterol 1 1 0 1 1 0 1
SystolicBP 0 0 0 0 0 0 0
BMI 1 1 1 1 0 0 1
Smoking 0 0 0 0 0 0 0
Education 1 1 1 1 1 0 0
PredictorMatrix:
Age Gender Cholesterol SystolicBP BMI Smoking Education
Age 0 1 1 1 1 0 1
Gender 0 0 0 0 0 0 0
Cholesterol 1 1 0 1 1 0 1
SystolicBP 0 0 0 0 0 0 0
BMI 1 1 1 1 0 0 1
Smoking 0 0 0 0 0 0 0
Education 1 1 1 1 1 0 0
Random generator seed value: NA
> meth = init$method
> predM = init$predictorMatrix
> predM[, c("BMI")]=0 更改BP列的矩阵向量全为0
> meth[c("Age")]="" 不填补Age列
> predM
Age Gender Cholesterol SystolicBP BMI Smoking Education
Age 0 1 1 1 0 0 1
Gender 0 0 0 0 0 0 0
Cholesterol 1 1 0 1 0 0 1
SystolicBP 0 0 0 0 0 0 0
BMI 1 1 1 1 0 0 1
Smoking 0 0 0 0 0 0 0
Education 1 1 1 1 0 0 0
> meth
Age Gender Cholesterol SystolicBP BMI
"" "" "pmm" "" "pmm"
Smoking Education
"logreg" "polyreg"
> meth[c("Cholesterol")]="norm"
> meth[c("Smoking")]="logreg"
> meth[c("Education")]="polyreg"
> set.seed(103)
> imputed = mice(Z1, method=meth, predictorMatrix=predM, m=5
> imputed <- complete(imputed)#迭代后成创建一个数据集
>head(imputed)
Age Gender Cholesterol SystolicBP BMI Smoking Education
1 NA Female 225.7175 129.8 26.4 Yes Medium
2 54.8 Female 205.3551 133.4 30.6 No Medium
3 NA Male 130.8879 158.5 24.1 No Medium
4 67.9 Male 266.9956 136.0 19.9 No Low
5 60.9 Male 208.1604 145.4 19.9 No Low
6 68.4 Female 222.5000 130.6 30.6 No Low
> sapply(imputed, function(x) sum(is.na(x))) 观察填补缺失值之后的情况
Age Gender Cholesterol SystolicBP BMI
5 0 0 0 0
Smoking Education
0 0
> # Cholesterol
> actual <- original$Cholesterol[is.na(Z1$Cholesterol)]
> predicted <- imputed$Cholesterol[is.na(Z1$Cholesterol)]
> mean(actual)#原缺失值的真实值平均
[1] 216.665
> mean(predicted)#预测缺失值的平均
[1] 211.977
# smoking
> actual <- original$Smoking[is.na(Z1$Smoking)]
> predicted <- imputed$Smoking[is.na(Z1$Smoking)]
> table(actual)
actual#原缺失值的真实值情况
No Yes
13 7
> table(predicted)#预测值的真实值情况
predicted
No Yes
19 1
原文链接:http://datascienceplus.com/handling-missing-data-with-mice-package-a-simple-approach/
严禁修改,可以转载,请注明作者和出自数据人网和原文链接。
公众号推荐:脚印英语JoyEnglish,分享英语口语干货。
数据人网是数据人学习、交流和分享的平台http://shujuren.org 。专注于从数据中学习。
平台的理念:人人投稿,知识共享;人人分析,洞见驱动;智慧聚合,普惠人人。
您在数据人网平台,可以1)学习数据知识;2)创建数据博客;3)认识数据朋友;4)寻找数据工作;5)其它与数据相关的干货。
我们努力坚持做原创,分享和传播数据知识干货!
我们都是数据人,数据是有价值的,坚定不移地利用数据价值创造价值!
数据资料、数据课程、数据圈子、数据工作和数据项目服务,请加微信:
luqin360
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!