• 主页
  • 课程

    关于课程

    • 课程归档
    • 成为一名讲师
    • 讲师信息
    教学以及管理操作教程

    教学以及管理操作教程

    ¥1,000.00 ¥100.00
    阅读更多
  • 特色
    • 展示
    • 关于我们
    • 问答
  • 事件
  • 个性化
  • 博客
  • 联系
  • 站点资源
    有任何问题吗?
    (00) 123 456 789
    weinfoadmin@weinformatics.cn
    注册登录
    恒诺新知
    • 主页
    • 课程

      关于课程

      • 课程归档
      • 成为一名讲师
      • 讲师信息
      教学以及管理操作教程

      教学以及管理操作教程

      ¥1,000.00 ¥100.00
      阅读更多
    • 特色
      • 展示
      • 关于我们
      • 问答
    • 事件
    • 个性化
    • 博客
    • 联系
    • 站点资源

      R语言

      • 首页
      • 博客
      • R语言
      • 【RApplication】K Means Clustering in R

      【RApplication】K Means Clustering in R

      • 发布者 weinfoadmin
      • 分类 R语言
      • 日期 2016年1月5日
      测试开头

      Hello everyone, hope you had a wonderful Christmas! In this post I will show you how to do k means clustering in R. We will use the iris dataset from thedatasets library.

      What is K Means Clustering?

      K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

      • Reassign data points to the cluster whose centroid is closest.

      • Calculate new centroid of each cluster.

      These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

      Exploring the data

      The iris dataset contains data about sepal length, sepal width, petal length, and petal width of flowers of different species. Let us see what it looks like:

      library(datasets) head(iris)   Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1          5.1         3.5          1.4         0.2  setosa 2          4.9         3.0          1.4         0.2  setosa 3          4.7         3.2          1.3         0.2  setosa 4          4.6         3.1          1.5         0.2  setosa 5          5.0         3.6          1.4         0.2  setosa 6          5.4         3.9          1.7         0.4  setosa

      After a little bit of exploration, I found that Petal.Length andPetal.Width were similar among the same species but varied considerably between different species, as demonstrated below:

      library(ggplot2) ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

      【RApplication】K Means Clustering in R

      Clustering

      Okay, now that we have seen the data, let us try to cluster it. Since the initial cluster assignments are random, let us set the seed to ensure reproducibility.

      set.seed(20) irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20) irisClusterK-means clustering with 3 clusters of sizes 46, 54, 50Cluster means:   Petal.Length Petal.Width1     5.626087    2.0478262     4.292593    1.3592593     1.462000    0.246000Clustering vector:   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3  [35] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2  [69] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1[103] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1[137] 1 1 2 1 1 1 1 1 1 1 1 1 1 1Within cluster sum of squares by cluster: [1] 15.16348 14.22741  2.02200  (between_SS / total_SS =  94.3 %)  Available components:  [1] "cluster"      "centers"      "totss"        "withinss"    [5] "tot.withinss" "betweenss"    "size"         "iter"        [9] "ifault"

      Since we know that there are 3 species involved, we ask the algorithm to group the data into 3 clusters, and since the starting assignments are random, we specify nstart = 20 . This means that R will try 20 different random starting assignments and then select the one with the lowest within cluster variation.

      We can see the cluster centroids, the clusters that each data point was assigned to, and the within cluster variation.

      Let us compare the clusters with the species.

      table(irisCluster$cluster, iris$Species)     setosa versicolor virginica   1      0          2        44   2      0         48         6   3     50          0         0

      As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly classified two data points belonging to versicolor and six data points belonging to virginica .

      We can also plot the data to see the clusters:

      irisCluster$cluster <- as.factor(irisCluster$cluster)ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster)) + geom_point()

      【RApplication】K Means Clustering in R

      That brings us to the end of the article. I hope you enjoyed it! If you have any questions or feedback, feel free to leave a comment or reach out to me on wechat ID:luqin360。


      From link:http://datascienceplus.com/k-means-clustering-in-r/

      测试结尾

      请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!

      • 分享:
      作者头像
      weinfoadmin

      上一篇文章

      【R应用】R 语言企业级数据挖掘应用
      2016年1月5日

      下一篇文章

      【R应用】Logistic模型的客户流失预警分析
      2016年1月6日

      你可能也喜欢

      3-1665801675
      R语言学习:重读《R数据科学(中文版)》书籍
      28 9月, 2022
      6-1652833487
      经典铁死亡,再出新思路
      16 5月, 2022
      1-1651501980
      R语言学习:阅读《R For Everyone 》(第二版)
      1 5月, 2022

      搜索

      分类

      • R语言
      • TCGA数据挖掘
      • 单细胞RNA-seq测序
      • 在线会议直播预告与回放
      • 数据分析那些事儿分类
      • 未分类
      • 生信星球
      • 老俊俊的生信笔记

      投稿培训

      免费

      alphafold2培训

      免费

      群晖配置培训

      免费

      最新博文

      Nature | 单细胞技术揭示衰老细胞与肌肉再生
      301月2023
      lncRNA和miRNA生信分析系列讲座免费视频课和课件资源包,干货满满
      301月2023
      如何快速批量修改 Git 提交记录中的用户信息
      261月2023
      logo-eduma-the-best-lms-wordpress-theme

      (00) 123 456 789

      weinfoadmin@weinformatics.cn

      恒诺新知

      • 关于我们
      • 博客
      • 联系
      • 成为一名讲师

      链接

      • 课程
      • 事件
      • 展示
      • 问答

      支持

      • 文档
      • 论坛
      • 语言包
      • 发行状态

      推荐

      • iHub汉语代码托管
      • iLAB耗材管理
      • WooCommerce
      • 丁香园论坛

      weinformatics 即 恒诺新知。ICP备案号:粤ICP备19129767号

      • 关于我们
      • 博客
      • 联系
      • 成为一名讲师

      要成为一名讲师吗?

      加入数以千计的演讲者获得100%课时费!

      现在开始

      用你的站点账户登录

      忘记密码?

      还不是会员? 现在注册

      注册新帐户

      已经拥有注册账户? 现在登录

      close
      会员购买 你还没有登录,请先登录
      • ¥99 VIP-1个月
      • ¥199 VIP-半年
      • ¥299 VIP-1年
      在线支付 激活码

      立即支付
      支付宝
      微信支付
      请使用 支付宝 或 微信 扫码支付
      登录
      注册|忘记密码?