用R探索自行车租赁行为
自行车已经成为城市旅行中增长最快的方式之一,这就是为什么Lyft和Uber进入了两轮车的游戏中。Lyft最近收购了全球最大的自行车租赁公司,将与优步(Uber)的Jump和福特(Ford)的GoBikes展开竞争。这两家公司在旧金山分别实现了62.5万次和140万次的单车出行。
自行车租赁的增长对提供这些服务的公司和城市都提出了独特的挑战,以应对变化的规模,尤其是在预测需求方面。
下面的教程将使用R语言探索UCI机器学习库中提供的Capital Bikeshare数据来探索2011年到2012年华盛顿特区的自行车租赁情况。
加载包
1library(tidyverse)
2library(rsample) # data splitting
3library(randomForest) # basic implementation
4library(ranger) # a faster implementation of randomForest
5library(caret) # an aggregator package for performing many machine learning models
6library(ggthemes)
7library(scales)
8library(wesanderson)
9library(styler)
导入数据集
我们拥有的数据包括对天气状况(温度、湿度、风速等)的测量,租用了多少辆自行车,以及其他可能影响租赁的季节属性(即工作日和节假日)。奇怪的是,尽管周一气温较高,但骑车者更有可能在周一之后租一辆自行车,而周六气温在15至25摄氏度之间,这种可能性会增加。
这段代码将导入data文件夹中的日常自行车租赁数据。
1bike <- readr::read_csv("data/day.csv")
处理数据
下面的脚本将对数据进行处理,并为建模做好准备。阅读注释,了解为什么要采取每一步,以及这些变量是如何进入可视化和模型的。
1# WRANGLE ---------------------------------------------------------------
2# I like to be overly cautious when it comes to wrangling because all models
3# are only as good as the underlying data. This data set came with many
4# categorical variables coded numerically, so I am going to create a
5# character version of each variable (_chr) and a factor version (_fct).
6# Creating a character and factor variable will let me choose which one to
7# use for each graph and model.
8#
9#
10#
11
12# this recodes the weekday variable into a character variable
13# test
14# bike %>%
15# mutate(
16# weekday_chr =
17# case_when(
18# weekday == 0 ~ "Sunday",
19# weekday == 1 ~ "Monday",
20# weekday == 2 ~ "Tuesday",
21# weekday == 3 ~ "Wednesday",
22# weekday == 4 ~ "Thursday",
23# weekday == 5 ~ "Friday",
24# weekday == 6 ~ "Saturday",
25# TRUE ~ "other")) %>%
26# dplyr::count(weekday, weekday_chr) %>%
27# tidyr::spread(weekday, n)
28
29# assign
30bike <- bike %>%
31 mutate(
32 weekday_chr =
33 case_when(
34 weekday == 0 ~ "Sunday",
35 weekday == 1 ~ "Monday",
36 weekday == 2 ~ "Tuesday",
37 weekday == 3 ~ "Wednesday",
38 weekday == 4 ~ "Thursday",
39 weekday == 5 ~ "Friday",
40 weekday == 6 ~ "Saturday",
41 TRUE ~ "other"))
42
43# verify
44# bike %>%
45# dplyr::count(weekday, weekday_chr) %>%
46# tidyr::spread(weekday, n)
47
48# Weekdays (factor) ---
49
50# test factor variable
51# bike %>%
52# mutate(
53# weekday_fct = factor(x = weekday,
54# levels = c(0,1,2,3,4,5,6),
55# labels = c("Sunday",
56# "Monday",
57# "Tuesday",
58# "Wednesday",
59# "Thursday",
60# "Friday",
61# "Saturday"))) %>%
62# dplyr::count(weekday, weekday_fct) %>%
63# tidyr::spread(weekday, n)
64
65# assign factor variable
66bike <- bike %>%
67 mutate(
68 weekday_fct = factor(x = weekday,
69 levels = c(0,1,2,3,4,5,6),
70 labels = c("Sunday",
71 "Monday",
72 "Tuesday",
73 "Wednesday",
74 "Thursday",
75 "Friday",
76 "Saturday")))
77
78# verify factor variable
79# bike %>%
80# dplyr::count(weekday, weekday_fct) %>%
81# tidyr::spread(weekday, n)
82
83
84# Holidays ----
85# test
86# bike %>%
87# mutate(holiday_chr =
88# case_when(
89# holiday == 0 ~ "Non-Holiday",
90# holiday == 1 ~ "Holiday")) %>%
91# dplyr::count(holiday, holiday_chr) %>%
92# tidyr::spread(holiday, n)
93
94# assign
95bike <- bike %>%
96 mutate(holiday_chr =
97 case_when(
98 holiday == 0 ~ "Non-Holiday",
99 holiday == 1 ~ "Holiday"))
100
101# verify
102# bike %>%
103# dplyr::count(holiday, holiday_chr) %>%
104# tidyr::spread(holiday, n)
105
106# test
107# bike %>%
108# mutate(
109# holiday_fct = factor(x = holiday,
110# levels = c(0,1),
111# labels = c("Non-Holiday",
112# "Holiday"))) %>%
113# dplyr::count(holiday, holiday_fct) %>%
114# tidyr::spread(holiday, n)
115
116# assign
117bike <- bike %>%
118 mutate(
119 holiday_fct = factor(x = holiday,
120 levels = c(0,1),
121 labels = c("Non-Holiday",
122 "Holiday")))
123
124# # verify
125# bike %>%
126# dplyr::count(holiday_chr, holiday_fct) %>%
127# tidyr::spread(holiday_chr, n)
128
129# Working days ----
130# test
131 # bike %>%
132 # mutate(
133 # workingday_chr =
134 # case_when(
135 # workingday == 0 ~ "Non-Working Day",
136 # workingday == 1 ~ "Working Day",
137 # TRUE ~ "other")) %>%
138 # dplyr::count(workingday, workingday_chr) %>%
139 # tidyr::spread(workingday, n)
140
141# assign
142 bike <- bike %>%
143 mutate(
144 workingday_chr =
145 case_when(
146 workingday == 0 ~ "Non-Working Day",
147 workingday == 1 ~ "Working Day",
148 TRUE ~ "other"))
149
150 # verify
151 # bike %>%
152 # dplyr::count(workingday, workingday_chr) %>%
153 # tidyr::spread(workingday, n)
154
155# test
156# bike %>%
157# mutate(
158# workingday_fct = factor(x = workingday,
159# levels = c(0,1),
160# labels = c("Non-Working Day",
161# "Working Day"))) %>%
162# dplyr::count(workingday, workingday_fct) %>%
163# tidyr::spread(workingday, n)
164
165# assign
166bike <- bike %>%
167 mutate(
168 workingday_fct = factor(x = workingday,
169 levels = c(0,1),
170 labels = c("Non-Working Day",
171 "Working Day")))
172
173# verify
174# bike %>%
175# dplyr::count(workingday_chr, workingday_fct) %>%
176# tidyr::spread(workingday_chr, n)
177
178
179# Seasons
180bike <- bike %>%
181 mutate(
182 season_chr =
183 case_when(
184 season == 1 ~ "Spring",
185 season == 2 ~ "Summer",
186 season == 3 ~ "Fall",
187 season == 4 ~ "Winter",
188 TRUE ~ "other"
189 ))
190
191# test
192# bike %>%
193# mutate(
194# season_fct = factor(x = season,
195# levels = c(1, 2, 3, 4),
196# labels = c("Spring",
197# "Summer",
198# "Fall",
199# "Winter"))) %>%
200# dplyr::count(season_chr, season_fct) %>%
201# tidyr::spread(season_chr, n)
202
203# assign
204bike <- bike %>%
205 mutate(
206 season_fct = factor(x = season,
207 levels = c(1, 2, 3, 4),
208 labels = c("Spring",
209 "Summer",
210 "Fall",
211 "Winter")))
212
213# verify
214# bike %>%
215# dplyr::count(season_chr, season_fct) %>%
216# tidyr::spread(season_chr, n)
217
218
219# Weather situation ----
220# test
221# bike %>%
222# mutate(
223# weathersit_chr =
224# case_when(
225# weathersit == 1 ~ "Good",
226# weathersit == 2 ~ "Clouds/Mist",
227# weathersit == 3 ~ "Rain/Snow/Storm",
228# TRUE ~ "other")) %>%
229# dplyr::count(weathersit, weathersit_chr) %>%
230# tidyr::spread(weathersit, n)
231
232# assign
233bike <- bike %>%
234 mutate(
235 weathersit_chr =
236 case_when(
237 weathersit == 1 ~ "Good",
238 weathersit == 2 ~ "Clouds/Mist",
239 weathersit == 3 ~ "Rain/Snow/Storm"))
240
241# verify
242# bike %>%
243# dplyr::count(weathersit, weathersit_chr) %>%
244# tidyr::spread(weathersit, n)
245
246# test
247# bike %>%
248# mutate(
249# weathersit_fct = factor(x = weathersit,
250# levels = c(1, 2, 3),
251# labels = c("Good",
252# "Clouds/Mist",
253# "Rain/Snow/Storm"))) %>%
254# dplyr::count(weathersit, weathersit_fct) %>%
255# tidyr::spread(weathersit, n)
256
257# assign
258bike <- bike %>%
259 mutate(
260 weathersit_fct = factor(x = weathersit,
261 levels = c(1, 2, 3),
262 labels = c("Good",
263 "Clouds/Mist",
264 "Rain/Snow/Storm")))
265# verify
266# bike %>%
267# dplyr::count(weathersit_chr, weathersit_fct) %>%
268# tidyr::spread(weathersit_chr, n)
269
270
271# Months ----
272# huge shoutout to Thomas Mock over at RStudio for showing me
273# lubridate::month() (and stopping my case_when() obsession)
274# https://twitter.com/thomas_mock/status/1113105497480183818
275
276# test
277# bike %>%
278# mutate(month_ord =
279# lubridate::month(mnth, label = TRUE)) %>%
280# dplyr::count(month_ord, mnth) %>%
281# tidyr::spread(month_ord, n)
282
283# assign
284bike <- bike %>%
285 mutate(month_ord =
286 lubridate::month(mnth, label = TRUE))
287
288# verify
289# bike %>%
290# dplyr::count(month_ord, mnth) %>%
291# tidyr::spread(month_ord, n)
292
293
294# test
295# bike %>%
296# mutate(
297# month_fct = factor(x = mnth,
298# levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
299# labels = c("January", "February", "March", "April", "May",
300# "June", "July", "August", "September", "October",
301# "November", "December"))) %>%
302# dplyr::count(mnth, month_fct) %>%
303# tidyr::spread(month_fct, n)
304
305# assign
306bike <- bike %>%
307 mutate(
308 month_fct = factor(x = mnth,
309 levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
310 labels = c("January", "February", "March", "April", "May",
311 "June", "July", "August", "September", "October",
312 "November", "December")))
313
314# verify
315# bike %>%
316# dplyr::count(month_chr, month_fct) %>%
317# tidyr::spread(month_fct, n)
318
319# Year ----
320# test
321# bike %>%
322# mutate(
323# yr_chr =
324# case_when(
325# yr == 0 ~ "2011",
326# yr == 1 ~ "2012",
327# TRUE ~ "other")) %>%
328# dplyr::count(yr, yr_chr) %>%
329# tidyr::spread(yr, n)
330
331# assign
332bike <- bike %>%
333 mutate(
334 yr_chr =
335 case_when(
336 yr == 0 ~ "2011",
337 yr == 1 ~ "2012"))
338# verify
339# bike %>%
340# dplyr::count(yr, yr_chr) %>%
341# tidyr::spread(yr, n)
342
343# test
344# bike %>%
345# mutate(
346# yr_fct = factor(x = yr,
347# levels = c(0, 1),
348# labels = c("2011",
349# "2012"))) %>%
350# dplyr::count(yr, yr_fct) %>%
351# tidyr::spread(yr, n)
352
353# assign
354bike <- bike %>%
355 mutate(
356 yr_fct = factor(x = yr,
357 levels = c(0, 1),
358 labels = c("2011",
359 "2012")))
360# verify
361# bike %>%
362# dplyr::count(yr_chr, yr_fct) %>%
363# tidyr::spread(yr_chr, n)
364
365# normalize temperatures ----
366bike <- bike %>%
367 mutate(temp = as.integer(temp * (39 - (-8)) + (-8)))
368
369bike <- bike %>%
370 mutate(atemp = atemp * (50 - (16)) + (16))
371
372# ~ windspeed ----
373bike <- bike %>%
374 mutate(windspeed = as.integer(67 * bike$windspeed))
375
376# ~ humidity ----
377bike <- bike %>%
378 mutate(hum = as.integer(100 * bike$hum))
379
380# ~ convert to date ----
381bike <- bike %>%
382 mutate(dteday = as.Date(dteday))
383
384# check df
385# bike %>% dplyr::glimpse(78)
386
387# rename the data frame so these don't get confused
388BikeData <- bike
389
390# reorganize variables for easier inspection
391
392BikeData <- BikeData %>%
393 dplyr::select(
394 dplyr::starts_with("week"),
395 dplyr::starts_with("holi"),
396 dplyr::starts_with("seas"),
397 dplyr::starts_with("work"),
398 dplyr::starts_with("month"),
399 dplyr::starts_with("yr"),
400 dplyr::starts_with("weath"),
401 dplyr::everything())
402If you then dplyr::glimpse(BikeData) you should have the following:
403
404Observations: 731
405Variables: 30
406$ weekday <int> 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2…
407$ weekday_chr <chr> "Saturday", "Sunday", "Monday", "Tuesday", "Wednesda…
408$ weekday_fct <fct> Saturday, Sunday, Monday, Tuesday, Wednesday, Thursd…
409$ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
410$ holiday_chr <chr> "Non-Holiday", "Non-Holiday", "Non-Holiday", "Non-Ho…
411$ holiday_fct <fct> Non-Holiday, Non-Holiday, Non-Holiday, Non-Holiday, …
412$ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
413$ season_chr <chr> "Spring", "Spring", "Spring", "Spring", "Spring", "S…
414$ season_fct <fct> Spring, Spring, Spring, Spring, Spring, Spring, Spri…
415$ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1…
416$ workingday_chr <chr> "Non-Working Day", "Non-Working Day", "Working Day",…
417$ workingday_fct <fct> Non-Working Day, Non-Working Day, Working Day, Worki…
418$ month_ord <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Ja…
419$ month_fct <fct> January, January, January, January, January, January…
420$ yr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
421$ yr_chr <chr> "2011", "2011", "2011", "2011", "2011", "2011", "201…
422$ yr_fct <fct> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
423$ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2…
424$ weathersit_chr <chr> "Clouds/Mist", "Clouds/Mist", "Good", "Good", "Good"…
425$ weathersit_fct <fct> Clouds/Mist, Clouds/Mist, Good, Good, Good, Good, Cl…
426$ instant <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
427$ dteday <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 201…
428$ mnth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
429$ temp <int> 8, 9, 1, 1, 2, 1, 1, 0, -1, 0, 0, 0, 0, 0, 2, 2, 0, …
430$ atemp <dbl> 28.36325, 28.02713, 22.43977, 23.21215, 23.79518, 23…
431$ hum <int> 80, 69, 43, 59, 43, 51, 49, 53, 43, 48, 68, 59, 47, …
432$ windspeed <int> 10, 16, 16, 10, 12, 6, 11, 17, 24, 14, 8, 20, 20, 8,…
433$ casual <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25,…
434$ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 12…
435$ cnt <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 13…
探索性数据分析
这里有三个选项可以将自行车表汇总为统计信息,这将使我们更好地理解BikeData数据框中每个变量的分布。首先,我们将使用dplyr包。
1BikeDplyrSummary <- BikeData %>%
2 select(temp, atemp, hum, windspeed, casual, registered, cnt) %>%
3 summarise_each(list(
4 min = ~min,
5 q25 = ~quantile(., 0.25),
6 median = ~median,
7 q75 = ~quantile(., 0.75),
8 max = ~max,
9 mean = ~mean,
10 sd = ~sd
11 )) %>%
12 gather(stat, val) %>%
13 separate(stat,
14 into = c("var", "stat"),
15 sep = "_") %>%
16 spread(stat, val) %>%
17 select(var, min, q25, median, q75, max, mean, sd)
18
19knitr::kable(BikeDplyrSummary)

接下来,我们将使用skimr包(可以在这里找到)。
1BikeSkimrSummary <- bike %>%
2 skimr::skim_to_wide() %>%
3 dplyr::select(type,
4 variable,
5 missing,
6 complete,
7 min,
8 max,
9 mean,
10 sd,
11 median = p50,
12 hist)
13knitr::kable(BikeSkimrSummary)
最后,我们将使用mosaic包(可以在这里找到)。
1BikeMosaicInspect <- mosaic::inspect(BikeData)
2# categorical
3knitr::kable(BikeMosaicInspect$categorical)
1# date
2knitr::kable(BikeMosaicInspect$Date)

1# quantitative
2knitr::kable(BikeMosaicInspect$quantitative)
探讨天气状况对自行车租赁的影响
与拼车乘客不同,骑自行车的人易受天气条件的影响,这可能会影响他们选择自行车而不是其他交通工具的可能性。如果天气状况对交通决策有影响,我们希望看到自行车租赁数量与天气特征(包括温度、湿度和风速)之间的关系。让我们探索。
1# ~ rentals by temperature ----
2ggRentalsByTemp <- BikeData %>%
3 ggplot(aes(y = cnt,
4 x = temp,
5 color = weekday_fct)) +
6 geom_point(show.legend = FALSE) +
7 geom_smooth(se = FALSE,
8 show.legend = FALSE) +
9 facet_grid(~weekday_fct) +
10 scale_color_brewer(palette = "Dark2") +
11 theme_fivethirtyeight() +
12 theme(axis.title = element_text()) +
13 ylab("Bike Rentals") +
14 xlab("Temperature (°C)") +
15 ggtitle("Bike Rental Volume By Temperature")
16ggRentalsByTemp
下面的输出是R告诉我们如何通过每组数据绘制最佳拟合线的方法。这符合 LOESS局部多项式回归的数据。
1`geom_smooth()` using method = 'loess' and formula 'y ~ x'
我们还预计,大风天气将对自行车租赁产生负面影响。让我们分析数据。
1# ggRentalVolByWindSpeed ----
2ggRentalVolByWindSpeed <- ggplot(bike) +
3 geom_point(aes(y = cnt,
4 x = windspeed,
5 color = weekday_fct),
6 show.legend = FALSE) +
7 facet_grid(~weekday_fct) +
8 scale_color_brewer(palette = "Dark2") +
9 theme_fivethirtyeight() +
10 theme(axis.title = element_text()) +
11 ylab("Bike Rentals") +
12 xlab("Windspeed") +
13 ggtitle("Rental Volume By Windspeed")
14ggRentalVolByWindSpeed
15
探索假期对自行车租赁量的影响
假期可能会以不同的方式影响骑自行车的人。例如,我们可以把假期看作是自行车手享受少开车上路的更多机会,因为通常很少人在假期开车。我们也可以考虑这样一种情况,自行车爱好者只喜欢在夏季或春季假期骑自行车(考虑到我们了解到的天气条件对自行车租赁的影响)。
1ggRentalVolByHoliday <- ggplot(BikeData) +
2 geom_density(aes(x = cnt,
3 fill = holiday_chr),
4 alpha = 0.2) +
5 scale_fill_brewer(palette = "Paired") +
6
7 theme_fivethirtyeight() +
8 theme(axis.title = element_text()) +
9 labs(title = "Bike Rental Density By Holiday",
10 fill = "Holiday",
11 x = "Average Bike Rentals",
12 y = "Density")
13
14ggRentalVolByHoliday
15
随着Lyft和优步今年都在筹备ipo,每家公司的总乘车量分别达到10亿次和100亿次,全国城市的交通中断速度似乎正在加快。这种变化将如何影响自行车等更环保的交通方式,在一定程度上取决于企业和城市根据适应需求进行规划和执行的能力。学会预测这种转变对大多数相关人员来说都是一项重要的技能。
本文的GitHub,请点击下面链接:
https://github.com/mjfrigaard/storybenchR/tree/master/12.3-bike-rentals-example
阅读完本文后,请继续阅读【如何用R建模GBM】一文。
原文链接:
http://www.storybench.org/exploring-bike-rental-behavior-using-r/
数据人才(ID:datarencai)
(一个帮助数据人才找工作的公众号,
也分享数据人才学习和生活的有趣事情。)
内容推荐
请关注“恒诺新知”微信公众号,感谢“R语言“,”数据那些事儿“,”老俊俊的生信笔记“,”冷🈚️思“,“珞珈R”,“生信星球”的支持!