TIOBE 编程社区指数是一个衡量编程语言受欢迎程度的指标。该指数每月更新一次。评级的依据是世界范围内熟练的工程师的数量、课程和第三方供应商。流行的搜索引擎,如 Google,Bing,Yahoo,Wikipedia, Amazon, YouTube 和百度被用来计算评级。值得注意的是,TIOBE 指数并不是关于最好的编程语言或编写了最多行代码的语言。
在TIOBE中,R语言一直在10~20名间徘徊,那么我们可以通过爬取TIOBE网页中的指数数据,整理成数据框,并绘制成折线图。要实现这个思路,有以下几个步骤:
- 爬取静态网页:
rvest
。 - 数据清洗:
tidyverse
。 - 绘图:
ggplot2
。
1 R爬虫:从网页解析表格数据
1.1 例1 爬取2022美国节假日
网址是()[https://www.officeholidays.com/countries/usa/2022],中间有如下图的表格,记录了节日的名称和日期。
通过检查网页源代码,发现该网页是非常标准的规范网页,表格存在了table
标签样式中,可以使用html_table()
函数直接读取。
read_html("https://www.officeholidays.com/countries/usa/2022") %>%
html_node("table") %>%
html_table()
# A tibble: 21 × 5
Day Date `Holiday Name` Type Comments
<chr> <chr> <chr> <chr> <chr>
1 Saturday Jan 01 New Year's Day Federal Holiday ""
2 Monday Jan 03 New Year's Day (in lieu) Regional Holiday ""
3 Monday Jan 17 Martin Luther King Jr. Day Federal Holiday "3rd Monday …
4 Monday Feb 21 President's Day Federal Holiday "3rd Monday …
5 Sunday Apr 17 Easter Sunday Not A Public Holiday ""
6 Sunday May 08 Mother's Day Not A Public Holiday "2nd Sunday …
7 Monday May 30 Memorial Day Federal Holiday "Last Monday…
8 Friday Jun 17 Juneteenth Federal Holiday "Emancipatio…
9 Sunday Jun 19 Juneteenth Federal Holiday "Emancipatio…
10 Sunday Jun 19 Father's Day Not A Public Holiday "3rd Sunday …
# ℹ 11 more rows
1.2 例2 从XML爬取食物表
网址是 https://www.w3schools.com/xml/simple.xml,如下图所示,不是规范的table
标签样式:
。
这种情况,需要使用常规的爬虫进行操作:找到具体标签,批量提取相应内容。
# 现针对一个标签进行提取
html <- read_html("https://www.w3schools.com/xml/simple.xml")
html %>%
html_nodes("name") %>%
html_text2()
[1] "Belgian Waffles" "Strawberry Belgian Waffles"
[3] "Berry-Berry Belgian Waffles" "French Toast"
[5] "Homestyle Breakfast"
# 使用map()函数对所需标签批量提取,并合并为数据框
vars <- c("name", "price", "description", "calories")
map(
vars,
~ html_nodes(html, .x) %>%
html_text2()
) %>%
set_names(vars) %>%
as_tibble()
# A tibble: 5 × 4
name price description calories
<chr> <chr> <chr> <chr>
1 Belgian Waffles $5.95 Two of our famous Belgian Waffles … 650
2 Strawberry Belgian Waffles $7.95 Light Belgian waffles covered with… 900
3 Berry-Berry Belgian Waffles $8.95 Light Belgian waffles covered with… 900
4 French Toast $4.50 Thick slices made from our homemad… 600
5 Homestyle Breakfast $6.95 Two eggs, bacon or sausage, toast,… 950
2 R语言TIOBE指数-单网页提取
2.1 数据爬取
TOIBE是静态网页,使用rvest
来解析。一般情况建议使用SelectorGadget
插件定位到数据的节点,但这网页无法定位,所以只能手动定位:
查看网页源代码不难定位到数据位置:
数据非常长,,位于标签之间。注意,月份需要+1才能对上。
url <- "https://www.tiobe.com/tiobe-index/r/"
dats <- read_html(url) %>% # 解析网页
html_node("script") %>% # 定位节点
html_text2() # 获取文本内容
dats
[1] "window.dataLayer = window.dataLayer || []; function gtag() { dataLayer.push(arguments); } gtag(\"consent\", \"default\", { ad_user_data: \"denied\", ad_personalization: \"denied\", ad_storage: \"denied\", analytics_storage: \"denied\", functionality_storage: \"denied\", personalization_storage: \"denied\", security_storage: \"granted\", wait_for_update: 500, }); gtag(\"set\", \"ads_data_redaction\", true);"
好想结果不符合我们的期待,返回源代码继续往多找一层。是article
!
dats <- read_html(url) %>% # 解析网页
html_node("article") %>% # 定位节点
html_text2() # 获取文本内容
dats
[1] "Home » TIOBE Index\n\n$(function () { $('#container').highcharts({ credits: { enabled: false }, chart: { type: 'spline' }, plotOptions: { spline: { lineWidth: 4, states: { hover: { lineWidth: 5 } }, marker: { enabled: false } } }, title: { text: 'TIOBE Index for R', x: -20, useHTML: true }, subtitle: { text: 'Source: www.tiobe.com', x: -20, useHTML: true }, xAxis: { type: 'datetime', dateTimeLabelFormats: { year: '%Y' } }, yAxis: { title: { text: 'Ratings (%)' }, plotLines: [{ value: 0, width: 1, color: '#808080' }] }, tooltip: { valueSuffix: '%', dateTimeLabelFormats: { week: \"%B %Y\" } }, legend: { enabled: false }, series: [ {name : 'R',data : [[Date.UTC(2007, 4, 5), 0.06], [Date.UTC(2007, 5, 2), 0.07], [Date.UTC(2007, 6, 2), 0.07], [Date.UTC(2007, 7, 5), 0.06], [Date.UTC(2007, 8, 2), 0.07], [Date.UTC(2007, 9, 4), 0.07], [Date.UTC(2007, 10, 4), 0.09], [Date.UTC(2007, 11, 3), 0.09], [Date.UTC(2008, 0, 3), 0.04], [Date.UTC(2008, 1, 7), 0.06], [Date.UTC(2008, 5, 1), 0.09], [Date.UTC(2008, 6, 2), 0.08], [Date.UTC(2008, 7, 3), 0.06], [Date.UTC(2008, 8, 3), 0.07], [Date.UTC(2008, 9, 6), 0.07], [Date.UTC(2008, 10, 2), 0.08], [Date.UTC(2008, 11, 3), 0.07], [Date.UTC(2009, 0, 2), 0.07], [Date.UTC(2009, 1, 1), 0.08], [Date.UTC(2009, 2, 5), 0.08], [Date.UTC(2009, 3, 7), 0.09], [Date.UTC(2009, 4, 1), 0.10], [Date.UTC(2009, 5, 4), 0.10], [Date.UTC(2009, 6, 2), 0.10], [Date.UTC(2009, 7, 1), 0.11], [Date.UTC(2009, 8, 5), 0.09], [Date.UTC(2009, 9, 2), 0.10], [Date.UTC(2009, 10, 2), 0.10], [Date.UTC(2009, 11, 2), 0.11], [Date.UTC(2010, 0, 5), 0.10], [Date.UTC(2010, 1, 7), 0.14], [Date.UTC(2010, 2, 7), 0.15], [Date.UTC(2010, 3, 5), 0.14], [Date.UTC(2010, 4, 15), 0.20], [Date.UTC(2010, 6, 6), 0.17], [Date.UTC(2010, 6, 30), 0.18], [Date.UTC(2010, 8, 11), 0.33], [Date.UTC(2010, 9, 2), 0.36], [Date.UTC(2010, 10, 3), 0.36], [Date.UTC(2010, 11, 7), 0.54], [Date.UTC(2011, 0, 2), 0.54], [Date.UTC(2011, 1, 8), 0.56], [Date.UTC(2011, 2, 8), 0.49], [Date.UTC(2011, 3, 3), 0.42], [Date.UTC(2011, 4, 2), 0.43], [Date.UTC(2011, 5, 5), 0.37], [Date.UTC(2011, 5, 27), 0.37], [Date.UTC(2011, 6, 8), 0.36], [Date.UTC(2011, 7, 3), 0.40], [Date.UTC(2011, 8, 10), 0.39], [Date.UTC(2011, 9, 9), 0.42], [Date.UTC(2011, 10, 7), 0.50], [Date.UTC(2011, 11, 4), 0.52], [Date.UTC(2012, 0, 8), 0.60], [Date.UTC(2012, 1, 5), 0.62], [Date.UTC(2012, 2, 11), 0.50], [Date.UTC(2012, 3, 8), 0.38], [Date.UTC(2012, 4, 9), 0.38], [Date.UTC(2012, 5, 10), 0.44], [Date.UTC(2012, 6, 4), 0.44], [Date.UTC(2012, 7, 10), 0.43], [Date.UTC(2012, 8, 2), 0.44], [Date.UTC(2012, 9, 5), 0.42], [Date.UTC(2012, 10, 4), 0.42], [Date.UTC(2012, 11, 2), 0.45], [Date.UTC(2013, 0, 5), 0.44], [Date.UTC(2013, 1, 8), 0.46], [Date.UTC(2013, 2, 11), 0.53], [Date.UTC(2013, 3, 7), 0.48], [Date.UTC(2013, 4, 8), 0.54], [Date.UTC(2013, 5, 9), 0.48], [Date.UTC(2013, 6, 7), 0.51], [Date.UTC(2013, 6, 12), 0.51], [Date.UTC(2013, 7, 4), 0.39], [Date.UTC(2013, 8, 11), 0.65], [Date.UTC(2013, 9, 10), 0.55], [Date.UTC(2013, 10, 9), 0.41], [Date.UTC(2013, 11, 6), 0.25], [Date.UTC(2014, 0, 1), 0.25], [Date.UTC(2014, 1, 8), 0.25], [Date.UTC(2014, 2, 3), 0.23], [Date.UTC(2014, 3, 10), 0.26], [Date.UTC(2014, 4, 7), 0.38], [Date.UTC(2014, 5, 8), 0.67], [Date.UTC(2014, 6, 6), 0.41], [Date.UTC(2014, 7, 11), 0.52], [Date.UTC(2014, 8, 1), 0.80], [Date.UTC(2014, 9, 3), 1.52], [Date.UTC(2014, 10, 8), 1.55], [Date.UTC(2014, 11, 7), 1.63], [Date.UTC(2015, 0, 6), 1.04], [Date.UTC(2015, 1, 5), 0.96], [Date.UTC(2015, 2, 7), 0.95], [Date.UTC(2015, 3, 13), 1.03], [Date.UTC(2015, 4, 13), 1.44], [Date.UTC(2015, 5, 6), 1.52], [Date.UTC(2015, 6, 12), 1.23], [Date.UTC(2015, 7, 6), 1.01], [Date.UTC(2015, 8, 5), 1.04], [Date.UTC(2015, 9, 4), 0.99], [Date.UTC(2015, 10, 7), 1.01], [Date.UTC(2015, 11, 4), 1.12], [Date.UTC(2016, 0, 2), 1.05], [Date.UTC(2016, 1, 2), 1.19], [Date.UTC(2016, 2, 3), 1.29], [Date.UTC(2016, 3, 7), 1.27], [Date.UTC(2016, 4, 6), 1.33], [Date.UTC(2016, 5, 5), 1.54], [Date.UTC(2016, 6, 4), 1.51], [Date.UTC(2016, 7, 6), 1.61], [Date.UTC(2016, 8, 8), 1.68], [Date.UTC(2016, 9, 7), 1.74], [Date.UTC(2016, 10, 5), 1.72], [Date.UTC(2016, 11, 4), 1.83], [Date.UTC(2017, 0, 7), 1.79], [Date.UTC(2017, 1, 8), 1.92], [Date.UTC(2017, 2, 7), 2.02], [Date.UTC(2017, 3, 9), 2.14], [Date.UTC(2017, 4, 6), 2.19], [Date.UTC(2017, 5, 3), 2.15], [Date.UTC(2017, 6, 7), 2.11], [Date.UTC(2017, 7, 2), 1.77], [Date.UTC(2017, 8, 6), 1.82], [Date.UTC(2017, 9, 5), 1.68], [Date.UTC(2017, 10, 12), 1.60], [Date.UTC(2017, 11, 9), 1.91], [Date.UTC(2018, 0, 3), 2.55], [Date.UTC(2018, 1, 8), 2.09], [Date.UTC(2018, 2, 7), 1.13], [Date.UTC(2018, 3, 1), 1.81], [Date.UTC(2018, 4, 6), 1.18], [Date.UTC(2018, 5, 10), 1.45], [Date.UTC(2018, 6, 7), 1.15], [Date.UTC(2018, 7, 1), 0.96], [Date.UTC(2018, 8, 3), 1.02], [Date.UTC(2018, 9, 5), 1.21], [Date.UTC(2018, 10, 8), 1.41], [Date.UTC(2018, 11, 2), 1.11], [Date.UTC(2019, 0, 4), 1.33], [Date.UTC(2019, 1, 6), 1.04], [Date.UTC(2019, 2, 2), 1.28], [Date.UTC(2019, 3, 7), 1.18], [Date.UTC(2019, 4, 4), 0.95], [Date.UTC(2019, 5, 9), 0.91], [Date.UTC(2019, 6, 6), 0.84], [Date.UTC(2019, 7, 5), 0.82], [Date.UTC(2019, 8, 9), 1.05], [Date.UTC(2019, 9, 5), 1.26], [Date.UTC(2019, 10, 3), 0.98], [Date.UTC(2019, 11, 6), 0.99], [Date.UTC(2020, 0, 5), 0.81], [Date.UTC(2020, 1, 4), 1.01], [Date.UTC(2020, 2, 4), 1.26], [Date.UTC(2020, 3, 2), 1.54], [Date.UTC(2020, 4, 2), 1.85], [Date.UTC(2020, 5, 1), 2.19], [Date.UTC(2020, 6, 4), 2.41], [Date.UTC(2020, 7, 2), 2.79], [Date.UTC(2020, 8, 6), 2.37], [Date.UTC(2020, 9, 4), 1.99], [Date.UTC(2020, 10, 3), 1.64], [Date.UTC(2020, 11, 3), 1.60], [Date.UTC(2021, 0, 2), 1.90], [Date.UTC(2021, 1, 6), 1.56], [Date.UTC(2021, 2, 4), 1.25], [Date.UTC(2021, 3, 4), 1.12], [Date.UTC(2021, 4, 2), 1.38], [Date.UTC(2021, 5, 5), 1.20], [Date.UTC(2021, 6, 4), 1.33], [Date.UTC(2021, 7, 3), 1.05], [Date.UTC(2021, 8, 11), 0.98], [Date.UTC(2021, 9, 6), 1.20], [Date.UTC(2021, 10, 6), 1.28], [Date.UTC(2021, 11, 5), 1.58], [Date.UTC(2022, 0, 1), 1.25], [Date.UTC(2022, 1, 2), 1.11], [Date.UTC(2022, 2, 2), 1.37], [Date.UTC(2022, 3, 5), 1.55], [Date.UTC(2022, 4, 3), 1.22], [Date.UTC(2022, 5, 4), 0.98], [Date.UTC(2022, 6, 2), 0.76], [Date.UTC(2022, 7, 2), 0.90], [Date.UTC(2022, 8, 1), 0.95], [Date.UTC(2022, 9, 1), 1.22], [Date.UTC(2022, 10, 1), 1.14], [Date.UTC(2022, 11, 2), 1.25], [Date.UTC(2022, 11, 29), 1.04], [Date.UTC(2023, 1, 1), 1.08], [Date.UTC(2023, 2, 2), 0.93], [Date.UTC(2023, 3, 1), 0.76], [Date.UTC(2023, 4, 2), 0.82], [Date.UTC(2023, 5, 2), 0.94], [Date.UTC(2023, 6, 2), 0.87], [Date.UTC(2023, 7, 4), 0.92], [Date.UTC(2023, 8, 2), 0.97], [Date.UTC(2023, 9, 4), 0.96], [Date.UTC(2023, 10, 2), 0.93], [Date.UTC(2023, 11, 4), 0.72], [Date.UTC(2024, 0, 2), 0.74], [Date.UTC(2024, 1, 2), 0.99], [Date.UTC(2024, 2, 1), 0.81], [Date.UTC(2024, 3, 3), 0.84], [Date.UTC(2024, 4, 1), 0.75], [Date.UTC(2024, 5, 1), 0.96], [Date.UTC(2024, 6, 3), 0.83], [Date.UTC(2024, 7, 1), 1.11], [Date.UTC(2024, 8, 1), 1.20], [Date.UTC(2024, 9, 2), 1.09]]} ] }); });\nThe R Programming Language\nSome information about R:\n\nHighest Position (since 2007): #8 in Aug 2020\n\nLowest Position (since 2007): #73 in Dec 2008"
这几次结果包含了数据,但还有些我们不需要的信息。我们需要的内容都在[[]]
之间,自然可以用政策表达式来提取。
# 使用零断宽言提取
dats <- dats %>%
str_extract("(?<=\\[\\[).*(?=\\]\\])")
dats
[1] "Date.UTC(2007, 4, 5), 0.06], [Date.UTC(2007, 5, 2), 0.07], [Date.UTC(2007, 6, 2), 0.07], [Date.UTC(2007, 7, 5), 0.06], [Date.UTC(2007, 8, 2), 0.07], [Date.UTC(2007, 9, 4), 0.07], [Date.UTC(2007, 10, 4), 0.09], [Date.UTC(2007, 11, 3), 0.09], [Date.UTC(2008, 0, 3), 0.04], [Date.UTC(2008, 1, 7), 0.06], [Date.UTC(2008, 5, 1), 0.09], [Date.UTC(2008, 6, 2), 0.08], [Date.UTC(2008, 7, 3), 0.06], [Date.UTC(2008, 8, 3), 0.07], [Date.UTC(2008, 9, 6), 0.07], [Date.UTC(2008, 10, 2), 0.08], [Date.UTC(2008, 11, 3), 0.07], [Date.UTC(2009, 0, 2), 0.07], [Date.UTC(2009, 1, 1), 0.08], [Date.UTC(2009, 2, 5), 0.08], [Date.UTC(2009, 3, 7), 0.09], [Date.UTC(2009, 4, 1), 0.10], [Date.UTC(2009, 5, 4), 0.10], [Date.UTC(2009, 6, 2), 0.10], [Date.UTC(2009, 7, 1), 0.11], [Date.UTC(2009, 8, 5), 0.09], [Date.UTC(2009, 9, 2), 0.10], [Date.UTC(2009, 10, 2), 0.10], [Date.UTC(2009, 11, 2), 0.11], [Date.UTC(2010, 0, 5), 0.10], [Date.UTC(2010, 1, 7), 0.14], [Date.UTC(2010, 2, 7), 0.15], [Date.UTC(2010, 3, 5), 0.14], [Date.UTC(2010, 4, 15), 0.20], [Date.UTC(2010, 6, 6), 0.17], [Date.UTC(2010, 6, 30), 0.18], [Date.UTC(2010, 8, 11), 0.33], [Date.UTC(2010, 9, 2), 0.36], [Date.UTC(2010, 10, 3), 0.36], [Date.UTC(2010, 11, 7), 0.54], [Date.UTC(2011, 0, 2), 0.54], [Date.UTC(2011, 1, 8), 0.56], [Date.UTC(2011, 2, 8), 0.49], [Date.UTC(2011, 3, 3), 0.42], [Date.UTC(2011, 4, 2), 0.43], [Date.UTC(2011, 5, 5), 0.37], [Date.UTC(2011, 5, 27), 0.37], [Date.UTC(2011, 6, 8), 0.36], [Date.UTC(2011, 7, 3), 0.40], [Date.UTC(2011, 8, 10), 0.39], [Date.UTC(2011, 9, 9), 0.42], [Date.UTC(2011, 10, 7), 0.50], [Date.UTC(2011, 11, 4), 0.52], [Date.UTC(2012, 0, 8), 0.60], [Date.UTC(2012, 1, 5), 0.62], [Date.UTC(2012, 2, 11), 0.50], [Date.UTC(2012, 3, 8), 0.38], [Date.UTC(2012, 4, 9), 0.38], [Date.UTC(2012, 5, 10), 0.44], [Date.UTC(2012, 6, 4), 0.44], [Date.UTC(2012, 7, 10), 0.43], [Date.UTC(2012, 8, 2), 0.44], [Date.UTC(2012, 9, 5), 0.42], [Date.UTC(2012, 10, 4), 0.42], [Date.UTC(2012, 11, 2), 0.45], [Date.UTC(2013, 0, 5), 0.44], [Date.UTC(2013, 1, 8), 0.46], [Date.UTC(2013, 2, 11), 0.53], [Date.UTC(2013, 3, 7), 0.48], [Date.UTC(2013, 4, 8), 0.54], [Date.UTC(2013, 5, 9), 0.48], [Date.UTC(2013, 6, 7), 0.51], [Date.UTC(2013, 6, 12), 0.51], [Date.UTC(2013, 7, 4), 0.39], [Date.UTC(2013, 8, 11), 0.65], [Date.UTC(2013, 9, 10), 0.55], [Date.UTC(2013, 10, 9), 0.41], [Date.UTC(2013, 11, 6), 0.25], [Date.UTC(2014, 0, 1), 0.25], [Date.UTC(2014, 1, 8), 0.25], [Date.UTC(2014, 2, 3), 0.23], [Date.UTC(2014, 3, 10), 0.26], [Date.UTC(2014, 4, 7), 0.38], [Date.UTC(2014, 5, 8), 0.67], [Date.UTC(2014, 6, 6), 0.41], [Date.UTC(2014, 7, 11), 0.52], [Date.UTC(2014, 8, 1), 0.80], [Date.UTC(2014, 9, 3), 1.52], [Date.UTC(2014, 10, 8), 1.55], [Date.UTC(2014, 11, 7), 1.63], [Date.UTC(2015, 0, 6), 1.04], [Date.UTC(2015, 1, 5), 0.96], [Date.UTC(2015, 2, 7), 0.95], [Date.UTC(2015, 3, 13), 1.03], [Date.UTC(2015, 4, 13), 1.44], [Date.UTC(2015, 5, 6), 1.52], [Date.UTC(2015, 6, 12), 1.23], [Date.UTC(2015, 7, 6), 1.01], [Date.UTC(2015, 8, 5), 1.04], [Date.UTC(2015, 9, 4), 0.99], [Date.UTC(2015, 10, 7), 1.01], [Date.UTC(2015, 11, 4), 1.12], [Date.UTC(2016, 0, 2), 1.05], [Date.UTC(2016, 1, 2), 1.19], [Date.UTC(2016, 2, 3), 1.29], [Date.UTC(2016, 3, 7), 1.27], [Date.UTC(2016, 4, 6), 1.33], [Date.UTC(2016, 5, 5), 1.54], [Date.UTC(2016, 6, 4), 1.51], [Date.UTC(2016, 7, 6), 1.61], [Date.UTC(2016, 8, 8), 1.68], [Date.UTC(2016, 9, 7), 1.74], [Date.UTC(2016, 10, 5), 1.72], [Date.UTC(2016, 11, 4), 1.83], [Date.UTC(2017, 0, 7), 1.79], [Date.UTC(2017, 1, 8), 1.92], [Date.UTC(2017, 2, 7), 2.02], [Date.UTC(2017, 3, 9), 2.14], [Date.UTC(2017, 4, 6), 2.19], [Date.UTC(2017, 5, 3), 2.15], [Date.UTC(2017, 6, 7), 2.11], [Date.UTC(2017, 7, 2), 1.77], [Date.UTC(2017, 8, 6), 1.82], [Date.UTC(2017, 9, 5), 1.68], [Date.UTC(2017, 10, 12), 1.60], [Date.UTC(2017, 11, 9), 1.91], [Date.UTC(2018, 0, 3), 2.55], [Date.UTC(2018, 1, 8), 2.09], [Date.UTC(2018, 2, 7), 1.13], [Date.UTC(2018, 3, 1), 1.81], [Date.UTC(2018, 4, 6), 1.18], [Date.UTC(2018, 5, 10), 1.45], [Date.UTC(2018, 6, 7), 1.15], [Date.UTC(2018, 7, 1), 0.96], [Date.UTC(2018, 8, 3), 1.02], [Date.UTC(2018, 9, 5), 1.21], [Date.UTC(2018, 10, 8), 1.41], [Date.UTC(2018, 11, 2), 1.11], [Date.UTC(2019, 0, 4), 1.33], [Date.UTC(2019, 1, 6), 1.04], [Date.UTC(2019, 2, 2), 1.28], [Date.UTC(2019, 3, 7), 1.18], [Date.UTC(2019, 4, 4), 0.95], [Date.UTC(2019, 5, 9), 0.91], [Date.UTC(2019, 6, 6), 0.84], [Date.UTC(2019, 7, 5), 0.82], [Date.UTC(2019, 8, 9), 1.05], [Date.UTC(2019, 9, 5), 1.26], [Date.UTC(2019, 10, 3), 0.98], [Date.UTC(2019, 11, 6), 0.99], [Date.UTC(2020, 0, 5), 0.81], [Date.UTC(2020, 1, 4), 1.01], [Date.UTC(2020, 2, 4), 1.26], [Date.UTC(2020, 3, 2), 1.54], [Date.UTC(2020, 4, 2), 1.85], [Date.UTC(2020, 5, 1), 2.19], [Date.UTC(2020, 6, 4), 2.41], [Date.UTC(2020, 7, 2), 2.79], [Date.UTC(2020, 8, 6), 2.37], [Date.UTC(2020, 9, 4), 1.99], [Date.UTC(2020, 10, 3), 1.64], [Date.UTC(2020, 11, 3), 1.60], [Date.UTC(2021, 0, 2), 1.90], [Date.UTC(2021, 1, 6), 1.56], [Date.UTC(2021, 2, 4), 1.25], [Date.UTC(2021, 3, 4), 1.12], [Date.UTC(2021, 4, 2), 1.38], [Date.UTC(2021, 5, 5), 1.20], [Date.UTC(2021, 6, 4), 1.33], [Date.UTC(2021, 7, 3), 1.05], [Date.UTC(2021, 8, 11), 0.98], [Date.UTC(2021, 9, 6), 1.20], [Date.UTC(2021, 10, 6), 1.28], [Date.UTC(2021, 11, 5), 1.58], [Date.UTC(2022, 0, 1), 1.25], [Date.UTC(2022, 1, 2), 1.11], [Date.UTC(2022, 2, 2), 1.37], [Date.UTC(2022, 3, 5), 1.55], [Date.UTC(2022, 4, 3), 1.22], [Date.UTC(2022, 5, 4), 0.98], [Date.UTC(2022, 6, 2), 0.76], [Date.UTC(2022, 7, 2), 0.90], [Date.UTC(2022, 8, 1), 0.95], [Date.UTC(2022, 9, 1), 1.22], [Date.UTC(2022, 10, 1), 1.14], [Date.UTC(2022, 11, 2), 1.25], [Date.UTC(2022, 11, 29), 1.04], [Date.UTC(2023, 1, 1), 1.08], [Date.UTC(2023, 2, 2), 0.93], [Date.UTC(2023, 3, 1), 0.76], [Date.UTC(2023, 4, 2), 0.82], [Date.UTC(2023, 5, 2), 0.94], [Date.UTC(2023, 6, 2), 0.87], [Date.UTC(2023, 7, 4), 0.92], [Date.UTC(2023, 8, 2), 0.97], [Date.UTC(2023, 9, 4), 0.96], [Date.UTC(2023, 10, 2), 0.93], [Date.UTC(2023, 11, 4), 0.72], [Date.UTC(2024, 0, 2), 0.74], [Date.UTC(2024, 1, 2), 0.99], [Date.UTC(2024, 2, 1), 0.81], [Date.UTC(2024, 3, 3), 0.84], [Date.UTC(2024, 4, 1), 0.75], [Date.UTC(2024, 5, 1), 0.96], [Date.UTC(2024, 6, 3), 0.83], [Date.UTC(2024, 7, 1), 1.11], [Date.UTC(2024, 8, 1), 1.20], [Date.UTC(2024, 9, 2), 1.09"
2.2 数据清洗
dats
是一个相对比较整齐的数据,但只是一大窜字符串,我们需要对其进行清洗。
- 数据思维,首先将数据纳入数据框中,接着把每个数据点分隔开。注意观察规律,分割开每个数据点的符号为
],[
还是使用正则表达式。
df <- tibble(x = dats) %>%
separate_rows(x, sep = "\\], \\[")
df
# A tibble: 208 × 1
x
<chr>
1 Date.UTC(2007, 4, 5), 0.06
2 Date.UTC(2007, 5, 2), 0.07
3 Date.UTC(2007, 6, 2), 0.07
4 Date.UTC(2007, 7, 5), 0.06
5 Date.UTC(2007, 8, 2), 0.07
6 Date.UTC(2007, 9, 4), 0.07
7 Date.UTC(2007, 10, 4), 0.09
8 Date.UTC(2007, 11, 3), 0.09
9 Date.UTC(2008, 0, 3), 0.04
10 Date.UTC(2008, 1, 7), 0.06
# ℹ 198 more rows
- 将得到数据继续分割,同时将字符型数据转化为数值型。
# A tibble: 208 × 2
Date Index
<chr> <dbl>
1 Date.UTC(2007, 4, 5 0.06
2 Date.UTC(2007, 5, 2 0.07
3 Date.UTC(2007, 6, 2 0.07
4 Date.UTC(2007, 7, 5 0.06
5 Date.UTC(2007, 8, 2 0.07
6 Date.UTC(2007, 9, 4 0.07
7 Date.UTC(2007, 10, 4 0.09
8 Date.UTC(2007, 11, 3 0.09
9 Date.UTC(2008, 0, 3 0.04
10 Date.UTC(2008, 1, 7 0.06
# ℹ 198 more rows
- 将
Date
列转化为日期。注意,第8行为11月,第9行按理说应该为12月,但却显示为0月,所以相应的月份需要加1。我们可以先解析日期,但更简单的做法是现将数字提取出来,分割后再将月份列+1,再转换为日期列即可。
df <- df %>%
mutate(Date = str_replace(Date, "Date.UTC\\(", "")) %>%
separate(Date, c("year", "month", "day"), convert = TRUE) %>%
mutate(Date = make_date(year, month + 1, day)) %>%
select(Date, Index)
df
# A tibble: 208 × 2
Date Index
<date> <dbl>
1 2007-05-05 0.06
2 2007-06-02 0.07
3 2007-07-02 0.07
4 2007-08-05 0.06
5 2007-09-02 0.07
6 2007-10-04 0.07
7 2007-11-04 0.09
8 2007-12-03 0.09
9 2008-01-03 0.04
10 2008-02-07 0.06
# ℹ 198 more rows
搞定!
2.3 可视化
使用ggplot2
绘制R语言指数变化折线图。
df %>%
ggplot(aes(Date, Index), size = 1.5) +
geom_line(color = "steelblue") +
scale_x_date(date_breaks = "1 years", date_labels = "%Y") +
labs(
x = "日期", y = "TIOBE指数",
title = "R语言TOIBE指数",
caption = "数据来自:www.tiobe.com"
) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
3 北京历史天气数据-多网页批量提取
天气网北京历史天气链接页:http://lishi.tianqi.com/beijing/index.html
点开2022年03月天气:http://lishi.tianqi.com/beijing/202203.html
虽然看起来是表格,但 html 代码并不是 table(是就超级简单了),只能是借助正则表达式来解析了。
3.1 获取网页地址
postfix <- read_html("http://lishi.tianqi.com/beijing/index.html") %>%
html_elements("a") %>%
html_attr("href") %>% # 解析网址
str_subset("^/beijing") # 筛选出正确的
urls <- str_c("http://lishi.tianqi.com", postfix)
3.2 解决一个网页的爬取+数据整理
3.2.1 从网页解析历史天气数据
weather <- read_html(urls[1]) %>%
html_nodes(".thrui") %>%
html_text2()
weather
[1] "\r\n2024-10-01 星期二\n\r\n17℃\n\r\n7℃\n\r\n多云~晴\n\r\n西北风 4级\n\r\r\n\r\n\r\n2024-10-02 星期三\n\r\n20℃\n\r\n7℃\n\r\n晴\n\r\n西北风 2级\n\r\r\n\r\n\r\n2024-10-03 星期四\n\r\n22℃\n\r\n8℃\n\r\n晴\n\r\n南风 2级\n\r\r\n\r\n\r\n2024-10-04 星期五\n\r\n22℃\n\r\n9℃\n\r\n多云~晴\n\r\n西南风 2级\n\r\r\n\r\n\r\n2024-10-05 星期六\n\r\n23℃\n\r\n13℃\n\r\n多云~阴\n\r\n南风 1级\n\r\r\n\r\n\r\n2024-10-06 星期日\n\r\n18℃\n\r\n10℃\n\r\n多云~小雨\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-07 星期一\n\r\n22℃\n\r\n8℃\n\r\n晴\n\r\n北风 2级\n\r\r\n\r\n\r\n2024-10-08 星期二\n\r\n23℃\n\r\n10℃\n\r\n晴~多云\n\r\n西南风 1级\n\r\r\n\r\n\r\n2024-10-09 星期三\n\r\n19℃\n\r\n11℃\n\r\n多云\n\r\n西南风 1级\n\r\r\n\r\n\r\n2024-10-10 星期四\n\r\n22℃\n\r\n8℃\n\r\n晴\n\r\n西北风 1级\n\r\r\n\r\n\r\n2024-10-11 星期五\n\r\n23℃\n\r\n10℃\n\r\n晴\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-12 星期六\n\r\n24℃\n\r\n13℃\n\r\n多云\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-13 星期日\n\r\n18℃\n\r\n13℃\n\r\n多云~雾\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-14 星期一\n\r\n20℃\n\r\n12℃\n\r\n多云\n\r\n西南风 2级\n\r\r\n\r\n\r\n2024-10-15 星期二\n\r\n21℃\n\r\n10℃\n\r\n阴~多云\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-16 星期三\n\r\n18℃\n\r\n13℃\n\r\n多云~阴\n\r\n南风 2级\n\r\r\n\r\n\r\n2024-10-17 星期四\n\r\n17℃\n\r\n14℃\n\r\n多云~小雨\n\r\n东北风 1级\n\r\r\n\r\n\r\n2024-10-18 星期五\n\r\n18℃\n\r\n5℃\n\r\n小雨~晴\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-19 星期六\n\r\n13℃\n\r\n2℃\n\r\n晴\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-20 星期日\n\r\n12℃\n\r\n1℃\n\r\n阴~多云\n\r\n南风 3级\n\r\r\n\r\n\r\n2024-10-21 星期一\n\r\n13℃\n\r\n6℃\n\r\n阴~小雨\n\r\n南风 1级\n\r\r\n\r\n\r\n2024-10-22 星期二\n\r\n17℃\n\r\n3℃\n\r\n阴~晴\n\r\n西北风 3级\n\r\r\n\r\n\r\n2024-10-23 星期三\n\r\n19℃\n\r\n5℃\n\r\n晴\n\r\n西南风 1级\n\r\r\n\r\n\r\n2024-10-24 星期四\n\r\n21℃\n\r\n7℃\n\r\n多云~晴\n\r\n东风 1级\n\r\r\n\r\n\r\n2024-10-25 星期五\n\r\n19℃\n\r\n10℃\n\r\n多云~雾\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-26 星期六\n\r\n16℃\n\r\n10℃\n\r\n多云~雾\n\r\n东北风 1级\n\r\r\n\r\n\r\n2024-10-27 星期日\n\r\n18℃\n\r\n4℃\n\r\n多云~晴\n\r\n北风 2级\n\r\r\n\r\n\r\n2024-10-28 星期一\n\r\n19℃\n\r\n7℃\n\r\n多云\n\r\n南风 1级\n\r\r\n\r\n\r\n2024-10-29 星期二\n\r\n19℃\n\r\n7℃\n\r\n多云~晴\n\r\n西北风 1级\n\r\r\n\r\n\r\n2024-10-30 星期三\n\r\n18℃\n\r\n10℃\n\r\n多云~雾\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-31 星期四\n\r\n17℃\n\r\n12℃\n\r\n轻度雾霾\n\r\n北风 1级\n\r\r\n\r\n查看更多\n\r"
得到了想要的结果,当然还不是最终结果,需要进一步整理。
3.2.2 清洗数据
将数据通过清洗转换为数据框。
- 将数据切分成多行,每一个日期一行。
- 再将每一行按照规律切分成多个列。
tibble(weather) %>%
separate_rows(weather, sep = "\n\r\n(?=\\d{4})") %>%
mutate(weather = str_replace_all(weather, "\r\n|查看更多", "")) %>%
separate(weather,
sep = "\n\r*", extra = "drop",
into = c("日期", "最高气温", "最低气温", "天气", "风向")
)
# A tibble: 31 × 5
日期 最高气温 最低气温 天气 风向
<chr> <chr> <chr> <chr> <chr>
1 2024-10-01 星期二 17℃ 7℃ 多云~晴 西北风 4级
2 2024-10-02 星期三 20℃ 7℃ 晴 西北风 2级
3 2024-10-03 星期四 22℃ 8℃ 晴 南风 2级
4 2024-10-04 星期五 22℃ 9℃ 多云~晴 西南风 2级
5 2024-10-05 星期六 23℃ 13℃ 多云~阴 南风 1级
6 2024-10-06 星期日 18℃ 10℃ 多云~小雨 东北风 2级
7 2024-10-07 星期一 22℃ 8℃ 晴 北风 2级
8 2024-10-08 星期二 23℃ 10℃ 晴~多云 西南风 1级
9 2024-10-09 星期三 19℃ 11℃ 多云 西南风 1级
10 2024-10-10 星期四 22℃ 8℃ 晴 西北风 1级
# ℹ 21 more rows
- 先用
\n\r\n
分割成多行,加个零宽断言(?=\\d{4})
是为了避免把\n\r\n查看更多
也给断行。 - 将多余的
\r\n
或查看更多
替换成空格。 - 以
\n
+0个或多个\r
为分隔符,将1列分割成多列。 - 参数
extra = "drop"
,是避免输出一个没有什么影响的警告消息。
3.3 批量爬取多个网址并整理数据
将上述单网址的爬取及整理步骤封装一个函数。
crawurl <- function(url) {
Sys.sleep(sample(5, 1)) # 增加随机等待1-5秒
weather <- read_html(url) %>%
html_nodes(".thrui") %>%
html_text2()
tibble(weather) %>%
separate_rows(weather, sep = "\n\r\n(?=\\d{4})") %>%
mutate(weather, str_replace_all(weather, "\r\n|查看更多", "")) %>%
separate(weather,
sep = "\n\r*", extra = "drop",
into = c("日期", "最高气温", "最低气温", "天气", "风向")
)
}
使用map()
系列函数,完成循环迭代:
# beijing <- map_dfr(urls, crawurl)