Tidyverse实例操作(1)

R语言爬虫:TIOBE指数分析+北京历史天气数据

先将问题分解,然后用自带数据集或自编数据集设计一个简单实例,最后尝试写代码调试逐步解决问题,是非常有效的一种思维方式。
tidyverse
rvest
ggplot2
Author

Lee

Published

March 22, 2023

TIOBE 编程社区指数是一个衡量编程语言受欢迎程度的指标。该指数每月更新一次。评级的依据是世界范围内熟练的工程师的数量、课程和第三方供应商。流行的搜索引擎,如 Google,Bing,Yahoo,Wikipedia, Amazon, YouTube 和百度被用来计算评级。值得注意的是,TIOBE 指数并不是关于最好的编程语言或编写了最多行代码的语言。

在TIOBE中,R语言一直在10~20名间徘徊,那么我们可以通过爬取TIOBE网页中的指数数据,整理成数据框,并绘制成折线图。要实现这个思路,有以下几个步骤:

1 R爬虫:从网页解析表格数据

1.1 例1 爬取2022美国节假日

网址是()[https://www.officeholidays.com/countries/usa/2022],中间有如下图的表格,记录了节日的名称和日期。

通过检查网页源代码,发现该网页是非常标准的规范网页,表格存在了table标签样式中,可以使用html_table()函数直接读取。

read_html("https://www.officeholidays.com/countries/usa/2022") %>%
  html_node("table") %>% 
  html_table()
# A tibble: 21 × 5
   Day      Date   `Holiday Name`             Type                 Comments     
   <chr>    <chr>  <chr>                      <chr>                <chr>        
 1 Saturday Jan 01 New Year's Day             Federal Holiday      ""           
 2 Monday   Jan 03 New Year's Day (in lieu)   Regional Holiday     ""           
 3 Monday   Jan 17 Martin Luther King Jr. Day Federal Holiday      "3rd Monday …
 4 Monday   Feb 21 President's Day            Federal Holiday      "3rd Monday …
 5 Sunday   Apr 17 Easter Sunday              Not A Public Holiday ""           
 6 Sunday   May 08 Mother's Day               Not A Public Holiday "2nd Sunday …
 7 Monday   May 30 Memorial Day               Federal Holiday      "Last Monday…
 8 Friday   Jun 17 Juneteenth                 Federal Holiday      "Emancipatio…
 9 Sunday   Jun 19 Juneteenth                 Federal Holiday      "Emancipatio…
10 Sunday   Jun 19 Father's Day               Not A Public Holiday "3rd Sunday …
# ℹ 11 more rows

1.2 例2 从XML爬取食物表

网址是 https://www.w3schools.com/xml/simple.xml,如下图所示,不是规范的table标签样式:

这种情况,需要使用常规的爬虫进行操作:找到具体标签,批量提取相应内容。

# 现针对一个标签进行提取
html <- read_html("https://www.w3schools.com/xml/simple.xml")
html %>%
  html_nodes("name") %>%
  html_text2()
[1] "Belgian Waffles"             "Strawberry Belgian Waffles" 
[3] "Berry-Berry Belgian Waffles" "French Toast"               
[5] "Homestyle Breakfast"        
# 使用map()函数对所需标签批量提取,并合并为数据框
vars <- c("name", "price", "description", "calories")
map(
  vars,
  ~ html_nodes(html, .x) %>%
    html_text2()
) %>%
  set_names(vars) %>%
  as_tibble()
# A tibble: 5 × 4
  name                        price description                         calories
  <chr>                       <chr> <chr>                               <chr>   
1 Belgian Waffles             $5.95 Two of our famous Belgian Waffles … 650     
2 Strawberry Belgian Waffles  $7.95 Light Belgian waffles covered with… 900     
3 Berry-Berry Belgian Waffles $8.95 Light Belgian waffles covered with… 900     
4 French Toast                $4.50 Thick slices made from our homemad… 600     
5 Homestyle Breakfast         $6.95 Two eggs, bacon or sausage, toast,… 950     

2 R语言TIOBE指数-单网页提取

2.1 数据爬取

TOIBE是静态网页,使用rvest来解析。一般情况建议使用SelectorGadget插件定位到数据的节点,但这网页无法定位,所以只能手动定位:

查看网页源代码不难定位到数据位置:

数据非常长,,位于

标签之间。注意,月份需要+1才能对上。

url <- "https://www.tiobe.com/tiobe-index/r/"
dats <- read_html(url) %>% # 解析网页
  html_node("script") %>% # 定位节点
  html_text2() # 获取文本内容
dats
[1] "window.dataLayer = window.dataLayer || []; function gtag() { dataLayer.push(arguments); } gtag(\"consent\", \"default\", { ad_user_data: \"denied\", ad_personalization: \"denied\", ad_storage: \"denied\", analytics_storage: \"denied\", functionality_storage: \"denied\", personalization_storage: \"denied\", security_storage: \"granted\", wait_for_update: 500, }); gtag(\"set\", \"ads_data_redaction\", true);"

好想结果不符合我们的期待,返回源代码继续往多找一层。是article

dats <- read_html(url) %>% # 解析网页
  html_node("article") %>% # 定位节点
  html_text2() # 获取文本内容
dats
[1] "Home » TIOBE Index\n\n$(function () { $('#container').highcharts({ credits: { enabled: false }, chart: { type: 'spline' }, plotOptions: { spline: { lineWidth: 4, states: { hover: { lineWidth: 5 } }, marker: { enabled: false } } }, title: { text: 'TIOBE Index for R', x: -20, useHTML: true }, subtitle: { text: 'Source: www.tiobe.com', x: -20, useHTML: true }, xAxis: { type: 'datetime', dateTimeLabelFormats: { year: '%Y' } }, yAxis: { title: { text: 'Ratings (%)' }, plotLines: [{ value: 0, width: 1, color: '#808080' }] }, tooltip: { valueSuffix: '%', dateTimeLabelFormats: { week: \"%B %Y\" } }, legend: { enabled: false }, series: [ {name : 'R',data : [[Date.UTC(2007, 4, 5), 0.06], [Date.UTC(2007, 5, 2), 0.07], [Date.UTC(2007, 6, 2), 0.07], [Date.UTC(2007, 7, 5), 0.06], [Date.UTC(2007, 8, 2), 0.07], [Date.UTC(2007, 9, 4), 0.07], [Date.UTC(2007, 10, 4), 0.09], [Date.UTC(2007, 11, 3), 0.09], [Date.UTC(2008, 0, 3), 0.04], [Date.UTC(2008, 1, 7), 0.06], [Date.UTC(2008, 5, 1), 0.09], [Date.UTC(2008, 6, 2), 0.08], [Date.UTC(2008, 7, 3), 0.06], [Date.UTC(2008, 8, 3), 0.07], [Date.UTC(2008, 9, 6), 0.07], [Date.UTC(2008, 10, 2), 0.08], [Date.UTC(2008, 11, 3), 0.07], [Date.UTC(2009, 0, 2), 0.07], [Date.UTC(2009, 1, 1), 0.08], [Date.UTC(2009, 2, 5), 0.08], [Date.UTC(2009, 3, 7), 0.09], [Date.UTC(2009, 4, 1), 0.10], [Date.UTC(2009, 5, 4), 0.10], [Date.UTC(2009, 6, 2), 0.10], [Date.UTC(2009, 7, 1), 0.11], [Date.UTC(2009, 8, 5), 0.09], [Date.UTC(2009, 9, 2), 0.10], [Date.UTC(2009, 10, 2), 0.10], [Date.UTC(2009, 11, 2), 0.11], [Date.UTC(2010, 0, 5), 0.10], [Date.UTC(2010, 1, 7), 0.14], [Date.UTC(2010, 2, 7), 0.15], [Date.UTC(2010, 3, 5), 0.14], [Date.UTC(2010, 4, 15), 0.20], [Date.UTC(2010, 6, 6), 0.17], [Date.UTC(2010, 6, 30), 0.18], [Date.UTC(2010, 8, 11), 0.33], [Date.UTC(2010, 9, 2), 0.36], [Date.UTC(2010, 10, 3), 0.36], [Date.UTC(2010, 11, 7), 0.54], [Date.UTC(2011, 0, 2), 0.54], [Date.UTC(2011, 1, 8), 0.56], [Date.UTC(2011, 2, 8), 0.49], [Date.UTC(2011, 3, 3), 0.42], [Date.UTC(2011, 4, 2), 0.43], [Date.UTC(2011, 5, 5), 0.37], [Date.UTC(2011, 5, 27), 0.37], [Date.UTC(2011, 6, 8), 0.36], [Date.UTC(2011, 7, 3), 0.40], [Date.UTC(2011, 8, 10), 0.39], [Date.UTC(2011, 9, 9), 0.42], [Date.UTC(2011, 10, 7), 0.50], [Date.UTC(2011, 11, 4), 0.52], [Date.UTC(2012, 0, 8), 0.60], [Date.UTC(2012, 1, 5), 0.62], [Date.UTC(2012, 2, 11), 0.50], [Date.UTC(2012, 3, 8), 0.38], [Date.UTC(2012, 4, 9), 0.38], [Date.UTC(2012, 5, 10), 0.44], [Date.UTC(2012, 6, 4), 0.44], [Date.UTC(2012, 7, 10), 0.43], [Date.UTC(2012, 8, 2), 0.44], [Date.UTC(2012, 9, 5), 0.42], [Date.UTC(2012, 10, 4), 0.42], [Date.UTC(2012, 11, 2), 0.45], [Date.UTC(2013, 0, 5), 0.44], [Date.UTC(2013, 1, 8), 0.46], [Date.UTC(2013, 2, 11), 0.53], [Date.UTC(2013, 3, 7), 0.48], [Date.UTC(2013, 4, 8), 0.54], [Date.UTC(2013, 5, 9), 0.48], [Date.UTC(2013, 6, 7), 0.51], [Date.UTC(2013, 6, 12), 0.51], [Date.UTC(2013, 7, 4), 0.39], [Date.UTC(2013, 8, 11), 0.65], [Date.UTC(2013, 9, 10), 0.55], [Date.UTC(2013, 10, 9), 0.41], [Date.UTC(2013, 11, 6), 0.25], [Date.UTC(2014, 0, 1), 0.25], [Date.UTC(2014, 1, 8), 0.25], [Date.UTC(2014, 2, 3), 0.23], [Date.UTC(2014, 3, 10), 0.26], [Date.UTC(2014, 4, 7), 0.38], [Date.UTC(2014, 5, 8), 0.67], [Date.UTC(2014, 6, 6), 0.41], [Date.UTC(2014, 7, 11), 0.52], [Date.UTC(2014, 8, 1), 0.80], [Date.UTC(2014, 9, 3), 1.52], [Date.UTC(2014, 10, 8), 1.55], [Date.UTC(2014, 11, 7), 1.63], [Date.UTC(2015, 0, 6), 1.04], [Date.UTC(2015, 1, 5), 0.96], [Date.UTC(2015, 2, 7), 0.95], [Date.UTC(2015, 3, 13), 1.03], [Date.UTC(2015, 4, 13), 1.44], [Date.UTC(2015, 5, 6), 1.52], [Date.UTC(2015, 6, 12), 1.23], [Date.UTC(2015, 7, 6), 1.01], [Date.UTC(2015, 8, 5), 1.04], [Date.UTC(2015, 9, 4), 0.99], [Date.UTC(2015, 10, 7), 1.01], [Date.UTC(2015, 11, 4), 1.12], [Date.UTC(2016, 0, 2), 1.05], [Date.UTC(2016, 1, 2), 1.19], [Date.UTC(2016, 2, 3), 1.29], [Date.UTC(2016, 3, 7), 1.27], [Date.UTC(2016, 4, 6), 1.33], [Date.UTC(2016, 5, 5), 1.54], [Date.UTC(2016, 6, 4), 1.51], [Date.UTC(2016, 7, 6), 1.61], [Date.UTC(2016, 8, 8), 1.68], [Date.UTC(2016, 9, 7), 1.74], [Date.UTC(2016, 10, 5), 1.72], [Date.UTC(2016, 11, 4), 1.83], [Date.UTC(2017, 0, 7), 1.79], [Date.UTC(2017, 1, 8), 1.92], [Date.UTC(2017, 2, 7), 2.02], [Date.UTC(2017, 3, 9), 2.14], [Date.UTC(2017, 4, 6), 2.19], [Date.UTC(2017, 5, 3), 2.15], [Date.UTC(2017, 6, 7), 2.11], [Date.UTC(2017, 7, 2), 1.77], [Date.UTC(2017, 8, 6), 1.82], [Date.UTC(2017, 9, 5), 1.68], [Date.UTC(2017, 10, 12), 1.60], [Date.UTC(2017, 11, 9), 1.91], [Date.UTC(2018, 0, 3), 2.55], [Date.UTC(2018, 1, 8), 2.09], [Date.UTC(2018, 2, 7), 1.13], [Date.UTC(2018, 3, 1), 1.81], [Date.UTC(2018, 4, 6), 1.18], [Date.UTC(2018, 5, 10), 1.45], [Date.UTC(2018, 6, 7), 1.15], [Date.UTC(2018, 7, 1), 0.96], [Date.UTC(2018, 8, 3), 1.02], [Date.UTC(2018, 9, 5), 1.21], [Date.UTC(2018, 10, 8), 1.41], [Date.UTC(2018, 11, 2), 1.11], [Date.UTC(2019, 0, 4), 1.33], [Date.UTC(2019, 1, 6), 1.04], [Date.UTC(2019, 2, 2), 1.28], [Date.UTC(2019, 3, 7), 1.18], [Date.UTC(2019, 4, 4), 0.95], [Date.UTC(2019, 5, 9), 0.91], [Date.UTC(2019, 6, 6), 0.84], [Date.UTC(2019, 7, 5), 0.82], [Date.UTC(2019, 8, 9), 1.05], [Date.UTC(2019, 9, 5), 1.26], [Date.UTC(2019, 10, 3), 0.98], [Date.UTC(2019, 11, 6), 0.99], [Date.UTC(2020, 0, 5), 0.81], [Date.UTC(2020, 1, 4), 1.01], [Date.UTC(2020, 2, 4), 1.26], [Date.UTC(2020, 3, 2), 1.54], [Date.UTC(2020, 4, 2), 1.85], [Date.UTC(2020, 5, 1), 2.19], [Date.UTC(2020, 6, 4), 2.41], [Date.UTC(2020, 7, 2), 2.79], [Date.UTC(2020, 8, 6), 2.37], [Date.UTC(2020, 9, 4), 1.99], [Date.UTC(2020, 10, 3), 1.64], [Date.UTC(2020, 11, 3), 1.60], [Date.UTC(2021, 0, 2), 1.90], [Date.UTC(2021, 1, 6), 1.56], [Date.UTC(2021, 2, 4), 1.25], [Date.UTC(2021, 3, 4), 1.12], [Date.UTC(2021, 4, 2), 1.38], [Date.UTC(2021, 5, 5), 1.20], [Date.UTC(2021, 6, 4), 1.33], [Date.UTC(2021, 7, 3), 1.05], [Date.UTC(2021, 8, 11), 0.98], [Date.UTC(2021, 9, 6), 1.20], [Date.UTC(2021, 10, 6), 1.28], [Date.UTC(2021, 11, 5), 1.58], [Date.UTC(2022, 0, 1), 1.25], [Date.UTC(2022, 1, 2), 1.11], [Date.UTC(2022, 2, 2), 1.37], [Date.UTC(2022, 3, 5), 1.55], [Date.UTC(2022, 4, 3), 1.22], [Date.UTC(2022, 5, 4), 0.98], [Date.UTC(2022, 6, 2), 0.76], [Date.UTC(2022, 7, 2), 0.90], [Date.UTC(2022, 8, 1), 0.95], [Date.UTC(2022, 9, 1), 1.22], [Date.UTC(2022, 10, 1), 1.14], [Date.UTC(2022, 11, 2), 1.25], [Date.UTC(2022, 11, 29), 1.04], [Date.UTC(2023, 1, 1), 1.08], [Date.UTC(2023, 2, 2), 0.93], [Date.UTC(2023, 3, 1), 0.76], [Date.UTC(2023, 4, 2), 0.82], [Date.UTC(2023, 5, 2), 0.94], [Date.UTC(2023, 6, 2), 0.87], [Date.UTC(2023, 7, 4), 0.92], [Date.UTC(2023, 8, 2), 0.97], [Date.UTC(2023, 9, 4), 0.96], [Date.UTC(2023, 10, 2), 0.93], [Date.UTC(2023, 11, 4), 0.72], [Date.UTC(2024, 0, 2), 0.74], [Date.UTC(2024, 1, 2), 0.99], [Date.UTC(2024, 2, 1), 0.81], [Date.UTC(2024, 3, 3), 0.84], [Date.UTC(2024, 4, 1), 0.75], [Date.UTC(2024, 5, 1), 0.96], [Date.UTC(2024, 6, 3), 0.83], [Date.UTC(2024, 7, 1), 1.11], [Date.UTC(2024, 8, 1), 1.20], [Date.UTC(2024, 9, 2), 1.09]]} ] }); });\nThe R Programming Language\nSome information about R:\n\nHighest Position (since 2007): #8 in Aug 2020\n\nLowest Position (since 2007): #73 in Dec 2008"

这几次结果包含了数据,但还有些我们不需要的信息。我们需要的内容都在[[]]之间,自然可以用政策表达式来提取。

# 使用零断宽言提取
dats <- dats %>%
  str_extract("(?<=\\[\\[).*(?=\\]\\])")
dats
[1] "Date.UTC(2007, 4, 5), 0.06], [Date.UTC(2007, 5, 2), 0.07], [Date.UTC(2007, 6, 2), 0.07], [Date.UTC(2007, 7, 5), 0.06], [Date.UTC(2007, 8, 2), 0.07], [Date.UTC(2007, 9, 4), 0.07], [Date.UTC(2007, 10, 4), 0.09], [Date.UTC(2007, 11, 3), 0.09], [Date.UTC(2008, 0, 3), 0.04], [Date.UTC(2008, 1, 7), 0.06], [Date.UTC(2008, 5, 1), 0.09], [Date.UTC(2008, 6, 2), 0.08], [Date.UTC(2008, 7, 3), 0.06], [Date.UTC(2008, 8, 3), 0.07], [Date.UTC(2008, 9, 6), 0.07], [Date.UTC(2008, 10, 2), 0.08], [Date.UTC(2008, 11, 3), 0.07], [Date.UTC(2009, 0, 2), 0.07], [Date.UTC(2009, 1, 1), 0.08], [Date.UTC(2009, 2, 5), 0.08], [Date.UTC(2009, 3, 7), 0.09], [Date.UTC(2009, 4, 1), 0.10], [Date.UTC(2009, 5, 4), 0.10], [Date.UTC(2009, 6, 2), 0.10], [Date.UTC(2009, 7, 1), 0.11], [Date.UTC(2009, 8, 5), 0.09], [Date.UTC(2009, 9, 2), 0.10], [Date.UTC(2009, 10, 2), 0.10], [Date.UTC(2009, 11, 2), 0.11], [Date.UTC(2010, 0, 5), 0.10], [Date.UTC(2010, 1, 7), 0.14], [Date.UTC(2010, 2, 7), 0.15], [Date.UTC(2010, 3, 5), 0.14], [Date.UTC(2010, 4, 15), 0.20], [Date.UTC(2010, 6, 6), 0.17], [Date.UTC(2010, 6, 30), 0.18], [Date.UTC(2010, 8, 11), 0.33], [Date.UTC(2010, 9, 2), 0.36], [Date.UTC(2010, 10, 3), 0.36], [Date.UTC(2010, 11, 7), 0.54], [Date.UTC(2011, 0, 2), 0.54], [Date.UTC(2011, 1, 8), 0.56], [Date.UTC(2011, 2, 8), 0.49], [Date.UTC(2011, 3, 3), 0.42], [Date.UTC(2011, 4, 2), 0.43], [Date.UTC(2011, 5, 5), 0.37], [Date.UTC(2011, 5, 27), 0.37], [Date.UTC(2011, 6, 8), 0.36], [Date.UTC(2011, 7, 3), 0.40], [Date.UTC(2011, 8, 10), 0.39], [Date.UTC(2011, 9, 9), 0.42], [Date.UTC(2011, 10, 7), 0.50], [Date.UTC(2011, 11, 4), 0.52], [Date.UTC(2012, 0, 8), 0.60], [Date.UTC(2012, 1, 5), 0.62], [Date.UTC(2012, 2, 11), 0.50], [Date.UTC(2012, 3, 8), 0.38], [Date.UTC(2012, 4, 9), 0.38], [Date.UTC(2012, 5, 10), 0.44], [Date.UTC(2012, 6, 4), 0.44], [Date.UTC(2012, 7, 10), 0.43], [Date.UTC(2012, 8, 2), 0.44], [Date.UTC(2012, 9, 5), 0.42], [Date.UTC(2012, 10, 4), 0.42], [Date.UTC(2012, 11, 2), 0.45], [Date.UTC(2013, 0, 5), 0.44], [Date.UTC(2013, 1, 8), 0.46], [Date.UTC(2013, 2, 11), 0.53], [Date.UTC(2013, 3, 7), 0.48], [Date.UTC(2013, 4, 8), 0.54], [Date.UTC(2013, 5, 9), 0.48], [Date.UTC(2013, 6, 7), 0.51], [Date.UTC(2013, 6, 12), 0.51], [Date.UTC(2013, 7, 4), 0.39], [Date.UTC(2013, 8, 11), 0.65], [Date.UTC(2013, 9, 10), 0.55], [Date.UTC(2013, 10, 9), 0.41], [Date.UTC(2013, 11, 6), 0.25], [Date.UTC(2014, 0, 1), 0.25], [Date.UTC(2014, 1, 8), 0.25], [Date.UTC(2014, 2, 3), 0.23], [Date.UTC(2014, 3, 10), 0.26], [Date.UTC(2014, 4, 7), 0.38], [Date.UTC(2014, 5, 8), 0.67], [Date.UTC(2014, 6, 6), 0.41], [Date.UTC(2014, 7, 11), 0.52], [Date.UTC(2014, 8, 1), 0.80], [Date.UTC(2014, 9, 3), 1.52], [Date.UTC(2014, 10, 8), 1.55], [Date.UTC(2014, 11, 7), 1.63], [Date.UTC(2015, 0, 6), 1.04], [Date.UTC(2015, 1, 5), 0.96], [Date.UTC(2015, 2, 7), 0.95], [Date.UTC(2015, 3, 13), 1.03], [Date.UTC(2015, 4, 13), 1.44], [Date.UTC(2015, 5, 6), 1.52], [Date.UTC(2015, 6, 12), 1.23], [Date.UTC(2015, 7, 6), 1.01], [Date.UTC(2015, 8, 5), 1.04], [Date.UTC(2015, 9, 4), 0.99], [Date.UTC(2015, 10, 7), 1.01], [Date.UTC(2015, 11, 4), 1.12], [Date.UTC(2016, 0, 2), 1.05], [Date.UTC(2016, 1, 2), 1.19], [Date.UTC(2016, 2, 3), 1.29], [Date.UTC(2016, 3, 7), 1.27], [Date.UTC(2016, 4, 6), 1.33], [Date.UTC(2016, 5, 5), 1.54], [Date.UTC(2016, 6, 4), 1.51], [Date.UTC(2016, 7, 6), 1.61], [Date.UTC(2016, 8, 8), 1.68], [Date.UTC(2016, 9, 7), 1.74], [Date.UTC(2016, 10, 5), 1.72], [Date.UTC(2016, 11, 4), 1.83], [Date.UTC(2017, 0, 7), 1.79], [Date.UTC(2017, 1, 8), 1.92], [Date.UTC(2017, 2, 7), 2.02], [Date.UTC(2017, 3, 9), 2.14], [Date.UTC(2017, 4, 6), 2.19], [Date.UTC(2017, 5, 3), 2.15], [Date.UTC(2017, 6, 7), 2.11], [Date.UTC(2017, 7, 2), 1.77], [Date.UTC(2017, 8, 6), 1.82], [Date.UTC(2017, 9, 5), 1.68], [Date.UTC(2017, 10, 12), 1.60], [Date.UTC(2017, 11, 9), 1.91], [Date.UTC(2018, 0, 3), 2.55], [Date.UTC(2018, 1, 8), 2.09], [Date.UTC(2018, 2, 7), 1.13], [Date.UTC(2018, 3, 1), 1.81], [Date.UTC(2018, 4, 6), 1.18], [Date.UTC(2018, 5, 10), 1.45], [Date.UTC(2018, 6, 7), 1.15], [Date.UTC(2018, 7, 1), 0.96], [Date.UTC(2018, 8, 3), 1.02], [Date.UTC(2018, 9, 5), 1.21], [Date.UTC(2018, 10, 8), 1.41], [Date.UTC(2018, 11, 2), 1.11], [Date.UTC(2019, 0, 4), 1.33], [Date.UTC(2019, 1, 6), 1.04], [Date.UTC(2019, 2, 2), 1.28], [Date.UTC(2019, 3, 7), 1.18], [Date.UTC(2019, 4, 4), 0.95], [Date.UTC(2019, 5, 9), 0.91], [Date.UTC(2019, 6, 6), 0.84], [Date.UTC(2019, 7, 5), 0.82], [Date.UTC(2019, 8, 9), 1.05], [Date.UTC(2019, 9, 5), 1.26], [Date.UTC(2019, 10, 3), 0.98], [Date.UTC(2019, 11, 6), 0.99], [Date.UTC(2020, 0, 5), 0.81], [Date.UTC(2020, 1, 4), 1.01], [Date.UTC(2020, 2, 4), 1.26], [Date.UTC(2020, 3, 2), 1.54], [Date.UTC(2020, 4, 2), 1.85], [Date.UTC(2020, 5, 1), 2.19], [Date.UTC(2020, 6, 4), 2.41], [Date.UTC(2020, 7, 2), 2.79], [Date.UTC(2020, 8, 6), 2.37], [Date.UTC(2020, 9, 4), 1.99], [Date.UTC(2020, 10, 3), 1.64], [Date.UTC(2020, 11, 3), 1.60], [Date.UTC(2021, 0, 2), 1.90], [Date.UTC(2021, 1, 6), 1.56], [Date.UTC(2021, 2, 4), 1.25], [Date.UTC(2021, 3, 4), 1.12], [Date.UTC(2021, 4, 2), 1.38], [Date.UTC(2021, 5, 5), 1.20], [Date.UTC(2021, 6, 4), 1.33], [Date.UTC(2021, 7, 3), 1.05], [Date.UTC(2021, 8, 11), 0.98], [Date.UTC(2021, 9, 6), 1.20], [Date.UTC(2021, 10, 6), 1.28], [Date.UTC(2021, 11, 5), 1.58], [Date.UTC(2022, 0, 1), 1.25], [Date.UTC(2022, 1, 2), 1.11], [Date.UTC(2022, 2, 2), 1.37], [Date.UTC(2022, 3, 5), 1.55], [Date.UTC(2022, 4, 3), 1.22], [Date.UTC(2022, 5, 4), 0.98], [Date.UTC(2022, 6, 2), 0.76], [Date.UTC(2022, 7, 2), 0.90], [Date.UTC(2022, 8, 1), 0.95], [Date.UTC(2022, 9, 1), 1.22], [Date.UTC(2022, 10, 1), 1.14], [Date.UTC(2022, 11, 2), 1.25], [Date.UTC(2022, 11, 29), 1.04], [Date.UTC(2023, 1, 1), 1.08], [Date.UTC(2023, 2, 2), 0.93], [Date.UTC(2023, 3, 1), 0.76], [Date.UTC(2023, 4, 2), 0.82], [Date.UTC(2023, 5, 2), 0.94], [Date.UTC(2023, 6, 2), 0.87], [Date.UTC(2023, 7, 4), 0.92], [Date.UTC(2023, 8, 2), 0.97], [Date.UTC(2023, 9, 4), 0.96], [Date.UTC(2023, 10, 2), 0.93], [Date.UTC(2023, 11, 4), 0.72], [Date.UTC(2024, 0, 2), 0.74], [Date.UTC(2024, 1, 2), 0.99], [Date.UTC(2024, 2, 1), 0.81], [Date.UTC(2024, 3, 3), 0.84], [Date.UTC(2024, 4, 1), 0.75], [Date.UTC(2024, 5, 1), 0.96], [Date.UTC(2024, 6, 3), 0.83], [Date.UTC(2024, 7, 1), 1.11], [Date.UTC(2024, 8, 1), 1.20], [Date.UTC(2024, 9, 2), 1.09"

2.2 数据清洗

dats是一个相对比较整齐的数据,但只是一大窜字符串,我们需要对其进行清洗。

  1. 数据思维,首先将数据纳入数据框中,接着把每个数据点分隔开。注意观察规律,分割开每个数据点的符号为],[还是使用正则表达式。
df <- tibble(x = dats) %>%
  separate_rows(x, sep = "\\], \\[")
df
# A tibble: 208 × 1
   x                          
   <chr>                      
 1 Date.UTC(2007, 4, 5), 0.06 
 2 Date.UTC(2007, 5, 2), 0.07 
 3 Date.UTC(2007, 6, 2), 0.07 
 4 Date.UTC(2007, 7, 5), 0.06 
 5 Date.UTC(2007, 8, 2), 0.07 
 6 Date.UTC(2007, 9, 4), 0.07 
 7 Date.UTC(2007, 10, 4), 0.09
 8 Date.UTC(2007, 11, 3), 0.09
 9 Date.UTC(2008, 0, 3), 0.04 
10 Date.UTC(2008, 1, 7), 0.06 
# ℹ 198 more rows
  1. 将得到数据继续分割,同时将字符型数据转化为数值型。
df <- df %>%
  separate(x, c("Date", "Index"), sep = "\\),", convert = TRUE)
df
# A tibble: 208 × 2
   Date                 Index
   <chr>                <dbl>
 1 Date.UTC(2007, 4, 5   0.06
 2 Date.UTC(2007, 5, 2   0.07
 3 Date.UTC(2007, 6, 2   0.07
 4 Date.UTC(2007, 7, 5   0.06
 5 Date.UTC(2007, 8, 2   0.07
 6 Date.UTC(2007, 9, 4   0.07
 7 Date.UTC(2007, 10, 4  0.09
 8 Date.UTC(2007, 11, 3  0.09
 9 Date.UTC(2008, 0, 3   0.04
10 Date.UTC(2008, 1, 7   0.06
# ℹ 198 more rows
  1. Date列转化为日期。注意,第8行为11月,第9行按理说应该为12月,但却显示为0月,所以相应的月份需要加1。我们可以先解析日期,但更简单的做法是现将数字提取出来,分割后再将月份列+1,再转换为日期列即可。
df <- df %>%
  mutate(Date = str_replace(Date, "Date.UTC\\(", "")) %>%
  separate(Date, c("year", "month", "day"), convert = TRUE) %>%
  mutate(Date = make_date(year, month + 1, day)) %>%
  select(Date, Index) 
df
# A tibble: 208 × 2
   Date       Index
   <date>     <dbl>
 1 2007-05-05  0.06
 2 2007-06-02  0.07
 3 2007-07-02  0.07
 4 2007-08-05  0.06
 5 2007-09-02  0.07
 6 2007-10-04  0.07
 7 2007-11-04  0.09
 8 2007-12-03  0.09
 9 2008-01-03  0.04
10 2008-02-07  0.06
# ℹ 198 more rows

搞定!

2.3 可视化

使用ggplot2绘制R语言指数变化折线图。

df %>%
  ggplot(aes(Date, Index), size = 1.5) +
  geom_line(color = "steelblue") +
  scale_x_date(date_breaks = "1 years", date_labels = "%Y") +
  labs(
    x = "日期", y = "TIOBE指数",
    title = "R语言TOIBE指数",
    caption = "数据来自:www.tiobe.com"
  ) +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

3 北京历史天气数据-多网页批量提取

天气网北京历史天气链接页:http://lishi.tianqi.com/beijing/index.html

点开2022年03月天气:http://lishi.tianqi.com/beijing/202203.html

虽然看起来是表格,但 html 代码并不是 table(是就超级简单了),只能是借助正则表达式来解析了。

3.1 获取网页地址

postfix <- read_html("http://lishi.tianqi.com/beijing/index.html") %>%
  html_elements("a") %>%
  html_attr("href") %>% # 解析网址
  str_subset("^/beijing") # 筛选出正确的

urls <- str_c("http://lishi.tianqi.com", postfix)

3.2 解决一个网页的爬取+数据整理

3.2.1 从网页解析历史天气数据

weather <- read_html(urls[1]) %>%
  html_nodes(".thrui") %>%
  html_text2()
weather
[1] "\r\n2024-10-01 星期二\n\r\n17℃\n\r\n7℃\n\r\n多云~晴\n\r\n西北风 4级\n\r\r\n\r\n\r\n2024-10-02 星期三\n\r\n20℃\n\r\n7℃\n\r\n晴\n\r\n西北风 2级\n\r\r\n\r\n\r\n2024-10-03 星期四\n\r\n22℃\n\r\n8℃\n\r\n晴\n\r\n南风 2级\n\r\r\n\r\n\r\n2024-10-04 星期五\n\r\n22℃\n\r\n9℃\n\r\n多云~晴\n\r\n西南风 2级\n\r\r\n\r\n\r\n2024-10-05 星期六\n\r\n23℃\n\r\n13℃\n\r\n多云~阴\n\r\n南风 1级\n\r\r\n\r\n\r\n2024-10-06 星期日\n\r\n18℃\n\r\n10℃\n\r\n多云~小雨\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-07 星期一\n\r\n22℃\n\r\n8℃\n\r\n晴\n\r\n北风 2级\n\r\r\n\r\n\r\n2024-10-08 星期二\n\r\n23℃\n\r\n10℃\n\r\n晴~多云\n\r\n西南风 1级\n\r\r\n\r\n\r\n2024-10-09 星期三\n\r\n19℃\n\r\n11℃\n\r\n多云\n\r\n西南风 1级\n\r\r\n\r\n\r\n2024-10-10 星期四\n\r\n22℃\n\r\n8℃\n\r\n晴\n\r\n西北风 1级\n\r\r\n\r\n\r\n2024-10-11 星期五\n\r\n23℃\n\r\n10℃\n\r\n晴\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-12 星期六\n\r\n24℃\n\r\n13℃\n\r\n多云\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-13 星期日\n\r\n18℃\n\r\n13℃\n\r\n多云~雾\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-14 星期一\n\r\n20℃\n\r\n12℃\n\r\n多云\n\r\n西南风 2级\n\r\r\n\r\n\r\n2024-10-15 星期二\n\r\n21℃\n\r\n10℃\n\r\n阴~多云\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-16 星期三\n\r\n18℃\n\r\n13℃\n\r\n多云~阴\n\r\n南风 2级\n\r\r\n\r\n\r\n2024-10-17 星期四\n\r\n17℃\n\r\n14℃\n\r\n多云~小雨\n\r\n东北风 1级\n\r\r\n\r\n\r\n2024-10-18 星期五\n\r\n18℃\n\r\n5℃\n\r\n小雨~晴\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-19 星期六\n\r\n13℃\n\r\n2℃\n\r\n晴\n\r\n东北风 2级\n\r\r\n\r\n\r\n2024-10-20 星期日\n\r\n12℃\n\r\n1℃\n\r\n阴~多云\n\r\n南风 3级\n\r\r\n\r\n\r\n2024-10-21 星期一\n\r\n13℃\n\r\n6℃\n\r\n阴~小雨\n\r\n南风 1级\n\r\r\n\r\n\r\n2024-10-22 星期二\n\r\n17℃\n\r\n3℃\n\r\n阴~晴\n\r\n西北风 3级\n\r\r\n\r\n\r\n2024-10-23 星期三\n\r\n19℃\n\r\n5℃\n\r\n晴\n\r\n西南风 1级\n\r\r\n\r\n\r\n2024-10-24 星期四\n\r\n21℃\n\r\n7℃\n\r\n多云~晴\n\r\n东风 1级\n\r\r\n\r\n\r\n2024-10-25 星期五\n\r\n19℃\n\r\n10℃\n\r\n多云~雾\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-26 星期六\n\r\n16℃\n\r\n10℃\n\r\n多云~雾\n\r\n东北风 1级\n\r\r\n\r\n\r\n2024-10-27 星期日\n\r\n18℃\n\r\n4℃\n\r\n多云~晴\n\r\n北风 2级\n\r\r\n\r\n\r\n2024-10-28 星期一\n\r\n19℃\n\r\n7℃\n\r\n多云\n\r\n南风 1级\n\r\r\n\r\n\r\n2024-10-29 星期二\n\r\n19℃\n\r\n7℃\n\r\n多云~晴\n\r\n西北风 1级\n\r\r\n\r\n\r\n2024-10-30 星期三\n\r\n18℃\n\r\n10℃\n\r\n多云~雾\n\r\n东南风 1级\n\r\r\n\r\n\r\n2024-10-31 星期四\n\r\n17℃\n\r\n12℃\n\r\n轻度雾霾\n\r\n北风 1级\n\r\r\n\r\n查看更多\n\r"

得到了想要的结果,当然还不是最终结果,需要进一步整理。

3.2.2 清洗数据

将数据通过清洗转换为数据框。

  • 将数据切分成多行,每一个日期一行。
  • 再将每一行按照规律切分成多个列。
tibble(weather) %>%
  separate_rows(weather, sep = "\n\r\n(?=\\d{4})") %>%
  mutate(weather = str_replace_all(weather, "\r\n|查看更多", "")) %>%
  separate(weather,
    sep = "\n\r*", extra = "drop",
    into = c("日期", "最高气温", "最低气温", "天气", "风向")
  )
# A tibble: 31 × 5
   日期              最高气温 最低气温 天气      风向      
   <chr>             <chr>    <chr>    <chr>     <chr>     
 1 2024-10-01 星期二 17℃      7℃       多云~晴   西北风 4级
 2 2024-10-02 星期三 20℃      7℃       晴        西北风 2级
 3 2024-10-03 星期四 22℃      8℃       晴        南风 2级  
 4 2024-10-04 星期五 22℃      9℃       多云~晴   西南风 2级
 5 2024-10-05 星期六 23℃      13℃      多云~阴   南风 1级  
 6 2024-10-06 星期日 18℃      10℃      多云~小雨 东北风 2级
 7 2024-10-07 星期一 22℃      8℃       晴        北风 2级  
 8 2024-10-08 星期二 23℃      10℃      晴~多云   西南风 1级
 9 2024-10-09 星期三 19℃      11℃      多云      西南风 1级
10 2024-10-10 星期四 22℃      8℃       晴        西北风 1级
# ℹ 21 more rows
  • 先用\n\r\n分割成多行,加个零宽断言(?=\\d{4})是为了避免把\n\r\n查看更多也给断行。
  • 将多余的\r\n查看更多替换成空格。
  • \n+0个或多个\r为分隔符,将1列分割成多列。
  • 参数extra = "drop",是避免输出一个没有什么影响的警告消息。

3.3 批量爬取多个网址并整理数据

将上述单网址的爬取及整理步骤封装一个函数。

crawurl <- function(url) {
  Sys.sleep(sample(5, 1)) # 增加随机等待1-5秒
  weather <- read_html(url) %>%
    html_nodes(".thrui") %>%
    html_text2()
  tibble(weather) %>%
    separate_rows(weather, sep = "\n\r\n(?=\\d{4})") %>%
    mutate(weather, str_replace_all(weather, "\r\n|查看更多", "")) %>%
    separate(weather,
      sep = "\n\r*", extra = "drop",
      into = c("日期", "最高气温", "最低气温", "天气", "风向")
    )
}

使用map()系列函数,完成循环迭代:

# beijing <- map_dfr(urls, crawurl)