library(tidyverse)
library(stringr)
library(babynames)
double_quote <- "\""
single_quote <- "'"
backslash <- "\\"
x <- c(double_quote, single_quote, backslash)
x # 输出的结果包含转义字符,
[1] "\"" "'" "\\"
str_view(x) # 输出的结果为转义后的结果
[1] │ "
[2] │ '
[3] │ \
Lee
November 8, 2023
本章中,我们主要使用string
包来实现字符串的高效处理。
这里我们主要介绍转义字符.
有时在创建字符串时,我们需要输入引号,这时我们需要使用转义字符\
来转义它,即告诉计算机这个是字符串中的引号,而不是程序中的引号.
\
时,同样需要转义符号\
创建一个包含多个引号或反斜杠的字符串会令人感到十分困惑.
library(stringr)
tricky <- "double_quote <- \"\\\"\" # or '\"'
single_quote <- '\\'' # or \"'\""
str_view(tricky)
[1] │ double_quote <- "\"" # or '"'
│ single_quote <- '\'' # or "'"
如上代码中有非常多的反斜杠,要消除其中的过多的转义字符(主要是转义字符的转义字符),可以使用原始字符串.
原始字符串通常以r"(
开始,以)"
结束. 如果字符串中已经包含了)"
,则可以使用r[]
或r{}
,如果还不够,还可以插入任意数量的破折号,使开头和结尾的字符串具有唯一性.例如r"--()--"
或r"---()---
.那么以上tricky
字符串可以怎么改写呢?
tricky_amendment <- r"(double_quote <- "\"" # or'"'
single_quote <- '\'' # or "'")"
str_view(tricky_amendment)
[1] │ double_quote <- "\"" # or'"'
│ single_quote <- '\'' # or "'"
最常见的特殊字符有:
\n
: 新行\t
: 制表符\u
& \U
: Unicode转义字符的字符串(非英语字符)更多的特殊字符可以通过?Quotes
查看.
[1] │ one
│ tow
[2] │ one{\t}two
[3] │ µ
[4] │ 😄
创建包含以下值的字符串:
本部分我们将讨论: - str_c()
,str_glue()
,str_flatten()
三个函数的用法. - 以上函数与mutate()
,summarise()
等函数连用的方法.
手动创建字符串只是最基础的内容,更多时候我们希望将一些文本与已经保存在数据中的文本连接起来,例如,将”hello”与一个保存有姓名的数据连接起来,创建一些列问候语.
str_c()
函数mutate()
连用df <- tibble(name = c("Flora", "David", "Terra", NA))
df %>%
mutate(greeting1 = str_c("Hi ", name, "!")) %>%
# coalesce()函数用于处理NA值,用一个指定值代替缺失值,有两种形式。
mutate(greeting2 = str_c("Hi ", coalesce(name, "you"), "!")) %>%
mutate(greeting3 = coalesce(str_c("Hi ", name, "!"), "Hi!"))
# A tibble: 4 × 4
name greeting1 greeting2 greeting3
<chr> <chr> <chr> <chr>
1 Flora Hi Flora! Hi Flora! Hi Flora!
2 David Hi David! Hi David! Hi David!
3 Terra Hi Terra! Hi Terra! Hi Terra!
4 <NA> <NA> Hi you! Hi!
str_glue()
函数使用str_c()
将字符串结合会存在一个问题,我们会输入大量的引号"
,在字符串较长时我们很难一眼看出代码的总体目标.str_glue()
函数提供了另外一种结合的方式.
str_flatten()
函数str_c()
和str_glue()
可以很好地与mutate()
配合使用,因为它们的输出与输入长度相同。如果您想要一个能与summarize()
完美配合的函数,即总是返回单个字符串的函数,该怎么办呢?这就是str_flatten()
的工作:它接收字符向量,并将向量中的每个元素合并为一个字符串:
str_flatten(c("x", "y", "z")) # 返回单个字符串
[1] "xyz"
# 与summarise()函数连用
df <- tribble(
~name, ~fruit,
"Carmen", "banana",
"Carmen", "apple",
"Marvin", "nectarine",
"Terence", "cantaloupe",
"Terence", "papaya",
"Terence", "mandarin"
)
df %>%
group_by(name) %>%
summarise(
str_flatten(fruit, ", ")
)
# A tibble: 3 × 2
name `str_flatten(fruit, ", ")`
<chr> <chr>
1 Carmen banana, apple
2 Marvin nectarine
3 Terence cantaloupe, papaya, mandarin
将下列表达式从str_c()
转换为str_glue()
,反之亦然.
str_c(“The price of”, food, ” is “, price)
str_glue(“I’m {age} years old and live in {country}”)
str_c(“\section{”, title, “}”)
food <- c("apple", "rice")
price <- c(10, 20)
age <- c(35, 45)
country <- c("China", "US")
title <- c("doctor", "teacher")
str_glue("The price of {food} is {price}")
The price of apple is 10
The price of rice is 20
str_c("I'm ", age, "years old and live in ", country)
[1] "I'm 35years old and live in China" "I'm 45years old and live in US"
str_glue("\\\\section{{{title}}}")
\\section{doctor}
\\section{teacher}
在处理字符串问题时,最好先将字符串的各部分分解出来再进行分析.如语句str_c("\\section{", title, "}")
中的字符串,包括了三个部分:
\(N = 1\)
title
本部分我们将讨论: 如何提取一个字符串中的多个变量,主要有四个形式类似的函数:
separate_longer_delim(col, delim)
、separate_longer_position(col, width)
、separate_wider_delim(col, delim, names)
、separate_wider_position(col, widths)
。
pivot_longer()
和pivot_wider()
函数类似,对新生成的数据进行长宽的变换。delim
使用分隔符分割字符串,position
按指定宽度分割字符串。separate_wider_regex()
函数,可以根据正则表达式分割字符串,在使用这个函数之前,需要对正则表达式有所了解。df1 <- tibble(x = c("a,b,c", "d,e", "f"))
df1 %>%
separate_longer_delim(x, delim = ",")
# A tibble: 6 × 1
x
<chr>
1 a
2 b
3 c
4 d
5 e
6 f
df3 <- tibble(x = c("a10.1.2022", "b10.2.2011", "e15.1.2015"))
df3 %>%
separate_wider_delim(x, delim = ".", names = c("code", "edition", "year"))
# A tibble: 3 × 3
code edition year
<chr> <chr> <chr>
1 a10 1 2022
2 b10 2 2011
3 e15 1 2015
如果想在分割的过程汇中删除某列,可以再names
参数中将对应的列名设置为NA
df3 %>%
separate_wider_delim(
x,
delim = ".",
names = c("code", NA, "edition")
)
# A tibble: 3 × 2
code edition
<chr> <chr>
1 a10 2022
2 b10 2011
3 e15 2015
df4 <- tibble(x = c("202215TX", "202122LA", "202325CA"))
df4 %>%
separate_wider_position(
x,
widths = c(year = 4, age = 2, state = 2)
)
# A tibble: 3 × 3
year age state
<chr> <chr> <chr>
1 2022 15 TX
2 2021 22 LA
3 2023 25 CA
我们先看一个例子
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
df %>%
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z")
)
Error in `separate_wider_delim()`:
! Expected 3 pieces in each element of `x`.
! 2 values were too short.
ℹ Use `too_few = "debug"` to diagnose the problem.
ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
报错了!报错的原因在于,separate_wider_delim()
函数在分列时,需要每个字符串的分裂数量一致,显然在上面代码中,“1-3”和”1”的分裂数不同。在实际工作中,这个分裂数的不同可能是多或少,需要在separate_wider_delim()
函数中加入too_few
参数。
df %>%
separate_wider_delim(
x,
delim = "-",
names = c("a", "x", "y", "z"),
too_few = "debug" # 使用调试模式输出结果
)
# A tibble: 5 × 7
a x y z x_ok x_pieces x_remainder
<chr> <chr> <chr> <chr> <lgl> <int> <chr>
1 1 1-1-1 1 <NA> FALSE 3 ""
2 1 1-1-2 2 <NA> FALSE 3 ""
3 1 1-3 <NA> <NA> FALSE 2 ""
4 1 1-3-2 2 <NA> FALSE 3 ""
5 1 1 <NA> <NA> FALSE 1 ""
使用调试模式时,输出中会多出三列:x_ok
、x_pieces
和x_remainder
(如果用不同的名称分隔变量,会得到不同的前缀):
x_ok
可以让你快速找到失败的输入.x_pieces
列包含了各字符串分裂后的有多少个元素.x_remainder
列包含了剩余的元素,通常在处理过多元素时使用.在得知问题后,我们有两种处理的选择:
too_few
参数,用NA
填补缺失的部分。df %>%
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_few = "align_start" # align_end和align_start控制填补NA的位置
)
# A tibble: 5 × 3
x y z
<chr> <chr> <chr>
1 1 1 1
2 1 1 2
3 1 3 <NA>
4 1 3 2
5 1 <NA> <NA>
df <- tibble(x = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
df %>%
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "debug"
)
# A tibble: 5 × 6
x y z x_ok x_pieces x_remainder
<chr> <chr> <chr> <lgl> <int> <chr>
1 1-1-1 1 1 TRUE 3 ""
2 1-1-2 1 2 TRUE 3 ""
3 1-3-5-6 3 5 FALSE 4 "-6"
4 1-3-2 3 2 TRUE 3 ""
5 1-3-5-7-9 3 5 FALSE 5 "-7-9"
我们同样有两种选择:
df %>%
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "drop"
)
# A tibble: 5 × 3
x y z
<chr> <chr> <chr>
1 1 1 1
2 1 1 2
3 1 3 5
4 1 3 2
5 1 3 5
df %>%
separate_wider_delim(
x,
delim = "-",
names = c("x", "y", "z"),
too_many = "merge"
)
# A tibble: 5 × 3
x y z
<chr> <chr> <chr>
1 1 1 1
2 1 1 2
3 1 3 5-6
4 1 3 2
5 1 3 5-7-9
如何查询字符串长度,提取子字符串以及在图形和表中处理字符串.
str_length(c("a", "R for data science", NA))
[1] 1 18 NA
str_length()
函数可以和dplyr
系列函数连用.
babynames %>%
count(length = str_length(name), wt = n)
# A tibble: 14 × 2
length n
<int> <int>
1 2 338150
2 3 8589596
3 4 48506739
4 5 87011607
5 6 90749404
6 7 72120767
7 8 25404066
8 9 11926551
9 10 1306159
10 11 2135827
11 12 16295
12 13 10845
13 14 3681
14 15 830
babynames %>%
filter(str_length(name) == 15) %>%
count(name, wt = n, sort = TRUE)
# A tibble: 34 × 2
name n
<chr> <int>
1 Franciscojavier 123
2 Christopherjohn 118
3 Johnchristopher 118
4 Christopherjame 108
5 Christophermich 52
6 Ryanchristopher 45
7 Mariadelosangel 28
8 Jonathanmichael 25
9 Christianjoseph 22
10 Christopherjose 22
# ℹ 24 more rows
str_sub(字符串, start, end)
str_sub()
函数同样可以和dplyr
系列函数连用.
babynames %>%
mutate(
first = str_sub(name, 1, 1), # 提取每个名字的第一个字母
last = str_sub(name, -1, -1) # 提取每个名字的最后一个字母
)
# A tibble: 1,924,665 × 7
year sex name n prop first last
<dbl> <chr> <chr> <int> <dbl> <chr> <chr>
1 1880 F Mary 7065 0.0724 M y
2 1880 F Anna 2604 0.0267 A a
3 1880 F Emma 2003 0.0205 E a
4 1880 F Elizabeth 1939 0.0199 E h
5 1880 F Minnie 1746 0.0179 M e
6 1880 F Margaret 1578 0.0162 M t
7 1880 F Ida 1472 0.0151 I a
8 1880 F Alice 1414 0.0145 A e
9 1880 F Bertha 1320 0.0135 B a
10 1880 F Sarah 1288 0.0132 S h
# ℹ 1,924,655 more rows
提取每个婴儿姓名中中间的字母.
babynames %>%
mutate(
middle_letter = if_else(
str_length(name) %% 2 == 1,
str_sub(name, (str_length(name) / 2 + 1), (str_length(name) / 2 + 1)),
str_sub(name, (str_length(name) / 2 - 1), (str_length(name) / 2))
))
# A tibble: 1,924,665 × 6
year sex name n prop middle_letter
<dbl> <chr> <chr> <int> <dbl> <chr>
1 1880 F Mary 7065 0.0724 Ma
2 1880 F Anna 2604 0.0267 An
3 1880 F Emma 2003 0.0205 Em
4 1880 F Elizabeth 1939 0.0199 a
5 1880 F Minnie 1746 0.0179 in
6 1880 F Margaret 1578 0.0162 rg
7 1880 F Ida 1472 0.0151 d
8 1880 F Alice 1414 0.0145 i
9 1880 F Bertha 1320 0.0135 er
10 1880 F Sarah 1288 0.0132 r
# ℹ 1,924,655 more rows
#TODO:
正则表达式是一种简洁而强大的语言,用于描述字符串中的模式.本节我们将通过几个简单的示例了解正则表达式的基本概念.
str_view()
( Section 1.2 有介绍)可以显示字符串向量所匹配的元素,并用<>
包围,并用蓝色高亮显示匹配的元素.
.
,+
,*
,[
,]
,?
,都有特殊的含义,被称为元字符:
?
,匹配0次或1次.+
,至少匹配1次.*
,匹配任意次数(包括0次).# 匹配文字字符
str_view(fruit, "berry")
[6] │ bil<berry>
[7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>
[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>
# 元字符匹配-找出所有包含字母a+三个字母+字母e的水果字符串
str_view(fruit, "a...e")
[1] │ <apple>
[7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
[2] │ <ab>
[3] │ <abb>
[1] │ <a>
[2] │ <ab>
[3] │ <abb>