Tidyverse包中的across()函数

dplyr
across
Author

Lee

Published

October 13, 2022

1 一个简单的例子

tidyverse-dplyr中的across()函数是R语言以及tidy思想的完美体现。across()函数 可以与dplyr中的筛选与汇总计算函数相结合使用,实现强大到匪夷所思的功能。

我们先看一个例子

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <fct>, year <int>

如果我们希望统计数据中每一行的缺失值有多少,那么传统方法要怎么写呢?(当然可以使用naniar来简单的计算,但我们更希望在这里用一些tidy传统的方法说明一些问题)

penguins %>%
  summarise(
    na_in_species = sum(is.na(species)),
    na_in_island  = sum(is.na(island)),
    na_in_length  = sum(is.na(bill_length_mm)),
    na_in_depth   = sum(is.na(bill_depth_mm)),
    na_in_flipper = sum(is.na(flipper_length_mm)),
    na_in_body    = sum(is.na(body_mass_g)),
    na_in_sex     = sum(is.na(sex)),
    na_in_year    = sum(is.na(year))
  )
# A tibble: 1 × 8
  na_in_species na_in_island na_in_length na_in_depth na_in_flipper na_in_body
          <int>        <int>        <int>       <int>         <int>      <int>
1             0            0            2           2             2          2
# ℹ 2 more variables: na_in_sex <int>, na_in_year <int>

虽然也计算出来了,但好像有点复杂,如果数据框有很多列,那工作量就太大了。有没有简单的方法?试试summarise_all()函数。

penguins %>%
  summarise_all(
    ~ sum(is.na(.x))
  )
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       0      0              2             2                 2           2
# ℹ 2 more variables: sex <int>, year <int>

没问题。我希望进一步按照企鹅不同的种类,统计各体征数据的均值,用summarise_if()

df1 <- penguins %>%
  group_by(species) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE) %>%
  select(-year)
df1
# A tibble: 3 × 5
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie              38.8          18.3              190.       3701.
2 Chinstrap           48.8          18.4              196.       3733.
3 Gentoo              47.5          15.0              217.       5076.

除了结果多了一个year列需要处理到外,一切OK。我还希望在结果中加入每类企鹅的数量。因为summarise_if()的缘故,这个就无法直接计算了。需要从新计算后将两个结果合并。

# 计算各类企鹅数量,很简单
df2 <- penguins %>%
  count(species) # 直接用count,连group_by()都给它省略。
df2
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
df2 %>%
  left_join(df1, by = "species")
# A tibble: 3 × 6
  species       n bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <int>          <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie      152           38.8          18.3              190.       3701.
2 Chinstrap    68           48.8          18.4              196.       3733.
3 Gentoo      124           47.5          15.0              217.       5076.

我们将整个步骤回忆下:

  1. summarise_all()探索缺失值。
  2. summarise_if()summarise()分组汇总。
  3. left_join()合并结果数据。

没问题,思路清晰,代码简洁。但……还有没有更简单的方法?

2 across()函数的基本形式

across()有三个主要的参数,across(.cols, .fns, .nams)

  • .cols选取列,和select()的用法非常一致。

  • .fns,指定对所选列需要执行的函数。可以使默认函数,purrr风格匿名函数,以及一个函数列表(这个列表中也可以同时存在默认函数和匿名函数)。

  • .names,指定所计算新列的列名,主要针对.fns为一个函数列表时使用。如果.fns是多个函数,就在数据列的列名后面跟上函数名,比如"{.col}_{.fn}";当然,我们也可以简单调整列名和函数之间的顺序或者增加一个标识的字符串,比如弄成"{.fn}_{.col}""{.col}_{.fn}_aa“。

那么现在使用across()函数,“一键”搞定上面的三个步骤!其实,主要要解决summarise_if()summarise()分组汇总不能连用的痛点。

# 探索缺失值
penguins %>%
  summarise(
    across(everything(), ~ sum(is.na(.x)))
  )
# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
    <int>  <int>          <int>         <int>             <int>       <int>
1       0      0              2             2                 2           2
# ℹ 2 more variables: sex <int>, year <int>
# 汇总分析
penguins %>%
  group_by(species) %>%
  summarise(
    across(where(is.numeric) & !year, mean, na.rm = TRUE), # 选择列,定义函数。
    n = n()
  )
# A tibble: 3 × 6
  species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g     n
  <fct>              <dbl>         <dbl>             <dbl>       <dbl> <int>
1 Adelie              38.8          18.3              190.       3701.   152
2 Chinstrap           48.8          18.4              196.       3733.    68
3 Gentoo              47.5          15.0              217.       5076.   124

好!

3 across()函数应用

3.1summarise()连用,对多个列的多个统计量进行统计汇总。

3.1.1 每个类型变量下有多少组

penguins %>%
  summarise(
    across(c(species, island, sex), n_distinct)
  )
# A tibble: 1 × 3
  species island   sex
    <int>  <int> <int>
1       3      3     3

3.1.2 多列多个统计函数

计算每个体征的平均值和标准差。

penguins %>%
  group_by(species) %>%
  summarise(
    across(where(is.numeric), list(mean = mean, sd = sd),
      .names = "{.col}-{.fn}", na.rm = TRUE
    ),
    n = n()
  )
# A tibble: 3 × 12
  species   `bill_length_mm-mean` `bill_length_mm-sd` `bill_depth_mm-mean`
  <fct>                     <dbl>               <dbl>                <dbl>
1 Adelie                     38.8                2.66                 18.3
2 Chinstrap                  48.8                3.34                 18.4
3 Gentoo                     47.5                3.08                 15.0
# ℹ 8 more variables: `bill_depth_mm-sd` <dbl>, `flipper_length_mm-mean` <dbl>,
#   `flipper_length_mm-sd` <dbl>, `body_mass_g-mean` <dbl>,
#   `body_mass_g-sd` <dbl>, `year-mean` <dbl>, `year-sd` <dbl>, n <int>

3.1.3 不同分组下数据变量的多个分位数

# 传统方法
penguins %>%
  group_by(species, island) %>%
  reframe(
    prob = c(0.25, 0.75),
    length = quantile(bill_length_mm, prob, na.rm = TRUE),
    depth = quantile(bill_depth_mm, prob, na.rm = TRUE),
    flipper = quantile(flipper_length_mm, prob, na.rm = TRUE)
  )
# A tibble: 10 × 6
   species   island     prob length depth flipper
   <fct>     <fct>     <dbl>  <dbl> <dbl>   <dbl>
 1 Adelie    Biscoe     0.25   37.7  17.6    185.
 2 Adelie    Biscoe     0.75   40.7  19.0    193 
 3 Adelie    Dream      0.25   36.8  17.5    185 
 4 Adelie    Dream      0.75   40.4  18.8    193 
 5 Adelie    Torgersen  0.25   36.7  17.4    187 
 6 Adelie    Torgersen  0.75   41.1  19.2    195 
 7 Chinstrap Dream      0.25   46.3  17.5    191 
 8 Chinstrap Dream      0.75   51.1  19.4    201 
 9 Gentoo    Biscoe     0.25   45.3  14.2    212 
10 Gentoo    Biscoe     0.75   49.6  15.7    221 
# across()方法
penguins %>%
  group_by(species, island) %>%
  reframe(
    prob = c(0.25, 0.75),
    across(
      ends_with("_mm"), ~ quantile(.x, prob, na.rm = TRUE)
    )
  )
# A tibble: 10 × 6
   species   island     prob bill_length_mm bill_depth_mm flipper_length_mm
   <fct>     <fct>     <dbl>          <dbl>         <dbl>             <dbl>
 1 Adelie    Biscoe     0.25           37.7          17.6              185.
 2 Adelie    Biscoe     0.75           40.7          19.0              193 
 3 Adelie    Dream      0.25           36.8          17.5              185 
 4 Adelie    Dream      0.75           40.4          18.8              193 
 5 Adelie    Torgersen  0.25           36.7          17.4              187 
 6 Adelie    Torgersen  0.75           41.1          19.2              195 
 7 Chinstrap Dream      0.25           46.3          17.5              191 
 8 Chinstrap Dream      0.75           51.1          19.4              201 
 9 Gentoo    Biscoe     0.25           45.3          14.2              212 
10 Gentoo    Biscoe     0.75           49.6          15.7              221 

3.1.4 更复杂的统计

across()函数与其他统计变量连用。

penguins %>%
  group_by(species) %>%
  summarise(
    n = n(),
    across(starts_with("bill_"), mean, na.rm = TRUE),
    Area = mean(bill_length_mm * bill_depth_mm, na.rm = TRUE),
    across(ends_with("_g"), mean, na.rm = TRUE)
  )
# A tibble: 3 × 6
  species       n bill_length_mm bill_depth_mm  Area body_mass_g
  <fct>     <int>          <dbl>         <dbl> <dbl>       <dbl>
1 Adelie      152           38.8          18.3  712.       3701.
2 Chinstrap    68           48.8          18.4  900.       3733.
3 Gentoo      124           47.5          15.0  712.       5076.

3.2mutate()连用-对所有符合条件的列进行函数操作

3.2.1 将数据中小于0的值替换为NA

test <- tibble(
  Staff.Confirmed = c(0, 1, -999),
  Residents.Confirmed = c(12, -192, 0)
)
test
# A tibble: 3 × 2
  Staff.Confirmed Residents.Confirmed
            <dbl>               <dbl>
1               0                  12
2               1                -192
3            -999                   0
test %>%
  mutate(
    across(contains("Confirmed"), ~ if_else(.x < 0, NA_real_, .x), .names = "res_{.col}")
  )
# A tibble: 3 × 4
  Staff.Confirmed Residents.Confirmed res_Staff.Confirmed res_Residents.Confir…¹
            <dbl>               <dbl>               <dbl>                  <dbl>
1               0                  12                   0                     12
2               1                -192                   1                     NA
3            -999                   0                  NA                      0
# ℹ abbreviated name: ¹​res_Residents.Confirmed

3.2.2 对所有符合条件的列操作

palmerpenguins::penguins %>%
  drop_na() %>%
  mutate(
    across(where(is.numeric), log),
    across(where(is.character), as.factor)
  )
# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           3.67          2.93              5.20        8.23
 2 Adelie  Torgersen           3.68          2.86              5.23        8.24
 3 Adelie  Torgersen           3.70          2.89              5.27        8.09
 4 Adelie  Torgersen           3.60          2.96              5.26        8.15
 5 Adelie  Torgersen           3.67          3.03              5.25        8.20
 6 Adelie  Torgersen           3.66          2.88              5.20        8.20
 7 Adelie  Torgersen           3.67          2.98              5.27        8.45
 8 Adelie  Torgersen           3.72          2.87              5.20        8.07
 9 Adelie  Torgersen           3.65          3.05              5.25        8.24
10 Adelie  Torgersen           3.54          3.05              5.29        8.39
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <dbl>

4filter()连用-if_all()if_any()函数

我们通过几个例子来了解下if_any()if_all()两个函数的使用方法。

为什么要有这俩函数?因across()函数在针对行进行筛选时(即在与filter()函数连用时)存在一些问题,会导致报错,且返回的结果可能存在错误。就像下面的代码:

penguins %>%
  filter(
    across(everything(), is.na)
  )
# A tibble: 0 × 8
# ℹ 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
#   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
#   year <int>

4.1 筛选有缺失值的行

penguins %>%
  filter(if_any(everything(), is.na))
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           NA            NA                  NA          NA
 2 Adelie  Torgersen           34.1          18.1               193        3475
 3 Adelie  Torgersen           42            20.2               190        4250
 4 Adelie  Torgersen           37.8          17.1               186        3300
 5 Adelie  Torgersen           37.8          17.3               180        3700
 6 Adelie  Dream               37.5          18.9               179        2975
 7 Gentoo  Biscoe              44.5          14.3               216        4100
 8 Gentoo  Biscoe              46.2          14.4               214        4650
 9 Gentoo  Biscoe              47.3          13.8               216        4725
10 Gentoo  Biscoe              44.5          15.7               217        4875
11 Gentoo  Biscoe              NA            NA                  NA          NA
# ℹ 2 more variables: sex <fct>, year <int>

4.2 筛选某列大于某值得所有航

penguins %>%
  filter(if_any(contains("bill"), ~ .x > 21))
# A tibble: 342 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           34.1          18.1               193        3475
 9 Adelie  Torgersen           42            20.2               190        4250
10 Adelie  Torgersen           37.8          17.1               186        3300
# ℹ 332 more rows
# ℹ 2 more variables: sex <fct>, year <int>

4.3 筛选保留符合条件的行

  1. 在指定的列(嘴峰长度和厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就保留下来.
bigger_than_mean <- function(x) {
  x > mean(x, na.rm = TRUE)
}

penguins %>%
  filter(if_all(contains("bill"), bigger_than_mean))
# A tibble: 61 × 8
   species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
 1 Adelie    Torgers…           46            21.5               194        4200
 2 Adelie    Dream              44.1          19.7               196        4400
 3 Adelie    Torgers…           45.8          18.9               197        4150
 4 Adelie    Biscoe             45.6          20.3               191        4600
 5 Adelie    Torgers…           44.1          18                 210        4000
 6 Gentoo    Biscoe             44.4          17.3               219        5250
 7 Gentoo    Biscoe             50.8          17.3               228        5600
 8 Chinstrap Dream              46.5          17.9               192        3500
 9 Chinstrap Dream              50            19.5               196        3900
10 Chinstrap Dream              51.3          19.2               193        3650
# ℹ 51 more rows
# ℹ 2 more variables: sex <fct>, year <int>
  • 在指定的列(嘴峰长度和嘴峰厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就“both big”;如果这些元素有一个大于自己所在列的均值,就“one big”,(注意case_when中if_all要在if_any之前)。
penguins %>%
  filter(!is.na(bill_length_mm)) %>%
  mutate(
    category = case_when(
      if_all(contains("bill"), bigger_than_mean) ~ "both big",
      if_any(contains("bill"), bigger_than_mean) ~ "one big",
      TRUE ~ "small"
    )
  )
# A tibble: 342 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           34.1          18.1               193        3475
 9 Adelie  Torgersen           42            20.2               190        4250
10 Adelie  Torgersen           37.8          17.1               186        3300
# ℹ 332 more rows
# ℹ 3 more variables: sex <fct>, year <int>, category <chr>

5 更复杂的例子-across()与迭代

across()函数的结果是返回数据框,利用这个特性可以做很多事。

假如我们需要计算一个人每天摄入的水分和食物所占的比例。

library(tidyverse)
library(palmerpenguins)
df <- tibble::tribble( # tribble 生成rowwise形式的数据框
  ~water, ~food,
  10.0,   10.0,
  12.1,   10.3,
  13.5,   19.1,
  17.4,   16.0,
  25.8,   15.6,
  27.4,   19.8
)
df
# A tibble: 6 × 2
  water  food
  <dbl> <dbl>
1  10    10  
2  12.1  10.3
3  13.5  19.1
4  17.4  16  
5  25.8  15.6
6  27.4  19.8

如果是传统思路,我们应该怎么做?

df %>%
  rownames_to_column() %>%
  pivot_longer(cols = -rowname) %>%
  group_by(rowname) %>%
  mutate(
    percent = 100 * value / sum(value)
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = name,
    values_from = c(value, percent),
    names_glue = "{name}_{.value}"
  )
# A tibble: 6 × 5
  rowname water_value food_value water_percent food_percent
  <chr>         <dbl>      <dbl>         <dbl>        <dbl>
1 1              10         10            50           50  
2 2              12.1       10.3          54.0         46.0
3 3              13.5       19.1          41.4         58.6
4 4              17.4       16            52.1         47.9
5 5              25.8       15.6          62.3         37.7
6 6              27.4       19.8          58.1         41.9

5.1 使用across()解决以上问题

df %>%
  mutate(100 * across(.names = "porb(%{.col})") / rowSums(across())) %>%
  ungroup()
# A tibble: 6 × 4
  water  food `porb(%water)` `porb(%food)`
  <dbl> <dbl>          <dbl>         <dbl>
1  10    10             50            50  
2  12.1  10.3           54.0          46.0
3  13.5  19.1           41.4          58.6
4  17.4  16             52.1          47.9
5  25.8  15.6           62.3          37.7
6  27.4  19.8           58.1          41.9

5.2 迭代

Note
  1. rowwise()+across()组合起来,就是一行一行处理+一列一列处理,换句话说就是一个一个处理。

  2. across()+map_dbl():

    • purrr::map_dbl() 作为across( .fns = ) 中的函数,即across(.cols = , .fns = map_dbl())across()一列一列的迭代,每一列又传入purrr::map_dbl()再次迭代,因此这里是双重迭代。

    • across()作为purrr::map_df(.x = )的数据,即purrr::map_df(.x = across(), .f = )。因为在mutate()across()返回数据框,因此可以把across()整体视为数据框,然后这个数据框传入purrr::map_df(.x = )进行迭代,因此这种情形可以认为只有purrr::map_*()一次迭代。

有点拗口,我们具体来看。

# rowwise() + across()
# rowwise()设定行方向后,接着across()就行方向上的元素,分别执行.fns。
# 循环的模式为:第一层一行一行处理,第二层在每一行一个元素一个元素处理。
palmerpenguins::penguins %>%
  group_by(species, year) %>%
  summarise(flipper_length_mm = list(flipper_length_mm)) %>%
  ungroup() %>%
  pivot_wider(
    names_from = year,
    values_from = flipper_length_mm
  ) %>%
  rowwise() %>%
  mutate(
    across(where(is.list), .fns = length)
  )
# A tibble: 3 × 4
# Rowwise: 
  species   `2007` `2008` `2009`
  <fct>      <int>  <int>  <int>
1 Adelie        50     50     52
2 Chinstrap     26     18     24
3 Gentoo        34     46     44
# across(.cols = , .fns = purrr::map_dbl())
# 用across()就是一列一列的处理,
# 此时的一列是vector or list,又可以进入purrr::map_dbl()再次迭代,对这一列的每个元素,执行.fns
# 然后across()到下一列
# 循环模式:第一层,一列一列,第二层在每一列里,一个元素到一个元素
palmerpenguins::penguins %>%
  group_by(species, year) %>%
  summarise(flipper_length_mm = list(flipper_length_mm)) %>%
  ungroup() %>%
  pivot_wider(
    names_from = year,
    values_from = flipper_length_mm
  ) %>%
  mutate(
    across(where(is.list), function(x) map_dbl(x, length))
  )
# A tibble: 3 × 4
  species   `2007` `2008` `2009`
  <fct>      <dbl>  <dbl>  <dbl>
1 Adelie        50     50     52
2 Chinstrap     26     18     24
3 Gentoo        34     46     44

5.3 案例1

tt <- penguins %>%
  group_by(species, year) %>%
  summarise(
    across(c(bill_length_mm, bill_depth_mm), list)
  ) %>%
  ungroup()
tt
# A tibble: 9 × 4
  species    year bill_length_mm bill_depth_mm
  <fct>     <int> <list>         <list>       
1 Adelie     2007 <dbl [50]>     <dbl [50]>   
2 Adelie     2008 <dbl [50]>     <dbl [50]>   
3 Adelie     2009 <dbl [52]>     <dbl [52]>   
4 Chinstrap  2007 <dbl [26]>     <dbl [26]>   
5 Chinstrap  2008 <dbl [18]>     <dbl [18]>   
6 Chinstrap  2009 <dbl [24]>     <dbl [24]>   
7 Gentoo     2007 <dbl [34]>     <dbl [34]>   
8 Gentoo     2008 <dbl [46]>     <dbl [46]>   
9 Gentoo     2009 <dbl [44]>     <dbl [44]>   
# 分别计算bill_length_mm和bill_depth_mm两个列表列的长度。
tt %>%
  mutate(
    across(
      ends_with("_mm"),
      .fns = ~ map_dbl(.x, length),
      .names = '{sub("_mm", "", .col)}'
    )
  )
# A tibble: 9 × 6
  species    year bill_length_mm bill_depth_mm bill_length bill_depth
  <fct>     <int> <list>         <list>              <dbl>      <dbl>
1 Adelie     2007 <dbl [50]>     <dbl [50]>             50         50
2 Adelie     2008 <dbl [50]>     <dbl [50]>             50         50
3 Adelie     2009 <dbl [52]>     <dbl [52]>             52         52
4 Chinstrap  2007 <dbl [26]>     <dbl [26]>             26         26
5 Chinstrap  2008 <dbl [18]>     <dbl [18]>             18         18
6 Chinstrap  2009 <dbl [24]>     <dbl [24]>             24         24
7 Gentoo     2007 <dbl [34]>     <dbl [34]>             34         34
8 Gentoo     2008 <dbl [46]>     <dbl [46]>             46         46
9 Gentoo     2009 <dbl [44]>     <dbl [44]>             44         44
# 更好理解的写法-注意,across()结果返回的是数据框
tt %>%
  mutate(
    map_dfc(
      .x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
      .f = ~ map_dbl(.x, length)
    )
  )
# A tibble: 9 × 6
  species    year bill_length_mm bill_depth_mm bill_length bill_depth
  <fct>     <int> <list>         <list>              <dbl>      <dbl>
1 Adelie     2007 <dbl [50]>     <dbl [50]>             50         50
2 Adelie     2008 <dbl [50]>     <dbl [50]>             50         50
3 Adelie     2009 <dbl [52]>     <dbl [52]>             52         52
4 Chinstrap  2007 <dbl [26]>     <dbl [26]>             26         26
5 Chinstrap  2008 <dbl [18]>     <dbl [18]>             18         18
6 Chinstrap  2009 <dbl [24]>     <dbl [24]>             24         24
7 Gentoo     2007 <dbl [34]>     <dbl [34]>             34         34
8 Gentoo     2008 <dbl [46]>     <dbl [46]>             46         46
9 Gentoo     2009 <dbl [44]>     <dbl [44]>             44         44
tt %>%
  mutate(
    map_dfc(
      .x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
      .f = ~ map_dbl(.x, length)
    )
  )
# A tibble: 9 × 6
  species    year bill_length_mm bill_depth_mm bill_length bill_depth
  <fct>     <int> <list>         <list>              <dbl>      <dbl>
1 Adelie     2007 <dbl [50]>     <dbl [50]>             50         50
2 Adelie     2008 <dbl [50]>     <dbl [50]>             50         50
3 Adelie     2009 <dbl [52]>     <dbl [52]>             52         52
4 Chinstrap  2007 <dbl [26]>     <dbl [26]>             26         26
5 Chinstrap  2008 <dbl [18]>     <dbl [18]>             18         18
6 Chinstrap  2009 <dbl [24]>     <dbl [24]>             24         24
7 Gentoo     2007 <dbl [34]>     <dbl [34]>             34         34
8 Gentoo     2008 <dbl [46]>     <dbl [46]>             46         46
9 Gentoo     2009 <dbl [44]>     <dbl [44]>             44         44

5.4 案例2

对于以下数据:

df <- tibble(
  id = 1:10,
  sex = c("m", "m", "m", "f", "f", "f", "m", "f", "f", "m"),
  lds1.x = c(NA, 1, 0, 1, NA, 0, 0, NA, 0, 1),
  lds1.y = c(1, NA, 1, 1, 0, NA, 0, 3, NA, 1),
  lds2.x = c(2, 1, NA, 0, 0, NA, 1, NA, NA, 1),
  lds2.y = c(0, 2, 2, NA, NA, 0, 0, 3, 0, NA)
)
df
# A tibble: 10 × 6
      id sex   lds1.x lds1.y lds2.x lds2.y
   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl>
 1     1 m         NA      1      2      0
 2     2 m          1     NA      1      2
 3     3 m          0      1     NA      2
 4     4 f          1      1      0     NA
 5     5 f         NA      0      0     NA
 6     6 f          0     NA     NA      0
 7     7 m          0      0      1      0
 8     8 f         NA      3     NA      3
 9     9 f          0     NA     NA      0
10    10 m          1      1      1     NA

across()的方式实现以下操作:

df %>%
  mutate(
    lds1 = coalesce(lds1.x, lds1.y),
    lds2 = coalesce(lds2.x, lds2.y)
  )
# A tibble: 10 × 8
      id sex   lds1.x lds1.y lds2.x lds2.y  lds1  lds2
   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
 1     1 m         NA      1      2      0     1     2
 2     2 m          1     NA      1      2     1     1
 3     3 m          0      1     NA      2     0     2
 4     4 f          1      1      0     NA     1     0
 5     5 f         NA      0      0     NA     0     0
 6     6 f          0     NA     NA      0     0     0
 7     7 m          0      0      1      0     0     1
 8     8 f         NA      3     NA      3     3     3
 9     9 f          0     NA     NA      0     0     0
10    10 m          1      1      1     NA     1     1

思路:

mutate()中,把across()整体当做数据框使用

df %>%
  mutate(
    across(ends_with(".x"), .names = '{sub(".x","", .col)}')
  )
# A tibble: 10 × 8
      id sex   lds1.x lds1.y lds2.x lds2.y  lds1  lds2
   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
 1     1 m         NA      1      2      0    NA     2
 2     2 m          1     NA      1      2     1     1
 3     3 m          0      1     NA      2     0    NA
 4     4 f          1      1      0     NA     1     0
 5     5 f         NA      0      0     NA    NA     0
 6     6 f          0     NA     NA      0     0    NA
 7     7 m          0      0      1      0     0     1
 8     8 f         NA      3     NA      3    NA    NA
 9     9 f          0     NA     NA      0     0    NA
10    10 m          1      1      1     NA     1     1
# 将以上结果中的across()整体传入map_**()函数中。
df %>%
  mutate(
    map_dfc(
      .x = across(ends_with(".x"), .names = '{sub(".x","", .col)}'),
      .f = ~ is.na(.x)
    )
  )
# A tibble: 10 × 8
      id sex   lds1.x lds1.y lds2.x lds2.y lds1  lds2 
   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <lgl> <lgl>
 1     1 m         NA      1      2      0 TRUE  FALSE
 2     2 m          1     NA      1      2 FALSE FALSE
 3     3 m          0      1     NA      2 FALSE TRUE 
 4     4 f          1      1      0     NA FALSE FALSE
 5     5 f         NA      0      0     NA TRUE  FALSE
 6     6 f          0     NA     NA      0 FALSE TRUE 
 7     7 m          0      0      1      0 FALSE FALSE
 8     8 f         NA      3     NA      3 TRUE  TRUE 
 9     9 f          0     NA     NA      0 FALSE TRUE 
10    10 m          1      1      1     NA FALSE FALSE
# 针对.y进行同样的操作,并将两个across()数据框传递给map2_**()函数
df %>%
  mutate(
    map2_dfc(
      .x = across(ends_with(".x"), .names = '{sub(".x","", .col)}'),
      .y = across(ends_with(".y"), .names = '{sub(".y","", .col)}'),
      .f = coalesce
    )
  )
# A tibble: 10 × 8
      id sex   lds1.x lds1.y lds2.x lds2.y  lds1  lds2
   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
 1     1 m         NA      1      2      0     1     2
 2     2 m          1     NA      1      2     1     1
 3     3 m          0      1     NA      2     0     2
 4     4 f          1      1      0     NA     1     0
 5     5 f         NA      0      0     NA     0     0
 6     6 f          0     NA     NA      0     0     0
 7     7 m          0      0      1      0     0     1
 8     8 f         NA      3     NA      3     3     3
 9     9 f          0     NA     NA      0     0     0
10    10 m          1      1      1     NA     1     1

上述例子中的迭代过程为:.x和.y两个across()数据框的第一列与第一列、第二列与第二列,…进行coalesce操作。分步骤的写法:

df %>%
  mutate({
    df1 <- across(ends_with(".x"), .names = '{sub(".x","",.col)}')
    df2 <- across(ends_with(".y"))
    out <- map2_dfc(df1, df2, ~ coalesce(.x, .y))
    out
  })
# A tibble: 10 × 8
      id sex   lds1.x lds1.y lds2.x lds2.y  lds1  lds2
   <int> <chr>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
 1     1 m         NA      1      2      0     1     2
 2     2 m          1     NA      1      2     1     1
 3     3 m          0      1     NA      2     0     2
 4     4 f          1      1      0     NA     1     0
 5     5 f         NA      0      0     NA     0     0
 6     6 f          0     NA     NA      0     0     0
 7     7 m          0      0      1      0     0     1
 8     8 f         NA      3     NA      3     3     3
 9     9 f          0     NA     NA      0     0     0
10    10 m          1      1      1     NA     1     1

6 小结

across()函数在summarise()/mutate()/...等函数中使用,可以实现以下功能:

  1. 针对数据框中的多列执行相同操作。

  2. 不同性质的操作,有时可以一起写出,不用在使用left_join()等函数链接各结果。

  3. 在针对行进行筛选时,应该使用if_any()if_all()