Diving into dependen-“sea”

How CRAN packages are interconnected

H. Sherry Zhang

2022-10-18

When writing a package, we may want to use functions in other packages. This creates a dependency for our package and a reverse dependency on the package we borrow functions from. As one of the recipients of the isoband email1, I’m curious to know how interconnected CRAN packages are. Luckily, it is not too hard to get data on this, and so the journey begins…

Preparing dependency data

The utils package provides the function available.packages() to extract CRAN package information. The data includes information on the package name, version, dependency, and license:

Code
(raw <- utils::available.packages() %>% as_tibble())
# A tibble: 18,692 × 17
   Package     Version Priority Depends  Imports Linki…¹ Sugge…² Enhan…³ License
   <chr>       <chr>   <chr>    <chr>    <chr>   <chr>   <chr>   <chr>   <chr>  
 1 A3          1.0.0   <NA>     R (>= 2… <NA>    <NA>    random… <NA>    GPL (>…
 2 AATtools    0.0.2   <NA>     R (>= 3… magrit… <NA>    <NA>    <NA>    GPL-3  
 3 ABACUS      1.0.0   <NA>     R (>= 3… ggplot… <NA>    rmarkd… <NA>    GPL-3  
 4 abbreviate  0.1     <NA>     <NA>     <NA>    <NA>    testth… <NA>    GPL-3  
 5 abbyyR      0.5.5   <NA>     R (>= 3… httr, … <NA>    testth… <NA>    MIT + …
 6 abc         2.2.1   <NA>     R (>= 2… <NA>    <NA>    <NA>    <NA>    GPL (>…
 7 abc.data    1.0     <NA>     R (>= 2… <NA>    <NA>    <NA>    <NA>    GPL (>…
 8 ABC.RAP     0.9.0   <NA>     R (>= 3… graphi… <NA>    knitr,… <NA>    GPL-3  
 9 abcADM      1.0     <NA>     <NA>     Rcpp (… Rcpp, … <NA>    <NA>    GPL-3  
10 ABCanalysis 1.2.1   <NA>     R (>= 2… plotrix <NA>    <NA>    <NA>    GPL-3  
# … with 18,682 more rows, 8 more variables: License_is_FOSS <chr>,
#   License_restricts_use <chr>, OS_type <chr>, Archs <chr>, MD5sum <chr>,
#   NeedsCompilation <chr>, File <chr>, Repository <chr>, and abbreviated
#   variable names ¹​LinkingTo, ²​Suggests, ³​Enhances

From this, we can extract a table to map out the direct dependency every CRAN package has. In this post we will focus on the two strong dependencies: Depends and Imports:

Code
all_pkgs <- raw %>% 
  tidyr::separate_rows(Imports, sep = ",") %>% 
  tidyr::separate_rows(Depends, sep = ",") %>% 
  mutate(
    across(c(Depends, Imports), ~gsub("\\(.*\\)", "\\1", .x)),
    across(c(Depends, Imports), str_trim)
    )
  # filter(!Depends %in% c("R", ""), Imports != "", !is.na(Depends))

(dep_lookup_tbl <- all_pkgs %>% 
  dplyr::select(Package, Depends, Imports) %>% 
  rename(downstream = Package) %>% 
  pivot_longer(Depends:Imports, names_to = "type", values_to = "upstream") %>% 
  distinct() %>% 
  filter(!upstream %in% c("R", "")) %>% 
  filter(!is.na(upstream)) %>% 
  arrange(downstream))
# A tibble: 96,765 × 3
   downstream type    upstream  
   <chr>      <chr>   <chr>     
 1 A3         Depends xtable    
 2 A3         Depends pbapply   
 3 AATtools   Imports magrittr  
 4 AATtools   Imports dplyr     
 5 AATtools   Imports doParallel
 6 AATtools   Imports foreach   
 7 ABACUS     Imports ggplot2   
 8 ABACUS     Imports shiny     
 9 ABC.RAP    Imports graphics  
10 ABC.RAP    Imports stats     
# … with 96,755 more rows

Dependency is a transitive relation. This means a package also (indirectly) depends on all the dependencies of the package of it imports and so on. Changes from an package will propagate downwards through its dependency chain. With the direct dependency table above, we can iteratively construct the extended dependency tree:

Code
find_all_deps <- function(upstream, data){
  print(upstream)
  dt <- tibble()
  dt2 <- data
  i <- 1
  while(nrow(dt2) > nrow(dt)){
    print(i)
    dt <- dt2
    n <- paste0("upstream", i) 
    dt2 <- dt %>% 
      rename(upstream = downstream) %>% 
      left_join(dep_lookup_tbl %>% select(-type), by = "upstream") %>% 
      rename(!!quo_name(n) := upstream)
    i <- i + 1
  }
  
  dep <- dt2 %>%
    pivot_longer(
      cols = c(contains("upstream"),  "downstream"),
      names_to = "dump", values_to = "downstream") %>%
    distinct(downstream) %>%
    filter(!is.na(downstream)) %>%
    mutate(downstream = sort(downstream))
  return(dep)
}

dep_all <- dep_lookup_tbl %>% 
  arrange(-desc(upstream)) %>% 
  nest(direct_deps = -upstream) %>% 
  mutate(all_deps = map2(upstream, direct_deps, find_all_deps))

(edges <- dep_all %>% 
    select(-direct_deps) %>% 
    unnest(all_deps) %>% 
    filter(!is.na(upstream), !is.na(downstream)))
# A tibble: 551,713 × 2
   upstream downstream
   <chr>    <chr>     
 1 a4Core   nlcv      
 2 abc      abctools  
 3 abc      EasyABC   
 4 abc      ecolottery
 5 abc      nlrx      
 6 abc      paleopop  
 7 abc      poems     
 8 abc.data abc       
 9 abc.data abctools  
10 abc.data EasyABC   
# … with 551,703 more rows

The plot below shows the number of dependencies and reverse dependencies a package has.

Code
nodes <- tibble(id = unique(c(edges$upstream, edges$downstream))) %>% 
  left_join(edges %>% count(upstream, name = "n_revdep"), by = c("id" = "upstream")) %>% 
  left_join(edges %>% count(downstream, name = "n_dep"), by = c("id" = "downstream")) %>% 
  filter(!is.na(id)) %>% 
  mutate(n_revdep = ifelse(is.na(n_revdep), 0, n_revdep),
         n_dep = ifelse(is.na(n_dep), 0, n_dep))

################################################################
# deriving color categories
recommended <- raw %>% filter(Priority == "recommended") %>% pull(Package)

base <- c("base", "compiler", "datasets", "grDevices", "graphics", "grid", "methods", "parallel", "splines", "stats", "stats4", "tcltk", "tools", "translations", "utils")

r_lib_gh <- gh("GET /orgs/{username}/repos", username = "r-lib", .limit = 200)
r_lib <- vapply(r_lib_gh, "[[", "", "name")

r_tidyverse_gh <- gh("GET /orgs/{username}/repos", username = "tidyverse", .limit = 40)
tidyverse <- vapply(r_tidyverse_gh, "[[", "", "name")

nodes <- nodes %>% 
  mutate(category = 
           case_when(id %in% tidyverse ~ "tidyverse", 
                     id %in% base ~ "base",
                     id %in% r_lib ~ "r-lib",
                     id %in% recommended ~ "recommended",
                     TRUE ~ "zzz"))
################################################################
# to deal with zero mark after sqrt tranform
# https://github.com/tidyverse/ggplot2/issues/980
mysqrt_trans <- function() {
    scales::trans_new("mysqrt", 
              transform = base::sqrt,
              inverse = function(x) ifelse(x<0, 0, x^2),
              domain = c(0, Inf))
}

p <- nodes %>% 
  mutate(tooltip = glue::glue("Pkg: {id}, dep: {n_dep}, revdep: {n_revdep}")) %>% 
  ggplot(aes(x = n_dep, y = n_revdep)) + 
  geom_point_interactive(aes(tooltip = tooltip)) +
  ggrepel::geom_text_repel(
    data = nodes %>% filter(n_revdep > 3100),
    aes(color= category, label = id), min.segment.length = 0) +
  scale_color_brewer(palette = "Set1") + 
  scale_y_continuous(breaks = c(0,  50, 200, 500, 1000, 2500, 5000, 7500, 10000, 15000), trans = "mysqrt") + 
  scale_x_continuous(breaks = c(0, 1, 5, 10, 20, 40, 80, 120, 160, 200), trans = "mysqrt") + 
  theme(panel.grid.minor = element_blank(),
        legend.position = "bottom") + 
  xlab("Number of dependencies") + 
  ylab("Number of reverse dependencies")

girafe(ggobj = p, width_svg = 16, height_svg = 12)