The database manager in OmnipathR
Denes Turei
Institute for Computational Biomedicine, Heidelberg Universityturei.denes@gmail.com Source:
vignettes/db_manager.Rmd
db_manager.Rmd
Abstract
The database manager is an API within OmnipathR which is able to load various datasets, keep track of their usage and remove them after an expiry period. Currently it supports a few Gene Ontology and UniProt datasets, but easily can be extended to cover all datasets in the package.
Available datasets
To see a full list of datasets call the omnipath_show_db
function:
## # A tibble: 20 × 10
## name last_…¹ lifet…² package loader loader_param lates…³ loaded db key
## <chr> <lgl> <dbl> <chr> <chr> <list> <lgl> <lgl> <lgl> <chr>
## 1 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_b…
## 2 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_f…
## 3 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_a…
## 4 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_a…
## 5 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_s…
## 6 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_c…
## 7 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_d…
## 8 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_c…
## 9 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_m…
## 10 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_p…
## 11 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_m…
## 12 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_p…
## 13 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_p…
## 14 Gene … NA 300 Omnipa… go_on… <named list> NA FALSE NA go_y…
## 15 GO an… NA 300 Omnipa… go_an… <named list> NA FALSE NA goa_…
## 16 UniPr… NA 300 Omnipa… unipr… <named list> NA FALSE NA up_gs
## 17 Ensem… NA 10800 Omnipa… taxon… <NULL> NA FALSE NA orga…
## 18 All S… NA 10800 Omnipa… all_u… <named list> NA FALSE NA swis…
## 19 All T… NA 10800 Omnipa… all_u… <named list> NA FALSE NA trem…
## 20 OmniP… NA 300 Omnipa… build… <named list> NA FALSE NA sear…
## # … with abbreviated variable names ¹last_used, ²lifetime, ³latest_param
It returns a tibble where each dataset has a human readable name and a key which can be used to refer to it. We can also check here if the dataset is currently loaded, the time it’s been last used, the loader function and its arguments.
Access a dataset
Datasets can be accessed by the get_db
function. Ideally
you should call this function every time you use the dataset. The first
time it will be loaded, the subsequent times the already loaded dataset
will be returned. This way each access is registered and extends the
expiry time. Let’s load the human UniProt-GeneSymbol table. Above we see
its key is up_gs
.
up_gs <- get_db('up_gs')
up_gs
## # A tibble: 20,375 × 2
## From To
## <chr> <chr>
## 1 Q96JT2 SLC45A3
## 2 Q9UP95 SLC12A4
## 3 Q08357 SLC20A2
## 4 O94855 SEC24D
## 5 Q8N2U9 SLC66A2
## 6 Q96CW6 SLC7A6OS
## 7 Q01959 SLC6A3
## 8 Q9NQ03 SCRT2
## 9 P48061 CXCL12
## 10 Q15047 SETDB1
## # … with 20,365 more rows
This dataset is a two columns data frame of SwissProt IDs and Gene
Symbols. Looking again at the datasets, we find that this dataset is
loaded now and the last_used
timestamp is set to the time
we called get_db
:
## # A tibble: 20 × 10
## name last_used lifet…¹ package loader loader_param latest_param
## <chr> <dttm> <dbl> <chr> <chr> <list> <list>
## 1 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 2 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 3 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 4 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 5 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 6 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 7 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 8 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 9 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 10 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 11 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 12 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 13 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 14 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 15 GO anno… NA 300 Omnipa… go_an… <named list> <lgl [1]>
## 16 UniProt… 2023-03-22 20:24:45 300 Omnipa… unipr… <named list> <named list>
## 17 Ensembl… 2023-03-22 20:24:44 10800 Omnipa… taxon… <NULL> <NULL>
## 18 All Swi… NA 10800 Omnipa… all_u… <named list> <lgl [1]>
## 19 All TrE… NA 10800 Omnipa… all_u… <named list> <lgl [1]>
## 20 OmniPat… NA 300 Omnipa… build… <named list> <lgl [1]>
## # … with 3 more variables: loaded <lgl>, db <list>, key <chr>, and abbreviated
## # variable name ¹lifetime
The above table contains also a reference to the dataset, and the arguments passed to the loader function:
d <- omnipath_show_db()
d %>% dplyr::pull(db) %>% magrittr::extract2(16)
## # A tibble: 20,375 × 2
## From To
## <chr> <chr>
## 1 Q96JT2 SLC45A3
## 2 Q9UP95 SLC12A4
## 3 Q08357 SLC20A2
## 4 O94855 SEC24D
## 5 Q8N2U9 SLC66A2
## 6 Q96CW6 SLC7A6OS
## 7 Q01959 SLC6A3
## 8 Q9NQ03 SCRT2
## 9 P48061 CXCL12
## 10 Q15047 SETDB1
## # … with 20,365 more rows
## $to
## [1] "genesymbol"
##
## $organism
## [1] 9606
If we call get_db
again, the timestamp is updated,
resetting the expiry counter:
up_gs <- get_db('up_gs')
omnipath_show_db()
## # A tibble: 20 × 10
## name last_used lifet…¹ package loader loader_param latest_param
## <chr> <dttm> <dbl> <chr> <chr> <list> <list>
## 1 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 2 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 3 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 4 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 5 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 6 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 7 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 8 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 9 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 10 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 11 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 12 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 13 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 14 Gene On… NA 300 Omnipa… go_on… <named list> <lgl [1]>
## 15 GO anno… NA 300 Omnipa… go_an… <named list> <lgl [1]>
## 16 UniProt… 2023-03-22 20:24:55 300 Omnipa… unipr… <named list> <named list>
## 17 Ensembl… 2023-03-22 20:24:44 10800 Omnipa… taxon… <NULL> <NULL>
## 18 All Swi… NA 10800 Omnipa… all_u… <named list> <lgl [1]>
## 19 All TrE… NA 10800 Omnipa… all_u… <named list> <lgl [1]>
## 20 OmniPat… NA 300 Omnipa… build… <named list> <lgl [1]>
## # … with 3 more variables: loaded <lgl>, db <list>, key <chr>, and abbreviated
## # variable name ¹lifetime
Where are the loaded datasets?
The loaded datasets live in an environment which belong to the
OmnipathR package. Normally users don’t need to access this environment.
As we see below, omnipath_show_db
presents us all
information availble by directly looking at the environment:
OmnipathR:::omnipath.env$db$up_gs
## $name
## [1] "UniProt-GeneSymbol table"
##
## $last_used
## [1] "2023-03-22 20:24:55 CET"
##
## $lifetime
## [1] 300
##
## $package
## [1] "OmnipathR"
##
## $loader
## [1] "uniprot_full_id_mapping_table"
##
## $loader_param
## $loader_param$to
## [1] "genesymbol"
##
## $loader_param$organism
## [1] 9606
##
##
## $latest_param
## $latest_param$to
## [1] "genesymbol"
##
## $latest_param$organism
## [1] 9606
##
##
## $loaded
## [1] TRUE
##
## $db
## # A tibble: 20,375 × 2
## From To
## <chr> <chr>
## 1 Q96JT2 SLC45A3
## 2 Q9UP95 SLC12A4
## 3 Q08357 SLC20A2
## 4 O94855 SEC24D
## 5 Q8N2U9 SLC66A2
## 6 Q96CW6 SLC7A6OS
## 7 Q01959 SLC6A3
## 8 Q9NQ03 SCRT2
## 9 P48061 CXCL12
## 10 Q15047 SETDB1
## # … with 20,365 more rows
How to extend the expiry period?
The default expiry of datasets is given by the option
omnipath.db_lifetime
. By calling
omnipath_save_config
this option is saved to the default
config file and will be valid in all subsequent sessions. Otherwise it’s
valid only in the current session.
options(omnipath.db_lifetime = 600)
omnipath_save_config()
Where are the datasets defined?
The built-in dataset definitions are in a JSON file shipped with the package. Easiest way to see it is by the git web interface.
How to add custom datasets?
Currently no API available for this, but it would be super easy to implement. It would be matter of providing a JSON similar to the above, or calling a function. Please open an issue if you are interested in this feature.
Session information
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libopenblasp-r0.3.21.so
## LAPACK: /usr/lib/liblapack.so.3.11.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] OmnipathR_3.7.8 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 tidyr_1.3.0 prettyunits_1.1.1
## [4] rprojroot_2.0.3 digest_0.6.31 utf8_1.2.3
## [7] R6_2.5.1 cellranger_1.1.0 backports_1.4.1
## [10] evaluate_0.20 httr_1.4.4 pillar_1.8.1
## [13] rlang_1.0.6 progress_1.2.2 curl_5.0.0
## [16] readxl_1.4.2 jquerylib_0.1.4 checkmate_2.1.0
## [19] rmarkdown_2.20 pkgdown_2.0.7 textshaping_0.3.6
## [22] desc_1.4.2 readr_2.1.4 stringr_1.5.0
## [25] selectr_0.4-2 igraph_1.4.0 bit_4.0.5
## [28] compiler_4.2.2 xfun_0.37 pkgconfig_2.0.3
## [31] systemfonts_1.0.4 htmltools_0.5.4 tidyselect_1.2.0
## [34] tibble_3.1.8 bookdown_0.32 fansi_1.0.4
## [37] crayon_1.5.2 dplyr_1.1.0 tzdb_0.3.0
## [40] withr_2.5.0 later_1.3.0 rappdirs_0.3.3
## [43] jsonlite_1.8.4 lifecycle_1.0.3 magrittr_2.0.3
## [46] cli_3.6.0 stringi_1.7.12 vroom_1.6.1
## [49] cachem_1.0.6 fs_1.6.1 xml2_1.3.3
## [52] logger_0.2.2 bslib_0.4.2 ellipsis_0.3.2
## [55] ragg_1.2.5 generics_0.1.3 vctrs_0.5.2
## [58] tools_4.2.2 bit64_4.0.5 glue_1.6.2
## [61] purrr_1.0.1 hms_1.1.2 fastmap_1.1.0
## [64] yaml_2.3.7 BiocManager_1.30.19 rvest_1.0.3
## [67] memoise_2.0.1 knitr_1.42 sass_0.4.5