Building prior knowledge network (PKN) for COSMOS
Denes Turei
Institute for Computational Biomedicine, Heidelberg Universityturei.denes@gmail.com Source:
vignettes/cosmos.Rmd
cosmos.Rmd
Abstract
The prior knowledge network (PKN) used by COSMOS is a network of heterogenous causal interactions: it contains protein-protein, reactant-enzyme and enzyme-product interactions. It is a combination of multiple resources. Here we present the functions that load each component, the options for customization, and the functions to build the complete PKN.
Introduction
The COSMOS PKN is a combination of the following datasets:
- Genome-scale metabolic model (GEM) from Chalmers Sysbio (Wang et al., 2021.)
- Network of chemical-protein interactions from STITCH (http://stitch.embl.de/)
- Protein-protein interactions from Omnipath (Türei et al., 2021)
Building the PKN is possible in by calling the
cosmos_pkn()
function that we present in the last section.
First let’s take a closer look at each resource.
Chalmers Sysbio GEM
The Chalmers Sysbio group provides genome scale models of metabolism (GEMs) for various organisms: human, mouse, rat, fish, fly and worm. These models are available as Matlab files, which contain reaction data, stoichiometry matrix and identifier translation data.
The raw contents of the Matlab file can be loaded by
chalmers_gem_raw()
. This results in a convoluted structure
of nested lists and arrays:
ch_raw <- chalmers_gem_raw()
Another function, chalmers_gem()
processes the above
structure into a data frame of reactions, also keeping the stoichiometry
matrix and including the identifier translation data:
ch_processed <- chalmers_gem()
## Loading required package: Matrix
names(ch_processed)
## [1] "organism" "reactions" "metabolites" "reaction_ids"
## [5] "metabolite_ids" "S"
The identifier translation tables are available by separate
functions, chalmers_gem_metabolites()
and
chalmers_gem_reactions()
for metabolite and reaction (and
enzyme) identifiers, respectively. These return simple data frames.
ch_met <- chalmers_gem_metabolites()
ch_met
## # A tibble: 8,456 × 14
## mets metsNoComp metBiGGID metKEGGID metHMDBID metChEBIID metPubChemID
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 MAM00001c MAM00001 carveol C00964 NA CHEBI:15389 NA
## 2 MAM00001e MAM00001 carveol C00964 NA CHEBI:15389 NA
## 3 MAM00002c MAM00002 appnn C09880 HMDB0006525 CHEBI:36740 6654
## 4 MAM00002e MAM00002 appnn C09880 HMDB0006525 CHEBI:36740 6654
## 5 MAM00003c MAM00003 NA NA NA CHEBI:78990 NA
## 6 MAM00003l MAM00003 NA NA NA CHEBI:78990 NA
## 7 MAM00003r MAM00003 NA NA NA CHEBI:78990 NA
## 8 MAM00003e MAM00003 NA NA NA CHEBI:78990 NA
## 9 MAM00004c MAM00004 NA NA NA NA NA
## 10 MAM00004m MAM00004 NA NA NA NA NA
## # ℹ 8,446 more rows
## # ℹ 7 more variables: metLipidMapsID <chr>, metEHMNID <chr>,
## # metHepatoNET1ID <chr>, metRecon3DID <chr>, metMetaNetXID <chr>,
## # metHMR2ID <chr>, metRetired <chr>
The metabolite identifier translation available here is also
integrated into the package’s translation service, available by the
translate_ids
and other functions.
translate_ids('MAM00001', 'metabolicatlas', 'recon3d', chalmers = TRUE)
## [1] "carveol"
Finally, the chalmers_gem_network()
function uses all
above data to compile a network binary chemical-protein interactions. By
default Metabolic Atlas identifiers are used for the metabolites and
Ensembl Gene IDs for the enzymes. These can be tranlated to the desired
identifiers using the metabolite_ids
and
protein_ids
arguments. Translation to multiple identifers
is possible. The ri
or record_id
column in
case of the Chalmers GEM represent the reaction ID, a unique identifier
of the original reaction. One reaction yields many binary interactions
as it consists of a number of gene products, reactants and products. The
column ci
means “complex ID”, it is a unique identifier of
groups of enzymes required together to carry out the reaction. The
column reverse
indicates if the row is derived from the
reversed version of a reversible reaction. The column
transporter
signals reactions where the same metabolite
occures both on reactant and product side, these are assumed to be
transport reactions. In the Chalmers GEM the reactions are also assigned
to compartments, these are encoded by single letter codes in the
comp
column. In the original data the compartment codes are
postfixes of the metabolite IDs, here we move them into a separate
column, leaving the Metabolic Atlas IDs clean and usable.
ch <- chalmers_gem_network()
ch
## # A tibble: 135,769 × 16
## ri ci source target reverse met_to_gene comp transporter
## <int> <int> <chr> <chr> <lgl> <lgl> <fct> <lgl>
## 1 1 1 MAM01796 ENSG00000147576 FALSE TRUE c NA
## 2 1 1 MAM01796 ENSG00000147576 FALSE TRUE c NA
## 3 1 2 MAM01796 ENSG00000172955 FALSE TRUE c NA
## 4 1 2 MAM01796 ENSG00000172955 FALSE TRUE c NA
## 5 1 3 MAM01796 ENSG00000180011 FALSE TRUE c NA
## 6 1 3 MAM01796 ENSG00000180011 FALSE TRUE c NA
## 7 1 4 MAM01796 ENSG00000187758 FALSE TRUE c NA
## 8 1 5 MAM01796 ENSG00000196344 FALSE TRUE c NA
## 9 1 5 MAM01796 ENSG00000196344 FALSE TRUE c NA
## 10 1 6 MAM01796 ENSG00000196616 FALSE TRUE c NA
## # ℹ 135,759 more rows
## # ℹ 8 more variables: uniprot_source <chr>, genesymbol_source <chr>,
## # uniprot_target <chr>, genesymbol_target <chr>, hmdb_source <chr>,
## # kegg_source <chr>, hmdb_target <chr>, kegg_target <chr>
STITCH enzyme-metabolite interactions
STITCH is a large compendium of binary interactions between proteins
and chemicals. Some of these are derived from metabolic reactions.
Various attributes such as mode of action, effect sign and scores are
assigned to each link. The datasets are available by organism, stored in
“actions” and “links” tables, available by the
stitch_actions()
and stitch_links()
functions,
respectively. STITCH supports a broad variety of organisms, please refer
to their website at (https://stitch.embl.de/).
sta <- stitch_actions()
sta
## # A tibble: 21,773,491 × 6
## item_id_a item_id_b mode action a_is_acting score
## <chr> <chr> <chr> <fct> <lgl> <dbl>
## 1 ENSP00000170630 10461 expression NA FALSE 150
## 2 10461 ENSP00000170630 expression NA TRUE 150
## 3 ENSP00000353915 23627457 binding NA FALSE 191
## 4 23627457 ENSP00000353915 binding NA FALSE 191
## 5 ENSP00000256906 44408029 binding NA FALSE 521
## 6 44408029 ENSP00000256906 binding NA FALSE 521
## 7 ENSP00000267377 23590374 pred_bind NA FALSE 170
## 8 23590374 ENSP00000267377 pred_bind NA FALSE 170
## 9 ENSP00000267377 23590374 binding NA FALSE 159
## 10 23590374 ENSP00000267377 binding NA FALSE 159
## # ℹ 21,773,481 more rows
stl <- stitch_links()
stl
## # A tibble: 15,473,939 × 7
## chemical protein experimental prediction database textmining combined_score
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 91758680 ENSP0000… 0 0 0 278 279
## 2 91758680 ENSP0000… 0 0 0 154 154
## 3 91758408 ENSP0000… 0 0 0 225 225
## 4 91758408 ENSP0000… 0 0 0 178 178
## 5 91758408 ENSP0000… 0 0 0 225 225
## 6 91758408 ENSP0000… 0 0 0 151 151
## 7 91758408 ENSP0000… 0 0 0 162 162
## 8 91758408 ENSP0000… 0 0 0 194 194
## 9 91758408 ENSP0000… 0 0 0 169 169
## 10 91758408 ENSP0000… 0 0 0 163 163
## # ℹ 15,473,929 more rows
stitch_gem()
combines the actions and links data frames,
filters by confidence scores, removes the prefixes from identifiers and
translates them to the desired ID types. STITCH prefixes Ensembl Protein
IDs with the NCBI Taxonomy ID, while PubChem CIDs with CID plus a
lowercase letter “s” or “m”, meaning stereo specific or merged
stereoisomers, respectively. These prefixes are removed by default by
the stitch_remove_prefixes()
function. Effect signs (1 =
activation, -1 = inhibition) and combined scores are aslo included in
the data frame. Similarly to Chalmers GEM, translation of chemical and
protein identifiers is available. The record_id
column
uniquely identifies the original records. Multiple rows with the same
record_id
are due to one-to-many identifier
translation.
st <- stitch_gem()
st
OmniPath signaling network
All parameters supported by the OmniPath web service
(import_omnipath_interactions()
) can be passed to
omnipath_for_cosmos()
, enabling precise control over the
resources, interaction types and other options when preparing the
signaling network from OmniPath. By default the “omnipath” dataset is
included which contains high confidence, literature curated, causal
protein-protein interactions. For human, mouse and rat, orthology
translated data is retrieved from the web service, while for other
organisms translation by orthologous gene pairs is carried out client
side.
op <- omnipath_for_cosmos()
op
## # A tibble: 141,695 × 8
## source target sign record_id uniprot_source uniprot_target genesymbol_source
## <chr> <chr> <dbl> <int> <chr> <chr> <chr>
## 1 P0DP23 P48995 -1 1 P0DP23 P48995 CALM1
## 2 P0DP25 P48995 -1 2 P0DP25 P48995 CALM3
## 3 P0DP25 P48995 -1 2 P0DP25 P48995 CALM1
## 4 P0DP24 P48995 -1 3 P0DP24 P48995 CALM2
## 5 P0DP24 P48995 -1 3 P0DP24 P48995 CALM1
## 6 Q03135 P48995 1 4 Q03135 P48995 CAV1
## 7 P14416 P48995 1 5 P14416 P48995 DRD2
## 8 Q99750 P48995 -1 6 Q99750 P48995 MDFI
## 9 Q14571 P48995 1 7 Q14571 P48995 ITPR2
## 10 P29966 P48995 -1 8 P29966 P48995 MARCKS
## # ℹ 141,685 more rows
## # ℹ 1 more variable: genesymbol_target <chr>
Complete build
Building the complete COSMOS PKN is done by
cosmos_pkn()
. All the resources above can be customized by
arguments passed to this function. With all downloads and processing the
build might take 30-40 minutes. Data is cached at various levels of
processing, shortening processing times. With all data downloaded and
HMDB ID translation data preprocessed, the build takes 3-4 minutes; the
complete PKN is also saved in the cache, if this is available, loading
it takes only a few seconds.
pkn <- cosmos_pkn()
pkn
## # A tibble: 323,716 × 20
## source target sign record_id resource entity_type_source entity_type_target
## <chr> <chr> <dbl> <int> <chr> <fct> <fct>
## 1 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 2 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 3 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 4 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 5 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 6 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 7 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 8 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 9 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## 10 MAM017… ENSG0… NA 1 Chalmer… metabolite protein
## # ℹ 323,706 more rows
## # ℹ 13 more variables: score_stitch <dbl>, ci_chalmers <int>,
## # comp_chalmers <fct>, reverse_chalmers <lgl>, transporter_chalmers <lgl>,
## # uniprot_source <chr>, genesymbol_source <chr>, uniprot_target <chr>,
## # genesymbol_target <chr>, hmdb_source <chr>, kegg_source <chr>,
## # hmdb_target <chr>, kegg_target <chr>
The record_id
column identifies the original records
within each resource. If one record_id
yields multiple
records in the final data frame, it is the result of one-to-many ID
translation or other processing steps. Before use, it is recommended to
select one pair of ID type columns (by combining the preferred ones) and
perform distinct
by the identifier columns and sign. After
the common columns, resource specific columns are labeled with the
resource name; after these columns, molecule type and side specific
identifer columns are named after the ID type and the side of the
interaction (“source” vs. “target”).
Session information
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.12.0
## LAPACK: /usr/lib/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] Matrix_1.6-5 OmnipathR_3.13.1 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] xfun_0.42 bslib_0.6.1 lattice_0.22-5
## [4] tzdb_0.4.0 vctrs_0.6.5 tools_4.3.3
## [7] generics_0.1.3 curl_5.2.1 tibble_3.2.1
## [10] fansi_1.0.6 R.oo_1.25.0 pkgconfig_2.0.3
## [13] checkmate_2.3.1 desc_1.4.3 readxl_1.4.3
## [16] lifecycle_1.0.4 compiler_4.3.3 stringr_1.5.1
## [19] textshaping_0.3.7 progress_1.2.3 htmltools_0.5.7
## [22] sass_0.4.9 yaml_2.3.8 later_1.3.2
## [25] pillar_1.9.0 pkgdown_2.0.7 crayon_1.5.2
## [28] jquerylib_0.1.4 tidyr_1.3.1 R.utils_2.12.3
## [31] cachem_1.0.8 tidyselect_1.2.1 rvest_1.0.4
## [34] zip_2.3.0 digest_0.6.35 stringi_1.8.3
## [37] dplyr_1.1.4 purrr_1.0.2 bookdown_0.38
## [40] grid_4.3.3 fastmap_1.1.1 cli_3.6.2
## [43] logger_0.3.0 magrittr_2.0.3 XML_3.99-0.16.1
## [46] utf8_1.2.4 readr_2.1.5 withr_3.0.0
## [49] prettyunits_1.2.0 backports_1.4.1 rappdirs_0.3.3
## [52] bit64_4.0.5 lubridate_1.9.3 timechange_0.3.0
## [55] rmarkdown_2.26 httr_1.4.7 igraph_2.0.3
## [58] bit_4.0.5 R.matlab_3.7.0 cellranger_1.1.0
## [61] R.methodsS3_1.8.2 ragg_1.2.7 hms_1.1.3
## [64] memoise_2.0.1 evaluate_0.23 knitr_1.45
## [67] rlang_1.1.3 Rcpp_1.0.12 glue_1.7.0
## [70] selectr_0.4-2 BiocManager_1.30.22 xml2_1.3.6
## [73] vroom_1.6.5 jsonlite_1.8.8 R6_2.5.1
## [76] systemfonts_1.0.6 fs_1.6.3