Building prior knowledge network (PKN) for COSMOS

Abstract

The prior knowledge network (PKN) used by COSMOS is a network of heterogenous causal interactions: it contains protein-protein, reactant-enzyme and enzyme-product interactions. It is a combination of multiple resources. Here we present the functions that load each component, the options for customization, and the functions to build the complete PKN.

Introduction

The COSMOS PKN is a combination of the following datasets:

Genome-scale metabolic model (GEM) from Chalmers Sysbio (Wang et al., 2021.)
Network of chemical-protein interactions from STITCH (http://stitch.embl.de/)
Protein-protein interactions from Omnipath (Türei et al., 2021)

Building the PKN is possible in by calling the cosmos_pkn() function that we present in the last section. First let’s take a closer look at each resource.

library(OmnipathR)

Chalmers Sysbio GEM

The Chalmers Sysbio group provides genome scale models of metabolism (GEMs) for various organisms: human, mouse, rat, fish, fly and worm. These models are available as Matlab files, which contain reaction data, stoichiometry matrix and identifier translation data.

The raw contents of the Matlab file can be loaded by chalmers_gem_raw(). This results in a convoluted structure of nested lists and arrays:

ch_raw <- chalmers_gem_raw()

Another function, chalmers_gem() processes the above structure into a data frame of reactions, also keeping the stoichiometry matrix and including the identifier translation data:

ch_processed <- chalmers_gem()

## Loading required package: Matrix

names(ch_processed)

## [1] "organism"       "reactions"      "metabolites"    "reaction_ids"  
## [5] "metabolite_ids" "S"

The identifier translation tables are available by separate functions, chalmers_gem_metabolites() and chalmers_gem_reactions() for metabolite and reaction (and enzyme) identifiers, respectively. These return simple data frames.

ch_met <- chalmers_gem_metabolites()

ch_met

## # A tibble: 8,456 × 14
##    mets      metsNoComp metBiGGID metKEGGID metHMDBID   metChEBIID  metPubChemID
##    <chr>     <chr>      <chr>     <chr>     <chr>       <chr>              <dbl>
##  1 MAM00001c MAM00001   carveol   C00964    NA          CHEBI:15389           NA
##  2 MAM00001e MAM00001   carveol   C00964    NA          CHEBI:15389           NA
##  3 MAM00002c MAM00002   appnn     C09880    HMDB0006525 CHEBI:36740         6654
##  4 MAM00002e MAM00002   appnn     C09880    HMDB0006525 CHEBI:36740         6654
##  5 MAM00003c MAM00003   NA        NA        NA          CHEBI:78990           NA
##  6 MAM00003l MAM00003   NA        NA        NA          CHEBI:78990           NA
##  7 MAM00003r MAM00003   NA        NA        NA          CHEBI:78990           NA
##  8 MAM00003e MAM00003   NA        NA        NA          CHEBI:78990           NA
##  9 MAM00004c MAM00004   NA        NA        NA          NA                    NA
## 10 MAM00004m MAM00004   NA        NA        NA          NA                    NA
## # ℹ 8,446 more rows
## # ℹ 7 more variables: metLipidMapsID <chr>, metEHMNID <chr>,
## #   metHepatoNET1ID <chr>, metRecon3DID <chr>, metMetaNetXID <chr>,
## #   metHMR2ID <chr>, metRetired <chr>

The metabolite identifier translation available here is also integrated into the package’s translation service, available by the translate_ids and other functions.

translate_ids('MAM00001', 'metabolicatlas', 'recon3d', chalmers = TRUE)

## [1] "carveol"

Finally, the chalmers_gem_network() function uses all above data to compile a network binary chemical-protein interactions. By default Metabolic Atlas identifiers are used for the metabolites and Ensembl Gene IDs for the enzymes. These can be tranlated to the desired identifiers using the metabolite_ids and protein_ids arguments. Translation to multiple identifers is possible. The ri or record_id column in case of the Chalmers GEM represent the reaction ID, a unique identifier of the original reaction. One reaction yields many binary interactions as it consists of a number of gene products, reactants and products. The column ci means “complex ID”, it is a unique identifier of groups of enzymes required together to carry out the reaction. The column reverse indicates if the row is derived from the reversed version of a reversible reaction. The column transporter signals reactions where the same metabolite occures both on reactant and product side, these are assumed to be transport reactions. In the Chalmers GEM the reactions are also assigned to compartments, these are encoded by single letter codes in the comp column. In the original data the compartment codes are postfixes of the metabolite IDs, here we move them into a separate column, leaving the Metabolic Atlas IDs clean and usable.

ch <- chalmers_gem_network()

ch

## # A tibble: 135,769 × 16
##       ri    ci source   target          reverse met_to_gene comp  transporter
##    <int> <int> <chr>    <chr>           <lgl>   <lgl>       <fct> <lgl>      
##  1     1     1 MAM01796 ENSG00000147576 FALSE   TRUE        c     NA         
##  2     1     1 MAM01796 ENSG00000147576 FALSE   TRUE        c     NA         
##  3     1     2 MAM01796 ENSG00000172955 FALSE   TRUE        c     NA         
##  4     1     2 MAM01796 ENSG00000172955 FALSE   TRUE        c     NA         
##  5     1     3 MAM01796 ENSG00000180011 FALSE   TRUE        c     NA         
##  6     1     3 MAM01796 ENSG00000180011 FALSE   TRUE        c     NA         
##  7     1     4 MAM01796 ENSG00000187758 FALSE   TRUE        c     NA         
##  8     1     5 MAM01796 ENSG00000196344 FALSE   TRUE        c     NA         
##  9     1     5 MAM01796 ENSG00000196344 FALSE   TRUE        c     NA         
## 10     1     6 MAM01796 ENSG00000196616 FALSE   TRUE        c     NA         
## # ℹ 135,759 more rows
## # ℹ 8 more variables: uniprot_source <chr>, genesymbol_source <chr>,
## #   uniprot_target <chr>, genesymbol_target <chr>, hmdb_source <chr>,
## #   kegg_source <chr>, hmdb_target <chr>, kegg_target <chr>

STITCH enzyme-metabolite interactions

STITCH is a large compendium of binary interactions between proteins and chemicals. Some of these are derived from metabolic reactions. Various attributes such as mode of action, effect sign and scores are assigned to each link. The datasets are available by organism, stored in “actions” and “links” tables, available by the stitch_actions() and stitch_links() functions, respectively. STITCH supports a broad variety of organisms, please refer to their website at (https://stitch.embl.de/).

sta <- stitch_actions()

sta

## # A tibble: 21,773,491 × 6
##    item_id_a       item_id_b       mode       action a_is_acting score
##    <chr>           <chr>           <chr>      <fct>  <lgl>       <dbl>
##  1 ENSP00000170630 10461           expression NA     FALSE         150
##  2 10461           ENSP00000170630 expression NA     TRUE          150
##  3 ENSP00000353915 23627457        binding    NA     FALSE         191
##  4 23627457        ENSP00000353915 binding    NA     FALSE         191
##  5 ENSP00000256906 44408029        binding    NA     FALSE         521
##  6 44408029        ENSP00000256906 binding    NA     FALSE         521
##  7 ENSP00000267377 23590374        pred_bind  NA     FALSE         170
##  8 23590374        ENSP00000267377 pred_bind  NA     FALSE         170
##  9 ENSP00000267377 23590374        binding    NA     FALSE         159
## 10 23590374        ENSP00000267377 binding    NA     FALSE         159
## # ℹ 21,773,481 more rows

stl <- stitch_links()

stl

## # A tibble: 15,473,939 × 7
##    chemical protein   experimental prediction database textmining combined_score
##    <chr>    <chr>            <dbl>      <dbl>    <dbl>      <dbl>          <dbl>
##  1 91758680 ENSP0000…            0          0        0        278            279
##  2 91758680 ENSP0000…            0          0        0        154            154
##  3 91758408 ENSP0000…            0          0        0        225            225
##  4 91758408 ENSP0000…            0          0        0        178            178
##  5 91758408 ENSP0000…            0          0        0        225            225
##  6 91758408 ENSP0000…            0          0        0        151            151
##  7 91758408 ENSP0000…            0          0        0        162            162
##  8 91758408 ENSP0000…            0          0        0        194            194
##  9 91758408 ENSP0000…            0          0        0        169            169
## 10 91758408 ENSP0000…            0          0        0        163            163
## # ℹ 15,473,929 more rows

stitch_gem() combines the actions and links data frames, filters by confidence scores, removes the prefixes from identifiers and translates them to the desired ID types. STITCH prefixes Ensembl Protein IDs with the NCBI Taxonomy ID, while PubChem CIDs with CID plus a lowercase letter “s” or “m”, meaning stereo specific or merged stereoisomers, respectively. These prefixes are removed by default by the stitch_remove_prefixes() function. Effect signs (1 = activation, -1 = inhibition) and combined scores are aslo included in the data frame. Similarly to Chalmers GEM, translation of chemical and protein identifiers is available. The record_id column uniquely identifies the original records. Multiple rows with the same record_id are due to one-to-many identifier translation.

st <- stitch_gem()

st

OmniPath signaling network

All parameters supported by the OmniPath web service (import_omnipath_interactions()) can be passed to omnipath_for_cosmos(), enabling precise control over the resources, interaction types and other options when preparing the signaling network from OmniPath. By default the “omnipath” dataset is included which contains high confidence, literature curated, causal protein-protein interactions. For human, mouse and rat, orthology translated data is retrieved from the web service, while for other organisms translation by orthologous gene pairs is carried out client side.

op <- omnipath_for_cosmos()

op

## # A tibble: 141,695 × 8
##    source target  sign record_id uniprot_source uniprot_target genesymbol_source
##    <chr>  <chr>  <dbl>     <int> <chr>          <chr>          <chr>            
##  1 P0DP23 P48995    -1         1 P0DP23         P48995         CALM1            
##  2 P0DP25 P48995    -1         2 P0DP25         P48995         CALM3            
##  3 P0DP25 P48995    -1         2 P0DP25         P48995         CALM1            
##  4 P0DP24 P48995    -1         3 P0DP24         P48995         CALM2            
##  5 P0DP24 P48995    -1         3 P0DP24         P48995         CALM1            
##  6 Q03135 P48995     1         4 Q03135         P48995         CAV1             
##  7 P14416 P48995     1         5 P14416         P48995         DRD2             
##  8 Q99750 P48995    -1         6 Q99750         P48995         MDFI             
##  9 Q14571 P48995     1         7 Q14571         P48995         ITPR2            
## 10 P29966 P48995    -1         8 P29966         P48995         MARCKS           
## # ℹ 141,685 more rows
## # ℹ 1 more variable: genesymbol_target <chr>

Complete build

Building the complete COSMOS PKN is done by cosmos_pkn(). All the resources above can be customized by arguments passed to this function. With all downloads and processing the build might take 30-40 minutes. Data is cached at various levels of processing, shortening processing times. With all data downloaded and HMDB ID translation data preprocessed, the build takes 3-4 minutes; the complete PKN is also saved in the cache, if this is available, loading it takes only a few seconds.

pkn <- cosmos_pkn()

pkn

## # A tibble: 323,716 × 20
##    source  target  sign record_id resource entity_type_source entity_type_target
##    <chr>   <chr>  <dbl>     <int> <chr>    <fct>              <fct>             
##  1 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  2 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  3 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  4 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  5 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  6 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  7 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  8 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
##  9 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
## 10 MAM017… ENSG0…    NA         1 Chalmer… metabolite         protein           
## # ℹ 323,706 more rows
## # ℹ 13 more variables: score_stitch <dbl>, ci_chalmers <int>,
## #   comp_chalmers <fct>, reverse_chalmers <lgl>, transporter_chalmers <lgl>,
## #   uniprot_source <chr>, genesymbol_source <chr>, uniprot_target <chr>,
## #   genesymbol_target <chr>, hmdb_source <chr>, kegg_source <chr>,
## #   hmdb_target <chr>, kegg_target <chr>

The record_id column identifies the original records within each resource. If one record_id yields multiple records in the final data frame, it is the result of one-to-many ID translation or other processing steps. Before use, it is recommended to select one pair of ID type columns (by combining the preferred ones) and perform distinct by the identifier columns and sign. After the common columns, resource specific columns are labeled with the resource name; after these columns, molecule type and side specific identifer columns are named after the ID type and the side of the interaction (“source” vs. “target”).

Session information

sessionInfo()