Resource specific interaction attributes
Denes Turei
Institute for Computational Biomedicine, Heidelberg Universityturei.denes@gmail.com Source:
vignettes/extra_attrs.Rmd
extra_attrs.Rmd
Abstract
OmniPath provides a broad variety of protein annotations, but
for interactions, until recently, only a standard set of essential
attributes (direction, effect, etc) and a handful of others
(e.g. DoRothEA confidence level) were available. The newly
introduced extra_attrs
column consists of JSON
encoded custom, resource specific attributes from network
databases. We also revised the processing of these resources to
ensure that we include as many useful attributes as possible. In
the OmnipathR package we added a few new functions to support the
processing of the JSON encoded column: to scan it for keys and
values, and to extract specific variables of interest into new
columns. We give a brief overview of these here.
Loading a network
First we retrieve the complete directed PPI network. Importantly, the
extra attributes are only included if the
fields = "extra_attrs"
argument is provided.
i <- import_post_translational_interactions(fields = 'extra_attrs')
dplyr::select(i, source_genesymbol, target_genesymbol, extra_attrs)
## # A tibble: 134,282 × 3
## source_genesymbol target_genesymbol extra_attrs
## <chr> <chr> <list>
## 1 CALM1 TRPC1 <named list [1]>
## 2 CALM3 TRPC1 <named list [1]>
## 3 CALM2 TRPC1 <named list [1]>
## 4 CAV1 TRPC1 <named list [1]>
## 5 DRD2 TRPC1 <named list [1]>
## 6 MDFI TRPC1 <named list [1]>
## 7 ITPR2 TRPC1 <named list [1]>
## 8 MARCKS TRPC1 <named list [1]>
## 9 TRPC1 GRM1 <named list [0]>
## 10 GRM1 TRPC1 <named list [1]>
## # ℹ 134,272 more rows
Above we see, the extra_attrs
column is a list type
column. Each list is a nested list itself, containing the extra
attributes from all resources, as it was extracted from the JSON.
Which extra attributes are available?
Which attributes present in the network depends only on the
interactions: if none of the interactions is from the SPIKE
database, obviously the SPIKE_mechanism
won’t be present.
The names of the extra attributes consist of the name of the resource
and the name of the attribute, separated by an underscore. The resource
name never contains underscore, while some attribute names do. To list
the extra attributes available in a particular data frame use the
extra_attrs
function:
extra_attrs(i)
## [1] "TRIP_method" "SIGNOR_mechanism"
## [3] "PhosphoSite_noref_evidence" "PhosphoPoint_category"
## [5] "PhosphoSite_evidence" "HPRD-phos_mechanism"
## [7] "Li2012_mechanism" "Li2012_route"
## [9] "SPIKE_effect" "SPIKE_mechanism"
## [11] "SPIKE_LC_effect" "SPIKE_LC_mechanism"
## [13] "CA1_effect" "CA1_type"
## [15] "Macrophage_type" "Macrophage_location"
## [17] "ACSN_effect" "Cellinker_type"
## [19] "CellChatDB_category" "talklr_putative"
## [21] "CellPhoneDB_type" "Ramilowski2015_source"
## [23] "HPMR_partner_role" "ARN_effect"
## [25] "ARN_is_direct" "ARN_is_directed"
## [27] "NRF2ome_effect" "NRF2ome_is_direct"
## [29] "NRF2ome_is_directed"
The labels listed here are the top level keys in the lists in the
extra_attrs
column. Note, the coverage of these variables
varies a lot, typically in agreement with the size of the resource.
Inspecting one attribute
The values of each extra attribute, in theory, can be arbitrarily
complex nested lists, but in reality, these are most often simple
numeric, logical or character values or vectors. To see the unique
values of one attribute use the extra_attr_values
function.
Let’s see the values of the SIGNOR_mechanism
attribute:
extra_attr_values(i, SIGNOR_mechanism)
## [1] "phosphorylation" "binding"
## [3] "dephosphorylation" "Phosphorylation"
## [5] "ubiquitination" "N/A"
## [7] "Physical Interaction" "Proteolytic Processing"
## [9] "cleavage" "Ubiquitination"
## [11] "Deubiqitination" "deubiquitination"
## [13] "relocalization" "Dephosphorylation"
## [15] "Other" "guanine nucleotide exchange factor"
## [17] "Transcription Regulation" "gtpase-activating protein"
## [19] "Indirect" ""
## [21] "Sumoylation" "sumoylation"
## [23] "palmitoylation" "Acetylation"
## [25] "acetylation" "polyubiquitination"
## [27] "Demethylation" "demethylation"
## [29] "mRNA stability" "methylation"
## [31] "Methylation" "trimethylation"
## [33] "hydroxylation" "monoubiquitination"
## [35] "Deacetylation" "deacetylation"
## [37] "Translational Regulation" "Protein Degradation"
## [39] "Glycosylation" "s-nitrosylation"
## [41] "phosphomotif_binding" "chemical activation"
## [43] "Proteolytic Cleavage" "tyrosination"
## [45] "post transcriptional regulation" "post translational modification"
## [47] "translation regulation" "carboxylation"
## [49] "neddylation" "Carboxylation"
## [51] "desumoylation" "glycosylation"
## [53] "ADP-ribosylation" "stabilization"
## [55] "catalytic activity" "deglycosylation"
## [57] "destabilization" "chemical inhibition"
## [59] "isomerization" "Neddylation"
## [61] "lipidation" "chemical modification"
## [63] "oxidation" "Alkylation"
The values are provided as they are in the original resource, including potential typos and inconsistencies, e.g. see above the capitalized vs. lowercase forms of each value.
Converting extra attributes to columns
To make use of the attributes, it is convenient to extract the
interesting ones into separate columns of the data frame. With the
extra_attrs_to_cols
function multiple attributes can be
converted in a single call. Custom column names can be passed by
argument names. As an example, let’s extract two attributes:
i0 <- extra_attrs_to_cols(
i,
si_mechanism = SIGNOR_mechanism,
ma_mechanism = Macrophage_type,
keep_empty = FALSE
)
dplyr::select(
i0,
source_genesymbol,
target_genesymbol,
si_mechanism,
ma_mechanism
)
## # A tibble: 61,406 × 4
## source_genesymbol target_genesymbol si_mechanism ma_mechanism
## <chr> <chr> <list> <list>
## 1 PRKG1 TRPC3 <list [1]> <NULL>
## 2 PRKG1 TRPC7 <list [1]> <NULL>
## 3 OS9 TRPV4 <list [1]> <NULL>
## 4 PTPN1 TRPV6 <list [1]> <NULL>
## 5 RACK1 TRPM6 <list [1]> <NULL>
## 6 PRKACA MCOLN1 <list [1]> <NULL>
## 7 MAPK14 MAPKAPK2 <list [1]> <list [1]>
## 8 MAPKAPK2 HNRNPA0 <list [2]> <NULL>
## 9 MAPKAPK2 PARN <list [2]> <NULL>
## 10 JAK2 EPOR <list [2]> <NULL>
## # ℹ 61,396 more rows
Above we disabled the keep_empty
option, otherwise the
new columns would have NULL
values for most of the records,
simply because out of the 80k interactions in the data frame only a few
thousands are from either SIGNOR or Macrophage. The new columns are list
type, individual values are character vectors. Let’s look into one
value:
dplyr::pull(i0, si_mechanism)[[7]]
## [[1]]
## [1] "phosphorylation"
Here we have two values, but only because the inconsistent names in the resource.
Depending on downstream methods, atomic columns might be preferable
instead of lists. In this case one interaction record might yield
multiple rows in the resulted data frame, depending on the number of
attributes it has. To have atomic columns, use the flatten
option:
i1 <- extra_attrs_to_cols(
i,
si_mechanism = SIGNOR_mechanism,
ma_mechanism = Macrophage_type,
keep_empty = FALSE,
flatten = TRUE
)
dplyr::select(
i1,
source_genesymbol,
target_genesymbol,
si_mechanism,
ma_mechanism
)
## # A tibble: 63,434 × 4
## source_genesymbol target_genesymbol si_mechanism ma_mechanism
## <chr> <chr> <list> <list>
## 1 PRKG1 TRPC3 <chr [1]> <NULL>
## 2 PRKG1 TRPC7 <chr [1]> <NULL>
## 3 OS9 TRPV4 <chr [1]> <NULL>
## 4 PTPN1 TRPV6 <chr [1]> <NULL>
## 5 RACK1 TRPM6 <chr [1]> <NULL>
## 6 PRKACA MCOLN1 <chr [1]> <NULL>
## 7 MAPK14 MAPKAPK2 <chr [1]> <chr [1]>
## 8 MAPKAPK2 HNRNPA0 <chr [1]> <NULL>
## 9 MAPKAPK2 HNRNPA0 <chr [1]> <NULL>
## 10 MAPKAPK2 PARN <chr [1]> <NULL>
## # ℹ 63,424 more rows
Filtering records based on extra attributes
Another useful application of extra attributes is filtering the
records of the interactions data frame. The
with_extra_attrs
function filters to records which have
certain extra attributes. For example, to have only interactions with
SIGNOR_mechanism
given:
nrow(with_extra_attrs(i, SIGNOR_mechanism))
## [1] 61111
This results around 11 thousands rows. Filtering for multiple attributes the records which have at least one of them will be selected. Adding some more attributes results more interactions:
nrow(with_extra_attrs(i, SIGNOR_mechanism, CA1_effect, Li2012_mechanism))
## [1] 62017
It is possible to filter the records not only by the names but the values of the extra attributes. Let’s select the interactions which are phosphorylation according to SIGNOR:
phos <- c('phosphorylation', 'Phosphorylation')
si_phos <- filter_extra_attrs(i, SIGNOR_mechanism = phos)
dplyr::select(si_phos, source_genesymbol, target_genesymbol)
## # A tibble: 4,353 × 2
## source_genesymbol target_genesymbol
## <chr> <chr>
## 1 PRKG1 TRPC3
## 2 PRKG1 TRPC7
## 3 PRKACA MCOLN1
## 4 MAPK14 MAPKAPK2
## 5 MAPKAPK2 HNRNPA0
## 6 MAPKAPK2 PARN
## 7 JAK2 EPOR
## 8 MAPK14 ZFP36
## 9 MAPKAPK2 ZFP36
## 10 PRKAA1_PRKAA2_PRKAB1_PRKAB2_PRKAG1_PRKAG2_PRKAG3 CRTC2
## # ℹ 4,343 more rows
Example: finding ubiquitination interactions
First let’s search for the word “ubiquitination” in the attributes. Below is a slow but simple solution:
keys <- extra_attrs(i)
keys_ubi <- purrr::keep(
keys,
function(k){
any(stringr::str_detect(extra_attr_values(i, !!k), 'biqu'))
}
)
keys_ubi
## [1] "SIGNOR_mechanism" "HPRD-phos_mechanism" "SPIKE_mechanism"
## [4] "SPIKE_LC_mechanism" "CA1_type" "Macrophage_type"
We found five attributes that have at least one value which matches “biqu”. Next take a look at their values:
ubi <- rlang::set_names(
purrr::map(
keys_ubi,
function(k){
stringr::str_subset(extra_attr_values(i, !!k), 'biqu')
}
),
keys_ubi
)
ubi
## $SIGNOR_mechanism
## [1] "ubiquitination" "Ubiquitination" "deubiquitination"
## [4] "polyubiquitination" "monoubiquitination"
##
## $`HPRD-phos_mechanism`
## [1] "Ubiquitination"
##
## $SPIKE_mechanism
## [1] "Ubiquitination" "Polyubiquitination"
##
## $SPIKE_LC_mechanism
## [1] "Ubiquitination" "Polyubiquitination"
##
## $CA1_type
## [1] "Ubiquitination"
##
## $Macrophage_type
## [1] "Ubiquitination"
Actually to match all ubiquitination interactions, it’s enough to filter for “ubiquitination” in its lowercase and capitalized forms (note, we could also include deubiqutination and polyubiquitination):
ubi_kws <- c('ubiquitination', 'Ubiquitination')
i_ubi <-
dplyr::distinct(
dplyr::bind_rows(
purrr::map(
keys_ubi,
function(k){
filter_extra_attrs(i, !!k := ubi_kws, na_ok = FALSE)
}
)
)
)
dplyr::select(i_ubi, source_genesymbol, target_genesymbol)
## # A tibble: 49,308 × 2
## source_genesymbol target_genesymbol
## <chr> <chr>
## 1 NUMB NOTCH1
## 2 BTRC_CUL1_SKP1 PER2
## 3 PRKN RANBP2
## 4 PRKN SNCA
## 5 FBXW7 MYC
## 6 UBE2T FANCL
## 7 BIRC2 TRAF2
## 8 TRAF2 MAP3K14
## 9 TRAF6 MAP3K7
## 10 BTRC_CUL1_SKP1 WEE1
## # ℹ 49,298 more rows
We found 405 ubiquitination interactions. We had to use
map
, bind_rows
and distinct
because otherwise filter_extra_attrs
would return the
intersection of the matches, instead of their union.
In this data frame we have 150 unique ubiquitin E3 ligases:
## [1] 365
UniProt annotates E3 ligases by the “Ubl conjugation” keyword. We can check how many of those 150 proteins have this annotation:
uniprot_kws <- import_omnipath_annotations(
resources = 'UniProt_keyword',
entity_type = 'protein',
wide = TRUE
)
e3_ligases <- dplyr::pull(
dplyr::filter(uniprot_kws, keyword == 'Ubl conjugation'),
genesymbol
)
length(e3_ligases)
## [1] 2542
## [1] 106
## [1] 259
We retrieved 2503 E3 ligases from UniProt. 83 of these has substrates in the interaction database, while 67 of the effectors of the interactions are not annotated in UniProt.
In the OmniPath enzyme-substrate database we collect ubiquitination interactions from enzyme-PTM resources. However, these contain only a small number of interactions:
es_ubi <- import_omnipath_enzsub(types = 'ubiquitination')
es_ubi
## # A tibble: 70 × 12
## enzyme substrate enzyme_genesymbol substrate_genesymbol residue_type
## <chr> <chr> <chr> <chr> <chr>
## 1 Q12933 Q13546 TRAF2 RIPK1 K
## 2 Q8IUD6 O95786 RNF135 RIGI K
## 3 Q8IUD6 O95786 RNF135 RIGI K
## 4 P60604 Q92813 UBE2G2 DIO2 K
## 5 P60604 Q92813 UBE2G2 DIO2 K
## 6 Q13489 Q13546 BIRC3 RIPK1 K
## 7 Q96J02 Q7Z434 ITCH MAVS K
## 8 Q96J02 Q7Z434 ITCH MAVS K
## 9 Q66K89 P04637 E4F1 TP53 K
## 10 Q66K89 P04637 E4F1 TP53 K
## # ℹ 60 more rows
## # ℹ 7 more variables: residue_offset <dbl>, modification <chr>, sources <chr>,
## # references <chr>, curation_effort <dbl>, n_references <int>,
## # n_resources <int>
With only two exception, all these have been recovered by using the extra attributes from the network database:
es_i_ubi <-
dplyr::inner_join(
es_ubi,
i_ubi,
by = c(
'enzyme_genesymbol' = 'source_genesymbol',
'substrate_genesymbol' = 'target_genesymbol'
)
)
nrow(dplyr::distinct(dplyr::select(es_i_ubi, enzyme, substrate, residue_offset)))
## [1] 57
Session information
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Arch Linux
##
## Matrix products: default
## BLAS: /usr/lib/libblas.so.3.12.0
## LAPACK: /usr/lib/liblapack.so.3.12.0
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] OmnipathR_3.11.10 BiocStyle_2.30.0
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.1 rappdirs_0.3.3 sass_0.4.9
## [4] utf8_1.2.4 generics_0.1.3 xml2_1.3.6
## [7] stringi_1.8.3 hms_1.1.3 digest_0.6.35
## [10] magrittr_2.0.3 evaluate_0.23 timechange_0.3.0
## [13] bookdown_0.38 fastmap_1.1.1 cellranger_1.1.0
## [16] jsonlite_1.8.8 progress_1.2.3 backports_1.4.1
## [19] BiocManager_1.30.22 rvest_1.0.4 httr_1.4.7
## [22] selectr_0.4-2 purrr_1.0.2 fansi_1.0.6
## [25] textshaping_0.3.7 jquerylib_0.1.4 cli_3.6.2
## [28] rlang_1.1.3 crayon_1.5.2 bit64_4.0.5
## [31] withr_3.0.0 cachem_1.0.8 yaml_2.3.8
## [34] parallel_4.3.2 tools_4.3.2 tzdb_0.4.0
## [37] memoise_2.0.1 checkmate_2.3.1 dplyr_1.1.4
## [40] curl_5.2.1 vctrs_0.6.5 logger_0.3.0
## [43] R6_2.5.1 lifecycle_1.0.4 lubridate_1.9.3
## [46] stringr_1.5.1 bit_4.0.5 fs_1.6.3
## [49] vroom_1.6.5 ragg_1.2.7 pkgconfig_2.0.3
## [52] desc_1.4.3 pkgdown_2.0.7 pillar_1.9.0
## [55] bslib_0.6.1 later_1.3.2 glue_1.7.0
## [58] Rcpp_1.0.12 systemfonts_1.0.6 xfun_0.42
## [61] tibble_3.2.1 tidyselect_1.2.1 knitr_1.45
## [64] htmltools_0.5.7 igraph_2.0.3 rmarkdown_2.26
## [67] readr_2.1.5 compiler_4.3.2 prettyunits_1.2.0
## [70] readxl_1.4.3