Translate gene, protein and small molecule identifiers from multiple columns

Especially when translating network interactions, where two ID columns exist (source and target), it is convenient to call the same ID translation on multiple columns. The translate_ids function is already able to translate to multiple ID types in one call, but is able to work only from one source column. Here too, multiple target IDs are supported. The source columns can be listed explicitely, or they might share a common stem, in this case the first element of ... will be used as stem, and the column names will be created by adding the suffixes. The suffixes are also used to name the target columns. If no suffixes are provided, the name of the source columns will be added to the name of the target columns. ID types can be defined the same way as for translate_ids. The only limitation is that, if the source columns are provided as stem+suffixes, they must be the same ID type.

Usage

translate_ids_multi(
  d,
  ...,
  suffixes = NULL,
  suffix_sep = "_",
  uploadlists = FALSE,
  ensembl = FALSE,
  hmdb = FALSE,
  chalmers = FALSE,
  entity_type = NULL,
  keep_untranslated = TRUE,
  organism = 9606,
  reviewed = TRUE
)

Arguments

d: A data frame.
...: At least two arguments, with or without names. These arguments describe identifier columns, either the ones we translate from (source), or the ones we translate to (target). Columns existing in the data frame will be used as source columns. All the rest will be considered target columns. Alternatively, the source columns can be defined as a stem and a vector of suffixes, plus a separator between the stem and suffix. In this case, the source columns will be the ones that exist in the data frame with the suffixes added. The values of all these arguments must be valid identifier types as shown at translate_ids. If ID type is provided only for the first source column, the rest of the source columns will be assumed to have the same ID type. For the target identifiers new columns will be created with the desired names, with the suffixes added. If no suffixes provided, the names of the source columns will be used instead.
suffixes: Column name suffixes in case the names should be composed of stem and suffix.
suffix_sep: Character: separator between the stem and suffixes.
uploadlists: Force using the `uploadlists` service from UniProt. By default the plain query interface is used (implemented in uniprot_full_id_mapping_table in this package). If any of the provided ID types is only available in the uploadlists service, it will be automatically selected. The plain query interface is preferred because in the long term, with caching, it requires less download and data storage.
ensembl: Logical: use data from Ensembl BioMart instead of UniProt.
hmdb: Logical: use HMDB ID translation data.
chalmers: Logical: use ID translation data from Chalmers Sysbio GEM.
entity_type: Character: "gene" and "smol" are short symbols for proteins, genes and small molecules respectively. Several other synonyms are also accepted.
keep_untranslated: In case the output is a data frame, keep the records where the source identifier could not be translated. At these records the target identifier will be NA.
organism: Character or integer, name or NCBI Taxonomy ID of the organism (by default 9606 for human). Matters only if uploadlists is FALSE.
reviewed: Translate only reviewed (TRUE), only unreviewed (FALSE) or both (NULL) UniProt records. Matters only if uploadlists is FALSE.

Value

A data frame with all source columns translated to all target identifiers. The number of new columns is the product of source and target columns. The target columns are distinguished by the suffexes added to their names.

Examples

ia <- omnipath()
translate_ids_multi(ia, source = uniprot, target, ensp, ensembl = TRUE)
#> # A tibble: 560,056 × 17
#>    source target source_genesymbol target_genesymbol is_directed is_stimulation
#>    <chr>  <chr>  <chr>             <chr>                   <dbl>          <dbl>
#>  1 P0DP23 P48995 CALM1             TRPC1                       1              0
#>  2 P0DP23 P48995 CALM1             TRPC1                       1              0
#>  3 P0DP25 P48995 CALM3             TRPC1                       1              0
#>  4 P0DP25 P48995 CALM3             TRPC1                       1              0
#>  5 P0DP25 P48995 CALM3             TRPC1                       1              0
#>  6 P0DP25 P48995 CALM3             TRPC1                       1              0
#>  7 P0DP24 P48995 CALM2             TRPC1                       1              0
#>  8 P0DP24 P48995 CALM2             TRPC1                       1              0
#>  9 Q03135 P48995 CAV1              TRPC1                       1              1
#> 10 Q03135 P48995 CAV1              TRPC1                       1              1
#> # ℹ 560,046 more rows
#> # ℹ 11 more variables: is_inhibition <dbl>, consensus_direction <dbl>,
#> #   consensus_stimulation <dbl>, consensus_inhibition <dbl>, sources <chr>,
#> #   references <chr>, curation_effort <dbl>, n_references <int>,
#> #   n_resources <int>, ensp_source <chr>, ensp_target <chr>

Translate gene, protein and small molecule identifiers from multiple columns

Usage

Arguments

Value

See also

Examples