Translates a vector of identifiers, resulting a new vector, or a column of identifiers in a data frame by creating another column with the target identifiers.
Usage
translate_ids(
  d,
  ...,
  uploadlists = FALSE,
  ensembl = FALSE,
  hmdb = FALSE,
  ramp = FALSE,
  chalmers = FALSE,
  entity_type = NULL,
  keep_untranslated = TRUE,
  return_df = FALSE,
  organism = 9606,
  reviewed = TRUE,
  complexes = NULL,
  complexes_one_to_many = NULL,
  track = FALSE,
  quantify_ambiguity = FALSE,
  qualify_ambiguity = FALSE,
  ambiguity_groups = NULL,
  ambiguity_global = FALSE,
  ambiguity_summary = FALSE,
  expand = TRUE
)Arguments
- d
- Character vector or data frame. 
- ...
- At least two arguments, with or without names. The first of these arguments describes the source identifier, the rest of them describe the target identifier(s). The values of all these arguments must be valid identifier types as shown in Details. The names of the arguments are column names. In case of the first (source) ID the column must exist. For the rest of the IDs new columns will be created with the desired names. For ID types provided as arguments without names, the name of the ID type will be used for column name. 
- uploadlists
- Force using the - uploadlistsservice from UniProt. By default the plain query interface is used (implemented in- uniprot_full_id_mapping_tablein this package). If any of the provided ID types is only available in the uploadlists service, it will be automatically selected. The plain query interface is preferred because in the long term, with caching, it requires less download and data storage.
- ensembl
- Logical: use data from Ensembl BioMart instead of UniProt. 
- hmdb
- Logical: use HMDB ID translation data. 
- ramp
- Logical: use RaMP ID translation data. 
- chalmers
- Logical: use ID translation data from Chalmers Sysbio GEM. 
- entity_type
- Character: "gene" and "smol" are short symbols for proteins, genes and small molecules respectively. Several other synonyms are also accepted. 
- keep_untranslated
- In case the output is a data frame, keep the records where the source identifier could not be translated. At these records the target identifier will be NA. 
- return_df
- Return a data frame even if the input is a vector. 
- organism
- Character or integer, name or NCBI Taxonomy ID of the organism (by default 9606 for human). Matters only if - uploadlistsis- FALSE.
- reviewed
- Translate only reviewed ( - TRUE), only unreviewed (- FALSE) or both (- NULL) UniProt records. Matters only if- uploadlistsis- FALSE.
- complexes
- Logical: translate complexes by their members. Only complexes where all members can be translated will be included in the result. If - NULL, the option- omnipathr.complex_translationwill be used.
- complexes_one_to_many
- Logical: allow combinatorial expansion or use only the first target identifier for each member of each complex. If - NULL, the option- omnipathr.complex_translation_one_to_manywill be used.
- track
- Logical: Track the records (rows) in the input data frame by adding a column - record_idwith the original row numbers.
- quantify_ambiguity
- Logical or character: inspect the mappings for each ID for ambiguity. If TRUE, for each translated column, two new columns will be created with numeric values, representing the ambiguity of the mapping on the "from" and "to" side of the translation, respectively. If a character value provided, it will be used as a column name suffix for the new columns. 
- qualify_ambiguity
- Logical or character: inspect the mappings for each ID for ambiguity. If TRUE, for each translated column, a new column will be inculded with values - one-to-one,- one-to-many,- many-to-oneor- many-to-many. If a character value provided, it will be used as a column name suffix for the new column.
- ambiguity_groups
- Character vector: additional column names to group by during inspecting ambiguity. By default, the identifier columns (from and to) will be used to determine the ambiguity of mappings. 
- ambiguity_global
- Logical or character: if - ambiguity_groupsare provided, analyse ambiguity also globally, across the whole data frame. Character value provides a custom suffix for the columns quantifying and qualifying global ambiguity.
- ambiguity_summary
- Logical: generate a summary about the ambiguity of the translation and make it available as an attribute. columns will be lists of character vectors. 
- expand
- Logical: if - TRUE, ambiguous (to-many) mappings will be expanded to multiple rows, resulting character type columns; if- FALSE, the original rows will be kept intact, and the target
Value
- Data frame: if the input is a data frame or the input is a vector and - return_dfis- TRUE.
- Vector: if the input is a vector, there is only one target ID type and - return_dfis- FALSE.
- List of vectors: if the input is a vector, there are more than one target ID types and - return_dfis- FALSE. The names of the list will be ID types (as they were column names, see the description of the- ...argument), and the list will also include the source IDs.
Details
This function, depending on the uploadlists parameter, uses either
the uploadlists service of UniProt or plain UniProt queries to obtain
identifier translation tables. The possible values for from and to
are the identifier type abbreviations used in the UniProt API, please
refer to the table here: https://www.uniprot.org/help/api_idmapping.
In addition, simple synonyms are available which realize a uniform API
for the uploadlists and UniProt query based backends. These are the
followings:
| OmnipathR | Uploadlists | UniProt query | Ensembl BioMart | 
| uniprot | ACC | id | uniprotswissprot | 
| uniprot_entry | ID | entry name | |
| trembl | reviewed = FALSE | reviewed = FALSE | uniprotsptrembl | 
| genesymbol | GENENAME | genes(PREFERRED) | external_gene_name | 
| genesymbol_syn | genes(ALTERNATIVE) | external_synonym | |
| hgnc | HGNC_ID | database(HGNC) | hgnc_symbol | 
| entrez | P_ENTREZGENEID | database(GeneID) | |
| ensembl | ENSEMBL_ID | ensembl_gene_id | |
| ensg | ENSEMBL_ID | ensembl_gene_id | |
| enst | ENSEMBL_TRS_ID | database(Ensembl) | ensembl_transcript_id | 
| ensp | ENSEMBL_PRO_ID | ensembl_peptide_id | |
| ensgg | ENSEMBLGENOME_ID | ||
| ensgt | ENSEMBLGENOME_TRS_ID | ||
| ensgp | ENSEMBLGENOME_PRO_ID | ||
| protein_name | protein names | ||
| pir | PIR | database(PIR) | |
| ccds | database(CCDS) | ||
| refseqp | P_REFSEQ_AC | database(refseq) | |
| ipro | interpro | ||
| ipro_desc | interpro_description | ||
| ipro_sdesc | interpro_short_description | ||
| wikigene | wikigene_name | ||
| rnacentral | rnacentral | ||
| gene_desc | description | ||
| wormbase | database(WormBase) | ||
| flybase | database(FlyBase) | ||
| xenbase | database(Xenbase) | ||
| zfin | database(ZFIN) | ||
| pbd | PBD_ID | database(PDB) | pbd | 
For a complete list of ID types and their synonyms, including metabolite and
chemical ID types which are not shown here, see id_types.
The mapping between identifiers can be ambiguous. In this case one row in the original data frame yields multiple rows or elements in the returned data frame or vector(s).
Examples
d <- data.frame(
    uniprot_id = c(
        'P00533', 'Q9ULV1', 'P43897', 'Q9Y2P5',
        'P01258', 'P06881', 'P42771', 'Q8N726'
    )
)
d <- translate_ids(d, uniprot_id = uniprot, genesymbol)
d
#>   uniprot_id genesymbol
#> 1     P00533       EGFR
#> 2     Q9ULV1       FZD4
#> 3     P43897       TSFM
#> 4     Q9Y2P5    SLC27A5
#> 5     P01258      CALCA
#> 6     P06881      CALCA
#> 7     P42771     CDKN2A
#> 8     Q8N726     CDKN2A
#   uniprot_id genesymbol
# 1     P00533       EGFR
# 2     Q9ULV1       FZD4
# 3     P43897       TSFM
# 4     Q9Y2P5    SLC27A5