ELM Help page
Questions and answers
- What methods are used for detecting functional sites?
- Why are the ELM predictions not scored?
- What does the ELM instance mapper do?
- Why is the context of a functional site important?
- What are the currently implemented context filters?
- Is there a nomenclature for representing functional site motifs?
- How can the ELM DB be accessed programmatically?
- What are "regular expressions"?
- Why use regular expressions in ELM?
What methods are used for detecting functional sites?
Currently, patterns written as regular expressions are used for detecting functional
sites. Since most ELMs are very short motifs, many of them will overpredict, implying that most matches shown are more likely
to be false positives than true matches. To improve the predictive value of ELM, we use logical filters (or rules) based
on context information to discriminate between likely true and false positives. (See below.)
Why are the ELM predictions not scored?
Since a regular expression either matches or does not match a string (or a protein subsequence), in its nature the score would be 1 or 0.
If a prediction is in a sequence that is similar or identical to an ELM instance sequence, and the motif is positionally conserved, the prediction
will be scored by the ELM instance mapper.If a prediction is within a known domain structure, it will be scored by the structure filter.
What does the ELM instance mapper do?
The ELM instance mapper uses PHI-BLAST
to map ELM predictions of a query sequence to known annotated ELM instances. The basis for the mapping is sequence similarity and positional conservation of the motif. If
an ELM prediction is mapped to a known ELM instance, the prediction is given a score from 0 to 1, where 1 means that the query sequence is identical to the database sequence,
and the prediction can be considered an experimentally verified ELM instance.
The result of a successful ELM instance mapping is the raw PHI-BLAST output and a summary of the results with the calculated score.
Why is the context of a functional site important?
A primary prerequisite for a site to be functional is that it must be recognised by the protein that will act upon it, in other words it must occur in the proper biochemical context. This implies that the functional site must reside in a non-structured region of a protein; either in non-globular region or in loops whithin globular domains. (see structure filter).
Furthermore, in order for a molecular function to occur, the functional site must be in the right cellular context.
That is, the protein harbouring the functional site must be in the right cellular compartment (see cell compartment filter).
Finally, while some functional sites are found in all eukaryotes, others are restricted to a specific species range (see taxonomic range filter).
What are the currently implemented context filters?
Taxonomic range filter
This filter relies on the user to submit the correct species for the protein sequence. ELMs are annotated with the proper taxonomic
range using
NCBI taxonomy node identifiers, and only ELMs relevant for that species will be shown in the prediction results.
Cell compartment filter
Since many functional sites are restricted to one or several cellular compartments,
ELMs are annotated with GO terms for the proper set of compartments. Currently this filter relies on user supplied information, but we will predict compartment if not supplied,
in the future. Only ELMs relevant to the user supplied compartments will be shown in the prediction results.
Since functional sites must be accessible, they cannot reside deep inside globular domains (unless the domain is known to be allosteric). Therefore, most true motifs are in the exposed loops. Whenever a globular domain structure is available (≥70% identity to the query sequence), the structure filter evaluates the context of linear motifs lying within it.
A summary of the 2D structural elements is provided in the 2D structure link using PDBsum. The filter scores both the secondary structure overlap and the accessibility of the individual residues. Final secondary structure (Qsse) and accessibility (Qacc) scores are assigned by averaging individual scores over the non-wildcard positions of the motif. The score range is calibrated on a benchmark of true motifs in solved structures and assigned into three categories (blue: enriched, half-blue: neutral, grey: sparse). Hovering the mouse over the ELM graphic matches reveals the scores in individual pop-up windows. P-values of <.01 are significant but remember that important assumptions are implicit in the statistical model, e.g. that any motif-interacting proteins do come into contact (i.e. at the minimum are in the same cell compartment).
Links are provided to display each motif in context using JMOL
and separately to a full summary of the filter results for each ELM.
see here for more indepth details on the structure filter.
Globular domain filter
If no structure is available but there is an identified domain, the globular domain filter provides a rougher guide to context. It uses SMART
and Pfam to predict globular domains, and ELMs predicted inside these are coloured grey. Since some functional
sites can occur in loops of globular domains (e.g. RGD motifs), some true predictions are removed by this filter, and users are encouraged to inspect the domain DB entries, while 2D structure predictions might be informative.
Users should be aware that about 10% of the Pfam entires do not in fact correspond to globular domains. Currently these are treated as globular domains by the ELM server, but this will be changed in the future.
In addition to globular domains, SMART predicts coiled
coils (COILS), signal peptide cleavage sites
(Sigcleave), low complexity regions
(
SEG),
and transmembrane helices (TMHMM2). ELM uses these predictions for rough filtering as well.
The GlobPlot algorithm is used to detect potential globular domains as well as protein disorder:
GlobDoms: segments of the sequence that is ordered according to the Russell/Linding scale. These segments correspond to potential globular domains.
Disorder: segments of the sequence that is potentially disordered and unstructured according to the Russell/Linding
scale. These regions are on average enriched in ELMs [
Linding et al. 2004].
For more information refer to the GlobPlot
and DisEMBL papers.
Probability filter
For each ELM class, a probability score (expect cutoff) is calculated based on its regular expression, using the following amino acid probabilities (derived from uniprot and using an IUPred cutoff of >= 0.4):
'A': 0.074253, 'C': 0.009697, 'D': 0.050147, 'E': 0.089011, 'F': 0.018359, 'G': 0.073955, 'H': 0.025784, 'I': 0.026977, 'K': 0.061170, 'L': 0.073093, 'M': 0.019150, 'N': 0.033851, 'P': 0.094297, 'Q': 0.056396, 'R': 0.063802, 'S': 0.104119, 'T': 0.059547, 'V': 0.045399, 'W': 0.006850, 'Y': 0.014141
This probability score is low for strictly annotated regular expressions (ex.
TRG_LysEnd_GGAAcLL_2, 'S[LW]LD[DE]EL[LM]', 1.03823788548E-9),
and high for more degenerate ones (ex.
MOD_GSK3_1, '...([ST])...[ST]', 0.026786559556) and should reflect the probability of the regular expression to be found by chance in any given protein sequence.
When searching a protein sequence for putative ELM motif instances, this score is used to limit the number of predicted degenerate motif instances.
(Currently, ELM does not correct the probability score for protein sequence length and motif length to avoid length bias).
Is there a nomenclature for representing functional site motifs?
Yes! In the ELM project we intend to use the nomenclature suggested by
R.
Aasland et al. (FEBS Lett. 2002; 513: 141-144.) when appropriate.
For example: the sumoylation motif (MOD_SUMO) can be described by the regular expression [VILAFP](K).[EDNGP], but a simpler description is the consensus sequence: %KxE. This is not as precise as the regular expression, but easier to read. Amino acids are written in the IUPAC one letter code, the % sign means hydrophobic amino acid residue, and the x is any amino acid.
Primarily, Greek symbols are used for representing different groups of amino acids, and ASCII equivalents are given for use in plain text files and e.g. for web-pages. ELM uses the ASCII symbols.
How can the ELM DB be accessed programmatically?
Warning The ELM webservices are deprecated and deliver out-dated data; they will be replaced by RESTful services in the next major release.
For users interested in querying the ELM DB for the latest information, we provide an API via web services (WSDL, Web Services Description Language). The description of the services can be found at: http://api.bioinfo.no/wsdl/ELMdb.wsdl, the actual WSDL file is located at: http://elm.eu.org/webservice/ELMdb.wsdl and a sample client can be downloaded at: http://api.bioinfo.no/clients/ELMdb.py
In case of problems, please contact the ELM webmaster
Regular expressions
Regular expressions are similar to the PROSITE patterns, but have a slightly different syntax:
Amino acids are written in one letter code (see IUPAC Nomenclature and Symbolism for Amino Acids and Peptides), other characters are:
| Character | Name | Meaning |
|---|---|---|
| . | dot | Any amino acid allowed |
| [...] | character class | Amino acids listed are allowed |
| [^...] | negated character class | Amino acids listed are not allowed |
| { min, max } | specified range | Min required, max allowed |
| ^ | caret | Matches the amino terminal |
| $ | dollar | Matches the carboxy terminal |
| ? | question | One amino acid is allowed, but is optional |
| * | star | Any number of amino acids are allowed but are optional |
| + | plus | One amino acid is allowed, additional are optional |
| | | alternation | Matches either expression it separates |
| (...) | parentheses | 1. Used to mark positions of specific interest; e.g. the amino acid being covalently modified. |
Examples:
The regular expression for MOD_SUMO: [VILAFP](K).[EDNGP]
- First position: [VILAFP] - one of the amino acids V, I, L, A, F or P must occur.
- Second position: (K) - K must occur. This amino acid is covalently modified, and is surrounded by parethesis.
- Third position: . - any amino acid is allowed.
- Fourth position: [EDNGP] - one of the amino acids E, D, N, G or P must occur.
- Examples of matching sequences:
The regular expression for LIG_PCNA: (^.{0,3}|Q).[SKRNDT][ILM][^P][^P][FHM][YFM]..
- First position: (^.{0,3}|Q) - any amino acid at or at most 3 amino acids from the amino terminus or it must be Q. The parenthesis are used to allow for the alternate situation.
- Second position: . - any amino acid is allowed.
- Third position: [SKRNDT] - one of the amino acids S, K, R, N, D or T must occur.
- Fourth position: [ILM] - one of the amino acids I, L, or M must occur.
- Fifth and sixth position: [^P] - all amino acids except P is allowed.
- Seventh position: [FHM] - one of the amino acids F, H, or M must occur.
- Eighth position: [YFM] - one of the amino acids Y, F, or M must occur.
- Positions 9 and 10: . - any amino acid is allowed. These are included to indicate that the motif should be at least 10 amino acids long.
- Examples of matching sequences:
Why use Regular Expressions in ELM?
The three most commonly used methods for bioinformatical representation of sequence conservation patterns are: Profiles/Hidden Markov Models (HMMs); Artificial neural networks (ANNs); and regular expressions (RegExps). Of these, RegExps are considered the worst approach to encapture protein sequence information. They are ad hoc - typically created by annotators without applying a consistent formalism. The motif characters are represented with integer values, so RegExps cannot use position-weighting to capture weaker preferences. They are over-determined and can only capture exactly what is specified (whereas the more probabilistic HMMs and ANNs can rank near misses too). They do not support searching for an exact number of a given amino acid character within a specified range (which would better approximate the charged runs in e.g. CAP-Gly and NLS motifs). Despite these shortcomings, using RegExps to establish ELM has proved to be the correct decision. Many LMs have short indels in the pattern. HMM software does not provide for variable gaps with exactly bounded ranges while ANNs do not account for gaps at all: a motif such as the NES with multiple short indels is hard to represent with these algorithms. The scoring of presence/absence matches for LM RegExps simplifies statistical analyses of motif searches. These two advantages have been critical to the first wave of development of motif-hunting software.
Thus we consider that it was appropriate to initiate LM database resources with RegExps. Of course, HMMs and ANNs are used in a number of useful predictive tools e.g. Scansite and NetPhorest and there is little doubt that HMMs, Neural Networks and other methods will grow in importance for LM analyses in future, once the contexts can be better controlled.
Dictionary
Definitions and explanations to terms used in the ELM resource.
- Biochemical context
- For functional sites, the biochemical context has several components: the sequence motif, its relation to the local structure and other domains in the protein as well as the protein complex it may reside in.
- Cellular context
- Where and when in the cell a site is functional.
- Context
- The space and time where a molecular function takes place.
- ELM
- 1. Eukaryotic Linear Motif.
2. The common pattern of a set of linear (sub)sequences that can be related to a molecular function. - ELM instance
- An experimentally verified instance of an ELM in a particular polypeptide.
- ELM instance sequence
- A protein sequence carrying one or more experimetally verified ELM instances
- Filter
- Method for discriminating between likely positive and negative ELM predictions; based on context information.
- Functional site
- A set of short linear (sub)sequences that can be related to a molecular function.
- GO
- Gene Ontology: a controlled vocabulary for describing the molecular functions, cellular components and biological processes of genes. See http://www.geneontology.org/
- Molecular context
- see Biochemical context.
- Molecular function
- The nature of interaction of a protein with another molecule.
- Sequence motif
- A recurrent pattern of conserved amino acids. The amino acids may be absolutely, chemically or sterically conserved.
- Siteseeing
- The process of annotating ELMs, including developement of detection methods and evaluation of context information used to formulate discriminating filters and rules.
Please cite: ELM - the database of eukaryotic linear motifs (PMID:
22110040)
ELM data can be downloaded and distributed for non-commercial use according to the ELM Software License Agreement

