ELM Help page

Questions and answers

What methods are used for detecting functional sites?

Currently, patterns written as regular expressions are used for detecting functional sites. Since most ELMs are very short motifs, many of them will overpredict, implying that most matches shown are more likely to be false positives than true matches. To improve the predictive value of ELM, we use logical filters (or rules) based on context information to discriminate between likely true and false positives. (See below.)

Why are the ELM predictions not scored?

Since a regular expression either matches or does not match a string (or a protein subsequence), in its nature the score would be 1 or 0.
If a prediction is in a sequence that is similar or identical to an ELM instance sequence, and the motif is positionally conserved, the prediction will be scored by the ELM instance mapper.If a prediction is within a known domain structure, it will be scored by the structure filter.

What does the ELM instance mapper do?

The ELM instance mapper uses PHI-BLAST to map ELM predictions of a query sequence to known annotated ELM instances. The basis for the mapping is sequence similarity and positional conservation of the motif. If an ELM prediction is mapped to a known ELM instance, the prediction is given a score from 0 to 1, where 1 means that the query sequence is identical to the database sequence, and the prediction can be considered an experimentally verified ELM instance.
The result of a successful ELM instance mapping is the raw PHI-BLAST output and a summary of the results with the calculated score.

Why is the context of a functional site important?

A primary prerequisite for a site to be functional is that it must be recognised by the protein that will act upon it, in other words it must occur in the proper biochemical context. This implies that the functional site must reside in a non-structured region of a protein; either in non-globular region or in loops whithin globular domains. (see structure filter).
Furthermore, in order for a molecular function to occur, the functional site must be in the right cellular context. That is, the protein harbouring the functional site must be in the right cellular compartment (see cell compartment filter).
Finally, while some functional sites are found in all eukaryotes, others are restricted to a specific species range (see taxonomic range filter).

What are the currently implemented context filters?

Taxonomic range filter
This filter relies on the user to submit the correct species for the protein sequence. ELMs are annotated with the proper taxonomic range using NCBI taxonomy node identifiers, and only ELMs relevant for that species will be shown in the prediction results.

Cell compartment filter
Since many functional sites are restricted to one or several cellular compartments, ELMs are annotated with GO terms for the proper set of compartments. Currently this filter relies on user supplied information, but we will predict compartment if not supplied, in the future. Only ELMs relevant to the user supplied compartments will be shown in the prediction results.

Structure filter

Since functional sites must be accessible, they cannot reside deep inside globular domains (unless the domain is known to be allosteric). Therefore, most true motifs are in the exposed loops. Whenever a globular domain structure is available (≥70% identity to the query sequence), the structure filter evaluates the context of linear motifs lying within it.

A summary of the 2D structural elements is provided in the 2D structure link using PDBsum. The filter scores both the secondary structure overlap and the accessibility of the individual residues. Final secondary structure (Q_sse) and accessibility (Q_acc) scores are assigned by averaging individual scores over the non-wildcard positions of the motif. The score range is calibrated on a benchmark of true motifs in solved structures and assigned into three categories (blue: enriched, half-blue: neutral, grey: sparse). Hovering the mouse over the ELM graphic matches reveals the scores in individual pop-up windows. P-values of <.01 are significant but remember that important assumptions are implicit in the statistical model, e.g. that any motif-interacting proteins do come into contact (i.e. at the minimum are in the same cell compartment).

Links are provided to display each motif in context using JMOL and separately to a full summary of the filter results for each ELM.

see here for more indepth details on the structure filter.

Globular domain filter
If no structure is available but there is an identified domain, the globular domain filter provides a rougher guide to context. It uses SMART and Pfam to predict globular domains, and ELMs predicted inside these are coloured grey. Since some functional sites can occur in loops of globular domains (e.g. RGD motifs), some true predictions are removed by this filter, and users are encouraged to inspect the domain DB entries, while 2D structure predictions might be informative.

Users should be aware that about 10% of the Pfam entires do not in fact correspond to globular domains. Currently these are treated as globular domains by the ELM server, but this will be changed in the future.

In addition to globular domains, SMART predicts coiled coils (COILS), signal peptide cleavage sites (Sigcleave), low complexity regions (SEG), and transmembrane helices (TMHMM2). ELM uses these predictions for rough filtering as well.

The GlobPlot algorithm is used to detect potential globular domains as well as protein disorder:

GlobDoms: segments of the sequence that is ordered according to the Russell/Linding scale. These segments correspond to potential globular domains.

Disorder: segments of the sequence that is potentially disordered and unstructured according to the Russell/Linding scale. These regions are on average enriched in ELMs [Linding et al. 2004].
For more information refer to the GlobPlot and DisEMBL papers.

Probability filter
For each ELM class, a probability score (expect cutoff) is calculated based on its regular expression, using the following amino acid probabilities (derived from uniprot and using an IUPred cutoff of >= 0.4):

'A':	0.074253	'C':	0.009697	'D':	0.050147	'E':	0.089011	'F':	0.018359
'G':	0.073955	'H':	0.025784	'I':	0.026977	'K':	0.061170	'L':	0.073093
'M':	0.019150	'N':	0.033851	'P':	0.094297	'Q':	0.056396	'R':	0.063802
'S':	0.104119	'T':	0.059547	'V':	0.045399	'W':	0.006850	'Y':	0.014141

This probability score is low for strictly annotated regular expressions (ex. TRG_LysEnd_GGAAcLL_2, 'S[LW]LD[DE]EL[LM]', 1.03823788548E-9), and high for more degenerate ones (ex. MOD_GSK3_1, '...([ST])...[ST]', 0.026786559556) and should reflect the probability of the regular expression to be found by chance in any given protein sequence. When searching a protein sequence for putative ELM motif instances, this score is used to limit the number of predicted degenerate motif instances. (Currently, ELM does not correct the probability score for protein sequence length and motif length to avoid length bias).

Is there a nomenclature for representing functional site motifs?

Yes! In the ELM project we intend to use the nomenclature suggested by R. Aasland et al. (FEBS Lett. 2002; 513: 141-144.) when appropriate.

For example: the sumoylation motif (MOD_SUMO) can be described by the regular expression [VILAFP](K).[EDNGP], but a simpler description is the consensus sequence: %KxE. This is not as precise as the regular expression, but easier to read. Amino acids are written in the IUPAC one letter code, the % sign means hydrophobic amino acid residue, and the x is any amino acid.

Primarily, Greek symbols are used for representing different groups of amino acids, and ASCII equivalents are given for use in plain text files and e.g. for web-pages. ELM uses the ASCII symbols.

How can the ELM DB be accessed programmatically?

Warning The ELM webservices have been deprecated and replaced by RESTful services. Please see the downloads page for details on how to retrieve the different type of ELM data. In case of problems, please contact the ELM webmaster

Regular expressions

Regular expressions are similar to the PROSITE patterns, but have a slightly different syntax:

Amino acids are written in one letter code (see IUPAC Nomenclature and Symbolism for Amino Acids and Peptides), other characters are:

Character	Name	Meaning
.	dot	Any amino acid allowed
`[...]`	character class	Amino acids listed are allowed
`[^...]`	negated character class	Amino acids listed are not allowed
`{` min, max }	specified range	Min required, max allowed
`^`	caret	Matches the amino terminal
`$`	dollar	Matches the carboxy terminal
`?`	question	One amino acid is allowed, but is optional
`*`	star	Any number of amino acids are allowed but are optional
`+`	plus	One amino acid is allowed, additional are optional
`\|`	alternation	Matches either expression it separates
`(...)`	parentheses	1. Used to mark positions of specific interest; e.g. the amino acid being covalently modified. 2. Used to group parts of the expression

Examples:

The regular expression for MOD_SUMO: [VILAFP](K).[EDNGP]

First position: [VILAFP] - one of the amino acids V, I, L, A, F or P must occur.
Second position: (K) - K must occur. This amino acid is covalently modified, and is surrounded by parethesis.
Third position: . - any amino acid is allowed.
Fourth position: [EDNGP] - one of the amino acids E, D, N, G or P must occur.
Examples of matching sequences:

The regular expression for LIG_PCNA_PIPBox_1: (^.{0,3}|Q).[SKRNDT][ILM][^P][^P][FHM][YFM]..

First position: (^.{0,3}|Q) - any amino acid at or at most 3 amino acids from the amino terminus or it must be Q. The parenthesis are used to allow for the alternate situation.
Second position: . - any amino acid is allowed.
Third position: [SKRNDT] - one of the amino acids S, K, R, N, D or T must occur.
Fourth position: [ILM] - one of the amino acids I, L, or M must occur.
Fifth and sixth position: [^P] - all amino acids except P is allowed.
Seventh position: [FHM] - one of the amino acids F, H, or M must occur.
Eighth position: [YFM] - one of the amino acids Y, F, or M must occur.
Positions 9 and 10: . - any amino acid is allowed. These are included to indicate that the motif should be at least 10 amino acids long.
Examples of matching sequences:

More information on regular expressions can be found in the Regular expression HOWTO

Why use Regular Expressions in ELM?

We consider that it was appropriate to initiate LM database resources with RegExps. Of course, HMMs and ANNs are used in a number of useful predictive tools e.g. Scansite and NetPhorest and there is little doubt that HMMs, Neural Networks and other methods will grow in importance for LM analyses in future, once the contexts can be better controlled.

Dictionary

Definitions and explanations to terms used in the ELM resource.

Biochemical context: For functional sites, the biochemical context has several components: the sequence motif, its relation to the local structure and other domains in the protein as well as the protein complex it may reside in.
Cellular context: Where and when in the cell a site is functional.
Context: The space and time where a molecular function takes place.
ELM: 1. Eukaryotic Linear Motif.
2. The common pattern of a set of linear (sub)sequences that can be related to a molecular function.
ELM instance: An experimentally verified instance of an ELM in a particular polypeptide.
ELM instance sequence: A protein sequence carrying one or more experimetally verified ELM instances
Filter: Method for discriminating between likely positive and negative ELM predictions; based on context information.
Functional site: A set of short linear (sub)sequences that can be related to a molecular function.
GO: Gene Ontology: a controlled vocabulary for describing the molecular functions, cellular components and biological processes of genes. See http://www.geneontology.org/
Logic: TP (True Positive): An instance annotated with experimental evidence showing this instance to be functional.
FP (False Positive): An instance with experimental evidence hinting at a function, but after careful inspection our annotators believe this instance to be non-functional.
TN (True Negative): An annotated instance where experiments have shown it to be non-functional.
U (Unknown): Not enough convincing evidence could be found to determine whether this instance is functional or not.
Molecular context: see Biochemical context.
Molecular function: The nature of interaction of a protein with another molecule.
Sequence motif: A recurrent pattern of conserved amino acids. The amino acids may be absolutely, chemically or sterically conserved.
Siteseeing: The process of annotating ELMs, including developement of detection methods and evaluation of context information used to formulate discriminating filters and rules.

Please cite: ELM-the Eukaryotic Linear Motif resource-2024 update. (PMID:37962385)

ELM data can be downloaded & distributed for non-commercial use according to the ELM Software License Agreement

feedback@elm.eu.org