ELMThe Eukaryote Linear Motif resource for Functional Sites in  Proteins
|
|
|
|
|
|
o Questions and answers
 

o What methods are used for detecting functional sites?

Currently, patterns written as regular expressions are used for detecting functional sites. Since most ELMs are very short motifs, many of them will overpredict, implying that most matches shown are more likely to be false positives than true matches. To improve the predictive value of ELM, we use logical filters (or rules) based on context information to discriminate between likely true and false positives. (See below.)

 

o Why are the ELM predictions not scored?

Since a regular expression either matches or does not match a string (or a protein subsequence), in its nature the score would be 1 or 0.
   If a prediction is in a sequence that is similar or identical to an ELM instance sequence, and the motif is positionally conserved, the prediction will be scored by the ELM instance mapper.If a prediction is within a known domain structure, it will be scored by the structure filter.

 

o What does the ELM instance mapper do?

The ELM instance mapper uses PHI-BLAST to map ELM predictions of a query sequence to known annotated ELM instances. The basis for the mapping is sequence similarity and positional conservation of the motif. If an ELM prediction is mapped to a known ELM instance, the prediction is given a score from 0 to 1, where 1 means that the query sequence is identical to the database sequence, and the prediction can be considered an experimentally verified ELM instance.
   The result of a successful ELM instance mapping is the raw PHI-BLAST output and a summary of the results with the calculated score.

 

o Why is the context of a functional site important?

A primary prerequisite for a site to be functional is that it must be recognised by the protein that will act upon it, in other words it must occur in the proper biochemical context. This implies that the functional site must reside in a non-structured region of a protein; either in non-globular region or in loops whithin globular domains. (see structure filter).
   Furthermore, in order for a molecular function to occur, the functional site must be in the right cellular context. That is, the protein harbouring the functional site must be in the right cellular compartment (see cell compartment filter).
   Finally, while some functional sites are found in all eukaryotes, others are restricted to a specific species range (see taxonomic range filter).

 

o What are the currently implemented context filters?

Taxonomic range filter
This filter relies on the user to submit the correct species for the protein sequence. ELMs are annotated with the proper taxonomic range using NCBI taxonomy node identifiers, and only ELMs relevant for that species will be shown in the prediction results.

Cell compartment filter
Since many functional sites are restricted to one or several cellular compartments, ELMs are annotated with GO terms for the proper set of compartments. Currently this filter relies on user supplied information, but we will predict compartment if not supplied, in the future. Only ELMs relevant to the user supplied compartments will be shown in the prediction results.

Structure filter
Since functional sites must be accessible, they cannot reside deep inside globular domains (unless the domain is known to be allosteric). Therefore, most true motifs are in the exposed loops. Whenever a globular domain structure is available (≥70% identity to the query sequence), the structure filter evaluates the context of linear motifs lying within it. A summary of the 2D structural elements is provided in the 2D structure line using PDBsum. The filter scores both the secondary structure overlap and the accessibility of the individual residues. Final secondary structure (SSSE) and accessibility (SA) scores are assigned by averaging individual scores over the whole motif. The score range is calibrated on a benchmark of true motifs in solved structures and assigned into three categories (blue: very good, half-blue:possible, grey:bad). Mouse-over the ELM graphic reveals the scores. Links are provided to display each motif in context using JMOL and separately to a full summary of the filter results.
see here for more indepth details on the structural filter.

Globular domain filter
If no structure is available but there is an identified domain, the globular domain filter provides a rougher guide to context. It uses SMART and Pfam to predict globular domains, and ELMs predicted inside these are coloured grey. Since some functional sites can occur in loops of globular domains (e.g. RGD motifs), some true predictions are removed by this filter, and users are encouraged to inspect the domain DB entries, while 2D structure predictions might be informative.

Users should be aware the about 10% of the Pfam entires do not in fact correspond to globular domains. Currently these are treated as globular domains by the ELM server, but these will be eliminated in the future.

In addition to globular domains, SMART predicts coiled coils (COILS), signal peptide cleavage sites (Sigcleave), low complexity regions (SEG), and transmembrane helices (TMHMM2). ELM uses these predictions for rough filtering as well.

The GlobPlot algorithm is used to detect potential globular domains as well as protein disorder:

GlobDoms: segments of the sequence that is ordered according to the Russell/Linding scale. These segments correspond to potential globular domains.

Disorder: segments of the sequence that is potentially disordered and unstructured according to the Russell/Linding scale. These regions are on average enriched in ELMs [Linding et al. 2004].
For more information refer to the GlobPlot and DisEMBL papers.

 

o Is there a nomenclature for representing functional site motifs?

Yes! In the ELM project we intend to use the nomenclature suggested by R. Aasland et al. (FEBS Lett. 2002; 513: 141-144.) when appropriate.

For example: the sumoylation motif (MOD_SUMO) can be described by the regular expression [VILAFP](K).[EDNGP], but a simpler description is the consensus sequence: %KxE. This is not as precise as the regular expression, but easier to read. Amino acids are written in the IUPAC one letter code, the % sign means hydrophobic amino acid residue, and the x is any amino acid.

Primarily, Greek symbols are used for representing different groups of amino acids, and ASCII equivalents are given for use in plain text files and e.g. for web-pages. ELM uses the ASCII symbols.




o Regular expressions

Regular expressions are similar to the PROSITE patterns, but have a slightly different syntax:

Amino acids are written in one letter code (see IUPAC Nomenclature and Symbolism for Amino Acids and Peptides), other characters are:

Character Name Meaning
. dot Any amino acid allowed
[...] character class Amino acids listed are allowed
[^...] negated character class Amino acids listed are not allowed
{ min, max } specified range Min required, max allowed
^ caret Matches the amino terminal
$ dollar Matches the carboxy terminal
? question One amino acid is allowed, but is optional
* star Any number of amino acids are allowed but are optional
+ plus One amino acid is allowed, additional are optional
| alternation Matches either expression it separates
(...) parentheses

1. Used to mark positions of specific interest; e.g. the amino acid being covalently modified.
2. Used to group parts of the expression

Examples:

The regular expression for MOD_SUMO: [VILAFP](K).[EDNGP]

  • First position: [VILAFP] - one of the amino acids V, I, L, A, F or P must occur.
  • Second position: (K) - K must occur. This amino acid is covalently modified, and is surrounded by parethesis.
  • Third position: . - any amino acid is allowed.
  • Fourth position: [EDNGP] - one of the amino acids E, D, N, G or P must occur.
  • Examples of matching sequences:

The regular expression for LIG_PCNA: (^.{0,3}|Q).[SKRNDT][ILM][^P][^P][FHM][YFM]..

  • First position: (^.{0,3}|Q) - any amino acid at or at most 3 amino acids from the amino terminus or it must be Q. The parenthesis are used to allow for the alternate situation.
  • Second position: . - any amino acid is allowed.
  • Third position: [SKRNDT] - one of the amino acids S, K, R, N, D or T must occur.
  • Fourth position: [ILM] - one of the amino acids I, L, or M must occur.
  • Fifth and sixth position: [^P] - all amino acids except P is allowed.
  • Seventh position: [FHM] - one of the amino acids F, H, or M must occur.
  • Eighth position: [YFM] - one of the amino acids Y, F, or M must occur.
  • Positions 9 and 10: . - any amino acid is allowed. These are included to indicate that the motif should be at least 10 amino acids long.
  • Examples of matching sequences:
 
More information on regular expressions: Regular expression HOWTO


o Dictionary
Definitions and explanations to terms used in the ELM resource.
Biochemical context For functional sites, the biochemical context has several components: the sequence motif, its relation to the local structure and other domains in the protein as well as the protein complex it may reside in.
Cellular context Where and when in the cell a site is functional.
Context The space and time where a molecular function takes place.
ELM 1. Eukaryotic Linear Motif.
2. The common pattern of a set of linear (sub)sequences that can be related to a molecular function.
ELM instance An experimentally verified instance of an ELM in a particular polypeptide.
ELM instance sequence A protein sequence carrying one or more experimetally verified ELM instances
Filter Method for discriminating between likely positive and negative ELM predictions; based on context information.
Functional site A set of short linear (sub)sequences that can be related to a molecular function.
GO Gene Ontology: a controlled vocabulary for describing the molecular functions, cellular components and biological processes of genes. See http://www.geneontology.org/
Molecular context see Biochemical context.
Molecular function The nature of interaction of a protein with another molecule.
Sequence motif A recurrent pattern of conserved amino acids. The amino acids may be absolutely, chemically or sterically conserved.
Siteseeing The process of annotating ELMs, including developement of detection methods and evaluation of context information used to formulate discriminating filters and rules.

Last modified 12-FEB-2008- webmaster