Questions and answers |
|
| |
What methods are used for detecting functional
sites?
Currently, patterns written as regular
expressions are used for detecting functional
sites. Since most ELMs are very short motifs,
many of them will overpredict, implying
that most matches shown are more likely
to be false positives than true matches.
To improve the predictive value of ELM,
we use logical filters (or rules) based
on context information to discriminate between
likely true and false positives. (See below.)
|
| |
Why are the ELM predictions not scored?
Since a regular expression either matches
or does not match a string (or a protein
subsequence), in its nature the score would
be 1 or 0.
If a prediction is in
a sequence that is similar or identical
to an ELM instance sequence, and the motif
is positionally conserved, the prediction
will be scored by the ELM instance mapper.If a prediction is within a known domain structure, it will be scored by the structure filter. |
| |
What does the ELM instance mapper do?
The ELM instance mapper uses PHI-BLAST
to map ELM predictions of a query sequence
to known annotated ELM instances. The basis
for the mapping is sequence similarity and
positional conservation of the motif. If
an ELM prediction is mapped to a known ELM
instance, the prediction is given a score
from 0 to 1, where 1 means that the query
sequence is identical to the database sequence,
and the prediction can be considered an
experimentally verified ELM instance.
The result of a successful
ELM instance mapping is the raw PHI-BLAST
output and a summary of the results with
the calculated score. |
| |
Why is the context of a functional site
important?
A primary prerequisite for a site to be
functional is that it must be recognised
by the protein that will act upon it, in
other words it must occur in the proper
biochemical context. This implies that the
functional site must reside in a non-structured
region of a protein; either in non-globular
region or in loops whithin globular domains.
(see structure
filter).
Furthermore, in order
for a molecular function to occur, the functional
site must be in the right cellular context.
That is, the protein harbouring the functional
site must be in the right cellular compartment
(see cell
compartment filter).
Finally, while some functional
sites are found in all eukaryotes, others
are restricted to a specific species range
(see taxonomic
range filter). |
| |
What are the currently implemented context
filters?
Taxonomic
range filter
This filter relies on the user to submit
the correct species for the protein sequence.
ELMs are annotated with the proper taxonomic
range using NCBI
taxonomy node identifiers, and only
ELMs relevant for that species will be shown
in the prediction results.
Cell
compartment filter
Since many functional sites are restricted
to one or several cellular compartments,
ELMs are annotated with GO terms for the
proper set of compartments. Currently this
filter relies on user supplied information,
but we will predict compartment if not supplied,
in the future. Only ELMs relevant to the
user supplied compartments will be shown
in the prediction results.
Structure filter
Since functional sites must be accessible, they cannot reside deep inside globular domains
(unless the domain is known to be allosteric). Therefore, most true motifs are in the exposed
loops. Whenever a globular domain structure is available (≥70% identity to the query sequence),
the structure filter evaluates the context of linear motifs lying within it. A summary of the 2D structural
elements is provided in the 2D structure line using PDBsum.
The filter scores both the secondary structure overlap and the accessibility of the individual residues. Final secondary structure (SSSE) and accessibility (SA) scores are assigned by averaging individual scores over the whole motif. The score range
is calibrated on a benchmark of true motifs in solved structures and assigned into three categories (blue: very good, half-blue:possible, grey:bad). Mouse-over the ELM graphic reveals
the scores. Links are provided to display each motif in context using JMOL
and separately to a full summary of the filter results.
see here for more indepth details on the structural filter.
Globular
domain filter
If no structure is available but there is an identified domain, the globular
domain filter provides a rougher guide to context. It uses SMART
and Pfam to predict globular
domains, and ELMs predicted inside these are coloured grey. Since some functional
sites can occur in loops of globular domains (e.g. RGD motifs), some true predictions
are removed by this filter, and users are encouraged to inspect the domain DB entries,
while 2D structure predictions might be informative.
Users should be aware the about 10% of
the Pfam entires do not in fact correspond
to globular domains. Currently these are
treated as globular domains by the ELM server,
but these will be eliminated in the future.
In addition
to globular domains, SMART predicts coiled
coils (COILS),
signal peptide cleavage sites
(Sigcleave),
low complexity regions
(SEG),
and transmembrane helices (TMHMM2).
ELM uses these predictions for rough filtering
as well.
The GlobPlot algorithm
is used to detect potential globular domains
as well as protein disorder:
GlobDoms: segments of
the sequence that is ordered according to
the Russell/Linding scale. These segments
correspond to potential globular domains.
Disorder: segments of
the sequence that is potentially disordered
and unstructured according to the Russell/Linding
scale. These regions are on average enriched
in ELMs [Linding et al. 2004].
For more information refer to the GlobPlot
and DisEMBL
papers. |
| |
Is there a nomenclature
for representing functional site motifs?
Yes! In the ELM project we intend to use
the nomenclature suggested by R.
Aasland et al. (FEBS Lett. 2002; 513: 141-144.)
when appropriate.
For example: the sumoylation motif (MOD_SUMO)
can be described by the regular
expression [VILAFP](K).[EDNGP], but
a simpler description is the consensus sequence:
%KxE. This is not as precise
as the regular expression, but easier to
read. Amino acids are written in the IUPAC
one letter code, the % sign means hydrophobic
amino acid residue, and the x is any amino
acid.
Primarily, Greek symbols are used for representing
different groups of amino acids, and ASCII
equivalents are given for use in plain text
files and e.g. for web-pages. ELM uses the
ASCII symbols. |
|

Regular expressions |
Regular expressions are similar to the
PROSITE
patterns, but have a slightly different
syntax:
Amino acids are written in one letter code
(see IUPAC
Nomenclature and Symbolism for Amino Acids
and Peptides), other characters are:
| Character |
Name |
Meaning |
| . |
dot |
Any amino acid allowed |
| [...] |
character class |
Amino acids listed are allowed |
| [^...] |
negated character class |
Amino acids listed are not allowed |
| { min, max } |
specified range |
Min required, max
allowed |
| ^ |
caret |
Matches the amino terminal |
| $ |
dollar |
Matches the carboxy terminal |
| ? |
question |
One amino acid is allowed, but
is optional |
| * |
star |
Any number of amino acids are
allowed but are optional |
| + |
plus |
One amino acid is allowed, additional
are optional |
| | |
alternation |
Matches either expression it
separates |
| (...) |
parentheses |
1. Used to mark positions
of specific interest; e.g. the
amino acid being covalently
modified.
2. Used to group parts of the
expression |
|
Examples:
The regular expression for MOD_SUMO:
[VILAFP](K).[EDNGP]
- First position: [VILAFP]
- one of the amino acids V, I, L, A, F
or P must occur.
- Second position: (K)
- K must occur. This amino acid is covalently
modified, and is surrounded by parethesis.
- Third position: . -
any amino acid is allowed.
- Fourth position: [EDNGP]
- one of the amino acids E, D, N, G or
P must occur.
- Examples of matching sequences:
The regular expression for LIG_PCNA:
(^.{0,3}|Q).[SKRNDT][ILM][^P][^P][FHM][YFM]..
- First position: (^.{0,3}|Q)
- any amino acid at or at most 3 amino
acids from the amino terminus or it must
be Q. The parenthesis are used to allow
for the alternate situation.
- Second position: .
- any amino acid is allowed.
- Third position: [SKRNDT]
- one of the amino acids S, K, R, N, D
or T must occur.
- Fourth position: [ILM]
- one of the amino acids I, L, or M must
occur.
- Fifth and sixth position: [^P]
- all amino acids except P is allowed.
- Seventh position: [FHM]
- one of the amino acids F, H, or M must
occur.
- Eighth position: [YFM]
- one of the amino acids Y, F, or M must
occur.
- Positions 9 and 10: .
- any amino acid is allowed. These are
included to indicate that the motif should
be at least 10 amino acids long.
- Examples of matching sequences:
|
| |
| More information on regular expressions:
Regular
expression HOWTO |
|

|
Dictionary |
| Definitions and explanations to terms used
in the ELM resource. |
| Biochemical
context |
For functional
sites, the biochemical context has
several components: the sequence motif,
its relation to the local structure
and other domains in the protein as
well as the protein complex it may reside
in. |
| Cellular
context |
Where and when in the cell a site
is functional. |
| Context |
The space and time where a molecular
function takes place. |
| ELM |
1. Eukaryotic Linear Motif.
2. The common pattern of a set of linear
(sub)sequences that can be related to
a molecular
function. |
| ELM
instance |
An experimentally verified instance
of an ELM in a particular
polypeptide. |
| ELM instance sequence |
A protein sequence carrying one or
more experimetally verified ELM
instances |
| Filter |
Method for discriminating between
likely positive and negative ELM predictions;
based on context information. |
| Functional
site |
A set of short linear (sub)sequences
that can be related to a molecular
function. |
| GO |
Gene Ontology: a controlled vocabulary
for describing the molecular functions,
cellular components and biological processes
of genes. See http://www.geneontology.org/ |
| Molecular
context |
see Biochemical
context. |
| Molecular
function |
The nature of interaction of a protein
with another molecule. |
| Sequence
motif |
A recurrent pattern of conserved amino
acids. The amino acids may be absolutely,
chemically or sterically conserved. |
| Siteseeing |
The process of annotating ELMs, including
developement of detection methods and
evaluation of context information used
to formulate discriminating filters
and rules. |
|
|
 |