For MS-MS Database Search
Table of Contents:
I.
Introduction
II.
Usage
a.
Basic
b.
Specifying
amino acid masses
c.
Specifying static or
dynamic mass modifications
d.
Configuration
parameters guide
III.
Interpreting
results
c.
Validation
MyriMatch is
a tool designed to take experimental data from shotgun proteomics experiments
and compare those spectra against sequences in a known database of proteins.
Whether the program is being run in a single-computer environment or across an
entire cluster of processing nodes, it is able to optimally divide work in a
much more efficient way than many other database search programs. This is
because it only generates candidate sequences from the known database once for
the entire set of spectra instead of once for every spectrum. Thus, for each
candidate sequence generated, it is compared against every spectrum. The
spectra keep a certain (user-defined) number of candidate sequences that had
the highest scores.
a) The basic usage of
MyriMatch is quite simple:
myrimatch [flags] <FASTA database filepath>
<MS/MS data filepath in a supported file format> <another MS/MS data
filepath>
When running
it from the command line, the command line parser first determines what flags
you have specified in the command. The flags can be anywhere on the command
line. The following basic flags are supported:
-cfg <file> specifies
a runtime configuration [default: myrimatch.cfg]
-rescfg <file> specifies
an amino acid residue mass configuration [default: residue_masses.cfg]
-workdir <path> specifies
an absolute path to use as the working directory during execution [default:
current working directory]
-cpus <integer> specifies
the number of worker threads to use during search [default: all available
processors]
If a flag is
specified that expects an argument but no argument is provided, it might be
treated as a spectrum data file which probably undesirable. If you do not
specify a runtime configuration file with -cfg and the default configuration
file is not found, then default runtime values are used (and a warning that no
configuration file was found will be shown). Likewise, if you do not specify a
residue masses file with -rescfg and the default residue masses configuration
file is not found, then hard-coded values will be used for the 20 common amino
acids.
There is
another type of flag that is supported that has a unique pattern: the override
flags. Instead of having a name like cfg, the override flags have the same name
as the variable that they override. Overriding a variable is specifying a
different value on the command line than the one that is in the configuration
file (just like the configuration file overrides the built-in values). For
example, to override the variable DynamicMods to have the value “M @ 16”, use
the override flag:
-DynamicMods
"M @ 16"
The double
quotes are necessary on the command line because the value of the variable has
spaces in it.
After the
flags are parsed, the file arguments are processed. The first file argument
must be the relative or absolute path to the FASTA protein database you want to
search against. Every file argument after that is a relative or absolute path
to a MS/MS spectra data file from which to extract experimental spectra. The
FASTA database filepath must be a valid filepath, it does not support
wildcards. The MS/MS spectra data filepaths, however, do support wildcards. The
provided spectra must be in one of the following supported formats:
mzML 1.0
mzXML 3.0
Mascot Generic (MGF)
Bruker BioTools Data Exchange (btdx.xml)
Thermo RAW
Waters RAW
b) The residue masses configuration file,
if present, will override the 20 default amino acids whose masses are
hard-coded. The
residue masses file is not intended to be changed regularly if you want to
specify a static mass modification, refer to the StaticMods variable. The file should have a number of rows,
one per amino acid, where each row takes the form:
<AA residue
character> <monoisotopic mass> <average mass>
c) A static mass modification is
something like carboxymethylation of cysteines, where all cysteines should be
treated as about +57 in MyriMatch and all subsequent downstream analysis. Refer
to the StaticMods variable in the configuration parameters guide. A dynamic
mass modification is something like a potential oxidation
of methionine, where each methionine may be occur as either its natural mass or
about +16. Refer to the DynamicMods variable in the configuration parameters
guide.
d)
Configuration
parameters guide
|
Name (Type, Default
Value) |
Description |
|
NumChargeStates (integer, 3) |
Controls
the number of charge states that MyriMatch will handle during all stages of
the program. It is especially important during determination of charge state
(see DuplicateSpectra for more information). |
|
OutputSuffix (string, none) |
The output
of a MyriMatch job will be a SQT file for each input file. The string
specified by this parameter will be appended to each SQT filename. It is
useful for differentiating jobs within a single directory. |
|
StartSpectraScanNum (integer, 0) EndSpectraScanNum (integer, -1) |
A useful
feature to focus a search on a subset of spectra in a particular data file,
these two parameters can be set in order to limit the possible range of scan
numbers that MyriMatch will read from the input data files. By default, all
tandem mass spectra in the input files are read in for processing. |
|
StartProteinIndex (integer, 0) EndProteinIndex (integer, -1) |
A useful
feature to focus a search on a subset of proteins in the protein database, these
two parameters can be set in order to limit the range of proteins that
MyriMatch will read from the protein database. By default, all proteins in
the protein database are read in for processing. |
|
StatusUpdateFrequency (real, 5 seconds) |
Preprocessing
spectra and scoring candidates may take a long time. A measure of progress
through the protein database will be given on intervals that are specified by
this parameter, measured in seconds. |
|
UseChargeStateFromMS (boolean, false) |
If true,
MyriMatch will use the charge state from the input data if it is available.
If false, or if charge state is not available from a particular spectrum,
MyriMatch will use its internal algorithm to determine charge state. If, for
a given spectrum, MyriMatch uses its internal algorithm to determine charge
state and the result is multiply charged, that spectrum may be duplicated to
other charge states (see DuplicateSpectra for more information). |
|
DuplicateSpectra (boolean, true) |
If
MyriMatch determines a spectrum to be multiply charged and this parameter is
true, the spectrum will be copied and treated as if it was all possible
charge states from +2 to +<NumChargeStates>. If this parameter is
false, the spectrum will simply be treated as a +2. |
|
MinSequenceMass (real, 0 Da) |
When
preprocessing the experimental spectra, any spectrum with a precursor mass
that is less than the specified mass will be disqualified. This parameter is
useful to eliminate inherently unidentifiable spectra from an input data set.
A setting of 500 for example, will eliminate most 3-residue matches and clean
up the output file quite a lot. |
|
MaxSequenceMass (real, 10000 Da) |
When
preprocessing the experimental spectra, any spectrum with a precursor mass that
exceeds the specified mass will be disqualified. |
|
TicCutoffPercentage (real, 0.98) |
In order to
maximize the effectiveness of the MVH scoring algorithm, an important step in
preprocessing the experimental spectra is filtering out noise peaks. Noise
peaks are filtered out by sorting the original peaks in descending order of
intensity, and then picking peaks from that list until the cumulative ion
current of the picked peaks divided by the total ion current (TIC) is greater
than or equal to this parameter. Lower percentages mean that less of the
spectrums total intensity will be allowed to pass through preprocessing. See
the section on Advanced Usage for tips on how to use this parameter
optimally. |
|
NumIntensityClasses (integer, 3) |
Before
scoring any candidates, experimental spectra have their peaks stratified into
the number of intensity classes specified by this parameter. Spectra that are
very dense in peaks will likely benefit from more intensity classes in order
to best take advantage of the variation in peak intensities. Spectra that are
very sparse will not see much benefit from using many intensity classes. |
|
AdjustPrecursorMass (boolean, false) |
If true, the
preprocessing step will correct the precursor mass by adjusting it through a
specified range in steps of a specified length, finally choosing the optimal
adjustment. The optimal adjustment is the one that maximizes the sum of
products of all complementary peaks in the spectrum. |
|
MinPrecursorAdjustment (real, -2.5 Da) |
When
adjusting the precursor mass, this parameter sets the lower mass limit of
adjustment allowable from the original precursor mass, measured in |
|
MaxPrecursorAdjustment (real, 2.5 Da) |
When
adjusting the precursor mass, this parameter sets the upper mass limit of
adjustment allowable from the original precursor mass, measured in |
|
PrecursorAdjustmentStep (real, 0.1 Da) |
When adjusting
the precursor mass, this parameter sets the size of the steps between
adjustments, measured in |
|
DeisotopingMode (integer, 0) |
Deisotoping
a spectrum (consolidating isotopic peak intensities into the monoisotopic
peaks intensity) during preprocessing will significantly improve precursor
adjustment, and it may be desirable to keep the deisotoped spectrum around
for candidate scoring as well. Set to 0, no deisotoping will be used. Set to
1, deisotoping will be used for precursor adjustment only. Set to 2,
deisotoping will be used for both precursor adjustment and for candidate
scoring. |
|
IsotopeMzTolerance (real, 0.25 m/z) |
When
deisotoping a spectrum, an isotopic peak is one that is the mass of a neutron
higher than another peak, tolerating variation based on the value of this
parameter. Deisotoping actually traverses the spectrum at multiple charge
states, starting from the highest (NumChargeStates) and ending at the lowest. |
|
ComplementMzTolerance (real, 0.5 m/z) |
When
adjusting the precursor mass, this parameter controls how much tolerance
there is on each side of the calculated m/z when looking for a peaks
complement. |
|
StaticMods (string, none) |
If a
residue (or multiple residues) should always be treated as having a
modification on their natural mass, set this parameter to inform the search
engine which residues are modified. Residues are entered into this string as
a space-delimited list of pairs. Each pair is of the form: <AA residue character> <mod
mass> Thus, to
treat cysteine as always being carboxymethylated, this parameter would be set
to something like the string: “C 57” |
|
DynamicMods (string, none) |
Note: avoid using the ‘#’ symbol in a
configuration file since it begins a comment section. Using the ‘#’ symbol in
a command-line override works fine. In order to
search a database for potential post-translational modifications of candidate
sequences, the user must configure this parameter to inform the search engine
which residues may be modified. Residues that are modifiable are entered into
this string in a space-delimited list of triplets. Each triplet is of the
form: <AA motif> <character to
represent mod> <mod mass> Thus, to
search for potentially oxidized methionines and phosphorylated serines, this
parameter would be set to something like the string: “M * 15.995
S $ 79.966” The AA
motif can include multiple residues, and the peptide termini are represented
by opening ‘(‘ and closing ‘)’ parentheses for the N
and C termini, respectively. For example, an N terminal deamidation of
glutamine may be specified by the string: “(Q ^ -17” If the last
residue in the motif is not the residue intended to be modifiable, then use
an exclamation mark to indicate that the residue preceding the mark is the
modifiable residue. Using the previous example, “(Q! ^ -17” is an equivalent
way to specify it. Another example would be specifying the demidation of
asparagine when it is N terminal to a glycine, which might look like: “N!G @ -17” Another
possibility is to specify a block of interchangeable residues in the motif,
which is supported by the ‘[‘ and ‘]’ brackets. For
example, to specify a potential phosphorylation on any serine, threonine, or
tyrosine, use the string: “[STY] *
79.966” The ‘{‘ and ‘}’ brackets work in the opposite way as the ‘[‘ and
‘]’ brackets, i.e. “{STY} * 79.966” specifies a potential phosphorylation on
every residue EXCEPT serine, threonine, or tyrosine. Both kinds of brackets
can be combined with the exclamation mark, in which case the exclamation mark
should come after the block (because the block counts as a single residue).
Using the previous example, “[STY]! * 79.966” is an equivalent way to specify
it. Using the
negative multi-residue brackets is the best way to indicate the “any residue
except” concept, and it works on single residues as well. For example, to
specify a mod on lysine except when it is at the C terminus of a peptide, use
something like the string “K!{)} # 144”. Another
example is specifying the cleavage-blocking homoserine mod in a CnBr digest
when a serine or threonine is C terminal to a methionine: “M![ST] *
-29.99” Note that
it is not currently possible to specify (for example) the non-cleavage-blocking
homoserine lactone mod in a CnBr digest, because the motif would extend
outside of the peptide sequence itself. In the future a string like “M!){ST}
* -17” might work for that, but for now, if ‘(‘ is
used it must be the first character in the motif, and likewise if ‘)’ is used
it must be the last character in a motif. |
|
MaxDynamicMods (integer, 2) |
This
parameter sets the maximum number of modified residues that may be in any
candidate sequence. |
|
MaxResults (integer, 5) |
This parameter
sets the maximum number of candidate sequence matches to report for each
spectrum. |
|
CleavageRules (string,
“[|R|K . . ]”) |
This
important parameter allows the user to control the way peptides are generated
from the protein database. It can be used to configure the search on tryptic
peptides only, on non-tryptics, or anything in between. It can even be used
to test multiple residue motifs at a potential cleavage site. The parameter
itself is a space-delimited list of cleavage rules. A cleavage rule is a
space-delimited pair of strings. The first string of the cleavage rule
specifies the residue or residues that must be N-terminal to a potential
cleavage site. The second string specifies the residue or residues that must
be C-terminal to the site. Either string in the pair can contain multiple
sequences of one or more residues, separated by the | character. A .
character is a wildcard that will accept anything. Additionally, the [ and ]
characters refer to the N and C termini of a protein. Now that
you are thoroughly confused, here are examples of single cleavage rules: R . a
site is valid for cleavage if the N-terminal residue is R R|K . a
site is valid for cleavage if the N-terminal residue is R or K [ . a
site is valid for cleavage at the N terminus of a protein . ] a
site is valid for cleavage at the C terminus of a protein The ‘.’ wildcard is important because it
allows the cleavage routine to work very quickly. However, if you wanted to
leave out proline from a tryptic digest, you would have to explicitly declare
the valid residues for both sides of a cleavage site: R|K
A|C|D|E|F|G|H|I|K|L|M|N|Q|R|S|T|V|W|Y Remember
that this parameter is a list of
cleavage rules; a real tryptic digest can be declared with two cleavage
rules: [|R|K
. . ] a site is valid for
cleavage if it is at the N terminus of a protein, or
if the N-terminal residue is R or K; a site is valid for cleavage if
it is at the C terminus of a protein Also note
that a cleavage rule can have a residue string of more than one residue,
allowing for multiple-residue cleavage motifs: [M|[ . a
site is valid for cleavage if it is at the N terminus of a protein, or if the
N-terminal sequence of residues is [M (i.e. the M must be at the
N terminus of a protein) |
|
NumMinTerminiCleavages (integer, 2) |
By default,
when generating peptides from the protein database, a peptide must start
after a cleavage and end before a cleavage. Setting this parameter to 0 or 1
will reduce that requirement, so that neither terminus or only one terminus
of the peptide must match one of the cleavage rules specified in the
CleavageRules parameter. This parameter is useful to turn a tryptic digest
into a semi-tryptic digest. |
|
NumMaxMissedCleavages (integer, -1) |
By default,
when generating peptides from the protein database, a peptide may contain any
number of missed cleavages. A missed cleavage is a site within the peptide
that matches one of the cleavage rules (refer to CleavageRules). Settings
this parameter to some other number will stop generating peptides from a
sequence if it contains more than the specified number of missed cleavages. |
|
UseAvgMassOfSequences (boolean, true) |
If true,
the mass of candidate sequences will be calculated using the average masses of
its amino acid residues. This parameter should be set based on whether the
experimental data has precursor masses that are monoisotopic. For example,
LCQ/LTQ-derived precursors are generally measured by average masses and
FT-ICR/Orbitrap-derived precursors are generally measured by monoisotopic
masses. |
|
UseSmartPlusThreeModel (boolean, true) |
Once a
candidate sequence has been generated from the protein database, MyriMatch
determines which spectra will be compared to the sequence. For each unique
charge state of those spectra, a set of theoretical fragment ions is
generated by one of several different algorithms. For +1 and
+2 precursors, a +1 b and y ion is always predicted at each peptide bond. For +3 and higher
precursors, the fragment ions predicted depend on the way this parameter is
set. When this parameter is true, then for each peptide bond, an internal
calculation is done to estimate the basicity of the b and y fragment
sequence. The precursors protons are distributed to those ions based on that
calculation, with the more basic sequence generally getting more of the
protons. For example, when this parameter is true, each peptide bond of a +3
precursor will either generate a +2 bi and a +1 yi ion,
or a +1 bi and a +2 yi ion. For a +4 precursor,
depending on basicity, a peptide bond breakage may result in a +1 bi
and a +3 yi ion, a +2 bi and a +2 yi ion, or
a +3 bi and a +1 yi ion. When this parameter is false,
however, ALL possible charge distributions for the fragment ions are
generated for every peptide bond. So a +3 sequence of length 10 will always
have theoretical +1 y5, +2 y5, +1 b5, and +2 b5 ions. |
|
PrecursorMzTolerance (real, 1.25 m/z) |
A generated
sequence candidate is only compared to an experimental spectrum if the
candidates mass is within this tolerance of the experimental spectrums
precursor mass. This value is given in Da/z units, but the actual tolerance
is calculated by multiplying by the charge state. This parameter should be
set to the tolerance that is desired for +1 spectra. At the default value,
the precursor mass tolerances are 1.25, 2.5, and 3.75 Da for the first three
charge states, respectively. |
|
FragmentMzTolerance (real, 0.5 m/z) |
This
parameter controls how much tolerance there is on each side of the calculated
m/z when looking for an ion fragment peak during candidate scoring. |
|
ProteinSampleSize (integer, 100) |
Before
beginning sequence candidate generation and scoring, MyriMatch will do a
random sampling of the protein database to get an estimate of the number of
comparisons that will be done by the job. The bigger the sample size, the
longer this estimate will take and the more accurate it will be. Of course,
if there are fewer proteins in the database than the sample size, all
proteins will be used in the sampling and the number of comparisons will be
exact. |
|
ClassSizeMultiplier (real, 2) |
When
stratifying peaks into a specified, fixed number of intensity classes, this
parameter controls the size of each class relative to the class above it
(where the peaks are more intense). At default values, if the best class, A,
has 1 peak in it, then class B will have 2 peaks in it and class C will have
4 peaks. |
|
NumBatches (integer, 50) |
This
parameter sets a number of batches per node to strive for when using the
MPI-based parallelization features. Setting this too low means that some
nodes will finish before others (idle processor time), while setting it too
high means more overhead in network transmission as each batch is smaller. |
|
ThreadCountMultiplier (integer, 10) |
MyriMatch
is designed to take advantage of (symmetric) multiprocessor systems by
multithreading the database search. A search process on an SMP system will
spawn one worker thread for each processing unit (where a processing unit can
be either a core on a multi-core CPU or a separate CPU entirely). The main
thread then generates a list of worker numbers which is equal to the number
of worker threads multiplied by this parameter. The worker threads then take
a worker number from the list and use that number to iterate through the
protein list. It is possible that one thread will be assigned all the
proteins that generate a few candidates while another thread is assigned all
the proteins that generate many candidates, resulting in one thread finishing
its searching early. By having each thread use multiple worker numbers, the
chance of one thread being penalized for picking all the easy proteins is
reduced because if it finishes early it can just pick a new number. The only
disadvantage to this system is that picking the new number incurs some
overhead because of synchronizing with the other worker threads that might be
trying to pick a worker number at the same time. The default value is a nice
compromise between incurring that overhead and minimizing wasted time. |
|
UseMultipleProcessors (boolean, true) |
If true,
each process will use all the processing units available on the system it is
running on. |
a)
Search-time output of MyriMatch serves
several purposes. The majority of the output will usually be progress
information, telling the user which part of the job that MyriMatch is currently
working on, and in some cases how far along into that part the job is. There
will be periodic updates when MyriMatch is preprocessing spectra and when it is
generating candidates from the database and comparing those candidates against
the spectra. In a multi-process (MPI) job, there will also be progress
information on bulk transfers of data over the network. Additionally, MyriMatch
will display statistics on the spectra that remain after preprocessing,
specifically the average number of peaks in a spectrum before and after
preprocessing. Also provided is the average number of the percentage of peaks
that were filtered out by the preprocessing step. Finally, in the case of an
MPI job, when the database search is complete each node that took part in the
search will display statistics detailing the work that node did. The lines will
be like:
Process #1 (foohost)
stats: <numBatches> / <numProteins> /
<numCandidatesGenerated> / <numCandidatesSearched> /
<numComparisonsDone>
b)
One pepXML (a database search output format
originally developed at the SPC) file is produced for every input spectra file
that a MyriMatch job searches. The file contains an entry for each spectrum
kept during the search (i.e. only the spectra that were not obviously junk) that was compared to any candidate. Only the best
MaxResults (a config parameter, defaulting to 5) matches for each spectrum are
kept and output in the file. The pepXML format is quite versatile, but does
have a few limitations. It is not possible to fully encode MyriMatch’s
CleavageRules variable if multi-residue motifs are used (in the rule “[|[M|K|R .” the “[M” part is a multi-residue motif), and it
is not possible to fully encode Static/DynamicMods if multi-residue motifs are
used (in the mod “(Q ^ -17” the “(Q” part is a multi-residue motif). We have
proposed extensions to pepXML to support these motif-oriented capabilities.
c) To validate results generated by
MyriMatch (and also several other popular search engines), users can pass
pepXML files to IDPickerQonvert (in preparation for analyzing the results in
the IDPicker suite). This approach only works when the protein database that
was searched included distracter (decoy) proteins with a common prefix in their
name that is not a prefix in any of the valid proteins. Most often, the
database that was searched will include proteins in their forward and reverse
orders, with the reversed protein having a prefix added to the name like
“rev_“.