DirecTag
For MS-MS Sequence Tagging
Table of Contents:
I.
Introduction
II.
Usage
a.
Basic
b.
Specifying amino acid masses
c.
Specifying static or dynamic mass
modifications
d.
Configuration
parameters guide
III.
Interpreting results
DirecTag is a tool designed to take
experimental data from shotgun proteomics experiments and build sequence tags
from the gaps between the most appropriate peaks in each spectrum. It is able
to divide work evenly across any number of computers with any number of
processors. Once generated, sequence tags are ranked based on various scores
that are more fully described in the DirecTag paper.
Each spectrum keeps a certain (user-defined) number of sequence tags that
ranked the best.
a) The basic usage of DirecTag:
directag [flags] <MS/MS data filepath
in a supported file format> <another MS/MS data filepath>
When running it from the
command line, the command line parser first determines what flags you have
specified in the command. The flags can be anywhere on the command line. The
following basic flags are supported:
-cfg <file> specifies
a runtime configuration [default: directag.cfg]
-rescfg
<file> specifies
an amino acid residue mass configuration [default: residue_masses.cfg]
-workdir
<path> specifies
an absolute path to use as the working directory during execution [default:
current working directory]
-cpus
<integer> specifies the number of
worker threads to use during tagging [default: all available processors]
If a flag is specified that
expects an argument but no argument is provided, it might be treated as a
spectrum data file which probably undesirable. If you do not specify a runtime
configuration file with -cfg and the default
configuration file is not found, then default runtime values are used (and a
warning that no configuration file was found will be shown). Likewise, if you
do not specify a residue masses file with -rescfg and
the default residue masses configuration file is not found, then hard-coded
values will be used for the 20 common amino acids.
There is another type of flag
that is supported that has a unique pattern: the override flags. Instead of
having a name like cfg, the override flags have the
same name as the variable that they override. Overriding a variable is
specifying a different value on the command line than the one that is in the
configuration file (just like the configuration file overrides the built-in
values). For example, to override the variable DynamicMods
to have the value M @ 16, use the override flag:
-DynamicMods
M @ 16
The double quotes are necessary
on the command line because the value of the variable has spaces in it.
After the flags are parsed,
the file arguments are processed. The file argument must be the relative or
absolute path to one or more MS/MS spectra data files in one of the following
supported file formats:
-mzData 1.05
b)
The residue
masses configuration file, if present, will override the 20 default amino acids
whose masses are hard-coded. The residue masses file is not intended to be changed
regularly if you want to specify a static mass modification, refer to the StaticMods variable. The
file should have a number of rows, one per amino acid, where each row takes the
form:
<AA residue character> <monoisotopic mass> <average mass>
c)
A static mass
modification is something like carboxymethylation of cysteines, where all cysteines
should be treated as about +57 in FreiTag and all
subsequent downstream analysis. Refer to the StaticMods
variable in the configuration parameters guide. A dynamic mass modification is
something like a potential oxidation of methionine, where each methionine
may be occur as either its natural mass or about +16. Refer to the DynamicMods variable in the configuration parameters guide.
d)
Configuration
parameters guide
|
Category |
Name |
Type (Default) |
Description |
|
General |
NumChargeStates |
integer (3) |
Controls the number of
charge states that DirecTag will handle during all
stages of the program. It is especially important during determination of
charge state (see DuplicateSpectra for more
information). |
|
General |
OutputSuffix |
string (none) |
The output of a DirecTag job will be a TAGS file for each input file. The
string specified by this parameter will be appended to each TAGS filename. It
is useful for differentiating jobs within a single directory. |
|
General |
StartSpectraScanNum EndSpectraScanNum |
integer (0, -1) |
A useful feature to focus
a job on a subset of spectra in a particular data file, these two parameters
can be set in order to limit the possible range of scan numbers that DirecTag will read from the input data files. By default,
all tandem mass spectra in the input files are read in for processing. |
|
General |
StatusUpdateFrequency |
real (5 seconds) |
Preprocessing spectra and generating
tags may take a long time. A measure of progress through the spectra will be
given on intervals that are specified by this parameter. |
|
Preprocessing |
UseChargeStateFromMS |
boolean
(false) |
If true, DirecTag will use the charge state from the input data if
it is available. If false, or if charge state is not available from a
particular spectrum, DirecTag will use its internal
algorithm to determine charge state. If, for a given spectrum, DirecTag uses its internal algorithm to determine charge
state and the result is multiply charged, that spectrum may be duplicated to
other charge states (see DuplicateSpectra for more
information). |
|
Preprocessing |
DuplicateSpectra |
boolean
(true) |
If DirecTag
determines a spectrum to be multiply charged and this parameter is true, the
spectrum will be copied and treated as if it was all possible charge states
from +2 to +<NumChargeStates>. If this
parameter is false, the spectrum will simply be treated as a +2. |
|
Preprocessing |
TicCutoffPercentage |
real
(85%) |
In order to maximize the
effectiveness of the MVH scoring algorithm, an important step in
preprocessing the experimental spectra is filtering out noise peaks. Noise
peaks are filtered out by sorting the original peaks in descending order of
intensity, and then picking peaks from that list until the cumulative ion
current of the picked peaks divided by the total ion current (TIC) is greater
than or equal to this parameter. Lower percentages mean that less of the
spectrums total intensity will be allowed to pass through preprocessing. See
the section on Advanced Usage for tips on how to use this parameter
optimally. |
|
Preprocessing |
MaxPeakCount |
integer (400) |
Another way of increasing
the effectiveness of the MVH scoring algorithm when used for tagging is to
set an upper bound on the number of peaks in a spectrum before generating
tags. This step tends to get rid of most noise peaks and makes tagging much
more feasible because so many fewer false positives are generated. |
|
Preprocessing |
NumIntensityClasses |
integer (3) |
Before scoring any
candidates, experimental spectra have their peaks stratified into the number
of intensity classes specified by this parameter. Spectra that are very dense
in peaks will likely benefit from more intensity classes in order to best
take advantage of the variation in peak intensities. Spectra that are very
sparse will not see much benefit from using many intensity classes. |
|
Preprocessing |
AdjustPrecursorMass |
boolean
(false) |
If
true, the preprocessing step will correct the precursor mass by adjusting it
through a specified range in steps of a specified length, finally choosing
the optimal adjustment. The optimal adjustment is the one that maximizes the
sum of products of all complementary peaks in the spectrum. |
|
Preprocessing |
MinPrecursorAdjustment |
real (-2.5 Da) |
When adjusting the
precursor mass, this parameter sets the lower mass limit of adjustment
allowable from the original precursor mass, measured in |
|
Preprocessing |
MaxPrecursorAdjustment |
real (2.5 Da) |
When adjusting the
precursor mass, this parameter sets the upper mass limit of adjustment
allowable from the original precursor mass, measured in |
|
Preprocessing |
PrecursorAdjustmentStep |
real (0.1 Da) |
When adjusting the
precursor mass, this parameter sets the size of the steps between
adjustments, measured in |
|
Preprocessing |
DeisotopingMode |
integer (0) |
Deisotoping a spectrum (consolidating isotopic peak
intensities into the monoisotopic peaks intensity) during
preprocessing will significantly improve precursor adjustment, and it may be
desirable to keep the deisotoped spectrum around
for candidate scoring as well. Set to 0, no deisotoping
will be used. Set to 1, deisotoping will be used
for precursor adjustment only. Set to 2, deisotoping
will be used for both precursor adjustment and for candidate scoring. |
|
Preprocessing |
PrecursorMzTolerance |
real (1.25 Da/z) |
In DirecTag
this variable is only used to adjust the maximum peak m/z possible in a
spectrum. |
|
Preprocessing |
IsotopeMzTolerance |
real (0.25 Da/z) |
When deisotoping
a spectrum, an isotopic peak is one that is the mass of a neutron higher than
another peak, tolerating variation based on the value of this parameter. Deisotoping actually traverses the spectrum at multiple
charge states, starting from the highest (NumChargeStates)
and ending at the lowest. |
|
Preprocessing |
ComplementMzTolerance |
real (0.5 Da/z) |
When adjusting the
precursor mass, this parameter controls how much tolerance there is on each
side of the calculated m/z when looking for a peaks complement. |
|
Generating Tags |
TagLength |
integer (3) |
A sequence tag is
generated from the gaps between a number of peaks equal to this parameter
plus one. Longer tag lengths are more specific, but harder to find because
many consecutive ion fragments are rare. |
|
Generating Tags |
StaticMods |
string (none) |
If a residue (or multiple residues)
should always be treated as having a modification on their natural mass, set
this parameter to inform the tagging engine which residues are modified.
Residues are entered into this string as a space-delimited list of pairs.
Each pair is of the form: <AA residue character> <mod mass> Thus, to treat cysteine as always being carboxymethylated,
this parameter would be set to something like the string C 57 |
|
Generating Tags |
DynamicMods |
string (none) |
In order to generate tags
with potential post-translational modifications to amino acid residues, the
user must configure this parameter to inform the tagging engine which
residues may be modified. Residues that are modifiable are entered into this
string in a space-delimited list of triplets. Each triplet is of the form: <AA residue character> <character to
represent mod> <mod mass> Thus, to generate tags for
potentially oxidized methionines and phosphorylated
serines, this parameter would be set to something
like the string M * 15.995 S # 79.966 |
|
Generating Tags |
MaxDynamicMods |
integer (2) |
This parameter sets the
maximum number of modified residues that may be in any candidate sequence. |
|
Generating Tags |
MaxResults |
integer (20) |
This parameter sets the
maximum number of sequence tags to report for each spectrum. |
|
Generating Tags |
FragmentMzTolerance |
real (0.5 Da/z) |
This parameter controls
how much tolerance there is when an m/z gap between two peaks is compared to
the masses of the amino acid residues. |
|
Generating Tags |
IntensityScoreWeight MzFidelityScoreWeight ComplementScoreWeight |
real (1.0) real (1.0) real (1.0) |
This group of parameters
controls how tag scores are combined to form a total score. The total score
of a tag is what is used to determine its final ranking in the result list. |
|
Advanced |
ClassSizeMultiplier |
real (2) |
When stratifying peaks
into a specified, fixed number of intensity classes, this parameter controls the
size of each class relative to the class above it (where the peaks are more
intense). At default values, if the best class, A, has 1 peak in it, then
class B will have 2 peaks in it and class C will have 4 peaks. |
|
Advanced |
NumBatches |
integer (50) |
This parameter sets a
number of batches per node to strive for when using the MPI-based
parallelization features. Setting this too low means that some nodes will
finish before others (idle processor time), while setting it too high means
more overhead in network transmission as each batch is smaller. |
|
Advanced |
ThreadCountMultiplier |
integer (10) |
DirecTag is designed to take advantage of (symmetric)
multiprocessor systems by multithreading the tagging job. A tagging process
on an SMP system will spawn one worker thread for each processing unit (where
a processing unit can be either a core on a multi-core CPU or a separate CPU
entirely). The main thread then generates a list of worker numbers which is
equal to the number of worker threads multiplied by this parameter. The
worker threads then take a worker number from the list and use that number to
iterate through the protein list. It is possible that one thread will be
assigned all the proteins that generate a few candidates while another thread
is assigned all the proteins that generate many candidates, resulting in one
thread finishing its tagging early. By having each thread use multiple worker
numbers, the chance of one thread being penalized for picking all the easy
proteins is reduced because if it finishes early it can just pick a new
number. The only disadvantage to this system is that picking the new number
incurs some overhead because of synchronizing with the other worker threads
that might be trying to pick a worker number at the same time. The default
value is a nice compromise between incurring that overhead and minimizing
wasted time. |
|
Advanced |
UseMultipleProcessors |
boolean
(true) |
If true, each process will
use all the processing units available on the system it is running on. |
a)
Tagging-time
output of DirecTag serves several purposes. The
majority of the output will usually be progress information, telling the user
which part of the job that DirecTag is currently
working on, and in some cases how far along into that part the job is. There
will be periodic updates when DirecTag is
preprocessing spectra and when it is generating tags. In a multi-process (MPI)
job, there will also be progress information on bulk transfers of data over the
network. Additionally, DirecTag will display
statistics on the spectra that remain after preprocessing, specifically the
average number of peaks in a spectrum before and after preprocessing. Also
provided is the average number of the percentage of peaks that were filtered
out by the preprocessing step. Finally, in the case of an MPI job, when tagging
is complete each node that took part in the tagging will display statistics
detailing the work that node did. The lines will be like:
Process #1 (foohost)
stats: <numBatches> / <numSpectraTagged>
/ <numResidueMassGaps> / <numTagsGenerated> / <numTagsRetained>
b)
One TAGS file is produced for every input
spectra file that a DirecTag job tags. A TAGS file
contains an entry for each spectrum kept during the job (i.e. only the spectra
that were not obviously junk) for which at least one
tag was generated. Only the best MaxResults (a config parameter, defaulting to 20) tags for each spectrum
is put in the file, and those tags are represented by entries under the
spectrum entry. Each tag entry has a field for the missing mass on each
terminus of the tag, the starting peaks m/z for the tag, the total score of the
tag, as well as the individual subscores that made up
the total score.