DirecTag

For MS-MS Sequence Tagging

 

Table of Contents:

I.                   Introduction

II.                Usage

a.      Basic

b.      Specifying amino acid masses

c.      Specifying static or dynamic mass modifications

d.      Configuration parameters guide

III.             Interpreting results

a.      Tagging-time output

b.      TAGS format guide

 

 

Introduction

DirecTag is a tool designed to take experimental data from shotgun proteomics experiments and build sequence tags from the gaps between the most appropriate peaks in each spectrum. It is able to divide work evenly across any number of computers with any number of processors. Once generated, sequence tags are ranked based on various scores that are more fully described in the DirecTag paper. Each spectrum keeps a certain (user-defined) number of sequence tags that ranked the best.

 

Usage

a)  The basic usage of DirecTag:

directag [flags] <MS/MS data filepath in a supported file format> <another MS/MS data filepath>

 

When running it from the command line, the command line parser first determines what flags you have specified in the command. The flags can be anywhere on the command line. The following basic flags are supported:

            -cfg <file>                                                                  specifies a runtime configuration [default: directag.cfg]

            -rescfg <file>                                                            specifies an amino acid residue mass configuration [default: residue_masses.cfg]

            -workdir <path>                                                      specifies an absolute path to use as the working directory during execution [default: current working directory]

            -cpus <integer>                                                       specifies the number of worker threads to use during tagging [default: all available processors]

 

If a flag is specified that expects an argument but no argument is provided, it might be treated as a spectrum data file which probably undesirable. If you do not specify a runtime configuration file with -cfg and the default configuration file is not found, then default runtime values are used (and a warning that no configuration file was found will be shown). Likewise, if you do not specify a residue masses file with -rescfg and the default residue masses configuration file is not found, then hard-coded values will be used for the 20 common amino acids.

 

There is another type of flag that is supported that has a unique pattern: the override flags. Instead of having a name like cfg, the override flags have the same name as the variable that they override. Overriding a variable is specifying a different value on the command line than the one that is in the configuration file (just like the configuration file overrides the built-in values). For example, to override the variable DynamicMods to have the value M @ 16, use the override flag:

      -DynamicMods M @ 16

 

The double quotes are necessary on the command line because the value of the variable has spaces in it.

 

After the flags are parsed, the file arguments are processed. The file argument must be the relative or absolute path to one or more MS/MS spectra data files in one of the following supported file formats:

            -mzData 1.05

 

b)    The residue masses configuration file, if present, will override the 20 default amino acids whose masses are hard-coded. The residue masses file is not intended to be changed regularly if you want to specify a static mass modification, refer to the StaticMods variable. The file should have a number of rows, one per amino acid, where each row takes the form:

<AA residue character> <monoisotopic mass> <average mass>

 

c)     A static mass modification is something like carboxymethylation of cysteines, where all cysteines should be treated as about +57 in FreiTag and all subsequent downstream analysis. Refer to the StaticMods variable in the configuration parameters guide. A dynamic mass modification is something like a potential oxidation of methionine, where each methionine may be occur as either its natural mass or about +16. Refer to the DynamicMods variable in the configuration parameters guide.

 

 

d)    Configuration parameters guide

Category

Name

Type (Default)

Description

General

NumChargeStates

integer (3)

Controls the number of charge states that DirecTag will handle during all stages of the program. It is especially important during determination of charge state (see DuplicateSpectra for more information).

General

OutputSuffix

string (none)

The output of a DirecTag job will be a TAGS file for each input file. The string specified by this parameter will be appended to each TAGS filename. It is useful for differentiating jobs within a single directory.

General

StartSpectraScanNum

EndSpectraScanNum

integer (0, -1)

A useful feature to focus a job on a subset of spectra in a particular data file, these two parameters can be set in order to limit the possible range of scan numbers that DirecTag will read from the input data files. By default, all tandem mass spectra in the input files are read in for processing.

General

StatusUpdateFrequency

real (5 seconds)

Preprocessing spectra and generating tags may take a long time. A measure of progress through the spectra will be given on intervals that are specified by this parameter.

Preprocessing

UseChargeStateFromMS

boolean (false)

If true, DirecTag will use the charge state from the input data if it is available. If false, or if charge state is not available from a particular spectrum, DirecTag will use its internal algorithm to determine charge state. If, for a given spectrum, DirecTag uses its internal algorithm to determine charge state and the result is multiply charged, that spectrum may be duplicated to other charge states (see DuplicateSpectra for more information).

Preprocessing

DuplicateSpectra

boolean (true)

If DirecTag determines a spectrum to be multiply charged and this parameter is true, the spectrum will be copied and treated as if it was all possible charge states from +2 to +<NumChargeStates>. If this parameter is false, the spectrum will simply be treated as a +2.

Preprocessing

TicCutoffPercentage

real (85%)

In order to maximize the effectiveness of the MVH scoring algorithm, an important step in preprocessing the experimental spectra is filtering out noise peaks. Noise peaks are filtered out by sorting the original peaks in descending order of intensity, and then picking peaks from that list until the cumulative ion current of the picked peaks divided by the total ion current (TIC) is greater than or equal to this parameter. Lower percentages mean that less of the spectrums total intensity will be allowed to pass through preprocessing. See the section on Advanced Usage for tips on how to use this parameter optimally.

Preprocessing

MaxPeakCount

integer (400)

Another way of increasing the effectiveness of the MVH scoring algorithm when used for tagging is to set an upper bound on the number of peaks in a spectrum before generating tags. This step tends to get rid of most noise peaks and makes tagging much more feasible because so many fewer false positives are generated.

Preprocessing

NumIntensityClasses

integer (3)

Before scoring any candidates, experimental spectra have their peaks stratified into the number of intensity classes specified by this parameter. Spectra that are very dense in peaks will likely benefit from more intensity classes in order to best take advantage of the variation in peak intensities. Spectra that are very sparse will not see much benefit from using many intensity classes.

Preprocessing

AdjustPrecursorMass

boolean (false)

If true, the preprocessing step will correct the precursor mass by adjusting it through a specified range in steps of a specified length, finally choosing the optimal adjustment. The optimal adjustment is the one that maximizes the sum of products of all complementary peaks in the spectrum.

Preprocessing

MinPrecursorAdjustment

real (-2.5 Da)

When adjusting the precursor mass, this parameter sets the lower mass limit of adjustment allowable from the original precursor mass, measured in Daltons.

Preprocessing

MaxPrecursorAdjustment

real (2.5 Da)

When adjusting the precursor mass, this parameter sets the upper mass limit of adjustment allowable from the original precursor mass, measured in Daltons.

Preprocessing

PrecursorAdjustmentStep

real (0.1 Da)

When adjusting the precursor mass, this parameter sets the size of the steps between adjustments, measured in Daltons.

Preprocessing

DeisotopingMode

integer (0)

Deisotoping a spectrum (consolidating isotopic peak intensities into the monoisotopic peaks intensity) during preprocessing will significantly improve precursor adjustment, and it may be desirable to keep the deisotoped spectrum around for candidate scoring as well. Set to 0, no deisotoping will be used. Set to 1, deisotoping will be used for precursor adjustment only. Set to 2, deisotoping will be used for both precursor adjustment and for candidate scoring.

Preprocessing

PrecursorMzTolerance

real (1.25 Da/z)

In DirecTag this variable is only used to adjust the maximum peak m/z possible in a spectrum.

Preprocessing

IsotopeMzTolerance

real (0.25 Da/z)

When deisotoping a spectrum, an isotopic peak is one that is the mass of a neutron higher than another peak, tolerating variation based on the value of this parameter. Deisotoping actually traverses the spectrum at multiple charge states, starting from the highest (NumChargeStates) and ending at the lowest.

Preprocessing

ComplementMzTolerance

real (0.5 Da/z)

When adjusting the precursor mass, this parameter controls how much tolerance there is on each side of the calculated m/z when looking for a peaks complement.

Generating Tags

TagLength

integer (3)

A sequence tag is generated from the gaps between a number of peaks equal to this parameter plus one. Longer tag lengths are more specific, but harder to find because many consecutive ion fragments are rare.

Generating Tags

StaticMods

string (none)

If a residue (or multiple residues) should always be treated as having a modification on their natural mass, set this parameter to inform the tagging engine which residues are modified. Residues are entered into this string as a space-delimited list of pairs. Each pair is of the form:

<AA residue character> <mod mass>

Thus, to treat cysteine as always being carboxymethylated, this parameter would be set to something like the string C 57

Generating Tags

DynamicMods

string (none)

In order to generate tags with potential post-translational modifications to amino acid residues, the user must configure this parameter to inform the tagging engine which residues may be modified. Residues that are modifiable are entered into this string in a space-delimited list of triplets. Each triplet is of the form:

<AA residue character> <character to represent mod> <mod mass>

Thus, to generate tags for potentially oxidized methionines and phosphorylated serines, this parameter would be set to something like the string M * 15.995 S # 79.966

Generating Tags

MaxDynamicMods

integer (2)

This parameter sets the maximum number of modified residues that may be in any candidate sequence.

Generating Tags

MaxResults

integer (20)

This parameter sets the maximum number of sequence tags to report for each spectrum.

Generating Tags

FragmentMzTolerance

real (0.5 Da/z)

This parameter controls how much tolerance there is when an m/z gap between two peaks is compared to the masses of the amino acid residues.

Generating Tags

IntensityScoreWeight

MzFidelityScoreWeight

ComplementScoreWeight

real (1.0)

real (1.0)

real (1.0)

This group of parameters controls how tag scores are combined to form a total score. The total score of a tag is what is used to determine its final ranking in the result list.

Advanced

ClassSizeMultiplier

real (2)

When stratifying peaks into a specified, fixed number of intensity classes, this parameter controls the size of each class relative to the class above it (where the peaks are more intense). At default values, if the best class, A, has 1 peak in it, then class B will have 2 peaks in it and class C will have 4 peaks.

Advanced

NumBatches

integer (50)

This parameter sets a number of batches per node to strive for when using the MPI-based parallelization features. Setting this too low means that some nodes will finish before others (idle processor time), while setting it too high means more overhead in network transmission as each batch is smaller.

Advanced

ThreadCountMultiplier

integer (10)

DirecTag is designed to take advantage of (symmetric) multiprocessor systems by multithreading the tagging job. A tagging process on an SMP system will spawn one worker thread for each processing unit (where a processing unit can be either a core on a multi-core CPU or a separate CPU entirely). The main thread then generates a list of worker numbers which is equal to the number of worker threads multiplied by this parameter. The worker threads then take a worker number from the list and use that number to iterate through the protein list. It is possible that one thread will be assigned all the proteins that generate a few candidates while another thread is assigned all the proteins that generate many candidates, resulting in one thread finishing its tagging early. By having each thread use multiple worker numbers, the chance of one thread being penalized for picking all the easy proteins is reduced because if it finishes early it can just pick a new number. The only disadvantage to this system is that picking the new number incurs some overhead because of synchronizing with the other worker threads that might be trying to pick a worker number at the same time. The default value is a nice compromise between incurring that overhead and minimizing wasted time.

Advanced

UseMultipleProcessors

boolean (true)

If true, each process will use all the processing units available on the system it is running on.

 

 

Interpreting results

a)     Tagging-time output of DirecTag serves several purposes. The majority of the output will usually be progress information, telling the user which part of the job that DirecTag is currently working on, and in some cases how far along into that part the job is. There will be periodic updates when DirecTag is preprocessing spectra and when it is generating tags. In a multi-process (MPI) job, there will also be progress information on bulk transfers of data over the network. Additionally, DirecTag will display statistics on the spectra that remain after preprocessing, specifically the average number of peaks in a spectrum before and after preprocessing. Also provided is the average number of the percentage of peaks that were filtered out by the preprocessing step. Finally, in the case of an MPI job, when tagging is complete each node that took part in the tagging will display statistics detailing the work that node did. The lines will be like:

Process #1 (foohost) stats: <numBatches> / <numSpectraTagged> / <numResidueMassGaps> / <numTagsGenerated> / <numTagsRetained>

 

b)     One TAGS file is produced for every input spectra file that a DirecTag job tags. A TAGS file contains an entry for each spectrum kept during the job (i.e. only the spectra that were not obviously junk) for which at least one tag was generated. Only the best MaxResults (a config parameter, defaulting to 20) tags for each spectrum is put in the file, and those tags are represented by entries under the spectrum entry. Each tag entry has a field for the missing mass on each terminus of the tag, the starting peaks m/z for the tag, the total score of the tag, as well as the individual subscores that made up the total score.