IDPicker v2.1

Accurate, transparent, and parsimonious protein assembly for MS/MS search results

 

Table of Contents:

      I.     Overview

    II.     Usage

1.     Setting default options and search paths

a)    Tools/Options

2.     Managing reports

a)    Create or Clone

                                                                        i.     Load input pepXMLs

                                                                       ii.     Assemble groups

                                                                     iii.     Configure filtering

                                                                     iv.     Advanced options

1)    Score names and weights

2)    Distinct/indistinct modifications

b)    View

c)     Export to ZIP (entire report)

d)    Export to CSV (tables only)

e)    Delete or Remove

3.     Interpreting reports

  III.     Glossary

 

Overview

A comprehensive introduction to IDPicker and its resulting output is available in this whitepaper.

 

Usage

This document details the steps involved in running IDPicker from the graphical user interface.

 

1.   Setting default options and search paths

 

a)    Tools/Options

Use this menu to set options that apply to all reports, even ones that have already been created.

 

image001

 

The “report directory” is the destination for the report’s output – a bunch of HTML files linked together with an index page. The “decoy prefix” is used for FDR calculation and the default can be changed for convenience. The “source extensions” are used in concert with the “source search path” to find the spectral data files for viewing peptide-spectrum matches. IDPicker can currently read spectral data from mzML, mzXML, MGF, Bruker FID/YEP/BAF, Thermo RAW if you have Xcalibur, and Waters RAW if you have MassLynx.

 

The search paths are used for finding certain files while creating and managing reports. To create a report, the FASTA protein database used to search the input files must be available in the “database search path” with the exact filename as it appears in the pepXML. To view peptide-spectrum matches, the source files must be available in the “source search path” with the exact filename as their “source name” (usually the pepXML filename without the extension). Finally, in order to export the pepXML files in a ZIP export, they must be available in the “search search path” (that’s not confusing, is it?). All of the search paths support “relative” paths where the search is done relative a report-specific variable:

·       <RootInputDirectory> is replaced with a report’s root input directory set when it was created; it can be used for all three search paths

·       <DatabaseDirectory> is replaced with the directory that a report’s database was found in when it was created; it can only be used in the source search path

·       <SourceName> is replaced with the source name of an input pepXML and is only used when IDPicker needs to access spectra in source files

 

 

 

2. Managing reports

 

a)   Create or clone a new report

 

To create a new report, select “File / New Report” or right click on an empty area on the My Reports page and select “New.” To clone a report (copying all the settings from a previously run report into a new report form), right click on the previously run report on the My Reports page and select “Clone.” The form that pops up starts the configuration of a new or cloned report. The first step is to pick some pepXML search files to use as sources for the report.

 

 

                           i.          Load input pepXMLs

Choose a name for the generated report, a directory to write the report to, and the decoy prefix that the input files use for their target-decoy search strategy. Then choose a root source directory. All input files must be somewhere within the source directory. IDPicker currently supports reading raw peptide identifications via the Seattle Proteome Center’s pepXML format. Many search engines have output that can be converted to this format using various SPC tools. Therefore, all input files must be pepXML files. The “List Files” button will search the source directory for any files with the file extension “.pepXML”. If the “include sub folders” checkbox is checked, the search will recurse through the source directory’s subdirectories as well.

 

At this point, the form should look something like:

 

LoadAndQonvert

 

 

The input file tree view is populated with the pepXML files that were found in the search space. IDPicker quickly checks each pepXML file to make sure it is valid and to check what protein database was used to generate it. Currently IDPicker will only work with FASTA protein databases. Put a check next to a file to include it in the report.

 

All input files in a report must use the same FASTA database. To make it easy to select files which use the same database, the “Database” combo box is populated with the various database filenames that were found in the search of the source directory. When a database is selected in the combo box, only the pepXML files that are associated with that database can be selected.

 

The database is read to give the proteins their description from the FASTA database. It is also used to find an offset into the protein for each peptide. If a peptide maps to more than one protein in the database, the offset is calculated using the first protein only.

 

 

                         ii.          Assemble groups

When the “Next” button is clicked, the next step is to assemble the source files into one or more groups that represent an experimental hierarchy. IDPicker supports protein assembly and analysis with arbitrarily complex hierarchies. This important information is not lost when analyzing the data. A few relatively simple use cases could be a report on a single LC-MS run, a group of LC-MS runs (like a MudPIT), or a group of groups of LC-MS runs (a group of MudPITs). A more complex use case could be a multi-level hierarchy across multiple instrument types, search engines, replicate types, etc.

After assembling the previous example into groups, the form might look something like this:

GroupAndFilter.png

 

          By default, included sources will be grouped according to their organization on the filesystem (relative to the source directory). The buttons at the top of the group management panes can be used (from left to right) to collapse all the groups, expand all the groups, set all sources to non-grouped, and set all sources to the default organization. It is allowable to have non-grouped files present when a report is run, but those files will not be present in the report. However, they will be available if that report is cloned in the future.

 

 

                       iii.Configure filtering

The final step is setting the various filter criteria (see the glossary for complete explanations of each filter). After setting the filters, press the “Run Report” button to start generating the report.

 

                        iv.Advanced options

It is sometimes necessary to configure some advanced options by clicking the “Advanced” button on the “Load and Qonvert” page of the new/clone report form.

1)    Score names and weights

IDPicker is setup by default to work with pepXML files from MyriMatch, X! Tandem, Sequest, and Mascot. However, it is able to read arbitrary score names from input files and assign them arbitrary weights to produce the “total score” which is used to sort the results from each spectrum. These default values show up as advanced options like:

     AdvancedOptionsScoreNamesAndWeights.png 

 

For each score, IDPicker must have a name to look for in the pepXML, a static weight, and knowledge of whether higher or lower values of that score indicate a better result. An “ascending” score means that higher scores are better. A weight of “0” indicates that the score will be effectively ignored.

 

IDPicker must also know how to combine the scores to produce a total. The default is “static” weighting which combines the scores by a weighted arithmetic mean. The “combine scores as quantiles” option tells IDPicker to convert the scores to a quantized version on a scale from 0 to 1, where 0 is the worst instance of that score in a pepXML, and 1 is the best instance of that score. The quantization is done before combination.

 

A new feature of IDPicker 2.0 is the ability to optimize the score weights by the Monte Carlo method of randomly permutating weights and choosing the permutation that produces the best performance (as measured by total number of ids). This feature is disabled by default; enable it by checking the “Apply score optimization” checkbox. The permutations count only has meaning when score optimization is enabled.

 

 

2)    Distinct/indistinct modifications

By default, IDPicker will treat peptides with a (post-translational) modification as distinct from the unmodified peptide (or a differently modified instance) for the purposes of the “minimum distinct peptides” filter and organization in the cluster reports. Configuring the distinctness or indistinctness of various mods is fairly self-explanatory:

 

AdvancedOptionsModifications.gif

 

A peptide with a “distinct” modification will be presented differently in the various IDPicker reports:

·       Peptide list on the cluster pages (as a separate row and a change in the sequence count for one or more proteins)

·       Spectra per peptide by group table (as a separate row)

·       Sequences per protein by group table (a change in the sequence counts)

A “distinct” modification will be presented in the index by modification page. “Indistinct” modifications are folded into the “none” category on that page.

 

b)   View an existing report

 

A report is opened up as it is created, but it can also be reopened at a later time from the “My Reports” page. Double click on a report in the “My Reports” page to view it, or right click and select “view.”

 

 

c)   Export to ZIP (entire report)

 

Since an IDPicker report is almost entirely rendered with HTML and Javascript, it is inherently viewable on any system with a web browser. But an IDPicker report is made up of many files, so the best way to transfer it is to “ZIP” it into a single file for easier file management. To set up an export to ZIP, right click on a report on the “My Reports” page and click “Export.” The export dialog will pop up and allow configuration of which files get included in the ZIP:

 

ExportToZIP.png

 

By default, IDPicker will only include the HTML and associated files necessary to view the report in a standalone web browser. However, it is also easy from this dialog to selectively include the input pepXMLs used to generate the report (the “search” files), the raw spectra files that contain the actual spectral data (the “source” files), and the protein FASTA database. These extra options depend on the search paths being set up correctly in the global Tools/Options menu.

 

When “source files” are selected for export, the “source extensions” and “include” controls become active. “Source extensions” controls which file extensions the export will look for in the source search path. The search happens in the order the extensions are listed in the semi-colon-delimited list. It is possible that when looking for a particular source name in the source search path multiple source files are found (e.g. MyData.RAW and MyData.mzML). In that case, the “include” combo box controls whether only the first matching file is included or whether all matching files are included.

 

When the desired options are set, click “Export” to begin the export process. Note that it may take a long time to finish the export if the extra export options are turned on (because those extra files tend to be very large).

 

 

d)   Export to CSV (tables only)

 

Sometimes it may only be desirable to have a parser-friendly version of the data-rich tables that IDPicker generates in HTML format. To set up an export to CSV (Comma-Separated-Values), right click on a report on the “My Reports” page and click “Export.” The export dialog will pop up. Select “CSV” in the “Export Type” combo box. The options will change to look like:

 

     ExportToCSV.png 

 

The most frequently used tables are available for export to CSV. Check which ones are desired and click “Export.”

 

 

e)   Delete or Remove

 

When removing a report from the “My Reports” the user can either choose to simply exclude the report from the display or to both exclude the report and delete its associated files in the report’s destination directory. Right click on a report and click “Delete” on the “My Reports” page to either remove or delete a report (a prompt will pop up to explain the difference).

 

 

Glossary (a.k.a. what do all these ambiguous terms actually mean?!)

Click on a term name in the left column to look it up in the Wikipedia; term names in the definitions link to other terms in the glossary.

 

Term Name

Term Definition

/

The root source group. This group always contains the analysis results for all other source groups combined.

Bipartite Graph

A visual way of representing the relationships between proteins and peptides, or between protein groups and peptide groups.

CID

A unique cluster identification number for a given analysis.

Cluster (connected component)

A set of interrelated protein groups and peptide groups. A cluster is not interrelated with any other cluster. In graph theory, this concept is called a connected component.

 

Decoy Prefix

To assess confident identifications, IDPicker depends on the target-decoy search strategy. IDPicker assumes that protein accession IDs that have a certain unique prefix (set by this parameter) are decoys. Decoys are used to calculate FDR. The default decoy prefix is “rev_” but it might be different for every database.

False Discovery Rate (FDR)

A statistical estimate of the probability that a given result is a false one. The FDR is based on the scores that the search engines supply (e.g. XCorr, MVH, Hyperscore, DeltaCn, Expectation value). For a given result, the FDR is the percentage of results (of the results that scored better) that can be expected to be false. An FDR measurement at a given result is a “q value.”

GID

A unique global identification number for either a protein group or a peptide group.

GLID

A numeric or alphabetic symbol used as a cluster-local identifier for a protein group or peptide group.

Identification (id)

A match between a spectrum and a peptide. An id may constitute an entire result or just part of one.

Maximum FDR

This filter sets the maximum FDR that a result can have to be included in the report. This is the first filter to be applied. The default maximum FDR is 5% which means that immediately after this filter is applied, 5% of the results should be false. After further filters are applied, that percentage may get higher or lower depending on the efficacy of the additional filtering steps.

Maximum ambiguous ids

An ambiguous identification comes from a result that has multiple peptides with equal scores, i.e. belonging to the same rank. These equal scores between different identifications are almost always due to isobaric or nearly isobaric residues such as L/I or Q/K. Such results can severely distort the efficacy of IDPicker’s clustering analysis, so this filter can be set to reject them outright to prevent it. The default value is 2 which means an ambiguous result like “LGMSTK/IGMSTK” is acceptable, but “LLGMSTK/ILGMSTK/LIGMSTK/IIGMSTK” is not.

Minimum additional peptides

In the IDPicker context, an “additional” peptide is one which provides evidence for a protein group not yet accounted for by other peptides. It is used when building a minimum covering set for a cluster. When set to 0, a protein group is accepted regardless of whether it provides additional evidence or not. When set to 1, a protein group is accepted if it explains at least one additional peptide in the cluster. When set to 2, a protein group is accepted if it explains at least two additional peptides in the cluster. Higher numbers increase the parsimoniousness of the report. The default value is 1 to provide a basic level of parsimony.

Minimum distinct peptides

A peptide may be identified many times in a collection of sources. A thousand identifications of the same peptide sequence count as one “distinct” peptide, and this parameter sets the minimum number of distinct peptides that a protein must be linked to in order to be included in the report. The default value is 2 in order to exclude “one-hit wonders.”

Minimum peptide length

This filter sets the minimum number of amino acids a peptide must have to be accepted as an id. The default value is 5 because smaller peptides map to so many proteins that the protein assembly is unpresentable.

Minimum Covering Set (MCS)

The smallest set of protein groups necessary to explain the existence of all peptide groups in a cluster.

Peptide

A short chain of amino acids that can be matched with a spectrum to form an identification

. This usage of peptide implies distinctness, i.e. that it may have been identified to many different spectra in the input data, but it only counts as one peptide. A unique peptide, on the other hand, is one that belongs to only one protein in the FASTA database.

Peptide Group (metapeptide)

A group of results that share the same set of proteins. These groups are used to make the analysis more presentable.

Protein

A unique accession string that identifies a large chain of amino acids with biological meaning in a protein database.

Protein Database

A list of protein identifiers with associated amino acid sequences and usually descriptions as well. IDPicker currently only supports FASTA format databases. All sources in an IDPicker analysis must be identified against the same database.

Protein Group (metaprotein)

A group of proteins that share the same set of results. They are indiscernible from each other based on available evidence.

Rank

A number assigned to a result to establish its relative ordering to other results for a given spectrum. Lower ranks are better, e.g. rank 1 is the best result.

Result

One or more identifications that matched to a spectrum with the same score and are members of the same rank. When a result has more than one identification, all of them are presented together to emphasize the ambiguity of the result.

Sequence

A sequence simply refers to a distinct peptide.

SID

A sequence identification number for distinguishing the different members of a protein group or peptide group.

Source Group

A canonical and hierarchical assignment of one or more input files to cause them to be analyzed separately (as well as with other groups).

Spectrum (Spectra)

A centroided list of mass-to-charge peaks from a tandem mass spectrum with an assigned charge state.