IDPicker
v2.1
Accurate, transparent, and parsimonious
protein assembly for MS/MS search results
Table of Contents:
I. Overview
II. Usage
1.
Setting default options and
search paths
2.
Managing reports
ii. Assemble groups
iii. Configure filtering
iv. Advanced options
2)
Distinct/indistinct
modifications
b)
View
c)
Export to ZIP
(entire report)
d)
Export to CSV
(tables only)
3.
Interpreting reports
III. Glossary
A
comprehensive introduction to IDPicker and its resulting output is available in
this whitepaper.
This
document details the steps involved in running IDPicker from the graphical user
interface.
1.
Setting default
options and search paths
Use
this menu to set options that apply to all reports, even ones that have already
been created.

The “report directory” is
the destination for the report’s output – a bunch of HTML files
linked together with an index page. The “decoy
prefix” is used for FDR calculation and the default can be changed
for convenience. The “source extensions” are used in concert with
the “source search path” to find the spectral data files for
viewing peptide-spectrum matches. IDPicker can currently read spectral data
from mzML, mzXML, MGF, Bruker FID/YEP/BAF, Thermo RAW if you have Xcalibur, and
Waters RAW if you have MassLynx.
The search paths are used for finding
certain files while creating and managing reports. To create a report, the
FASTA protein database used to search the input files must be available in the
“database search path” with the exact filename as it appears in the
pepXML. To view peptide-spectrum matches, the source files must be available in
the “source search path” with the exact filename as their
“source name” (usually the pepXML filename without the extension).
Finally, in order to export the pepXML files in a ZIP export, they must be
available in the “search search path” (that’s not confusing,
is it?). All of the search paths support “relative” paths where the
search is done relative a report-specific variable:
·
<RootInputDirectory>
is replaced with a report’s root input directory set when it was created;
it can be used for all three search paths
·
<DatabaseDirectory>
is replaced with the directory that a report’s database was found in when
it was created; it can only be used in the source search path
·
<SourceName>
is replaced with the source name of an input pepXML and is only used when
IDPicker needs to access spectra in source files
2.
Managing reports
a) Create or clone a new report
To create a
new report, select “File / New Report” or right click on an empty
area on the My Reports page and select “New.” To clone a report
(copying all the settings from a previously run report into a new report form),
right click on the previously run report on the My Reports page and select
“Clone.” The form that pops up starts the configuration of a new or
cloned report. The first step is to pick some pepXML search files to use as
sources for the report.
Choose
a name for the generated report, a directory to write the report to, and the
decoy prefix that the input files use for their target-decoy search strategy.
Then choose a root source directory. All input files must be somewhere within
the source directory. IDPicker currently supports reading raw peptide
identifications via the
At this
point, the form should look something like:

The input
file tree view is populated with the pepXML files that were found in the search
space. IDPicker quickly checks each pepXML file to make sure it is valid and to
check what protein database was used to generate it. Currently IDPicker will
only work with FASTA protein databases. Put a check next to a file to include
it in the report.
All input
files in a report must use the same FASTA database. To make it easy to select
files which use the same database, the “Database” combo box is
populated with the various database filenames that were found in the search of
the source directory. When a database is selected in the combo box, only the
pepXML files that are associated with that database can be selected.
The database
is read to give the proteins their description from the FASTA database. It is
also used to find an offset into the protein for each peptide. If a peptide
maps to more than one protein in the database, the offset is calculated using
the first protein only.
When the
“Next” button is clicked, the next step is to assemble the source
files into one or more groups that represent an experimental hierarchy. IDPicker supports protein assembly
and analysis with arbitrarily complex hierarchies. This important information
is not lost when analyzing the data. A few relatively simple use cases could be
a report on a single LC-MS run, a group of LC-MS runs (like a MudPIT), or a
group of groups of LC-MS runs (a group of MudPITs). A more complex use case
could be a multi-level hierarchy across multiple instrument types, search
engines, replicate types, etc.
After assembling
the previous example into groups, the form might look something like this:

By
default, included sources will be grouped according to their organization on
the filesystem (relative to the source directory). The buttons at the top of
the group management panes can be used (from left to right) to collapse all the
groups, expand all the groups, set all sources to non-grouped, and set all
sources to the default organization. It is allowable to have non-grouped files
present when a report is run, but those files will not be present in the
report. However, they will be available if that report is cloned in the future.
The final
step is setting the various filter criteria (see the glossary
for complete explanations of each filter). After setting the filters, press the
“Run Report” button to start generating the report.
It is sometimes
necessary to configure some advanced options by clicking the
“Advanced” button on the “Load and Qonvert” page of the
new/clone report form.
IDPicker
is setup by default to work with pepXML files from MyriMatch, X! Tandem, Sequest,
and Mascot. However, it is able to read arbitrary score names from input files
and assign them arbitrary weights to produce the “total score”
which is used to sort the results from each spectrum. These default values show
up as advanced options like:
For each
score, IDPicker must have a name to look for in the pepXML, a static weight,
and knowledge of whether higher or lower values of that score indicate a better
result. An “ascending” score means that higher scores are better. A
weight of “0” indicates that the score will be effectively ignored.
IDPicker
must also know how to combine the scores to produce a total. The default is
“static” weighting which combines the scores by a weighted
arithmetic mean. The “combine scores as quantiles” option tells
IDPicker to convert the scores to a quantized version on a scale from 0 to 1,
where 0 is the worst instance of that score in a pepXML, and 1 is the best
instance of that score. The quantization is done before combination.
A new
feature of IDPicker 2.0 is the ability to optimize the score weights by the
Monte Carlo method of randomly permutating weights and choosing the permutation
that produces the best performance (as measured by total number of ids). This
feature is disabled by default; enable it by checking the “Apply score
optimization” checkbox. The permutations count only has meaning when
score optimization is enabled.
2)
Distinct/indistinct
modifications
By default, IDPicker
will treat peptides with a (post-translational) modification as distinct from the unmodified peptide (or a differently
modified instance) for the purposes of the “minimum
distinct peptides” filter and organization in the cluster reports.
Configuring the distinctness or indistinctness of various mods is fairly
self-explanatory:

A peptide
with a “distinct” modification will be presented differently in the
various IDPicker reports:
·
Peptide list on the cluster pages (as a
separate row and a change in the sequence count for one or more proteins)
·
Spectra per peptide by group table (as a
separate row)
·
Sequences per protein by group table (a
change in the sequence counts)
A “distinct” modification will be presented
in the index by modification page. “Indistinct” modifications are
folded into the “none” category on that page.
A report is
opened up as it is created, but it can also be reopened at a later time from
the “My Reports” page. Double click on a report in the “My
Reports” page to view it, or right click and select “view.”
c) Export to ZIP (entire report)
Since an
IDPicker report is almost entirely rendered with HTML and Javascript, it is
inherently viewable on any system with a web browser. But an IDPicker report is
made up of many files, so the best way to transfer it is to “ZIP”
it into a single file for easier file management. To set up an export to ZIP,
right click on a report on the “My Reports” page and click
“Export.” The export dialog will pop up and allow configuration of
which files get included in the ZIP:

By default,
IDPicker will only include the HTML and associated files necessary to view the
report in a standalone web browser. However, it is also easy from this dialog
to selectively include the input pepXMLs used to generate the report (the
“search” files), the raw spectra files that contain the actual
spectral data (the “source” files), and the protein FASTA database.
These extra options depend on the search paths being set up correctly in the
global Tools/Options menu.
When
“source files” are selected for export, the “source
extensions” and “include” controls become active.
“Source extensions” controls which file extensions the export will
look for in the source search path. The search happens in the order the
extensions are listed in the semi-colon-delimited list. It is possible that
when looking for a particular source name in the source search path multiple
source files are found (e.g. MyData.RAW and MyData.mzML). In that case, the
“include” combo box controls whether only the first matching file
is included or whether all matching files are included.
When the
desired options are set, click “Export” to begin the export
process. Note that it may take a long time to finish the export if the extra
export options are turned on (because those extra files tend to be very large).
d) Export to CSV (tables only)
Sometimes it
may only be desirable to have a parser-friendly version of the data-rich tables
that IDPicker generates in HTML format. To set up an export to CSV
(Comma-Separated-Values), right click on a report on the “My
Reports” page and click “Export.” The export dialog will pop
up. Select “CSV” in the “Export Type” combo box. The
options will change to look like:
The most
frequently used tables are available for export to CSV. Check which ones are
desired and click “Export.”
When
removing a report from the “My Reports” the user can either choose
to simply exclude the report from the display or to both exclude the report and
delete its associated files in the report’s destination directory. Right
click on a report and click “Delete” on the “My
Reports” page to either remove or delete a report (a prompt will pop up
to explain the difference).
Glossary (a.k.a. what do all these ambiguous terms actually
mean?!)
Click on a term name in the left column to look it up in
the Wikipedia; term names in the definitions link to other terms in the
glossary.
|
Term
Name |
Term Definition |
|
/ |
The
root source group. This group always contains the analysis results for all
other source groups combined. |
|
A
visual way of representing the relationships between proteins
and peptides, or between protein
groups and peptide groups. |
|
|
CID |
A
unique cluster identification number for a given
analysis. |
|
Cluster
(connected component) |
A
set of interrelated protein groups and peptide groups. A cluster is not interrelated with
any other cluster. In graph theory, this concept is called a connected
component. |
|
To
assess confident identifications, IDPicker
depends on the target-decoy search strategy. IDPicker assumes that protein
accession IDs that have a certain unique prefix (set by this parameter) are
decoys. Decoys are used to calculate FDR. The default
decoy prefix is “rev_” but it might be different for every
database. |
|
|
A
statistical estimate of the probability that a given result
is a false one. The FDR is based on the scores that the search engines supply
(e.g. XCorr, MVH, Hyperscore, DeltaCn, Expectation value). For a given
result, the FDR is the percentage of results (of the results that scored
better) that can be expected to be false. An FDR measurement at a given
result is a “q value.” |
|
|
GID |
A
unique global identification number for either a protein
group or a peptide group. |
|
GLID |
A
numeric or alphabetic symbol used as a cluster-local
identifier for a protein group or peptide group. |
|
A
match between a spectrum and a peptide.
An id may constitute an entire result or just part of
one. |
|
|
This
filter sets the maximum FDR that a result
can have to be included in the report. This is the first filter to be
applied. The default maximum FDR is 5% which means that immediately after
this filter is applied, 5% of the results should be false. After further
filters are applied, that percentage may get higher or lower depending on the
efficacy of the additional filtering steps. |
|
|
An
ambiguous identification comes from a result that has
multiple peptides with equal scores, i.e. belonging to the same rank. These equal scores between different identifications are almost always due to isobaric
or nearly isobaric residues such as L/I or Q/K. Such results can severely
distort the efficacy of IDPicker’s clustering analysis, so this filter
can be set to reject them outright to prevent it. The default value is 2
which means an ambiguous result like “LGMSTK/IGMSTK” is
acceptable, but “LLGMSTK/ILGMSTK/LIGMSTK/IIGMSTK” is not. |
|
|
In
the IDPicker context, an “additional” peptide
is one which provides evidence for a protein group
not yet accounted for by other peptides. It is used when building a minimum covering set for a cluster. When set to 0, a protein
group is accepted regardless of whether it provides additional evidence or
not. When set to 1, a protein group is accepted if it explains at least one
additional peptide in the cluster. When set to 2, a protein group is accepted
if it explains at least two additional peptides in the cluster. Higher
numbers increase the parsimoniousness of the report. The default value is 1
to provide a basic level of parsimony. |
|
|
A
peptide may be identified many times in a collection
of sources. A thousand identifications of the same peptide sequence count as
one “distinct” peptide, and this parameter sets the minimum
number of distinct peptides that a protein must be linked to in order to be
included in the report. The default value is 2 in order to exclude
“one-hit wonders.” |
|
|
This
filter sets the minimum number of amino acids a peptide
must have to be accepted as an id. The default
value is 5 because smaller peptides map to so many proteins that the protein
assembly is unpresentable. |
|
|
The
smallest set of protein groups necessary to
explain the existence of all peptide groups in a
cluster. |
|
|
A short chain of amino acids that can be matched with a spectrum to form an identification .
This usage of peptide implies distinctness,
i.e. that it may have been identified to many different spectra in the input
data, but it only counts as one peptide. A unique peptide, on the other hand, is one that belongs to only
one protein in the FASTA database. |
|
|
A
group of results that share the same set of proteins. These groups are used to make the
analysis more presentable. |
|
|
A
unique accession string that identifies a large chain of amino acids with
biological meaning in a protein database. |
|
|
A
list of protein identifiers with associated amino acid sequences and usually
descriptions as well. IDPicker currently only supports FASTA format
databases. All sources in an IDPicker analysis must be identified against the
same database. |
|
|
A
group of proteins
that share the same set of results. They are
indiscernible from each other based on available evidence. |
|
|
A
number assigned to a result to establish its
relative ordering to other results for a given spectrum.
Lower ranks are better, e.g. rank 1 is the best result. |
|
|
One
or more identifications
that matched to a spectrum
with the same score and are members of the same rank. When a result has more than one
identification, all of them are presented together to emphasize the ambiguity
of the result. |
|
|
A
sequence simply refers to a distinct peptide. |
|
|
A
sequence identification number for distinguishing the different members of a protein group or peptide group. |
|
|
A
canonical and hierarchical assignment of one or more input files to cause
them to be analyzed separately (as well as with other groups). |
|
|
A
centroided list of mass-to-charge peaks from a tandem mass spectrum with an
assigned charge state. |