Transcript PPTX

Protein grouping in mzIdentML

ProteinDetectionList

ProteinAmbiguityGroup id=“PAG1” ProteinDetectionHypothesis id=“PDH1” dbseq_ref=“dbseq_Q05421|CP2E1_MOUSE” anchor protein ProteinDetectionHypothesis id=“PDH2” dbseq_ref=“dbseq_Q05423|CP2E2_MOUSE” sequence same-set ProteinDetectionHypothesis id=“PDH3” dbseq_ref=“dbseq_Q05312|CP2F1_MOUSE” sequence subset ProteinAmbiguityGroup id=“PAG2” ....

ProteinAmbiguityGroup and ProteinDetectionHypothesis

id: MS:1001591

Existing CV terms for ProteinDetectionHypothesis

name: anchor protein def: "A representative protein selected from a set of sequence same-set or spectrum same-set proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001592 name: family member protein def: "A protein with significant homology to another protein, but some distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001593 name: group member with undefined relationship OR ortholog protein def: "TO ENDETAIL: a really generic relationship OR ortholog protein." [PSI:MS] is_a: MS:1001101 ! protein group or subset relationship id: MS:1001594 name: sequence same-set protein def: "A protein which is indistinguishable or equivalent to another protein, having matches to an identical set of peptide sequences." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001595 name: spectrum same-set protein def: "A protein which is indistinguishable or equivalent to another protein, having matches to a set of peptide sequences that cannot be distinguished using the evidence in the mass spectra." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship

Existing CV terms for ProteinDetectionHypothesis

id: MS:1001596 name: sequence sub-set protein def: "A protein with a sub-set of the peptide sequence matches for another protein, and no distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001597 name: spectrum sub-set protein def: "A protein with a sub-set of the matched spectra for another protein, where the matches cannot be distinguished using the evidence in the mass spectra, and no distinguishing peptide matches." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001598 name: sequence subsumable protein def: "A sequence same-set or sequence sub-set protein where the matches are distributed across two or more proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship id: MS:1001599 name: spectrum subsumable protein def: "A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more proteins." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship

Problems

• • No requirement for any exporter to use the terms “MAY” “anchor protein” doesn’t capture intended role and isn’t used consis id: MS:1001596 name: sequence sub-set protein def: "A protein with a sub-set ...." [PSI:MS] xref: value-type:xsd\:string "The allowed value-type for this CV term." is_a: MS:1001101 ! protein group or subset relationship • • • No definition of what should be put in the value slot of cv terms: • Could be the PDH identifier, accession or DBSequence identifier of group representative or any other protein that is super-set to this protein • Or anything else for that matter What does passThreshold = “true” on PDH mean?

• Unclear how to count the number of identified proteins in an mzIdentML file • Count PAGs or count PDHs?

No terms for protocol describing how inference has been done or how to interpret results

Proposed work group outcomes

• Attach cv terms to describing how protein inference has been done – Still under discussion, since these effectively describe parts of the algorithm used • Exactly one mandatory “representative protein” MUST be present per group (new name for “anchor protein”) on PDH – To be checked by semantic validator • ProteinDetectionList MUST have a cv term “number of identified proteins” (count PAGs that have “representative protein” PDH with passThreshold=“true” • Each PDH SHOULD be flagged with one term from a group stating whether it is “representative protein”, “sequence|spectrum same-set”, “sequence|spectrum subset”, “sequence|spectrum subsumed” or “marginally distinguished” (i.e. Not strictly any of these, but not enough evidence to be a group representative) – Value slot of these terms SHOULD contain a comma-separated list of super-set or same-set (as appropriate) PDH IDs

mzIdentML context

ProteinDetection Protocol ProteinDetection Protocol ProteinDetection Protocol “Attempted isoform differentiation”, “Prevented isoform differentiation”

Parent term: “Isoform Differentiation”

ProteinDetection Protocol

CV terms

“No parsimony”, “Strict parsimony”, “Parsimony with additional considerations”

Parent term: “Parsimony usage”

“No intact protein separation for protein inference”, “Partial isolation for protein inference”, “Nearly complete isolation for protein inference”

Parent term: “Role of intact protein separation in protein inference”

Accession Ambiguity is Reported -

Values

xsd:String (to allow free text description) xsd:String (to allow free text description) “true”, “false”

Require ment level

SHOULD SHOULD SHOULD SHOULD

Description

No parsimony used means no parsimony approach has been applied generating the protein list. Strict parsimony used should be indicated if parsimony is the only consideration used to report proteins. Parsimony with additional considerations used should be indicated if additional information such as quantitation information is used to influence which proteins are reported, or if some additional proteins are reported for other reasons, such as a desire to report one protein from each gene to which any matched peptide maps.

In workflows where proteins are not separated to any degree, or in which protein separation information is not used in the protein inference, this will have a value of No intact protein separation for protein inference”, as will be the case in strictly bottom up proteomics. At the other limit, Nearly complete isolation should be indicated when separation of intact proteins is conducted and relied upon for protein inference, as is common in multi dimensional gel-based work. The Partial isolation for protein inference value should be specified for cases where some level of protein isolation is used – for example, if a sizing column is used to separate intact proteins into fractions or in the common GeLC-MS workflow where 1D gel separation is followed by bottom up analysis of the gel slices.

In the context of a parsimony approach, an inference tool can either attempt to report multiple protein forms by determining if there is adequate evidence to support the detection of more than one isoform in a cluster (most common), or alternately the tool could prevent this differentiation process and maximally group instead.

Used for reporting whether ambiguity is reported i.e. if true PAGs may contain one or more PDHs, if false, each PAG must contain only one PDH (no attempt to report ambiguity).

Table 1 –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

ProteinDetection Protocol Threshold applied to Peptides ProteinDetection Protocol Multiple matches per spectrum are considered “true”, “false” “true”, “false” ProteinDetection Protocol “Spectrum-centric parsimony Minimization”, “Sequence-centric parsimony minimization”, “Sequence-centric parsimony minimization with additional rules”, “No parsimony minimization”

Parent term: Parsimony Minimization Method

SHOULD SHOULD SHOULD Set to true if thresholds are applied to PSMs or peptide level prior to protein inference. If thresholds have been applied, these should be reported under ProteinDetectionProtocol->Threshold using appropriate CV terms.

This should be set to false for protein inference approaches that limit to a single top ranking peptide per spectrum for consideration during protein inference; true should be set for approaches that preserve multiple answers per spectrum and provide all of these to the protein inference algorithm.

Sequence-centric parsimony minimization means that the inference method has sought to find the minimal set of proteins that explain all the peptide sequences observed, while Spectrum-centric parsimony minimization means the inference approach has sought to find the minimal set of proteins that explains the collection of observed spectra. Sequence-centric parsimony with additional rules would apply if a sequence-centric approach is used but additional rules are used – for example, if allowances are made to compensate for limitations of this approach such as I/L and deamidation ambiguities. No parsimony minimization should be indicated only if the Parsimony usage field is set to No parsimony. ProteinDetection Protocol “Exhaustive list ambiguity modeling”, “Limited list ambiguity modeling”,

Parent term: Ambiguity Modeling Approach

SHOULD In modelling a PAG, in one approach an algorithm can list all known intersection relationships, including accessions that have very limited overlap with the representative protein in the group. Alternately approaches to limit the scope of accessions that are included using various approaches. For example, one could list only accessions that have at least some minimal level of intersection with the representative protein in the group. This CV term simply captures whether the group modelling is limited in some way or is exhaustive in listing accessions.

ProteinDetection Protocol ->Threshold Protein Quality Threshold: MinimumNumSequencesRe quired Integer SHOULD An integer value representing the number of identified peptide sequences required for creating a PDH.

ProteinDetection Protocol TaxonomyBasedPreference “true”, “false” SHOULD In some workflows, one might map identified peptides to a multi-species protein sequence database, but prefer matches to sequences from a particular species.

ProteinDetection Protocol ->Threshold Other thresholding terms?

Table 1 cont. –New CV terms for reporting how protein inference has been performed. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

mzIdentML context

ProteinDetectionList ProteinAmbiguityGroup ProteinAmbiguityGroup ProteinDetectionHypothesis

CV term

number of identified proteins Representative protein

Values

Integer Protein cluster identifier NumberDistinctProteinSeq uences String. A within-file unique identifier Integer -

Require-ment level

MUST MAY

Description

The value reported should equal the number of PAGs containing a PDH flagged as Representative Protein and passThreshold=“true” A common identifier reported allows multiple PAGs to be linked, for example indicating some peptides are shared between different PAGs.

SHOULD The number of distinct protein sequences among the PDHs in the group. For example, if there are two PDH with different identifiers that have identical full length sequences, the NumberDistinctProteinSequences would be one.

MUST (be present on one PDH per PAG that is counted) The Representative protein will generally have likelihood greater than or equal to other proteins in the ProteinAmbiguityGroup, but this is not required Exactly one PDH within a PAG must be assigned with this label to serve as the representative for the putatively detected protein. A PDH labelled as the Representative protein can have passThreshold=“true|false” i.e. it need not have passed the threshold reported in the ProteinDetectionProtocol.

ProteinDetectionHypothesis ProteinDetectionHypothesis Sequence Same-Set Protein Spectrum Same-Set Protein xsd:String – comma separated list of PDH Ids that are same-set SHOULD xsd:String – comma separated list of PDH Ids that are same-set SHOULD A protein that is indistinguishable or equivalent to another protein in the group, having matches to an identical set of peptide sequences.

A protein that is indistinguishable or equivalent to the Representative protein, having matches to a set of peptide sequences that cannot be distinguished using the evidence in the mass spectra.

Table 2 New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

ProteinDetectionHypothesis ProteinDetectionHypothesis ProteinDetectionHypothesis Sequence Subset Protein Spectrum Subset Protein Sequence Multiply Subsumable Protein xsd:String – comma separated list of PDH Ids that are super-set SHOULD xsd:String – comma separated list of PDH Ids that are super-set SHOULD xsd:String – comma separated list of PDH Ids that subsume this PDH SHOULD A protein with a sub-set of the peptide sequence matches for the Representative protein, and no distinguishing peptide matches.

A protein with a sub-set of the matched spectra for the Representative protein, where the matches cannot be distinguished using the evidence in the mass spectra.

A sequence same-set or sequence sub-set protein where the matches are distributed across two or more proteins.

ProteinDetectionHypothesis Spectrum Multiply Subsumable Protein xsd:String – comma separated list of PDH Ids that subsume this PDH SHOULD A spectrum same-set or spectrum sub-set protein where the matches are distributed across two or more proteins.

ProteinDetectionHypothesis ProteinDetectionHypothesis Marginally distinguished protein Covering Set Protein MAY MAY Assigned to a PDH that has some evidence to support its presence in addition to the representative protein i.e. they have a unique peptide but not sufficient to be promoted as a Representative Protein in a PAG.

A member of a minimal set of proteins sufficient to explain all matched peptides/spectra via a parsimony approach. This provides an alternative means of reporting a parsimonious protein list when ParsimonyUsage=“Parsimony with additional considerations”. A PAG can contain zero, one, or multiple PDHs bearing this term.

Full length protein sequence is identical with respect to the protein specified in the value attribute of this term.

DBSequence DBSequence Protein Sequence Identical Protein Sequence Subsequence xsd:String – comma separated list of native accession(s) of protein with identical protein sequence MAY xsd:String – native accession of protein with “super”-sequence MAY Full length protein sequence is a subsequence of the protein specified in the value attribute of this term.

Table 2 cont. New CV terms for reporting protein set (group) relationships and global statistics about the protein identification results. The semantic validation software for mzIdentML reports an error (MUST), a warning (SHOULD) or an informational message (MAY) if these terms are not reported within the file.

Unresolved issues

Are the protocol terms necessary / sensible / overkill?

Is there general consensus on the idea that the number of identified proteins MUST be reported

and must equal count of PAGs with PDH passThreshold=“true”

Is it sensible to have SHOULD rules on all subset/same-sets?

Extra terms for relationships between protein sequences

Probably these will be removed

Mechanism for updating the mzIdentML specifications and validation software

Minor update + submission to shortened PSI process?