Transcript Research and data intergrity
blogs.hds.com
Research Integrity: The Importance of Data Acquisition and Management.
Jennifer E. Van Eyk, Ph.D.
Prof. Medicine, Biol. Chem. and BME Director, JHU Bayview Proteomics Center Director, JHU ICTR Biomarker Development Center JHU NHLBI Innovative Proteomics Center on Heart Failure
Principles of Research Data Integrity
• • • Research integrity depends on data integrity.
– Includes all aspects of collection, use, storage and sharing of data. Data integrity is a shared responsibility.
– Everyone involved in the research is responsible. – The ultimate responsibility belongs to the PI. – However, there is a broader role and responsibility for the institute and scientific community.
Transparency of the research data is required.
Free and accurate information exchange is fundamental to scientific progress.
Data integrity can be compromised numerous ways.
i) malicious proprietors, ii) human mistakes and naivety, iii) technical error.
Top Ten Less-Extreme Rock Climbing Routes
Stolen Chimney, Utah
Data integrity is based on accurate and traceable: i) collection, ii) recording, iii) storage, iv) reporting.
www.gorp.com/parks-guide/travel
The Consequences of Failure
• • • • • • Personal loss Blocked scientific progression Impaired technology development Damage to the institution and sponsors Tarnished public perception of science Damage to or loss of patent protection
Clinical trials based on genomic selection: Duke University Based on 2 genomic studies coming from the same multi-disciplinary group (Potti and Nevins) from which three clinical trials were undertaken. All clinical trails have been ultimately suspended.
1. Papers using cancer tissue (Potti et al., N Engl J Med 2006;355:570) and cell based approaches (Hsu et al, J Clin Oncol.2007;25:4350; Potti, A. et al. Nat. Med. 2006 12, 1294–1300) were published with a lot of hype.
2. Issues were raised by K. Baggerly and/or K. Coombes (M.D. Anderson Cancer Center) based on publically available data (Annals of Applied Statistics, 2009:3:1309 and Nat Med. 2007;13:1276) as well as others.
3. Dukes argues mistakes were “clerical errors” and do not alter fundamental conclusions of papers.
4. NCI/CTEP required LMS to be tested in blinded pre-validation study. It failed. Predictor was altered after corrected for having been carried out in two different labs. Trials had to be randomized and blinded to minimize sensitivity of predictor to laboratory effects.
5. NCI due to continued concerns requests all computer code and data preprocessing in order to try to replicate earlier finding. It failed. Using predictor for randomization stratification was stopped in the trial. 6. Duke carried out internal review and reopened trial.
7. NCI determines it is partially funding another trial based on a different paper (Chemo-sensitivity). Issues were discovered with respect to differences in data used to build predictor and data used for validation. Trials were ultimately suspended.
8. Potti’s academic credentials were found to be falsified. Duke acts.
9. Duke statement indicates that that with respect to validation studies: the sensitivity labels are wrong, samples labels are wrong, the gene labels are wrong, making it “wrong: in way that could lead to assignment of patients to the wrong treatment.
10. Co-author J. Nevins institute (Duke Institute for Genomic Science and Policy) he directed is closed (“due reorganization”) as is the Center for applied Genomics and Technology which Dr. Anil Potti was based. Hsu D is still publishing.
11. Papers were retracted(JCO Dec 2010;28:5229 and N Engl J Med. Mar 2011;364:1176).
12. Institute of Medicine (IOM) committee struck for independent review and recommendations.
13. FDA audit at Duke starting in 2011.
The Cancer Letters, edit and published by P. Goldbert, www.cancerletter.com
Learn From Mistakes
• • • • • • Mentorship (do no harm) Oversight, training Verification and more verification Develop of multiple processing pipelines Patience and wisdom in applying translation Ensure benefit patents (do no harm)
•
Outline
Being practical while avoiding potential errors – The data • Individual Responsibilities – Data Management • Data Collection • Data Storage – Data Interpretation • Data interpretation and publication in a changing world of translation – The reality of translational science • Challenges • Role of Core Facilities – The role of the scientific community • • Journals Scientific organizations • Round table discussion on the research at Johns Hopkins
Images from the front covers of Circulation Research – S. Elliott (Van Eyk Lab) The Data Fundamental to research Basis for writing papers Important for experiment replication Meet contractual/funding requirements Settle intellectual property claims Defense against a charge of fraud 8
Individual responsibility Data Management
computer-networks-webdesign.com
Three aspects to consider before starting your data collected: 1. Ownership 2. Collection 3. Storage/protection of confidentiality/sharing 4. Interpretation and publication 9
Whose data is it?
• Custody does not imply ownership.
• Custody remains with investigator (PI) but JHU owns
all
data. But, others have rights. – Funders – Other data sources If there is intellectual property or the research was funded by a sponsored research agreement with a company, who owns the data?
Data Collection
• • • • Depends on the type of raw data Notebooks – day to day or specialize types of experiments Images Generated numbers and information
Goal is to preserve raw data, transparent processing of data, unbiased interpretation and representation of data.
nature.com
Data integrity in the digital age “With the emergence of web-based lab notebooks, digital image “enhancement”, and the quick and easy (and possibly dirty) generation and dissemination of colossal amounts of data, it’s becoming increasingly clear that technology provides new challenges to maintaining scientific integrity. In an attempt to tame the beast while it still has its baby teeth, the US National Academy of Sciences released a report today that provided a framework for dealing with these challenge "Ensuring the integrity, accessibility and stewardship of research data in the digital age.” http://blogs.nature.com/news/2009/07/data_integrity.html
“One theme, that threads through many fields: the primacy of scrupulously recorded data. Because the techniques that researchers employ to ensure the integrity—the truth and accuracy—of their data are as varied as the fields themselves, there are no universal procedures for achieving technical accuracy. The term “integrity of data” also has a structural meaning, related to the data’s preservation and presentation. “ “Broadly accepted practices for generating and analyzing research must be shown to be reproducible in order to be credible. Other general practices include checking and rechecking data to confirm their accuracy, validity and also submitting data and results to peer review to ensure that the interpretation is valid.”
What evidence proves the 67 kDa band is the same data as the 32 kDa band?
Gross manipulation of blots
http://ori.hhs.gov/
How can you show that three lanes are the same data?
What image should be published?
http://jcb.rupress.org/content/166/1/11/F6.expansion.html
Misrepresentation of immunogold data. The gold particles, which were actually present in the original (left), have been enhanced in the manipulated image (right). Note also that the background dot in the original data has been removed in the manipulated image. Example provided by Journal of Cell Biology.
http://jcb.rupress.org/content/166/1/11/F6.expansion.html
Data Forensics
• • • • Can only "de-authenticate" an image (indicate discrepancies).
Authentication requires access to the original data.
The identification of a discrepancy is an allegation, and does not mean there was an intentional falsification of data.
The interpretation of whether any image manipulation is serious requires familiarity with the experiment(s) and imaging instruments.
Data Forensic Tools are employed by journals in a manner similar to tools used to detect plagiarism.
Office of research integrity US Department of Health and Human Services http://ori.hhs.gov/
Data Storage, Protection & Sharing
• Raw data needs to be stored – Lab notebooks should be stored in a safe place – Computer files should be backed up – Protected and limited access to computer raw data • Samples should be saved appropriately so they will not degrade over time. • Data and experiment information should be available after publication.
• Data should be retained for a reasonable period of time.
Dilemma: When PDF or clinical fellow leaves a lab (especially if paper is not written up) where does the data stay?
18
• • •
How long?
Retain study records and records of disclosures of study information: – For IRB clinical trial • Retain records for 7 years after last subject completed study OR 7 years after date of last disclosure of identifiable health information from study records.
• If research subject is a child, retain until subject reaches age of 23 – For Investigational New Drug (IND) research – • Retain records for 2 years after marketing application approved for new drug or until 2 years after shipment and delivery of drug for investigation use is discontinued.
– For Investigational Device Exemption (IDE) research • Retain records for 2 years after the latter of the following two dates: date on which investigation is terminated or completed or date on which records no longer required for purposes of supporting a premarket approval application or notice of completion of a product development protocol.
Provide adequate data and safety monitoring (if activity represents more than minimal risk to participants) Complete required training – Human Subject Research (HSR) compliance – Conflict of Interest (COI) – Privacy issues 19
Traditional Lab Notebook Best Practices
admin.ox.ac.uk
•Date all entries (especially important if contesting IP) •Title and state purpose of experiment •Describe experiment in detail – Protocol – Calculations – Reagents (lot numbers, passage numbers, etc.) – Results (everything that does
and doesn
’
t
happen) – Print-outs, pictures, graphs, etc with links to other data storage locations.
• Record needs to be intact and permanent – All mistakes are to be left (cross out) – Do not remove pages – Write in pen – Clearly link connected experiments across time Requirement: Need to be able to follow the development and execution of the experiments and all of the data analysis. 20
Laboratory Notebooks: Types, Advantages & Drawbacks
Type Advantages Disadvantages
Bound book Loose-leaf sheets/folders Electronic notebook • • No lost sheets Proof against fraud • • Experiments entered as done, no logical order can not keep some raw data forms • Can group by experiment, maintain order • Easy to record data during experiments • More flexible to hold various types of data • Easy to read • Easy to do calculations • Can lose sheets, harder to prove authenticity • Must back up data, harder to prove authenticity • Can be manipulated after the fact.
Barker, Kathy. At the Bench: A Laboratory Navigator. Cold Spring Harbor: Cold Spring Harbor Laboratory Press (2005), 90. 21
Raw MS spectrum Processed MS spectrum Identifies peptides based on comparison to existing database Peptides clustered to identify protein name Proteins are clustered to remove name redundancy present BACKGROUND Proteomics uses mass spectrometers (MS) to identify peptides and proteins. MS accurately weighs the mass of peptides and their fragments.
The observed spectrum is compared to the theoretical mass of all known amino acid sequences in a database allowing assignment to a protein with a certain probability. Quantification can be based on number of spectrum observed per/analysis (spectral count).
From one high accurate MS instrument (Orbitrap LC/MS/MS) produces in a single run C number spectrum. This means there is ~ 21-25million spectrum/yr = to ~1 terabyte of raw data/year.
It is challenging as science, technology and our understanding is always advancing. Interpretation can change but RAW data never does.
Auditing Logs
Offsite Independent local back-ups
JVE Lab
Uploads and Downloads
ONLY
NO Deletes NO Overwrites
Collaborators
• • •
IA Storage Server
Currently Using 1.73TB
Multiple Redundant Hard Disks Secured Data Center Informatics Processing Lab (Pass) Database
Checks/ Balances
1 2 3
ERROR!
ERROR!
Rounding ERROR!
JVE Lab Collaborators
Backup of raw mass spectrometry data 1. Source Matches Target
i) Size and date (easy but 95% accurate) ii) CheckSum method (time consuming but 100% accurate)
2. No file update and/or delete permitted
i) If deletion of file required, written authorization required from PI ii) Overwriting is not possible
3. Easily accessible auditing of all activities
(who uploaded/downloaded which files, when, from where etc.)
4. Backup, backup, backup….
i) Different locations ii) Multiple time points
Data Processing Pipeline Quality Control Ensuring integrity of data analysis at local level • “Bookkeeping Checks”
– Treat core or, if reasonable, the entire informatics pipeline as “black box”, then run as many “integrity tests” as possible to verify input (i.e. original raw files) matches final output (e.g. reports).
– Manual Spot-Checking • Compare final outputs (e.g. from reports) for a few select data points with same data points in original inputs (e.g. raw files).
What is different?
Auditing Logs
Offsite Independent local back-ups
JHU NHLBI Proteomic Community JVE Lab
Uploads and Downloads
ONLY
NO Deletes NO Overwrites
Collaborators
• • •
IA Storage Server
Currently Using 1.73TB
Multiple Redundant Hard Disks Secured Data Center Informatics Processing Lab (Pass) Database
Checks/ Balances
1 2 3
ERROR!
ERROR!
Rounding ERROR!
JVE Lab Collaborators
When there is lots of data and/or fast analysis is required….how secure is the Cloud?
• • 99.9% uptime guarantee for most clouds providers Still, good to have local “cheap” backups (e.g. 23 computers in the JVE lab) • • To ensure security… – Transmission over internet can use same security as your online banking system (not hard to do…) – Clouds can be “VPN-enabled”, so that those cloud machines are “behind” JHU firewall, thus benefitting from JHU firewall – Nevertheless, best idea (for any system, cloud or not): minimize capturing sensitive information unless absolutely required for stats/analysis; if that’s not possible, encrypt sensitive information & restrict access conservatively NB – Different rules for HIPAA protected data
Learning from Recent Warning Letters Related to Computer Validation 6th October 2011 at 11:00 to 12:00 Organizer: Dr. Ludwig Huber
Type:Webinar ___________________________________________________________________________ This seminar will provide more that 20 examples of recent FDA warning letters and give clear recommendations for corrective and preventive actions.
In the last couple of years the FDA has discovered serious fraud related to security and integrity of electronic data. As a result FDA inspections look more than ever at computers, how they are validated and how companies comply with FDA's 21 CFR Part 11. This seminar will present more than 20 related warning letter examples together with detailed recommendations on how to avoid them.
Areas Covered in the Seminar: FDA inspections: Preparation, conducts, follow up The meaning of warning letters and 483 inspectional observations Learning from an FDA presentation: “Data Integrity and Fraud – Another Looming Crisis?” Data integrity and authenticity: FDA's new focus during inspections Examples of recent Part 11 related 483’s and Warning Letters Examples of recent 483’ and warning letters related to computer system validation ‘ Most obvious reasons for deviations Responding to 483's to avoid warning letters: going through case studies Writing corrective AND preventive action plans as follow up to 483's Using internal audits to prepare yourself for Part 11 related FDA inspections?
Strategies and tools for compliant Part 11 implementation The future of Part 11 and computer system validation http://www.nature.com/natureevents/science/events/12389
Interpretation and Publication
• • • • Use of core facilities Role of collaborators Setting standards via professional societies Responsibilities of journals and reviewers
Translational Science - Big Science Dealing with massive data sets, new technologies, and novel statistical approaches.
More Lessons from Duke
1.
2.
3.
4.
5.
Requirement of additional expertise outside group. It is still rare to have someone in group with sufficient expertise to monitor cross all aspect of the project.
Massive amounts of data and software complexity.
Error introduced due to data handling and poor documentation.
Computer software maybe “research grade”, highly complex and misunderstood or used inappropriately.
If you think or figure out something is not right, admit it and track it down and correct it.
“Some times the glamour (and ease) of (some) technology makes investigators forget basic scientific (and biological) principals” The Cancer Letters, edit and published by P. Goldbert, www.cancerletter.com
What’s wrong with this MS spectrum? Unless you are an expert you will not know. But, it is wrong.
Proteomic analysis of age dependent nitration of rat cardiac proteins by solution isoelectric focusing coupled to nanoHPLC tandem mass spectrometry. Hong SJ , Gokulrangan G , Schöneich C .
Exp Gerontol.
2007;42(7):639-51.
4 proteins and post-translational modified amino acid residues were reported, all were subsequently shown to be incorrect.
Manuscript that corrected it:
Misidentification of nitrated peptides: comments on Hong, S.J., Gokulrangan, G., Schöneich, C., 2007. Exp. Gerontol. 42, 639-651. Prokai L .
Exp Gerontol.
2009; 44(6-7):367-9.
biocompare.com
•
Use of core facilities: Can they (should they) provide required expertise?
Concern: – Who is responsible for analysis?
– – Complex data still requires understanding of technology limitations?
Situation is worse with emerging technologies where development of data analysis is still being developed?
• Solutions – Cores with experts (and provided support to help with data analysis but is time consuming and expensive) – – New hybrid cores-academic technology development labs.
Preservation of data transparency and storage of raw data.
– – Time to learn the methods across disciplines . New paradigm in collaboration requires new approaches to training.
We assume the best in people.
B. Obama at Martin Luther King Memorial speech Oct 2011.
What is the role of a collaborator?
Is being naïve or inexperienced a sufficient reason for not being responsible for data integrity and data interpretation?
How is broad (but in-depth) experience obtained when focused expertise is the norm?
How do you develop collaborative networks where in-depth cross disciplinary learning and training is intensic?
Role of our scientific community Setting data standards
HUPO Proteomics Standards Initiative
www.psidev.info/ The HUPO Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification. The PSI was founded at the HUPO meeting in Washington, April 28-29, 2002 (Science 296,827).
MIAPE The minimum information about a proteomics experiment
• • • • • • • •
2DE MS MS informatics Column chromatography Capillary electrophoresis Protein modifications Protein affinity Bioactive entities
All process MS data can be/should be uploaded to public data base
Role of journals
• • • • • • • State requirements in instruction to authors Set standards for data integrity?
Store raw or processed data?
Detection prior to publication?
Have expert reviewers?
Who curates?
How is annotated?
How many DBs?
Who pays?
If detected, bar authors from publishing in their journal and or inform author’s institute?
If detected after publication? Enforced retraction?
As a reviewer, what is your role?
How do we train reviewers?
More lessons from Duke
IOM “Committee on the Review of Omics-Based Tests for Predicting Outcomes in Clinical Trials” for IOM will determine criteria important for analytical validation, qualification and utilization components of test evaluation for the use of models that predict clinical outcomes from genomic and other Omic technologies. Report is due shortly.
Hopefully, IOM report will set goal posts and pathways in the same manner as for drug trials.
What I have learnt.
Science is difficult. We are limited by our knowledge.
Raw data is never wrong. We may misinterpret it, be fooled by an incorrect assumption, be limited by technology/approach but intrinsically, it is not wrong. Your scientific reputation is based on the quality of your data.
Conclusion
Data Integrity Principle: Research data integrity is essential for advancing scientific and medical knowledge and for maintaining public trust. Researchers that are ultimately responsible. Data Access and Sharing Principle: Research (raw and processed) data, methods, and other information integral to publicly reported results should be publicly accessible. Data Stewardship Principle: Research data should be retained to serve future use. Thus, (raw and processed) data must documented, referenced, and indexed in order for them to be used accurately and appropriately.
Round table
Sheila Garrity (Moderator)
Allen Everett (Assoc Professor) Kathleen Barnes (Professor) David Graham (Asst Professor)
Director, Division of Research Integrity
Pediatrics - Cardiology Medicine–Clinical Immunology Molecular and Comparative Pathobiology
Johns Hopkins
• • •
•
JHU Data Management Policy: http://jhuresearch.jhu.edu/Data_Management_Policy.pdf
Overall list of JHU policies page: http://jhuresearch.jhu.edu/policies-hopkins.htm
JHU laptop encryption: http://www.it.johnshopkins.edu/security/encryption.html
Overall JHU IT security page: http://www.it.johnshopkins.edu/security/