Dealing with Electronic Information

Download Report

Transcript Dealing with Electronic Information

Digital data – integrity and standards
Dr Simon Cockell
Bioinformatics Support Unit
[email protected]
http://bsu.ncl.ac.uk
http://twitter.com/nclbsu
Generating data
Generating big data
Problem..
Data doesn’t fit in:
(try pasting an
Informative genome
sequence, DIGE gel,
array experiment or
Excel spreadsheet
into this)
So where do we store it instead?
Why is this a problem?
• Is your PC backed
up?
• Do you check
integrity of files?
Why is this a problem?
• How organised is
your electronic life?
• How well does it
correlate with your
lab book?
There is another issue
• Big data - scientific data - is EXPENSIVE to
generate
• It makes sense to get the most value out of it
• Your funding bodies know this
• Often your funding is from public money
• You may think you have ownership of this data
• But do you? Or the University?
Increasing pressure to share data
BBSRC expects research data generated as a result of
BBSRC support to be made available with as few
restrictions as possible in a timely and responsible
manner to the scientific community for subsequent
research. Applicants should make use of existing
standards for data collection and management and
make data available through existing community
resources or databases where possible.
http://www.bbsrc.ac.uk/organisation/policies/position/policy/data-sharing-policy.aspx
Increasing pressure to share data
The MRC expects valuable data arising from MRCfunded research to be made available to the scientific
community with as few restrictions as possible so as
to maximize the value of the data for research and
for eventual patient and public benefit. Such data
must be shared in a timely and responsible manner.
http://www.mrc.ac.uk/Ourresearch/Ethicsresearchguidance/datasharing/index.htm
How do you solve a problem like data
sharing?
• How do we make sure that we can exchange, and
understand the data that we share with other researchers?
• Standardised formats for reporting certain experimental data
types have been developed
• A new set of data standards has emerged for modern
biological data
• Often called ‘MI’ data standards
• Capture ‘minimum information’ metadata (data about data)
required to comprehend and share scientific data
Particularly for high throughput data
• All started with MIAME (minimum information
about a microarray experiment)
• Now extends to proteomics, neurophysiology,
genome sequences – even gel electrophoresis
• If you are going to publish a high throughput
experiment it is very likely that the journal you
publish in will MANDATE that the data is
annotated to the correct standards AND deposited
in a recognised repository for that data
Where does the data go?
• Microarray (MIAME)
– Array Express
– Gene Expression Omnibus
• Proteomics (PSI)
– PRIDE (EBI)
• Next generation sequencing data
– SRA (NCBI)
– ENA (EBI)
Why should you care?
• It means when you call a ‘cell extract’ a ‘cell extract’
in one data standard, it’s also called a ‘cell extract’ in
the other data standards
• And also by other researchers!
• Data can be integrated across experiments and
domains
• You can submit your data, or share it with a
colleague in a standardised, community agreed
format
• Major publishers require this for certain data types
• And, as discussed, your funding body too
To sum up
• Back up your machine!
– USB hard drives do not count
• Be as organised with your digital data as you are
with your lab books
• Be aware of the expectations for releasing data
• Capturing good metadata will make many things
easier (including writing up!)
• <shameless plug>Want to talk about how best to
analyse and store your digital data? Come talk to
us!</shameless plug>
Digital Data – Integrity and Standards
Simon Cockell
bsu.ncl.ac.uk
@sjcockell