Digital Curation Centre a centre of support for data curation and preservation UK Digital Curation Centre One Year On Liz Lyon Associate Director, Outreach Chris.
Download ReportTranscript Digital Curation Centre a centre of support for data curation and preservation UK Digital Curation Centre One Year On Liz Lyon Associate Director, Outreach Chris.
Digital Curation Centre a centre of support for data curation and preservation UK Digital Curation Centre One Year On Liz Lyon Associate Director, Outreach Chris Rusbridge, DCC Director Overview • Why is digital curation important? • What are the challenges that the DCC faces? • About the people and our collaborative approach • Addressing the issues • How can you contribute to the DCC? 2 Curation? “maintaining and adding value to a trusted body of digital information for current and future use” 3 Digital curation continuum For later use? Static Data preservation 4 In use now (and the future)? Dynamic Data curation Assuring permanent access to the records of science & the humanities? Long term access to primary data • Increasing data volumes from eScience and Grid-enabled / cyberinfrastructure applications • Changing research paradigm: data-driven science, “big science” • Observational data, simulations, large-scale experimentation • Multi-media resources, statistical data, surveys, geo-spatial data…… 5 6 Facilitate “post-processing” and knowledge extraction Enable the acquisition of newly-derived information and knowledge • Run complex algorithms over primary datasets • Mining (data, text, structures) • Modelling (economic, climate, mathematical, biological) • Analysis (statistical, lexical, pattern matching, gene) • Presentation (visualisation, rendering) 7 8 Provide additional functionality beyond digital preservation processes Annotations • Gene and protein sequences • e-Lab books (Smart Tea Project in chemistry) 9 Presentation services: subject, media-specific, data, commercial portals Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Resource discovery, linking, embedding Data analysis, transformation, mining, modelling Searching , harvesting, embedding Aggregator services: national, commercial Harvesting metadata The scholarly knowledge cycle : linking research data to publications eBank UK Project http://www.ukoln.ac.uk/projects/ebank-uk/ Research & e-Science workflows Repositories : institutional, e-prints, subject, data, learning objects Validation Deposit / selfarchiving Validation Publication Linking 10 Data curation: databases & databanks Peer-reviewed publications: journals, conference proceedings Emerging policy on open access to data DCC people (some of them…) • Management & Co-ordination – Director Chris Rusbridge (University of Edinburgh) • Community Support & Outreach – Led by Dr Liz Lyon (UKOLN, University of Bath) • Service Definition & Delivery – Led by Professor Seamus Ross (HATII [ERPANET], University of Glasgow) • Development – Led by Dr David Giaretta (Astronomical Software & Services, CCLRC) • Research 11 – Led by Professor Peter Buneman (Informatics, University of Edinburgh) The challenges we face Standards • Interoperability issues: technical & hopefully soluble Scale • Volume and diversity of datasets Culture • Bringing communities together • Library/information science/archives “document tradition” • Domain research (chemists, astronomers, biologists) 12 • Computer science (databases) • Commercial suppliers (storage technology) More challenges…… Process • Highly-distributed organisation: use collaborative tools Skills • Distributed amongst the 4 partners & beyond Engagement • Lots of existing work and many significant players Impact • Visible & measurable, in the short & long-term 13 Meeting expectations (which are high…..) • Of the community and our funders User requirements analysis Commissioned study • Leona Carpenter • Reporting now • Desk-based research • Focus groups • Interviews Results will inform research, development service definition / delivery and outreach 14 Recommendations and priority tasks Some sound bytes… R&D issues: Annotation services, Ontology development, Automating metadata creation, Tools and toolkits, Data Format Description Language, Identifiers, Registries, Economic and cost-benefits studies Advisory services :“Ask-a-Curator”,FAQs, reports, briefings, awareness-raising materials, best practice guidance, Storage media, “Like Erpanet”, advise Government, Research Councils, funding bodies Professional development: Short courses, conferences, seminars, workshops, secondments to DCC and to working repository services Outreach: Leadership for the future, case studies, sharing solutions, collaboration with other partners, international peers, industry links Taxonomy of “Users” 15 Outline Taxonomy of digital curation users by role 4. Policy makers 2. Data Curators -funding bodies 1. Data Creators 16 -other leaders 3. Data Re-users Outline Taxonomy of digital curation users by role Data Preservers 4. Policy makers 2. Data Curators -funding bodies 1. Data Creators 17 Data -other publishers leaders 3. Data Re-users Outline Taxonomy by significant function of organisational entity 1. 4. Funders 5. Policy / strategy makers Research 3. Learning & teaching 2. Service provision “Designated communities” 18 Outline Taxonomy by significant function of organisational entity 1. 4. Funders 5. Policy / strategy makers Research 3. Learning & teaching 2. Service provision Commercial “Designated communities” 19 Service definition & delivery • Advisory services – Responses to queries—from legal to technical guidance [email protected] – Site visits (National Institute of Environmental eScience) • Information Services 20 – Briefing Documents - Freedom of Information by Mags McGinley – DIGITAL CURATION MANUAL – 20 chapters written by community experts e.g. Metadata written by Michael Day, UKOLN – Peer-reviewed – Checklist for Compliance with best practices and standards – Technology Watch Services: workshops • 2005 Programme 21 – Preservation of medical databases: 24-25 May at the Gulbenkian Institute, Lisbon in collaboration with ERPANET & the Wellcome Trust – Institutional repositories: 6 July at the University of Cambridge, UK in collaboration with DSpace – Cost models in collaboration with the Digital Preservation Coalition July at British Library – Persistent identifiers liaising with NISO, summer, UK location tbc Development approach • OAIS (Open Archival Information System) linkage: focus on representation information – link to global work on format registries? – Concentrate on scientific data formats? • Repository – Representation Information – Standards and Tools – Aim for OAIS compliance 22 • Persistent identifiers • Certification… RLG task force • Open development wiki and email list OAIS Reference Model – Functional Model 23 How relevant to curation? Representation Net 24 Representation Information More detail 25 How does this relate to format registries? High Level View Example of use of Representation Information Labelling 26 Registry issues? • Trusted repository of Representation Information – Authenticity of information – Access control – Certificates/Digests : (are they trustable over the long term?) • Findability – Persistent IDs • What can we rely on? – Labels (to support automated processing) • Extensibility • Distributed 27 Registry development • Simple PHP prototype • Scoping study- unification – Formats, standards, tools • More robust prototype in development – Based on ebXML & JAXR – Potentially distributed, cooperative maintenance model 28 Development Roadmap • Registry: complete prototype, link to PRONOM, GDFR etc, handover to service • Representation information: describe CCLRC (science) data using EAST, etc • Certification work continues • Additional tools: metadata extraction • Testbeds, interactions with others 29 Research approaches • • • • Publishing & integrating scientific databases ‘Archiving’ past states of volatile databases Database provenance and annotation Organisational dynamics of trusted repositories • Automating metadata extraction • Cost-benefit analysis of data curation • Rights and responsibilities 30 The database picture 31 Source data Curated data: classified, cleaned, annotated, integrated, cross-linked Curated Databases are Central Much/most scientific data is now in databases • They often do not contain source experimental data. Sometimes just annotation/metadata • They borrow extensively from, and refer to, other databases • You are now judged by your data as well as your (paper) publications!! • These databases are built and maintained with a great deal of human or computational effort. 32 What makes a database? – it has internal structure or it changes. Size alone doesn’t qualify Archiving (preserving) volatile databases • How do you preserve something that changes every hour or minute? – Important for the scientific record – someone might have cited your data at time t. • Current practice – – – – 33 Create versions (how often?) Log changes Use diffs Do nothing (common!) Curated databases – some issues 34 • Integrating and publishing data so that someone else can use it. • Annotating existing data and moving annotations to other databases • Provenance: where did this data come from? • Archiving: how do you preserve something that is constantly changing? How do we cite data? • A URL or citation to an article is already unsatisfactory. – DCC client complaint: “I spend a lot of time searching [electronic documents] for the part that is relevant to the citation.” • The problem is much worse when you are citing something in a very large database. • How do you use a citation to locate data? • How do you ensure that the citation persists? 35 – Connections with DB archiving and DOIs Research approaches • • • • Publishing & integrating scientific databases ‘Archiving’ past states of volatile databases Database provenance and annotation Organisational dynamics of trusted repositories • Automating metadata extraction • Cost-benefit analysis of data curation • Rights and responsibilities 36 – “Public domain, public interest, public funding” paper Waelde & McGinley 37 www.dcc.ac.uk • www.ijdc.net • Launch planned June/July • Peer-reviewed contributions • Peter Buneman Editor (research) • Production editor Philip Hunter 38 Sample issue Full papers Invited articles News & views 39 Papers for submission are very welcome! 1st DCC International Conference • Location - Bath UK • 29-30 September 2005 • Keynote speakers Cliff Lynch CNI Graham Cameron European Bio-informatics Institute • DCC Research update • Social highlights 40 Associates Network Goals Develop understanding, share best practice, advance research, promote recognition, develop consensus Membership International groups, national bodies, industry partners, funders, research groups, HEIs, FEIs, individuals…… Benefits Early access to R&D outputs, advisory services, training, input to definition and design, community participation 41 Discussion Forum www.dcc.ac.uk Please join us! BADC Cambridge Leicester Jodrell Bank NIEeS ESO RLG CMS-Bristol BODC NASA NARA CNES ESA RLG BNSC RG IVOA ESA SDSC RI UNC International Collaborations CEH DPC Council for Museums, Archives & Libraries ResearchEDG InstitutesGridPP EGEE So’ton MIMAS NOF ILRT CCLRC NEODC UKOLN DELOS AHDS DPC Standards Bodies NeSC UofE DLI (US) Research Councils Capri 42 IBM Almaden OCLC CDS ESO JHU CSIRO TU Vienna Caltech JHU CSIRO Data Archive LDC Roslin INRIA MRC HGU UPenn Kyoto USC MIMAS WT-CFG Leicester IC Maastricht Durham NTUA INRIA HUJ UPC MaxPlanck Dutch NA Swiss NA Urbino Salzburg UNC EBI GSK ACM HEIs & FE Oxford UofG Innogen NHS NLA OAI NCS Microsoft IBM Oracle BT STK RDN. OCLC IASSIST Acknowledgements Slides from Peter Buneman, David Giaretta and others used with thanks. How you can help us 44 How does OAIS relate to curation? How do format registries relate to representation information? Who else is working across these areas? What outcomes would you like to see?