Preparation of DTOC Indices

Transcript Preparation of DTOC Indices

Scientific Database Approaches

John H. Porter University of Virginia & Kristin Vanderbilt University of New Mexico

Road Map

 Why have Scientific Databases?

 Challenges for Scientific Databases  Approaches to Scientific Databases  Strategies for Initiating Ecological Databases

WHY have Scientific Databases?

 Improvement of data quality •

multiple users provides multiple opportunities for detecting and correcting problems in data

 Cost •

data costs less to save than to collect again

•

with environmental data, often data cannot be collected again at any cost

WHY have Scientific Databases?

 Environmental Policy and Management •

environmental policy decisions require data that are regional or national, but most ecological data is collected at smaller scales

• •

National Policies International Policies

WHY have Scientific Databases?



New Science

•

Long Term

–

long-term studies depend on databases to retain project history

•

Synthesis

–

use of data for a purpose other than which it was collected

•

Integrated, multidisciplinary projects

–

depend on databases to facilitate sharing of data

Evolution of Data Sharing - Traditional Model

Data Collection Use Data Lose or Discard Data Publications

Evolution of Data Sharing –New Model

Data Collection Use Data Data and Metadata Publications

•

Regional Analyses

•

Global Change

•

Long-term Studies

•

Synthesis

Challenges for Scientific Databases

 Long-term perspective •

without databases, most data do not outlive project that collected them



The 20-year rule

•

GOAL: data that is accessible and interpretable 20-years in the future

Meeting Long Term Needs •

TECHNOLOGICAL – media & formats that do not become obsolete

•

CONTEXTUAL- need to capture context of data collection

•

SEMANTIC - terms need to be well-defined

Challenges for Scientific Databases



Deal with Diversity

•

science means asking

NEW

questions

–

new kinds of queries

•

scientific data is heterogeneous and diverse

•

scientific users have different backgrounds and goals

•

the user community for a given database will be dynamic

Characteristics of Ecological Data

Data High Volume (per dataset) Satellite Images Weather Stations Business Data Most Software Gene Sequences Data GIS Most Ecological Primary Productivity Biodiversity Surveys Population Data Soil Cores Low Complexity/Metadata Requirements High

Comparison to Business Databases

 Business-oriented databases have been very different from scientific databases • Relatively small number of well-defined data elements – E.g., Part number, count, price • Repeatable reports (e.g., sales report) • Rules for integrating data well understood • Intolerant of different values associated with an element – E.g., hourly rate of pay

Ecoinformatics Development: Alignment with IT community

Information Technology Ecoinformatics

Reason: IT focused on proprietary business applications

modified from James Brunt

Changing Times

 New emphases on “data mining” are forcing business databases to become more like scientific databases • Example: data on customer demographics are linked to regional store inventories • Integration of data resources not designed with integration in mind

Ecoinformatics Development: Alignment with IT community

XML, Web Services, Semantic Mediation

IT Ecoinformatics

Reason: IT now focuses on domain-neutral access to distributed data products.

Modified from James Brunt

The Ecoinformatics Challenge:

 Can we make information available to ecologists: • In ways they can

locate

they need?

the information • With information in forms they can readily

use

 How can we assure that the information is current and accurate ?

Not all Scientific Databases are Alike!

Scientific data are available at a number of different “levels”  LOW: individual investigator posts data on web page for students to retrieve  MEDIUM: Online databases for supporting a project  HIGH: system automatically integrates data from a large number of sources

Different types of Scientific Databases “Portal”, “Value-Added” or “Integrated” Infobases Researchers International/ National/Regional Systems Project or Site-Based Systems Individual datasets

Tools for Creating Scientific Databases

 Web Server – HTML, XML • IIS • Apache – open source  Database Management Systems (DBMS) • Input, query, update, sort, output  Statistical Packages • Aggregate, graph  Programming Languages • C++, JAVA, PERL, Python, Visual Basic, PHP • Create Custom code

Tools for Scientific Database Development

 Relational Database Management Systems – RDBMS in common use • Access/ Microsoft SQL Server • Oracle • MySQL

– open source

 Statistical Packages • SAS • SPSS • R

– open source

Spreadsheets

 Spreadsheets are fantastic tools – but not for scientific databases!

• Encourage “bad practice” – irregular data structures that can’t be parsed easily • Lack “auditability” – difficult or impossible to back-track calculations • Proprietary formats become obsolete • Lack export capabilities for other than values or graphs (no formulae)

Not Every Scientific DB needs or uses the same tools

  Example 1 – Basic Data Access • Post comma-delimited files on web server • Metadata files – XML text files (structured) or unstructured Example 2 – Add Products • Use SAS to conduct error-checking and generate graphics from data • Use scripts/programs to automate production process

Possible Systems

 Example 3- Manage Metadata in DBMS • Metadata in Access Database • Provide comma-delimited data files  Example 4- Manage Metadata on Web • Link web forms to backend DBMS  Example 5- Full DMBS system • Metadata in DBMS • Data dynamically queried from DBMS using web interface

Level of Structure

 Unstructured Data/Metadata • Easy to produce • Hard to use  Structured Data/Metadata • Harder to produce • Easy to export, alter, update • The specific tool used to structure data (e.g., XML, DBMS) is increasingly less critical than the structure itself

Evolving a Database



Development of a database is an evolutionary process



Implement system based on current priorities - but think ahead!



Seek scalable solutions

• •

avoid bottlenecks adding the 1000th piece of data should be as easy as adding the first (or easier)

Developing a Database Questions to Ask



Why is this database NEEDED?



Who will be the USERS of the database?



What types of QUESTIONS should the database be able to answer?



What INCENTIVES will be available for data providers?

Meeting the Challenges

 Prioritize • focus on developing the most critical data resources • most commonly, critical data refer to the research site as a whole – Meteorology & Climatology – Bibliography of past research at the station – GIS data layers for the station research area

Meeting the Challenges

 Get additional resources • NSF Grants • Upcoming NSF initiatives: – SEI+II – interdisciplinary research – National Ecological Observing Network (NEON) • Institutional Support

Meeting the Challenges

 Work with researchers and enlist their help in developing ecological databases • Develop policies for data collection and sharing that dictate the responsibilities of: –The data provider/producer –The data system –Users of the data

Use Standard Methods when Possible

 Advantages of using standard methods • Increases intercomparability (and hence, value) of data, facilitating cross-site comparisons • Reduces cost of methods development

Standards

 Costs of using standards • Standard methods may be poorly suited to local conditions • Developing standards is time consuming and difficult  For some types of monitoring, standards may not exist, or may do a poor job characterizing desired parameters

Standards

“ The wonderful thing about standards is that there are so many of them to choose from”  Sources of Standards • Published literature • Government Agencies (e.g., USGS, EPA) • Project standards (e.g., LTER Climate Stations) • Resource Discovery Initiative for Field Stations (RDIFS) directory

(under development)

Information Systems

 Developing an information system is a critical component of research • You can’t exploit data you no longer have!

 Creating good “metadata” (data about data) is crucial to maintaining data usability over time

Exploit Partnerships & Existing Resources

 OBFS Resource Discovery Initiative for Field Stations (RDIFS) • Ecoinformatics Training • Publications Database • Registry for field station data (free advertising!) • Database of standards • Keyword Thesaurus  Ecoinformatics.org/ Knowledge Network for Biocomplexity Project • Ecological Metadata Language • Tools

 Ecological Metadata Language (EML)

Other Possible Collaborations

 ORNL Mercury System • Cataloging and metadata tools with the data and metadata left on your system  Global Change Master Directory • online system for metadata with searching capabilities  OpENDAP.org

• Online tools for oceanographic data

Exploiting External Resources

 Ecological Society of America journal

Ecological Archives

• accepts “data papers” for major and important data sets.

Concluding Thoughts

 Developing ecological information systems seems a daunting task  Every system starts somewhere. Even oaks start with acorns!

 Once started, you can build on successes, a little at a time  Remember, the compound interest on zero is zero!

Next Step

 Experience is a good guide to helping build the sort of database your users will want to use  Its good to try out the existing systems to see what works (and what doesn’t) as a user