eResearch briefing - Australian National Data Service

Download Report

Transcript eResearch briefing - Australian National Data Service

Getting from managed to reused:
Making it easier for researchers to
do something useful with data
Mingfang Wu, Stefanie Kethers, Andrew Treloar
What is ANDS?
 ANDS is supported by the Australian Government
 Began in 2009, currently funded to mid 2015
 Collaboration between Monash University, CSIRO and the
Australian National University
 Staff in 6 cities across the country
 Funded 200+ projects across 68 institutions
ANDS aims to make data more valuable to
researchers, research institutions and the nation
2
How Do We Make Data More Valuable?
Value
So that researchers can easily publish, discover, access and use
research data through the Australian Research Data Commons.
ANDS Programs
 Underpinning infrastructure for discovery and citation (ARDC Core)
 Enable rich metadata about data to be managed and accessible
(Metadata Stores)
 Make new data and associated metadata available from range of
instruments (Data Capture)
 Make a selection of existing data and associated metadata available
from Australia’s research-producing universities (Seeding the Commons)
 Make data and associated metadata available from government
departments (Public Sector Data)
 Provide the overall policy and practice frameworks to support better
data management and re-use (Frameworks and Capabilities)
 Demonstrate the value of doing all these (Applications)
4
Tools for Data-reuse
Form
Hypothesis
Look Up
Data
Design & Run Analyse
Data/Results
Experiment
Research Activities
Publish Paper,
Data, Software
Computing
Transform
Data
Discover
Data
Integrate
Data
Extract
Data
Analyse
Data
Workflow
Visualise
Data
Register
Data
Data
Data Collections
Metadata
5
The ANDS Applications Program
 Funded through EIF (Education Infrastructure Fund)
 Focus on Software Infrastructure to enable research
 Goal of the Applications program:
“to produce compelling demonstrations of the value of having
data available for re-use”
(i.e. enabling research across many sources of data that was
not previously possible).
6
Developed software might…






empower researchers to solve important problems
build new connections
enable important problems to be solved
enable new questions to be answered
simplify problems
accelerate solving problems, or analysing data
7
What have been funded under the apps program?
 7 projects in bio/characterisation
 8 projects in climate change adaptation
 10 others (urban planning, marine research, public
health, humanity)
 For a completed list of the apps projects and their
profiles, please visit ANDS project registry:
https://projects.ands.org.au/getAllProjects.php?start=app
8
What kind of tools have been developed?
 Data transformation
 Data linkage and integration
 Data service
 Data analysis and modelling
 Data visulisation
 Data manipulation workflow
9
….
Example Applications
 Climate Model Downscaling Data for Impacts
Research
 Cancer Genomics Linkage Application
 Brain Mapping National Resource
 POSITIVE PLACES: Spatial Analysis of Public Open
Space
10
Climate Model Downscaling Data for Impacts Research
Regional Climate Model
Data Collection

Very big!
•
High spatial and temporal
resolution
•
Large region
•
Many climate variables
•
Many atmospheric layers
•
Multiple simulations

Data on an irregular model grid

Stored in netCDF
11
Regional Climate Model Downscaling Data
Agricultural Impacts
Researchers
Hydrological Impact
Researchers
Health Impacts
Researchers
Ecological Impacts Group
12
Regional Climate Model Downscaling Data
Impacts-relevant high res

Very big!
•
•
•
•
•
High spatial and temporal
resolution
Large region
Many climate variables
Many atmospheric layers
Multiple simulations
Climate Change Impact Researchers: I see some
problems!

What is a Regional Climate Model?

I don’t have enough disk space for this dataset on
my computer

I can’t find data for the sites I’m interested in

My software tools can’t handle this irregular grid.

I can’t read this netCDF data format

Data on an irregular model grid

This data set doesn’t contain data for my site

Stored in netCDF

This data gives me strange results for the current
climate

This dataset is great! – How can I share my work on
it with others?
13
Data service – Climate
Model Downscaling Data
for Impact Research
(CliMDDIR) (AP04, UNSW)
Provide open source software to
transform RCM data
• Extract subsets of data (e.g.
variables, regions)
• Regrid or interpolate data to
sites
• Reformat data (e.g. GIS, ASCII,
CSV)
• Calculate derived variables
(e.g. pan evaporation)
• Apply statistical corrections (if
necessary)
14
http://www.climddir.org/node/33
CliMDDIR Service
Collection Description at RDA
Service Description at RDA
15
CliMDDIR Service Portal
Climate impact researchers can
•
•
•
•
•
•
select region
select time coverage
select variables
select simulation models
select output format
share (sub-set) data to other
researchers
16
Climate Modellers
Agricultural Impact Researchers
Assess how climate change impact on
wheat cropping in NSW using the
APSIM agriculture model
IT Specialists
17
Workflow - Cancer Genome Linkage Project
Challenges faced by
biologists and
Clinicians:
• The manual process
required to integrated
their research data with
other data sets
• No availability of
standarised analytical
processes
• The delay in transitioning
from analysis to
publication ready result
http://ap27-cgla.blogspot.com.au/
Raw data
tttctgaaga ccatggacta tgagacctct
Derived Data (i.e. mutation info) is released through
the ICGC Data Portal
18
Workflow - Cancer Genome Linkage Project
Provide software/infrastructure to enable
integration/transformation of multiple
datasets within the GVL environment




Variant detection pipeline in Galaxy
Software Development by QFAB
(Queensland Facility for Advanced
Bioinformatics, UQ)
Development aligned with that of the
NeCTAR GVL
Inclusion of the very large raw ICGC
Pancreatic Dataset into the NeCTAR GVL
Development of (reusable) Galaxy
Workflows for easier mutation searching
19
Workflow - Cancer Genome Linkage Project
Screenshots of
output data
20
Workflow - Cancer Genome Linkage Project
21
Data Visualisation
Brain Mapping National Resource
 Funded at QCIF and Centre for Advanced
Imaging, UQ
 Developed TissueStack that can link to
specific parts of the data, , and rapidly
view and collaboratively annotate on very
large 3D datasets via a web browser.
 For detail, please go to Dr. Andrew Janke’s
presentation on Wed. 12:05 – 12:25,
Room:P1
22
POSITIVE PLACES: spatial analysis of public open space
 Are the current provisions of POS and parks adequate for the projected
urban densification and population growth?
 Will there be enough POS? (i.e. will it meet the 10% land provision still?)
 Will the provision of different park types and facilities that encourage use by
different population demographics (i.e. small pocket parks with play
equipment for young children) or for different uses (i.e. active or passive
recreation) be adequate? What more / less will be needed?
 Is there sufficient large open space for active recreation and sporting needs?
 What type of POS can promote increase social connectedness within
communities?
Challenge: lack of a comprehensive and consistent digital datasets of public
open space
23
Data integration and interrogation: Public Open Space (POS) Tool developed at UWA
7624 areas of POS
• 3813 parks (up to 43
different facilities and
amenities per park)
• 820 school
grounds/playing fields
• 1860 natural and
conservation or
bushland areas
• 771 areas of residual
green space
POS statistics of
a searched
suburb or LGA
can be
downloaded as
an Excel
spreadsheet
With advance features, users can:
•
•
•
http://positiveplaces.blogspot.com.au/
define area of interest directly on screen
upload a user defined region as a GIS
shapefile
scenario test the relationship between
changes in population structure for a user
defined area and the provision of POS
24
Who benefit from the applications projects?
 Researchers
 Conduct existing research
more efficiently
 Enable new research
 Increase research
collaboration opportunities
 Strength relationship with
government agencies and
industries
 Connect science to the public
 Government agencies,
urban planner, and
infrastructure planner, …
 The public
Prof. Charles Watson, from Curtin University and
neuroscience Research Australia commented that
“The ability to share data from cloud, access it
through TissueStack, would make a huge difference
to the way we are able to interact, the ability for all
participates to access the same dataset, to annotate
it and to have a discussion on the way forward.
Max De Antoni Migliorati (PhD Candidate from QUT)
on Semaphore: monitoring and Modelling Australian
Gas Emissions: It is much more time effective, it is
much more easier to get our result with Semaphore.
Now I can run 5 simulation today, while a previous
method, it took me one day to get one simulation
done.
25
Summary
 Substantial data infrastructures have been built to enable data
sharing and data reuse
 The ANDS application program has demonstrated the value of
data sharing and data reuse
26
Information
 ANDS project registry:
https://projects.ands.org.au/getAllProjects.php?start=all
 Project blogs: http://andsapps.blogspot.com.au/p/projectfeed.html
 Demonstrations of value:
http://andsapps.blogspot.com.au/p/resources.html
27
Thanks
 To Ian Macadam (from UNSW) for providing some
slides about CliMDDIR project
 To all who have participated in and contributed to
the program
28
Questions?
29