Computational Science and the School of Informatics at Indiana University IU/HBCU STEM Initiative IUPUI April 11 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington.

Download Report

Transcript Computational Science and the School of Informatics at Indiana University IU/HBCU STEM Initiative IUPUI April 11 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington.

Computational Science and the
School of Informatics at
Indiana University
IU/HBCU STEM Initiative
IUPUI
April 11 2007
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
[email protected]
http://www.infomall.org
What is Computational Science?


Informatics is the integration of the art,
science, and the human dimensions of
information technology to provide solutions to
discipline-specific problems
Informatics is a response to the
data/information/knowledge gaps (data
deluge) caused by “billions and billions of bits”
• Grids are technology supporting this in distributed
research


Computational Science could be the same as
this or focus on the large scale simulation part
Multicore chips will revitalize simulation!
Bioinformatics Data Deluge
Challenge and Opportunity
1985
2000
1 experiment
1 experiment
1 gene
10,000 genes
10 data
10,000,000 data
e-moreorlessanything and the Grid







‘e-Science is about global collaboration in key areas of science,
and the next generation of infrastructure that will enable it.’ from
its inventor John Taylor Director General of Research Councils
UK, Office of Science and Technology
e-Science is about developing tools and technologies that allow
scientists to do ‘faster, better or different’ research
Similarly e-Business captures an emerging view of corporations as
dynamic virtual organizations linking employees, customers and
stakeholders across the world.
• The growing use of outsourcing is one example
The Grid provides the information technology e-infrastructure for
e-moreorlessanything.
A deluge of data of unprecedented and inevitable size must be
managed and understood.
People, computers, data and instruments must be linked.
On demand assignment of experts, computers, networks and
storage resources must be supported
Why Grids/ Cyberinfrastructure Useful








Supports distributed science – data, people, computers
Exploits Internet technology (Web2.0) adding management,
security, supercomputers etc.
It has two aspects: parallel – low latency (microseconds)
between nodes and distributed – highish latency
(microseconds) between nodes
Parallel needed to get high performance on individual 3D
simulations, data analysis etc.; must decompose problem
Distributed aspect integrates already distinct components
Cyberinfrastructure is in general a distributed collection of
parallel systems
Grids are made of services that are “just” programs or data
sources packaged for distributed access
Web 2.0 can be used “instead of” Grids
TeraGrid: Integrating NSF Cyberinfrastructure
Buffalo
Wisc
UC/ANL
Utah
Cornell
Iowa
PU
NCAR
IU
NCSA
Caltech
PSC
ORNL
USC-ISI
UNC-RENCI
SDSC
TACC
TeraGrid is a facility that integrates computational, information, and analysis resources at the
San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of
Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications,
Purdue University, Indiana University, Oak Ridge National Laboratory, the Pittsburgh
Supercomputing Center, and the National Center for Atmospheric Research.
Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today and doubling
APEC Cooperation for Earthquake Simulation

ACES is a seven year-long collaboration among scientists
interested in earthquake and tsunami predication
• iSERVO is Infrastructure to support
work of ACES
• SERVOGrid is (completed) US Grid that is
a prototype of iSERVO
• http://www.quakes.uq.edu.au/ACES/

Chartered under APEC –
the Asia Pacific Economic
Cooperation of 21 economies
Repositories
Federated Databases
Database
Sensors
Streaming
Data
Field Trip Data
Database
Sensor Grid
Database Grid
Research
SERVOGrid
Education
Compute Grid
Data
Filter
Services Research
Simulations
?
GIS
Discovery Grid
Services
Customization
Services
From
Research
to Education
Analysis and
Visualization
Portal
Grid of Grids: Research Grid and Education Grid
Education
Grid
Computer
Farm
SERVOGrid and Cyberinfrastructure


Grids are the technology based on Web services that implement
Cyberinfrastructure i.e. support eScience or science as a team
sport
• Internet scale managed services that link computers data
repositories sensors instruments and people
There is a portal and services in SERVOGrid for
• Applications such as GeoFEST, RDAHMM, Pattern
Informatics, Virtual California (VC), Simplex, mesh
generating programs …..
• Job management and monitoring web services for running
the above codes.
• File management web services for moving files between
various machines.
• Geographical Information System services
• Quaketables earthquake specific database
• Sensors as well as databases
• Context (dynamic metadata) and UDDI system long term
metadata services
• Services support streaming real-time data
LEAD Gateway Portal
NSF Large ITR and Teragrid Gateway
- Adaptive Response to Mesoscale
weather events
- Supports Data exploration,Grid Workflow
Grid Workflow Datamining in Earth Science

NASA GPS

Work with Scripps Institute
Grid services controlled by workflow process real time
data from ~70 GPS Sensors in Southern California
Earthquake
Streaming Data
Support
Transformations
Data Checking
Hidden Markov
Datamining (JPL)
Display (GIS)
Some Organizations I work with
• MSI CI2 Minority-Serving Institutions (MSI) Cyberinfrastructure Institute
led by the
• Alliance for Equity in Higher Education. Working with the Alliance will
have systemic impact on at least 335 Minority Serving Institutions
covered by the
• AIHEC American Indian Higher Education Consortium)
• HACU Hispanic Association of Colleges and Universities
• NAFEO National Association for Equal Opportunity in Higher Education
• MSI-CIEC Minority-Serving Institution Cyberinfrastructure (CI)
Empowerment Coalition led by
• UHD University of Houston Downtown as a major Hispanic Serving
Institution
• I am Senior Research Associate in the Center for Computational Science
and Advanced Distributed Simulation at UHD and Visiting Scholar for
Cyberinfrastructure Development at the Alliance for Equity in Higher
Education
Basic Ideas
• Cyberinfrastructure is critical to all involved in
Research and Education
• Cyberinfrastructure is intrinsically democratic
supporting broad participation
• MSI’s should lead MSI integration with
Cyberinfrastructure
• One should guide the projects with experts
• One should aim at scalable (systemic) approaches
• Goal is peer collaborations involving all institutions of
higher education
Teaching Jackson State Fall 97 to Spring 2005
Syracuse
JSU
Example: Setting up a Polar CI/Grid
• NSF CI-Team project with HBCU ECSU in North Carolina and
Kansas University will design and set up a Polar Grid
– CI Enable MSIs (ECSU Haskell) and a community (Polar Science)
• The North and South poles are melting with potential huge
environmental impact
– We have changed the 100,000 year Glacier cycle into a ~50 year cycle;
the field has increased dramatically in importance and interest
• Polar Grid is a network of computers, sensors (on robots and
satellites), data and people aimed at understanding science of
ice-sheets and impact of global warming
• We are planning Polar Grid relevant CI Education Infrastructure
and initial projects with Undergraduate students (ECSU) and
Graduate students (Kansas)
– Polar weather stations as Grid resources
– Use distance education to cover all CReSIS sites
CReSIS PolarGrid
• Important CReSIS-specific Cyberinfrastructure components include
– Managed data from sensors and satellites
– Data analysis such as SAR processing – possibly with parallel
algorithms
– Electromagnetic simulations (currently commercial codes) to design
instrument antennas
– 3D simulations of ice-sheets (glaciers) with non-uniform meshes
– GIS Geographical Information Systems
• Also need capabilities present in many Grids
– Portal i.e. Science Gateway
– Submitting multiple sequential or parallel jobs
• TeraGrid etc. (the National Cyberinfrastructure) is having
Cyberinfrastructure days at various places around country to popularize
and identify how institutions can participate
– ECSU will be later this year
Indiana University Cheminformatics Center Summary
Indiana University is focusing on two major areas:
• Creating a comprehensive, easily accessible infrastructure for
chemoinformatics tools and data sources, linked with PubChem and
made available as web services, and partnering with screening centers and
other users to demonstrate how this infrastructure can be usefully applied
– Infrastructure can include any tools, not just ours (commercial/open source,
chemoinformatics, bioinformatics, and so on)
– New, custom applications can be built quickly using existing services in a
similar way to Google Maps and other “web 2.0” resources
• Being a central hub of chemoinformatics education, including offering
distance courses on chemoinformatics theory and techniques, practical
workshops on using chemoinformatics resources, and freely available webbased educational resources
– We currently offer a Ph.D, M.S. and graduate certificate (distance) in chemical
informatics
– Distance education program allows you to “pick and choose” courses to meet
educational needs: certificate is awarded on completion of four courses
CICC
Chemical Informatics and Cyberinfrastucture Collaboratory
Funded by the National Institutes of Health
www.chembiogrid.org
CICC
CICC Combines Grid Computing with Chemical Informatics
Large Scale Computing Challenges
Chemical Informatics is non-traditional area of high
performance computing, but many new, challenging
problems may be investigated.
NIH
PubMed
DataBase
Chemical
informatics
text analysis
programs can
process
100,000’s of
abstracts of
online journal
articles to
extract
chemical
signatures of
potential
drugs.
OSCAR
Text
Analysis
Initial 3D
Structure
Calculation
Molecular
Mechanics
Calculations
Cluster
Grouping
Toxicity
Filtering
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data deluge
of publicly available data on potential new drugs.
.
Docking
OSCAR-mined molecular signatures can
be clustered, filtered for toxicity, and
docked onto larger proteins. These are
classic “pleasingly parallel” tasks. Topranking docked molecules can be further
examined for drug potential.
Quantum
Mechanics
Calculations
NIH
PubChem
DataBase
POVRay
Parallel
Rendering
IU’s
Varuna
DataBase
Big Red (and the TeraGrid) will
also enable us to perform time
consuming, multi-stepped
Quantum Chemistry
calculations on all of PubMed.
Results go back to public
databases that are freely
accessible by the scientific
community.
CICC supports the NIH mission by combining state of
the art chemical informatics techniques with
• World class high performance computing
• National-scale computing resources (TeraGrid)
• Internet-standard web services
• International activities for service orchestration
• Open distributed computing infrastructure for scientists
world wide
Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories
CICC Web Service Infrastructure
Cheminformatics Services
Statistics Services
Database Services
Core functionality
Fingerprints
Similarity
Descriptors
2D diagrams
File format conversion
Computation functionality
Regression
Classification
Clustering
Sampling distributions
3D structures by
CID
SMARTS
3D Similarity
Docking scores/poses by
CID
SMARTS
Protein
Docking scores
Applications
Applications
Docking
Predictive models
Filtering
Feature selection
Druglikeness
2D plots
Toxicity predictions
Arbitrary R code (PkCell)
Mutagenecity predictions
PubChem related data by
Anti-cancer activity predictions
Pharmacokinetic parameters
CID, SMARTS
OSCAR Document Analysis
InChI Generation/Search
Computational Chemistry (Gamess, Jaguar etc.)
Grid Services
Varuna.net
Quantum Chemistry
Portal Services
Service Registry
Job Submission and Management
Local Clusters
IU Big Red
TeraGrid, Open Science Grid
RSS Feeds
User Profiles
Collaboration as in Sakai
Varuna environment for molecular modeling (Baik, IU)
Researcher
Chemical
Concepts
Papers
etc.
ChemBioGrid
Experiments
Reaction
DB
QM
Database
PubChem, PDB,
NCI, etc.
DB Service
Queries, Clustering,
Curation, etc.
QM/MM
Database
Simulation Service
FORTRAN Code,
Scripts
Condor
TeraGrid
Supercomputers
“Flocks”