Transcript Document
Virtual Organizations: Building Interdisciplinary Collaborations Dan Reed [email protected] Chancellor’s Eminent Professor Vice Chancellor for IT University of North Carolina at Chapel Hill Director, Renaissance Computing Institute Acknowledgments • Funding agencies – NIH • Carolina Center for Exploratory Genetic Analysis (CCEGA) – NSF • TeraGrid Science Gateways – State of North Carolina • RENCI and ancillary Bioportal support • RENCI staff – – – – Alan Blatecky, Kevin Gamiel, Xiaojun Guan Clark Jefferies, Howard Lander John Magee, Ruth Marinshaw, Jeff Tilson Lavanya Ramakrishnan • And a host of others … 21st Century Challenges • The three fold way – – – – – distributed, multidisciplinary teams multimodal collaboration systems distributed, large scale data sources leading edge computing systems distributed experimental facilities • Socialization and community – – – – multidisciplinary groups geographic distribution new enabling technologies creation of 21st century IT infrastructure • sustainable, multidisciplinary communities • “Come as you are” response Experiment • Supported by Theory – theory and scholarship – experiment and measurement – computation and analysis Computation Exemplar 21st Century Challenges • Population growth in sensitive areas – severe weather sensitivity • national impact – geobiology and environment – economics and finance – sociology and policy • Economics and health care – longitudinal public health data • environmental interactions – genetic susceptibility • heart disease, cancer, Alzheimer's – privacy and insurance – public policy and coordination Mean Onset of Alzheimer’s Disease • apolipoprotein (apo) – apoE2, apoE3 and apoE4 alleles • on chromosome 19 – apoE4 allele • apo gene inheritance – ~25% inherit 1 copy of apoE4 allele • Alzheimer's risk increases 4X – 2% inherit 2 copies of apoE4 allele • Alzheimer's risk increases 10X Source: Alan Roses, GSK 1.0 Proportion of each genotype unaffected • 40% to 60% of Alzheimer's patients • not the only cause for Alzheimer’s 2/3 0.8 0.6 3/3 2/4 0.4 3/4 0.2 0 60 4/4 65 70 75 80 Age at onset 85 Big Questions Protein sequence and regulation Sequence Annotation Message Promoter DNA sequence T A T A C A G T A C C G T Protein structure Protein/enzyme function Q Homology based Y protein structure prediction Molecular simulations Data integration R Pathway simulations Network analysis Organs, Organisms and Ecologies Bacteria and cells Metabolic pathways and regulatory networks Multi-protein machines Genetics and Disease Susceptibility Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4 Ethnicity Environment Age Gender Identify Genes Pharmacokinetics Metabolism Endocrine Physiology Biomarker Signatures Immune Proteome Transcriptome Morphometrics Predictive Disease Susceptibility Source: Terry Magnuson, UNC PITAC Report Contents • Computational Science: Ensuring America’s Competitiveness 1. A Wake-up Call: The Challenges to U.S. Preeminence and Competitiveness 2. Medieval or Modern? Research and Education Structures for the 21st Century 3. Multi-decade Roadmap for Computational Science 4. Sustained Infrastructure for Discovery and Competitiveness 5. Research and Development Challenges • Two key appendices – – • Examples of Computational Science at Work Computational Science Warnings – A Message Rarely Heeded Available at www.nitrd.gov Life Science Lessons from Astronomy • Historically, discoveries accrued to those – with access to unique data – who built next generation telescopes • Two things changed – growing costs and complexity of telescopes – emergence of whole sky surveys • The result – virtual astronomy – discovering significant patterns • analysis of rich image/catalog databases – understanding complex astrophysical systems • integrated data/large numerical simulations {Inter}national Virtual Observatory Chandra SIA 3. X-ray and Optical Images retrieved via SIA interface Skyview SIA NED Cone Search CADC CNOC Cone Search DSS SIA Cluster Galaxy Morphology Analysis Portal 2. Look up cluster in internally stored catalog DSS SIA 5. Initial Galaxy Catalog generated via Cone Search CNOC SIA 6. Image cutout pointers merged into catalog clusters Morphology Calculation Service User’s Machine 1. User selects a cluster 4. User launches distributed analysis Source: Ray Plante, NCSA web browser Morphological 7. parameters calculated on grid for each galaxy User downloads final 8. table and images for analysis & visualization The Bioinformatics Challenge • Challenge – the rise of quantitative biology • burgeoning bioinformatics data – complex analysis and modeling problems – education and training in new technologies • Reality – diverse tools with idiosyncratic interfaces • steep learning curves – software development by diverse groups – distributed, databases with diverse metadata • Need – integrated, easy-to-use toolset with standard interfaces – extensible mechanisms that hide idiosyncrasies – tool and bioinformatics training • The solution – bioinformatics infrastructure and coupled training Need: Simple, Easy-To-Use Tools “Genome. Bought the book. Hard to read.” Eric Lander Web and Social Processes • Google – it’s a search engine, it’s a verb, … • Blogs – published self-expression • Instant Messenger – social networks • Wireless messaging – semi-synchronous • Internet commerce – the dot.com boom/bust – EBay, Amazon • Spam, phishing, … – anti-social behavior Benefits of Standards • • • • • • • • Interoperability Separation of concerns Reuse Independence Dependability Sharing Commonality Shared knowledge base – knowledge reuse – simplification (one hopes) Grids of All Flavors What’s A Grid/Web Service? It’s been 12 years! http:// Web: Uniform access to documents Grid/Web Services: Flexible, highperformance access to resources and services for distributed communities http:// Software catalogs Computers Sensors and instruments Colleagues Data archives Grid History: I-Way at SC’95 • A prototype national infrastructure – 17 sites, connected by • vBNS and six other ATM networks – 60 applications • Features – – – – I-POPs for site access Kerberos authentication manual scheduling distributed communication libraries • Experiences – led to Globus Grid toolkit • Concurrent industry needs – led to web services for B2B interoperation Web Services: “Commercial Grids” • From browser-centric to service-centric – from human-computer to computer-computer – structured negotiation and response • Workflow creation and management – end-to-end service negotiation – inter-organizational interaction • Prerequisites – metadata standard for service descriptions – standard communication mechanisms – resource discovery and registration eBay Web Services Architecture • Over 40% of eBay's listings are now via API calls Source: IBM Web Services: A Definition A web service is … designed to support interoperable machineto-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact … [using] its description using SOAPmessages, … using HTTP with an XML serialization .... W3C Working Draft, August 2003 Service SOAP Invoke Consumer SOAP Locate WSDL UDDI Service Provider Publish SOAP Service Broker • SOAP (Simple Object Access Protocol) • WSDL (Web Services Description Language) • UDDI (Universal Description, Discovery and Integration) Technology Push Source: Gartner Group European myGrid Architecture Source: www.mygrid.org The Bioinformatics Challenges • Complex, multilevel models – integration and in silico designs • Information visualization – complexity and scale • Data models and ontologies – community definition • Data federation, storage and management – shared access and support • User access portals – web-based tool and service interfaces • Packaging, distribution and deployment – community building Multilevel Cellular Models • Signaling networks – environmental triggers and behavior • e.g., cell lifecycle – different pathways in each tissue type • Metabolic networks – measurable products in pathway – many systems are steady state – negative feedback leads to stabilization • Protein interaction networks – localization of proteins that interact for function – protein-protein interactions for specific actions • Gene regulatory networks – many things affect gene product concentration – nucleic-nucleic, protein-nucleic interactions • Computing, physics, engineering and biology – control theory, mathematical models, phase spaces – from biological cartoons to predictive models • e.g., microRNAs and gene expression controls Biological Models • Simulation and prediction – structures and dynamics • Reasoning and discovery – reverse engineering Temporal (seconds) 10-12 10-9 Bond Motion 10-6 10-3 100 103 Catalysis Growth & Division Diffusion Spatial (nM3) 100 102 Metabolites Proteins 106 Transcription Translation 104 Ribosomes 106 108 Prokaryotes 1010 Eukaryotes 1012 Biophysical and Environmental Modeling Airway/flow Mucus Cilia Cell biochemistry and structure Proteomics Genomics Source: Ric Boucher, UNC Data Heterogeneity and Complexity Phenotype Genomic, proteomic, transcriptomic, metabalomic, proteinprotein interactions, regulatory bionetworks, alignments, disease, patterns and motifs, protein structure, protein classifications, specialist proteins (enzymes, receptors), … Source: Carole Goble (Manchester) Disease Clinical trial Gene Genome sequence sequence Disease Drug Gene Gene expression expression Proteome Disease Disease Protein Protein Structure homology Protein Sequence P-P interactions Sensor Data Overload Source: Chris Johnson, Utah Art Toga, UCLA Source: Robert Morris, IBM • High resolution brain imaging – 4.5 petabytes (PB) per brain RENCI: What Is It? • Statewide objectives – create broad benefit in a competitive world – engage industry, academia, government and citizens • Four target areas – public benefit • supporting urban planning, disaster response, … – economic development • helping companies and people with innovative ideas – research engagement across disciplines • catalyzing new projects and increasing success • building multidisciplinary partnerships – education and outreach • providing hands on experiences and broadening participation • Mechanisms and approaches – partnerships and collaborations – infrastructure as needed to accomplish goals Carolina Center for Exploratory Genetic Analysis (CCEGA) Interoperable Data Management Faculty, Staff & Students Driving Problems Analysis Techniques Extant Data Models Promoting Mutual Awareness Experimental Genetics Portal Statistical & Computational Techniques Virtuous Cycle Interdisciplinary Research & Education CCEGA Participants • Coordination team – – – – • Dan Reed, RENCI Terry Magnuson, CCGS Alan Blatecky, RENCI Kirk Wilhelmsen, CCGS Eleven departments/institutes – – – – – – – – – – – • • Biostatistics Cancer Center Genetics Computer Science Epidemiology Genetics Health Science Library Information and Library Science Pharmacy RENCI Statistics Campus wide support – from many sources Project participants – – – – – – – – – – – – – – – – – – – – – – – Brad Hemminger, Information & Library Science James Evans, Genetics Kevin Gamiel, RENCI Xiaojun Guan, RENCI Barrie Hays, Health Science Library Clark Jefferies, RENCI Ethan Lange, Genetics Andrew Nobel, Statistics Karen Mohlke, Genetics Kari North, Epidemiology Susan Paulsen, Computer Science Fernando Manuel Pardo, Genetics Charles Perou, Cancer Center Lavanya Ramakrishnan, RENCI Jan Prins, Computer Science Patrick Sullivan, Genetics Lisa Susswein, Cancer Center David Threadgill, Genetics Alexander Tropsha, Pharmacy K.T.L. Vaughan, Health Science Library Fred Wright, Biostatistics Wei Wang, Computer Science Fei Zou, Biostatistics Data: From Lab and Clinic to Analysis • Independent data management – – – – ELSI data security version control redundancy controlled access Clinical ELSI Analysis Integration & Informatics Analysis Laboratory • NIH CCEGA – Carolina Center for Exploratory Genetic Analysis Source: Brad Hemmenger, UNC Analysis Data Management and Information Viz Published Domain Literature Taxonomy…. Annotation . GenBank Ontology Annotation DB Schema Ontology Annotation Annotated Domain Literature Information Mining Module Information Visualization Module From SNPs to HapMap • Single Nucleotide Polymorphisms (SNPs) – one in ~1200 bases differ across individuals – SNPs act as markers to locate genes • Common groups of SNPs are shared – i.e., form a haplotype • HapMap data sources – – – – 90 Yoruba individuals (30 trios) from Nigeria (YRI) 90 individuals (30 trios) of European descent from Utah (CEU) 45 Han Chinese individuals from Beijing (CHB) 45 Japanese individuals from Tokyo (JPT) • ~3,500,000 SNPs typed – basis for association studies for disease identification CCEGA HapMap Simulator • Synthetic data – disease models – model testing • mining bakeoffs Carolina Bioportal • Three overlapping target groups – undergraduate education – graduate education and research – academic/industrial research • Features – access to common bioinformatics tools – extensible toolkit and infrastructure • OGCE and National Middleware Initiative (NMI) • leverages emerging international standards – remotely accessible or locally deployable – packaged and distributed with documentation • National reach and community – TeraGrid deployment • science gateway • Education and training – hands-on workshops • clusters, Grids, portals and bioinformatics Distributed Grid and Web Services Launch, configure and control Application Interface Workflow service Grid Portals App Instance App Instance App Instance Open Grid Service Architecture Layer Registries and Name binding Data Management Service Security Policy Reservations And Scheduling Administration & Monitoring Accounting Service Logging Grid Orchestration Event/Message Service Open Grid Service Infrastructure (web service component model) Resource Layer (from PCs to Supercomputers) Online instruments Source: Dennis Gannon, Indiana Bioportal Architecture HTML Files Interface Generator PISE Application XML Description Application Processing www.ncbioportal.org Application Databases Remote File Access Job Records Job History Database Job Submission Bioportal Velocity Files Application Processing Command Files User Profile OGCE User Databases MyProxy GridFTP Gatekeeper • OGCE toolkit – used by cyberinfrastructure projects • LEAD, NEES, PACI, DOE, TeraGrid … Local cluster Authentication, Grid Credential Putting the Technologies Together NC Bioportal OGCE Toolkit (Grid middleware) Chef (collaboration/standard portlets) Jakarta Jetspeed (enterprise portal) Turbine (web app framework) Velocity (template engine) VMC PISE Tomcat (XML Wrapper) (Apache servlet container) Bio Applications Grid Portlets, CoG Databases Community Software Toolkit: Lessons • NSF PACI Alliance “In a Box” toolkits – – – – cluster software (aka OSCAR) Grid infrastructure (aka NMI) Access Grid for distributed collaboration tiled display walls for visualization • Distribution materials – software and training materials • CDs and web • Community workshops and training – Linux Clusters Institute – MSI HPC workshops – hands on training • Lowering the entry barrier – usage and deployment • Bioportal distribution – workshops, tutorials – training materials – road shows NC Bioportal: What’s Next • Engagement – workshops, experiences and deployments • Infrastructure – – – – dynamic job scheduling across multiple sites migration to OGCE 2.0 fully automated database updates workflow construction and processing • Portal tool suite – expanded applications and databases • phylogeny, morphology, microarray analysis, … • Training materials – additional modules based on user feedback – workshop materials packaged for self-study • Leverage national presence – TeraGrid/NCSA bioinformatics portal The Vision of Grid/Web Services “… Behold, the people is one, and they have all one language; and this they begin to do: and now nothing will be restrained from them, which they have imagined to do.” – Book of Genesis Peter Bruegel The Tower of Babel (1563) Interdisciplinary Collaborations • Appropriate reward structures – well-matched time constants • Intellectual equality – balanced recognition of contributions • Research/infrastructure distinctions – timelines and people needs differ • Confidentiality and openness – academic/industry collaboration perspectives • Intellectual property – background IP and differential disciplinary models Some Thoughts on the Future • Grids/web services are not a panacea – we have seen this movie before • standards debates can be endless • make new mistakes, not the same old ones – code is shifted from modules to interfaces • Danger of “Death by CS Abstraction” – “all problems can be solved by another level of indirection” • Appropriate decomposition is a challenge – performance, usability, flexibility • Generality and extensibility really matter – incremental aggregation and interoperability – data management and federation • Better questions, not just private capabilities – limited by creativity not resources The Cambrian Explosion • Most phyla appear – sponges, archaeocyathids, brachiopods – trilobites, primitive mollusks, echinoderms • Indeed, most appeared quickly! – Tommotian and Atdbanian – as little as five million years • Lessons for computing – it doesn’t take long when conditions are right • raw materials and environment – leave fossil records if you want to be remembered! Thanks for the Invitation!