Using Existing Products And Technologies For Scientific Research Dan Fay Director – North America Technical Computing Microsoft Corporation.
Download ReportTranscript Using Existing Products And Technologies For Scientific Research Dan Fay Director – North America Technical Computing Microsoft Corporation.
Using Existing Products And Technologies For Scientific Research Dan Fay Director – North America Technical Computing Microsoft Corporation Can “Here And Now” Technologies Reduce Time To Insight? Can “Business” Tools and techniques for dealing with Be used in scientific research to raise the bar and allow researchers to be scientists and not computer scientists. The Problem For The e-Scientist Experiments & Instruments Other Archives Literature questions facts facts ? answers Simulations Data ingest Managing a petabyte Common schema How to organize it? How to reorganize it? How to coexist and cooperate with others? Data Query and Visualization tools Support/training Performance Execute queries in a minute Batch (big) query scheduling Computational Modeling Persistent Distributed Data Workflow, Data Mining & Algorithms Interpretation & Insight Real-world Data Persistent Distributed Storage Visual Programming Distributed Computation Interoperability & Legacy Support via Web Services Searching & Visualization Live Documents Reputation & Influence The Scripps Research Institute Peter Kuhn Lab Research Focus Early detection and therapy management of cancer patients Modulation of protein interactions for therapeutic intervention Projects Cancer bioengineering partnership Structural Proteomics of SARS TSRI Goals Improve Collaboration Complex experimental data Within Scripps and with outside organizations Capture more data electronically Images Discussions Structured Data To provide project data and decisions in context – e.g. annotations on 2D and 3D objects Leverage existing productivity applications The Collaborative Molecular Environment Application Allows the user to establish context among projects, entities, and annotations Easily collect data from multiple sources (notes, files, URLs, Screen Clipping) Provides for Annotation on pictures, data, and molecules Very simple reporting (not yet implemented) Windows Presentation Foundation Application container Controls for annotating 2D and 3D images Rapid application environment for images and 3D data SharePoint 2007 Supports the Application with standard Web Services Provides the security context for project teams and external collaborators Enables search of annotations in order to find relevant images Provides a single repository for collaboration with internal and external (SSL) collaborators Office 2007 Captures metadata to describe application context (image, investigator, etc.) External Research & Programs C-ME And 2D Annotation Annotating Protein Data Research Integrate Data acquisition from source systems and integration Data transformation and synthesis Analyze Data enrichment, with business logic, hierarchical views Data discovery via data mining Report Data presentation and distribution Data access for the masses Comparison Of Soil Moisture Water Content at 5 cm 0.6 0.5 Water Content at 20 cm 0.4 Vaira 0.6 0.3 0.5 0.2 0.1 y = 0.4712x R2 = 0.7039 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Vaira 0.4 0.3 0.2 Tonzi y = 0.5854x 0.1 Thanks to Gretchen Miller – UC Berkeley & Catharine Van Ingen (MSR) R2 = 0.9163 0.0 0.0 0.1 0.2 0.3 Tonzi 0.4 0.5 0.6 Other Applications Temperature at North American Sites Average Tempmerature in oC 30 20 10 ` 0 -10 20 30 40 50 Latitude 60 70 80 Dynameomics Goal: Perform MD simulations of representatives of all fold families (unique structures) ----maximize sampling of fold and sequence space Protein Folding Protein Unfolding (more tractable} In Protein Databank ~17,000 structures >35,000 domains CATH, SCOP, Dali fold classification methods consensus → 1130 non-redundant protein folds Thanks to Valerie Daggett – Univ of Washington Day et al, Prot. Sci., 2003 www.dynameomics.org Jelly Roll: 1sac SAP - plait: 1ris S6 4-helix bundle: 2a0b phosphotransfer domain -grasp: 1pgb protein G EF-hand: 4icb calbindin OB fold: 1mjc IGG-like: 1e65 Cytochrome C: 1hrc azurin cytochrome C IGG-like: 1fna fibronectin Rossman: 3chy CheY TIM barrel: 1ypi TIM 3-helix bundle: 1enh engrailed homeodomain Globin: 1a6n myoglobin Top 30 folds Represent ~50% of the structures in Protein Databank Trypsin-like serine protease: 1qq4 -lytic protease Thioredoxin-like: 1ev4 GST A1-1 Rossman: 1ght SH3 barrel: 1shg transposon resolvase CspA knottin: 1snb C-type lectin: 2afp -spectrin SH3 FAD/NAD(P) binding domain: 1ebd oxidoreductase neurotoxin BMK M8 type II antifreeze prot. lipocalin: 1ifc fatty acid binding prot. trefoil: 1tld bovine trypsin Zn finger: 2adr Zn finger (ADR) snake toxin: 1ntn cobra neurotoxin acid protease: 1g6l HIV-1 protease Rossman: 2pth peptidyl tRNA hydrolase GST (C-term): 1ev4 GST A1-1 IL-8 like (OB): 1bf4 Sso7d PLP dep. transferase: 1e5f methionine -lyase Laminin-like: 1edm coagulation factor IX Day et al, Prot. Sci., 2003 Example simulation system (1fna) Fibronectin, a representative from the top ranked fold (IGG like) is prepared for molecular dynamics (MD) simulation by adding hydrogens (not shown) to the PDB structure and solvating it with explicit waters (red and white in ball & stick). MD is the time dependent integration of the classical equations of motion for molecular system. Our MD methods have been qualitatively and quantitatively benchmarked against experiment for more than 50 proteins in the past 15 years Example unfolding simulation (1fna) 10 nanoseconds 21 ns Denatured (D) Starting structure Native (N) Unfolding of fibronectin, a representative from the top ranked fold (IGG like), from its biologically active state (N) to a denatured, inactive state (D). During unfolding, it loses a critical hydrophobic contact in its core between a valine and a tyrosine Dynameomics 200 targets complete – 6 simulations of each 1 native, 5 unfolding DOE INCITE Award 3,300,000 CPU hours On NERSC 250 GB every 48 hours Projecting that we will have 100 TB (compressed) Now a database is required SQL Server 2005 For Their Purposes Suite of applications Relational database engine OLAP engine Performance tools Extraction, Transformation and Loading tools Integrated development environment OLAP – On-line Analysis Processing MOLAP – Multi-dimensional OLAP MOLAP For Scientific Analysis Why A Multidimensional Database Is Desired It is efficient, most of the time, only two or three dimensions are actively in play The multidimensionality allows user to select properties of interest and sideline the rest Better than SQL by eliminating the need for complicated joins Sparsity tolerant Faster Time to Insight Better integration to existing Windows infrastructure Integrated and familiar development environment Fighting HIV With Computer Science Nebojsa Jojic and David Heckerman - MSR A major problem: Over 40 million infected Drug treatments are effective but are an expensive life commitment Vaccine needed for third world countries Effective vaccine could eradicate disease Methods from computer science are helping with the design of vaccine Machine learning: Finding biological patterns that may stimulate the immune system to fight the HIV virus Optimization methods: Compressing these patterns into a small, effective vaccine Developed Set Of Specialist Tools Chromatogram deconvolution Pathway analysis/association/ causal models Clustering/Trees (phylo, haplotypes etc.) Protein binding and folding Sequence diversity models (epitomes) Image analysis/classification Evolution modeling and inference Epitope prediction HIV: The Diabolical Virus The train-and-kill mechanism doesn’t work for HIV – the virus adapts through rapid mutation. As soon as the killer cells get the upper hand, the epitopes start changing Strategy Find peptides or epitopes that occur commonly across a *population* of HIV viruses Compact the known or potential immune targets into a small vaccine HPC and HIV Vaccine Design Carl Kadie and David Heckerman Machine Learning and Applied Statistics Microsoft Research Developed Software: 8 or so new research programs. Most .NET(C# & C++/CLI), One in ‘R’. One in native C++. Hardware: Cluster of 25 IBM eServer 326, 2 processors per machine Cluster Software: Windows Compute Cluster Server 2003 Fusion Events Integrated Discovery in Gene Networks Integrate genome-scale data for discovery and prediction Incorporate Disease, multiple organisms Create applied systems network standard Thanks to: Mehmet Dalkilic – Indiana University Andrews-Dalkilic Laboratory James Costello (PhD Candidate), Rupali Patwardhan, Sumit Middha, Brian Eads, John Colbourne, Scott Beason Junguk Hur Microarray Co-Expression Arbeitman – “Life Cycle of Drosophila” Parisi – “Incyte Drosophila LifeArray v1.0” White – “Larval Tissues-Specific Transcripts” Protein-Protein Interaction FlyGRID (Fly General Repository for Interaction Dataset) DIP (Drosophila Interaction Database by CuraGen) MINT (Molecular Interaction Database) BIND (Biomolecular Interaction Network Database) Genetic Interaction Flybase Phenotypic Data Flybase Binding Site DNase I Footprint Database and Patser3 using PWM RNAi Screens Harvard RNAi Screen – Norbert Perrimon xl-caBIG Smart Client How to give scientists a graphical interface for accessing cancer Biomedical Informatics Grid (caBIG) data-services http://xl-cabig-client.sourceforge.net/ PI - Katarzyna Macura Johns Hopkins caBIG In Vivo Imaging Workspace Subject Matter Expert xl-caBIG Smart Client Reproducible Research Document Broad Institute Infusion Development SharePoint Products And Technologies Microsoft Office SharePoint Server 2007 Server-based Excel spreadsheets and data visualization, Report Center, BI Web Parts, KPIs/Dashboards Docs/tasks/calendars, blogs, wikis, e-mail integration, project management “lite”, Outlook integration, offline docs/lists Business Intelligence Rich and Web forms based frontends, LOB actions, enterprise SSO Business Forms Integrated document management, records management, and Web content management with policies and workflow Collaboration Platform Services Workspaces, Mgmt, Security, Storage, Topology, Site Model Content Management Portal Enterprise Portal template, Site Directory, My Sites, social networking, privacy control Search Enterprise scalability, contextual relevance, rich people and business data search Excel Services Overview Browser Excel 2007 Publish Spreadsheets High quality web rending Zero-footprint Interactive: Set parameters, sort, filter, explore Limit to browser access View and Interact Design and author Export/Snapshot into Excel Programmatic Access Open in Excel for rich exploration and analysis Open snapshots SharePoint platform and Excel services Spreadsheets stored in document libraries Spreadsheet calculation and rendering External data retrieval and caching 100% calculation fidelity Excel 2007 Custom applications Set values, perform calculations, get updated values via web services Retrieve full workbook file Development Data Workflow Collaboration Publications .NET & Visual Studio F# Iron Python SQL Sever SQL Server analysis Services Windows Workflow SharePoint Server 2007 Knowledge Network Instant Messenger ConferenceXP Academic Live, Onfolio, etc … Resources Windows Compute Cluster Server Tuesday 12-1 Baker High-Performance Computing with Windows http://windowshpc.net/ Data mining www.sqlserverdatamining.com/ Develop without Borders Challenge www.developwithoutborders.com Technical Computing Blogs http://blogs.msdn.com/dan_fay and http://blogs.msdn.com/eScience © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.