Coordinated Laboratory for Computational Genomics

Download Report

Transcript Coordinated Laboratory for Computational Genomics

Computational Methods in Molecular Biology

Thomas L. Casavant, Ph.D.; Director Tom B. Bair, Ph.D.

*; Post-Doctoral Fellow Terry A. Braun, Ph.D.; Sr. Computational Scientist Todd E. Scheetz, Ph.D.; Sr. Computational Scientist

Coordinated Laboratory for Computational Genomics, Department of Electrical and Computer Engineering The University of Iowa 1

Fundamental Observation

• The study of the Life Sciences is changing – The so-called post genome era is here.

– 100’s of databases of DNA sequence and mRNA transcripts exist.

– Functional knowledge is expanding constantly.

– Most biological researchers are not prepared for this.

– People with computational training have not prepared to work in

this

application area.

2

Hub & Spoke Model

In the Hub Applied Math and Statistics CS/Math ECE Allied Disciplines Mgmt Inf. Sys.

Physics Information Science 3

Hub and Spoke Model

Gene Therapy Ophthalmology Cancer Center Center for Macular Degen.

Hub Pediatrics Genetics Pulmonary Medicine BioChemistry Allied Disciplines Immunology Biology 4

Benefits of Such a Model?

• In the Hub: Computational Scientists.

– Interacting with each other to share: • Knowledge and methods, computing infrastructure, organizational support, collaborative connections, curricular efforts, human resources (staff, post-docs, etc).

• At the Ends of the Spokes: Disciplinary Scientists and Interdisciplinary Collaborators – Computational Scientists form the links with collaborators who provide: • Access to real problems, and x-training in allied disciplines 5

Consequence of Model for this Course?

• Who are you?

– Some with lots of computer background – Some with lots of biology background • We will bring both groups along.

– Some material will inevitably be review for some – Some material will be over your head.

• Like the hub and spoke model, we will try to work on the “together” • “Different” kind of course … 6

Course Approach

• Three Levels – Lecture/Concepts – Demonstration – Hands-on experience • Areas of Focus: – Concepts and Algorithms – Basic Computer Knowledge (learn a bit about what there is to learn) – Basic Molecular Biology/Genetics – How bioinformatics tools are designed – How to use non web-based software – UNIX and Perl 7

Main Subjects

1. Overview of Computer Science (today); Bioinformatics and Computational Biology. Models, Algorithms, and Available Tools. (Dr. Casavant) 2. An introduction to using UNIX and why it is worth the “pain”. Script writing in Perl, installing and running free software, files, etc (Drs. Scheetz, Braun, and Bair) 3. An introduction to Molecular Biology, Biochemistry, and Genetics (Drs. Scheetz, Braun, and Bair) 4. Light-speed tour of contemporary problem areas in Bioinformatics and Computational Biology (Drs. Casavant, Scheetz, Braun, and Bair).

8

Main BCB Subjects (cont)

1. Sequence analyses 2. Hidden Markov Models 3. Gene Prediction 4. Phylogenetics 5. Protein Structure and Domains 6. Mapping 7. Linkage Analysis 8. Microarrays and Expression Study 9. Pathways 10. Promoters 9

Intended Take Home Lessons?

• The computer is a tool, but • A fairly blunt tool.

• Need to be able to flex this tool • Too many disciplines for a single person – Computation • Engineering • Science – Mathematics/Probability/Statistics – Biology (Cellular, Molecular, etc) – Genetics (Human, evolution, molecular, etc) – Physics (bio, molecular, etc) – Chemistry • Must learn to work in teams 10

BioInformatics and Computational Biology

• Defn: The modern study of biology, and its applications to medicine, agriculture, and other areas – centered on the use of substantial computing resources – hardware, software, and human.

11

How do Bioinformatics and Computational Biology differ?

Bioinformatics

: Driven by large datasets which must be gathered, curated, stored, organized, searched, and archived.

BioInformatics

Data Driven

Computational Biology

: Development of algorithms and computational procedures to model, simulate, and analyze biological systems.

Computational Biology

Model Driven

12

BCB Training Three Critical Components

1.

Continuously

Learn Biological/Computational Science (X-training) 2. Develop computer systems (hardware and software) to “

process

” raw data and store it in a form that can be used for later analysis (Data Pipelines) 3. Work with other biological/genetic/medical researchers to answer “

the questions

as new data. (Interactive Tools) ” by developing computational tools to analyze the archived, as well 13

BCB

X-Training Learning Biological/Computational Science

• Begin with a recognized strength in

either

Biological or Computational Science • Then work to

continuously

strengthen the

other

area --- IF -- – Computational Background: • Must study biology, biochemistry, genetics • Must develop a working knowledge of laboratory methods – Biological Background: • Must study algorithms, computer systems, programming • Must either develop competence in using computers at the “UNIX” level, or • Must be able to span the gap between biologists and those who can work with computers at that level 14

BCB

Data Pipeline Construction

• Develop computer systems (hardware and software) to “

process

” raw data and store it in a form that can be used for later analysis - a “pipeline” of efficient, accurate, and high performance processing steps.

• Examples – Data acquisition hardware/software – Format conversion – Quality screening – Correlation functions – Annotation – Database deposition – Distribution of data 15

MGC FL cDNA Sequencing Pipe

16

BCB

Interactive Tool Development

• Formulating and answering “

the questions

” in cooperation with other biologists.

– Understanding the problem – Formulating the

question

as it relates to

informatics

– Prototyping the analysis process – Iterating on the form and content of the analysis – Applying the tools to more general cases – Dissemination of tools and documenting their use – Evaluation of effectiveness – Continuous evolution of tools and their functions 17

Java Cluster Viewer

18

Hierarchical Clustering

19

Outline for Today

• Administrative matters. • Some Definitions and Scope • A Spectrum of Computing Issues I. Programming Fundamentals II. Data Structures III. Algorithms IV. Systems and Networks V. Tools and Scripts VI. Databases 20

I. Programming Fundamentals

1. Problem Solving 2. Problem Specification 3. Top-down Design 4. Languages 5. Debugging/Performance Tuning 6. Testing 7. Maintenance 21

• •

1. Problem Solving

(I. Programming Fundamentals) Since late 60’s, the phrase “Problem Solving with Computers” The computer as a tool: 1. Understand Problem 2. Specification 3. Design 4. Implement 5. Test 6. Maintain 22

2. Problem Specification

(I. Programming Fundamentals) • Can be informal, formal, or in between.

• A definition of Input/Output Relationships • Uncovers and clarifies any ambiguous issues • Involves interactions between end users and solution developers • Ideally, produces a specification “document” • Realistically, prototyping usually starts simultaneously with specification 23

3. Top-down Design

(I. Programming Fundamentals) • Extremely important methodological philosophy.

• Develops a solution in successive phases of decreasing levels of abstraction • Any problem can have a solution described (at some level of abstraction) on about a half sheet of paper. • Aliases: modular programming, stepwise refinement, (object oriented design is this philosophy with training wheels added) 24

4. Languages

(I. Programming Fundamentals) • Choice of language can be important.

• Often, however, final choice of language can be as much a matter of subjective, personal choice, as is a type of paint brush to an artist.

• Issues: acceptance (for maintenance), performance, portability 25

4. Languages

(I. Programming Fundamentals) • Language types: – Procedural (C, Fortran, Pascal, Basic) – Object-oriented (C++, Smalltalk) – Functional, Declarative (LISP, Prolog) – String processing (SNOBOL) • Deployment technologies: – Interpreted (Basic) – Compiled (C, Fortran, most languages) • Run-time systems – Statically linked (real-time, older systems) – Dynamically linked (most modern environments) 26

5. Debugging/Performance Tuning

(I. Programming Fundamentals) • • • • • The most unpredictable phase of the process Not a matter of luck, however 1.

Scientific principles are critical Formulate hypothesis 2.

3.

Perform an experiment, examine results Make a single change 4.

Repeat at step 1.

“Crash” testing, and obvious error finding Debugging tools can assist (gdb) 27

6. Testing

(I. Programming Fundamentals) • Goes beyond crash testing • Need to develop test sets • Functional testing • Structural testing • Must struggle with specification now 28

7. Maintenance

(I. Programming Fundamentals) • The on-going, necessary update and debugging of “finished” software • This step never ends • Often, earlier steps ignore this phase for the sake of expediency • Language choice, specification, modularization, all bear on this step 29

II. Data Structures

• A practical “framework” for holding data.

• Must consider input, intermediate, and possibly computed output data • Impacts on: – Development time – Memory usage – Performance (execution time) – Maintainability 30

II. Data Structures (cont.)

• Scalar and array variables • Static and dynamic structures • Dense and sparse structures • Linear and linked structures • Lists, Stacks (LIFO), Queues (FIFO), Trees, Graphs, and Heaps • Dynamic structure efficiency relies on OS interaction, and program “behavior” 31

III. Algorithms

• Control flow • Template structures • Complexity analysis 32

III. Algorithms (Control Flow)

• Sequential • Alternation or selection Statement 1 ?

Statement 2 • Iteration or looping ?

Statement 1 Statement 2 Loop Body 33

III. Algorithms (Template Structures)

• Divide and Conquer • Greedy • Backtracking • Branch and Bound • Searching (depth first, breadth first) • Dynamic Programming 34

III. Algorithms (Complexity Analysis)

for i = 1 to 100 for j = 1 to 50 x[i] = a[i] + b[j] Inner statement executes 50 x 100, or 5,000 times. If outer loop executed

n

times, and inner one

n

times, we would say that this “algorithm” had complexity O(

n

2 ).

In some sense, as the problem size

n

grows, the execution time will grow as the square of

n

.

35

IV. Systems and Networks

Memory

DATA

Scalar Processor 4. Store Data 2. Fetch Data Array

PROGRAMS

3. Execute Instruction 1. Fetch Instruction 36

IV. Systems and Networks (cont.)

Tools and Applications Libraries and Languages Network Operating System Peripherals: Disks, etc CPU/Memory Local Operating System 37

IV. Systems and Networks (cont.)

Network Medium

1 computer:

CPU Memory Disk

1 computer 1 computer 1 computer 1 computer

•Many Possible Media: Physical and Protocols •Functional Variants: message passing, shared files, shared memory •Security issues: protecting data, allow sharing •Heterogeneous Operating Systems 38

V. Tools and Scripts

• Tools: – Debugging – Performance Tuning – Administration • Scripting: – Programs of “shell” commands – “Glue” to allow other programs to work together, and manipulate whole files (of sequence, for example) as simple data objects 39

VI. Databases

• Pile ‘o data • Stored on large non-volatile media (e.g. disk system), Local vs. networked.

• Table Structures • Primary key for each item • Strength is “relational” query methods – SQL – structured query language – “retrieve from table X where name like “Joe” and age equal 32” – Insert, delete, update, etc.

40

Fin

41