Transcript Slide 1

Interdisciplinary Introductory Course in Bioinformatics

Yana Kortsarts

Computer Science Department

Robert Morris

Biology Department

Janine Utell

English Department, Widener University, Chester, PA

What is Bioinformatics?

Bioinformatics

is a relatively new interdisciplinary field that integrates computer science, mathematics, biology, and information technology to manage, analyze, and understand biological, biochemical and biophysical information. 

Bioinformatics

is a computational science and the subset of larger field of Computational Biology.

Motivation

   IS professionals must have strong analytical and critical thinking skills. (IS 2002 Model Curriculum and Guidelines for Undergraduate Degree Programs in IS) Introducing bioinformatics to CIS students will strengthen these required skills. Equip students with some of the following capabilities as suggested in the IS 2002 guidelines:      Creativity Application of both traditional and new concepts and skills Application development Problem solving abilities Ability to communicate effectively (oral, written and listening)

Motivation

     Provides opportunities for students to become familiar with one of the most widely used script languages, Python Explore various data structures and algorithmic techniques traditionally not covered in other courses. Helps students to make connections between theoretical topics learned in core CS and CIS courses, such as Data Structures and Algorithms, and to apply their knowledge to real world biology problems. Helps to diversify department course offering and provides interdisciplinary opportunities for CS and CIS students. CIS and CS students with bioinformatics background clearly will enhance their employment qualifications in the competitive job market

Challenges

 Students have different backgrounds  Choosing programming language  Defining course prerequisites  Defining course content   Programming Algorithms  End User Bioinformatics Tools  Balanced course   Content Hands-on/lecture  Interdisciplinary Nature

      

Course Development

First Iteration: Spring 05; Second Iteration: Spring 08 Cross–listed upper level technical elective. Prerequisites:    Biology/Chemistry/Biochemistry majors: Introduction to Molecular Biology CS/CIS/MATH majors: Introduction to Computer Science I Chemical Engineering Majors: Computer Programming and Engineering Problem Solving Team teaching: Biology and Computer Science Faculty 4 credits, 6 hr: 4 – CS, 2 – Biology Spring 05 Enrollment: 6 Biology students, 6 CS/CIS Spring 08 Enrollment: 6 CS/CIS

Course Objectives and Goals

 To integrate bioinformatics algorithms into the course and to teach the foundations of the algorithms and important results in bioinformatics  To introduce students to the Python programming language. Biopython Project is an international association of developers of freely available Python tools for computational molecular biology.

Course Objectives and Goals

 To introduce students to the principles that drive an algorithm’s design and to intellectual content of bioinformatics  To provide an opportunity for interdisciplinary collaboration in the in-class assignments and the course project

      

Course Curriculum

Ethics, Computing and Genomics

Project-Oriented Component, new for Spring 08 20% of the final grade, three weeks to work on this project Goal: developing oral and written communication skills and to engage students in the knowledge exchange process Learning about the ethics, computing and genomics topic independently and presenting the results of the self-learning. Students were assigned one or more scholarly articles from the collection

Ethics, Computing, and Genomics

, edited by Herman Tavani. Students were required:    read assigned essays prepare 25-minute Power Point presentations with a summary of the paper and answers to the questions posed in the introductory part of the corresponding section prepare a mini-quiz to assess the understanding of the presented material by their peers.

Ethics, Computing and Genomics

    Collaborative Work with English faculty member English faculty member did short presentation before students started to work on this assignment.     discussion of how to read critically and what questions to ask while reading the text; discussion of how to summarize the paper using the structure of the essay as a guide and elucidating key points and key moments of evidence while making connections to the rest of the class material; tips on writing the summary that include three steps: prewriting, drafting, revising; discussion of how to design an effective presentation of information. Was present at all oral presentations and provided detailed notes for each student explaining ways the presentation could have been stronger and also pointing out the positive and negative aspects of the presentation. This successful and enjoyable experience showed the value of working with colleagues across disciplines to further student learning.

Course Curriculum:

Introduction to Python

     Quickly introduce students to Python during first few weeks of the course Working on different problem solving algorithmic techniques. Introductory topics: arithmetic, decision and loop structures, functions, simple manipulations with strings, lists, tuples and dictionaries. Advanced Python topics were taught later throughout the course, building students’ knowledge and their abilities to tackle biology real-world problems. The programming examples were all biology-oriented and motivated students to learn in order to solve practical problems.

Introduction to Python

       Spring 05: 6 biology students, no programming experience, 6 CS and CIS students no experience in Python. 6 interdisciplinary teams, all concepts were practiced within the team Spring 08: all students were CS and CIS majors with prior programming experience in C and Java, and some with introductory knowledge in Python Special handouts were prepared to walk students through the introductory topics toward advanced Python concepts. Each topic was supported by a list of examples in increasing order of complexity. Students were required to run the proposed programs in order to gain understanding of basic Python structures. To assess the understanding of each concept, students were required to write short programs solving biology-oriented problems.

Introduction to Python

 Examples of the problems, given here in increasing level of complexity:     computation of the alignment score between two DNA sequences using different score matrices finding the maximal alignment score if no internal gaps are allowed using different score matrices finding all occurrences of one sequence in another sequence writing a program that reads a DNA sequence, first transcribing DNA into RNA and printing the resulting RNA sequence, then translating RNA into a protein sequence through the following: first, the program divides RNA into codons and prints the list of codons, and second, the codons are translated into the protein using genetic code table and finding the maximal alignment score if internal gaps are allowed using different score matrices.

Introduction to Python

     Spring 08: different levels of programming and computational experience and the best way to cover this topic was through independent learning. Handouts and Python and BioPython tutorials ( www.python.org

, www.biopython.org

), worked each at their own pace. Grading rubrics for each programming concept, minimal requirements to pass the specific concept, list of more advanced examples for students with prior Python experience.

Students with previous Python knowledge further advanced their experience and students new to Python learned the new programming language independently using structured guidance. Python provides an opportunity to solve some problems in very short ways, and it was a very enjoyable experience for students to try to find a shortest solution for the proposed problems using Python functions and libraries.

Introduction to Bioinformatics Algorithms

 Sequence alignments, scoring, gaps  Algorithm Design Techniques: Exhaustive Search, Dynamic Programming  The Needleman and Wunsch Algorithm  The Smith-Waterman Algorithm  Introduction to BLAST  Introduction to Multiple Sequence Alignments  Visualization of algorithms:  ALGGEN – EMBER Web resources

Introduction to Bioinformatics Algorithms

     Dynamic Programming technique usually is not covered in a core algorithms course Provided an opportunity to expand the theoretical background and to make connections between theory and practice. Helped to maintain an appropriate level of theoretical content required for upper-level elective courses in our department. This topic was very well blended with biology topics and students had an opportunity to learn the concept of sequence alignments from biology and computer science points of view.

EMBER website provides a suite of multimedia bioinformatics educational tools, allows to create a set of hands-on activities to help students to gain understanding of the dynamic programming technique in general and specific algorithms in particular.

Course Curriculum – Biology Topics

  Biological Research on the Web    Public Biological Databases and Data Formats NCBI - National Center for Biotechnology Information Searching Biological Databases Review of Molecular Biology and Biochemistry Concepts     DNA and protein structure Gene expression (transcription and translation) Molecular Biology Central Dogma Sequence Alignments

Hands-On Activities

Microbes Count! BioQUEST Curriculum Consortium  Exploring HIV Evolution: An Opportunity for Research.      The HIV genome is very small and relatively simple. It is made up of nine genes and about 9,500 nucleotides. In this lab students worked with HIV sequence data collected from 15 individuals from an intravenous-drug-using population in Baltimore. The goal of the study was to determine if the HIV isolated from particular subgroups of subjects derives from a common source.

CLUSTALW multiple sequence alignment tool Biology Workbench: http://workbench.sdsc.edu/

Biology Topics – Hands-On Activities

 Microarray Lab, developed by Campbell and Heyer, sold by Carolina Biologicals, called DNA Chips: Genes to Disease.    Understanding how microarrays are used to identify gene changes in disease and the role of gene expression in cancer. Students compared the relative expression levels of six different genes in healthy lung cells and lung cancer cells. After completing the lab, students had an opportunity to discuss the significance of the relative expression levels with respect to the genes' roles in causing cancer

Biology Topics – Hands-On Activities

 Epidemiology - the study of the distribution of diseases in populations.  Explored factors that influence disease spread throughout populations with the software Epidemiology. Ebola was used as a model organism and epidemiology was presented from both a microbiological and social perspective  Exploration of the structure and function of the insulin  generate a phylogenetic tree demonstrating evolution of insulin amongst the vertebrates - animals with an internal skeleton made of bone

Project: DNA Sequence Annotation

.

   Real Data:

Bacillus anthracis str. Ames

Institute project at J. Craig Venter Input DNA: about 50,000 nucleotides long, students worked on different sequences from the same organism. Project Steps:    Find a list of all potential genes and pseudo-genes in the input DNA sequence, using start and end codons, and to arrange found sequences in two separate lists: potential genes (length is larger than 300) and pseudo-genes (length of is less than 300), in order of increasing length. Locate the potential promoters in the given DNA sequences for each potential gene that they found in the first step, and calculated the strength of the promoter.

expressed.

promoter strength.

A promoter is a region of DNA near the beginning of a gene that controls if and when the gene is actually

Output: list potential genes in order of decreasing BLAST all potential genes and pseudo-genes that were found, and to perform an analysis of the results.

Project

     Summary: For each potential and pseudo-gene: start position, length, promoter score, BLAST results, summary and conclusion. For each sequence, we asked students to determine whether a potential gene could be a real gene based on the strength of the promoter and BLAST results. 15-minute in-class presentations: Python program, description of all Python functions that were used and the purpose of each function, all algorithms or/and programming techniques and the presentation and explanation of the summary results, including the information about the specific organism whose DNA was used as the input. Spring 05: team project: team of CS/CIS and biology student. Programming part of the project was mostly done by the computer science students, and the biology students were required to understand and to explain the programming techniques and algorithms that were used. The project provided a possibility for truly interdisciplinary collaboration between computer science and biology students. Spring 08: individual project.

   

Course Results

Spring 05: no formal assessment survey was conducted. An informal discussion about the course was conducted at the end of the semester and we asked students to provide their feedback. Students completed teaching evaluations and provided their comments there as well. All students showed satisfaction with the course and we were very pleased to receive the request to extend the programming component of the course from almost all students. Biology students showed interest in programming and asked that an environment be created where they would be able to more fully participate in all stages of the course project.

Spring 08: a short post-survey was designed in order to assess the students’ experience which included a list of the topics that were covered in the course. We asked students to rate the level of learning for each topic on a scale of 1 (not well) to 5 (very well). Six students were enrolled in the Spring 08

Topic Average

1. Introductory Python and ability to design simple Python programs 2. Advanced Python topics: functions, loops, if-else statements, string manipulations, lists, and list manipulations 3. Designing complex Python programs using advanced Python features 4. Understanding the concept of sequence alignment: global, local, semi-global, multiple sequence alignment 5. Understanding dynamic programming algorithmic technique 6. Understanding Exhaustive Search (brute force) algorithmic technique 3.7

3.5

3.3

3.2

3.7

4 3.8

7. Understanding Needleman-Wunsch algorithm and be able to trace the algorithm to produce the final result 8. Understanding Smith-Waterman algorithm and be able to trace the algorithm to produce the final result 9. The ability to work independently on the research – based project applying computer science and biology knowledge to solve problems 10. Understanding how to use BLAST tool and to read the results of BLAST 3.8

4.3

4.2

11. Using sequence alignments to understand relatedness among species 3.8

12. Using sequence alignments forensically (HIV experiment) 4.2

13. Understanding how microarrays are used to identify gene changes in disease 3.3

14. Understanding the flow of information from DNA to protein 3.3

15. Using computer simulations to test hypotheses about disease spread 3.8

16. The ability to read a research paper in the Ethics, Computing and Genomics 17. The ability to communicate effectively through the participation in the Ethics, Computing and Genomics project 3.8

4 18. The ability to create an informative power point presentation to present the results of the Ethics, Computing and Genomics project 4.3

19. The ability to learn the topic by yourself and the ability to present results of learning in clear way 3.8

Course Results

    All topics were learned on an above average level Some of the topics will require our special attention and should be revised for future iterations. Comment on the Ethics, Computing and Genomics component: received positive feedback from most of the students. Comments regarding the course: most of the students mentioned that they loved the course and would recommend it to their peers; they expressed their satisfaction with the level of the course and the amount of material covered and the depth of the coverage. They also mentioned that the final project was very interesting but at the same time they proposed to be more careful with the project description and to provide clear rules for finding genes on the main and complement strings to avoid confusion.

      

Future Plans

Blend more effectively computer science and biology topics Guest speaker from the field The teaching approach will try to foster student learning through a research-based process.

Further expanding programming and algorithms component To return to team work in the project in order to enhance the collaborative component of the course.

More careful project description Bioinformatics across computer science curriculum    Introduction to computer science Design and analysis of algorithms Programming for non-majors

References

1. An Introduction to Bioinformatics Algorithms, N.C. Jones and P. A. Pevzner, The MIT Press, 2004 2. Fundamental Concepts of Bioinformatics, D. E. Krane and M . L. Raymer, Publisher: Benjamin Cummings, 2002 3. Developing Bioinformatics Computer Skills, C. Gibas and P. Jambeck, O’Reilly, 2001 4. Python/Biopython websites: http://python.org http://biopython.org

5. ALGGEN – EMBER Web Resources: http://alggen.lsi.upc.es/docencia/ember/frame-ember.html

6. Microbes Count! John R. Jungck, Ethel D. Stanley, Marion Field Fass. BioQUEST Curriculum Consortium. http://bioquest.org/microbescount/modules_by_tools.pdf

7. Heyer, Laurie J. and Campbell, A. Malcolm, Microarray Lab: DNA Chips: Genes to Disease. Carolina Biologicals 8. Campbell, Neil, Reece, Jane (2004), Biology, Benjamin Cummings; 7th edition

BLAST

     The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Introduced by S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman in the early 1990s The original BLAST algorithm searches a sequence database for maximal un-gapped local alignments.

Sequence Alignment

Global Alignment

: compare two sequences in their entirety; the gap penalty is assessed regardless of whether gaps are located internally within a sequence, or at the end of one or both sequences.

The Needleman and Wunsch Algorithm

Local Alignment:

find best matching subsequences within the two search sequences.

The Smith-Waterman Algorithm.

Sequence Alignment

Semi-Global Alignment:

different treatment of terminal (end) gaps. Terminal Gaps are usually the result of incomplete data and do not have biological significance. Example: searching the best alignment between the short sequence and entire genome.

Modification of Needleman and Wunsch Algorithm.

Algorithm Design Techniques

Exhaustive Search (brute force)

algorithm examines every possible alternative to find one particular solution 

Dynamic Programming Algorithm

breaks the problem into smaller sub-problems and uses the solutions of the sub-problems to construct the solution of the larger problem.

Needleman and Wunsch Algorithm

Input:

two strings

X

=

x

1

…x M

and and scoring rules: scoring matrix

s Y

=

y

1 … and gap

y N

penalty GP 

Output:

An alignment of

X

and

Y

whose score as defined by scoring rules is maximal among all possible alignments of

X

and

Y

   Let

F

(

i, j

) = optimal score of aligning

x

1 …

x i

and

y

1 …

y j

Initialization: F (0

,

0)

=

0

, F

(0

, i

)

= -i, F

(

j,

0)

= -j

(

i =

1

….M, j =

1

….N

)

Main Iteration:

For each

i =

1

….M

and

j =

1

….N

F

(

i

,

j

)

max

  

F F F

( ( (

i i i

,

 

j

1

1 , ,

j

1 )

j

)

1 )

  

GP GP s

(

x i

,

y j

) (

case

1 ) (

case

2 ) (

case

3 )

TraceBackP

(

i

,

j

)

  

diagonal, left, up, if if if case 1 case 2 case 3

Termination: F (

M

,

N

) is an optimal score

 Finding the optimal alignment:  Every non-decreasing path from (0, 0) to (

M

,

N

) corresponds to an global alignment of the two sequences.

 Use

TraceBackP

starting at (

M

,

N

) to trace back an optimal alignment

TraceBackP

(

i

,

j

)

   

diagonal, left, up, if if if case 1 case 2 case 3

   case 1:

x i

aligns to

y j

case 2:

x i

aligns to a gap case 3:

y j

aligns to a gap

 

Global Alignment Example

Find the optimal global A C alignment of AACT and

0 -1 -2

ACG. A

-1 1 0

Scoring rules: match = 1, mismatch = 0, A

-2 0 1

C

-3 -1 1

gap penalty GP = -1 T

-4 -2 0 Optimal Alignments:

G

-3 -1 0 1 1 Alignment 1 A A C T score = 1 | | | | - A C G Alignment 2 A A C T score = 1 | | | | A - C G

Smith-Waterman Algorithm

 Input: Strings

X

and

Y

and scoring rules: scoring matrix

s

and gap penalty GP.

 Output: Substrings of

X

and

Y

whose global alignment, as defined by scoring rules is maximal among all global alignments of all substrings of

X

and

Y

.

  Initialization: F (0

,

0)

=

0

, F

(0

, i

)

=

0

, F

(

j,

0)

=

0 (

i =

1

….M, j =

1

….N

)

Main Iteration:

For each

i =

1

….M

and

j =

1

….N

 

F

(

i

,

j

)

max

  

0

F

(

i F

(

i

,

j

1 ,

j

1 )

 

1 )

GP s

(

x i

,

F

(

i

1 ,

j

)

GP y j

)

 

Largest value of F(i, j)

represents the score of the best local alignment of X and Y

Traceback

begins at the highest score in the matrix and continues until you reach 0.

Local Alignment Example

   Find the optimal local alignment of AACT and ACG. Scoring rules match = 1, mismatch = 0, gap penalty GP = -1 Solution: Local Alignment Score = 2

A C | | A C

A A C T 0 0 0 0 0 A 0 1

1

0 0 C 0 0 1

2

1 G 0 0 0 1 2