Bioinformatics lectures at Rice University

Download Report

Transcript Bioinformatics lectures at Rice University

Bioinformatics lectures at Rice
University
Li Zhang
Lecture 1
Department of Bioinformatics and Computational Biology
MD Anderson Cancer Center
March-April, 2015
Contact information
• Li Zhang
• Phone: 713-563-4298 (office)
713-962-6661 (cell)
• Email: [email protected]
• URL: http://odin.mdacc.tmc.edu/~llzhang/RiceCourse/
• Office location: FCT4.5034. Pickens Tower, 4th
floor, MD Anderson Cancer Center.
Homework
• There will be 2-3 assignments posted online.
• All students are required to complete the assignments.
Homework will be submitted at the beginning of class on
the due date.
• If circumstances beyond the student’s control arise and an
assignment cannot be submitted on the due date, an
instructor should be contacted prior to the due date. With
an instructor’s permission, late homework may be accepted
within one week of the due date.
• All decisions will be made on an individual student basis
and the final decision rests with the instructor assigning the
homework. A penalty of 10 percentage points will be
applied to late homework.
The Cancer Genome Atlas Project
What is bioinformatics?
• Bioinformatics is the application of computer science
and information technology to the field of biology and
medicine. Bioinformatics deals with algorithms,
databases and information systems, web technologies,
artificial intelligence and soft computing, information
and computation theory, software engineering, data
mining, image processing, modeling and simulation,
signal processing, discrete mathematics, control and
system theory, circuit theory, and statistics, for
generating new knowledge of biology and medicine,
and improving & discovering new models of
computation (e.g. DNA computing, neural computing,
evolutionary computing, immuno-computing, swarmcomputing, cellular-computing).
• Commonly used software tools and technologies in this
field include Java, XML, Perl, C, C++, Python, R, MySQL,
noSQL, CUDA, MATLAB, and Microsoft Excel.
Focus area of this course
• Reference book by in Pierre Baldi’s :
“Bioinformatics: A machine learning approach”
and a few key papers.
• Introducing high throughput technologies that
provide the data.
• Machine learning algorithms and models to
visualize and explore large datasets identify
patterns & relationships.
• Computing language: R/Perl.
• Database: Non-relational database NoSQL.
• Not focused web applications, no structural
biology.
Why should we study
bioinformatics?
Why it is important to study
bioinformatics?
Let us see a few growth charts …
There are 187 billion bases in 171 million sequence
records in the traditional GenBank as of Feburary 2015.
Growth Chart Of GEO (RNA etc)
Gene Expression Omnibus
(GEO) database holds over 10
000 experiments comprising
300 000 samples, 16 billion
individual abundance
measurements, for over 500
organisms, submitted by 5000
laboratories from around the
world. The database typically
receives over 60 000 query
hits and 10 000 bulk FTP
downloads per day, and has
been cited in over 5000
manuscripts.
Distribution of the number and types of selected studies released by GEO each year since
inception.
Tanya Barrett et al. Nucl. Acids Res. 2013;41:D991-D995
Published by Oxford University Press 2012.
Growth of PDB (Protein Structures)
The Protein Data Bank (PDB) is a
repository for the 3-D structural
data of large biological molecules,
such as proteins and nucleic acids.
Most structures are determined by
X-ray diffraction, but about 15% of
structures are determined by
NMR.
Large scale organized efforts by
Structural Genomics Initiative and
International Structural Genomics
Consortium have greatly
accelerated the pace of growth.
A brief history of the big bang of the digital universe
The age of big data
“The story is similar in fields as varied as science and
sports, advertising and public health — a drift toward
data-driven discovery and decision-making. It’s a
revolution. We’re really just getting under way. But the
march of quantification, made possible by enormous
new sources of data, will sweep through academia,
business and government. There is no area that is
going to be untouched.”
-------- By Steve Lohr, “The Age of Big data”, The New York
Times, 2012.
What is big data?
3Vs of big data:
• High volume,
• High-velocity,
• High-variety
--- A definition of big data,
The Gartner Inc.
Simply put, it is big and complex.
The big value of big data
The value of big data is that analysis of the big data
can lead to
(1)enhanced decision making,
(2)insight discovery and
(3)process optimization.
In business, big data can help to identify unknown
needs, customize advertisement, monitor and
evaluate operation, which leads to big profit and
big saving. In science, big data is a huge resource
for a lot of scientific discoveries.
A brief introduction
of molecular
biology
17
DNA
James Watson and Francis Crick
19
Next generation sequencing
The cost of sequencing has reduced 100
thousand fold in the past 12 years
Data explosion in the era of genomics
•There have been a large series of breakthroughs
in micro-electronics and nano-electronics that
have produced instruments that quantify and/or
characterize large number of biological molecules
in parallel using very small mount of biomaterial.
•Such technical advances have made possible to
comprehensively characterize and quantify the
building blocks (DNA, RNA, protein) in a
biological system.
Think google …
Or, think Netflix.
Bioinformatics is the key in genomics
Genome, genomics and post genomic era
List of sequenced
genomes of
mammals:
Type
Cow
Dog
Guinea Pig
Nine-banded Armadillo
Hedgehog-Tenrec
Horse
Western European Hedgehog
Cat
Human
African Elephant
Rhesus Macaque
Gray Mouse Lemur
Gray Short-tailed Opossum
Mouse
Little Brown Bat
American Pika
Platypus
Rabbit
Small-eared Galago, or Bushbaby
Chimpanzee
Orangutan
Rat
European Shrew
Thirteen-lined Ground Squirrel
Domestic pig
Northern Tree Shrew
Genome size
3.0 Gb
2.4 Gb
3.4 Gb
3.0 Gb
Year of
completion
2009
2005
2.1 Gb
2007
3 Gb
3.2 Gb
3 Gb
2007
2001
3.5 Gb
2.5 Gb
2007
2002
2.5 Gb
3.1 Gb
3.0 Gb
2.8 Gb
3.0 Gb
2005
2004
2009
Large Projects
•
•
•
•
TCGA: The cancer genome Atlas
1000 Genome Project
1001 Genome Project
ICGC: International cancer genome
consortium
• The International HapMap Project
• …
Data  Information 
Knowledge/power
Bioinformatics provides tools to catalyze the transformations
Ion semiconductor sensing
Ion Torent