Machine Learning and Data Mining: A Case Study with

Download Report

Transcript Machine Learning and Data Mining: A Case Study with

Machine Learning and Data Mining: A Case Study with
Enterotypes
Gabe Al-Ghalith
Jimmy Reeve
Chapter 28, data mining
http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-047-computational-biology-genomesnetworks-evolution-fall-2008/lecture-notes/MIT6_047f08_lec04_slide04.pdf
Choosing Between Clustering and Classification
–
Clustering: summarize big data without a priori hypotheses
–
How would you categorize people based on their:
– Blood-Type?
– Gut bacteria?
–
Blood type calls for Classification
●
–
Consensus on blood groups: A, B, AB, O
Gut Bacteria calls for Clustering
●
No consensus on types or even number of categories
http://www.nytimes.com/2011/04/21/science/21gut.html?
_r=2&scp=2&sq=bacteria&st=cse&
Reasons to Consider Gut Bacteria
●
Contribute to diseases and response to treatments
●
Protective role, digestive role
●
We have 100s of genes that involve handling these bacteria
●
NPR.org.- “Gut bacteria might guide the workings of our minds”
●
Characterizing these bacteria can help us tease out these associations:
●
Personalized medicine and treatment
http://www.gutmicrobiotawatch.org/gut-microbiota-info/
http://www.npr.org/blogs/health/2013/11/18/244526773/gutbacteria-might-guide-the-workings-of-our-minds
3 Distinct “Enterotypes” Revealed from Clustering
Approach
●
●
Bacterial populations fell into 3 groups based on population composition
These three “enterotypes” each contain one representative member of gut
bacteria (chief/first principle component)
– Enterotype 1: Bacteroides, enriched in vitamins B5,B7,C
– Enterotype 2: Prevotella, enriched in vitamins B1, B9
– Enterotype 3: Blautia (Ruminococcus): H2/CO2 to acetate
● ~ 1500 known sequences used as filter
for raw metagenomic reads. These are
the “features.” A “sample” is the
population composition in a subject's gut.
● 85 metagenomes from one source, 154
from another, 33 from a third. Same 3
classes emerged upon clustering each.
Enterotypes of the human gut microbiome. Nature
473: 174–180.
Clustering Methodology Used in the Original Paper
●
Karhunen–Loève transform (KLT) – PCA
●
Dimensionality reduction technique
●
–
Parallels with SQL3: “pivot” along axis with most variance, then final “roll up”
based on distance metric
–
Some metrics: Euclidian, Manhattan, Vector angle, Pearsons, Jensen-Shannon...
Ade4 package in R uses “pam” algorithm (“K-medoid”)
Enterotypes of the human gut microbiome. Nature 473: 174–180.
References
●
●
●
●
●
Cluster in R (ade4 hooks this) http://cran.rproject.org/web/packages/cluster/cluster.pdf
Ade4 primer on dimensionality reduction: cran.rproject.org/web/packages/ade4/index.html
“The human gut microbiome: are we our enterotypes?” Microbial
Biotechnology (2011) 4(5), 550–553
“Bacteria Divide People Into 3 Types, Scientists Say.” New York Times, April
20th, 2011.
Dan Knights. Seminar: “Diet and microbiome: Which came first, the chicken
nuggets or the Eggerthella?” Sep 26, 2013