Data Structures and Visualization J. B. Cole Animal Improvement Programs Laboratory

Download Report

Transcript Data Structures and Visualization J. B. Cole Animal Improvement Programs Laboratory

Data Structures and Visualization
J. B. Cole
Animal Improvement Programs Laboratory
Agricultural Research Service, USDA
Beltsville, MD 20705-2350
[email protected]
Introduction
• We’re drowning in information
• Genetics are viewed as a commodity
• We need to get better data from
fewer cows
• Do we have the resources we need?
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (2)
Cole
U.S. dairy population
Cows (millions)
30
25
20
15
10
5
0
40
50
60
70
80
Year
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (3)
90
00
Cole
We need to do more with less
• 47% of U.S. dairy cows are enrolled
in DHIA testing
• The Class III milk is $17/cwt
• Grain prices are very high
• Corn averaged $6/bu in May
• Soybeans averaged $13/bu in May
• Enrollment and cow numbers are
unlikely to increase
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (4)
Cole
Major topics
• Different sources of data
• Data source integration and quality
• Data mining models
• Visualization examples
• Computational resources
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (5)
Cole
Data currently in national database
• Identification and registration
• Conformation scores
• Milk production and composition
• Fertility and reproduction
• Longevity
• Some genotypes
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (6)
Cole
Data not routinely available
• Farm and herd management
• Geography and climate
• Housing systems
• Feed intake
• Milk composition
• Fatty acids, casein variants
• Conductivity, lactose, MUN
Photo: NOAA
• DNA data
• Cow SNP genotypes, DNA sequence data
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (7)
Cole
Data “trapped” on the farm
• Fertility and reproduction
• Insemination information
• Use of estrus synchronization
• Cow health and longevity
• Body condition scores
• Birth weights and mature weights
• Disease occurrence data
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (8)
Cole
Electronic milk meters
• Currently can provide—
• Milk yield
• Milking speed
• Electrical conductivity
• May possibly supply—
• Progesterone levels
• Milk temperature
• Fat and protein concentrations
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (9)
Photo: afimilk
Cole
Other sources of data
• RFID tags have lower ID
error rates associated with
meter data
• Pedometers are useful for
detecting estrus, the
onset of calving, and
some early-stage
disease
Top: Allflex; Bottom: afimilk
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (10)
Cole
Current sources of data
PDCA
NAAB
AIPL
DHI
CDCB
Universities
AIPL Animal Improvement Programs Lab., USDA
CDCB Council on Dairy Cattle Breeding
DHI Dairy Herd Improvement (milk recording organizations)
NAAB National Association of Animal Breeders (AI)
PDCA Purebred Dairy Cattle Association (breed registries)
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (11)
Cole
Sources of genomic data
Requester
(Ex: AI, breeds)
Genomic
Evaluation Lab
Dairy
producers
samples
DNA
laboratories
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (12)
Cole
Data source integration
• Incoming data from different sources
are checked against one another
• The AIPL edits system consists of
~64,000 SLOC
• Mostly C, some Fortran 90
• Data stored in a relational database
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (13)
Cole
Typical edits
• Match birth date with dam’s calving
• Compare with other sources (e.g. breed
association)
• Investigate maternal sibs born within 9
mo (may assume ET)
• IDs within 100 with same sire, dam, and
birth assumed to be twins
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (14)
Cole
How do we assess data quality
• Consistency
• e.g., calving, progeny birth,
breeding, dry dates
• Parentage verification
• Electronic ID
• Within-herd heritability
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (15)
Cole
Data mining
• The discovery of useful, possibly
unexpected patterns in data
• Four principal tasks
• Association
• Clustering
• Classification
• Regression
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (16)
Cole
Bonferroni’s principle
• You will find interesting patterns if
you look hard enough
• Not all relationships are legitimate
• You must have enough data to
support the questions you’re
asking
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (17)
Cole
Association analysis
• Discover interesting relationships
among variables in large databases
• e.g., predicting protein function and
identifying SNP-disease associations
• Not statistical association analysis!
• Lots of algorithms, many based on
counting attributes
• Watch for false positives
• Measures co-occurence, not causality
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (18)
Cole
Clustering
• Place items into distinct groups
such that
• Items in a group are similar
• Items in one group are dissimilar to
those in other groups
• Hierarchical or partitional
approaches
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (19)
Cole
Partitional clustering
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (20)
Cole
Hierarchical clustering
• Nested clusters organized into
hierarchical trees
• Data objects may belong to
multiple subsets
• Examples
• Relationships among species
• Evolutionary history of proteins
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (21)
Cole
Partners
Deep SNP Discovery
N’Dama
Sahiwal
Simmental
Hanwoo
Blonde d’Aquitaine
Montbeliard
Pfizer
Light SNP Discovery
Angus
Holstein
Jersey
Hereford
Charolais
Simmental
Brahman
Waygu
BFGL
Genome Assemblies
Nelore
Water Buffalo
BFGL-Illumina
Deep SNP Discovery
Angus
Holstein
Limousin
Jersey
Nelore
Brahman
Romagnola
Gir
Classification
• Training set used to develop a rule
for assigning individuals to classes
• Validation set used to assess the
accuracy of the classification rule
• Examples
• Identify cows with subclinical mastitis
• Mate assignment
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (23)
Cole
Classification methods
• Bayesian belief networks
• Decision trees
• Nearest-neighbor classification
• Neural networks
• Rule-based classification
• Support vector machines
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (24)
Cole
Decision tree classification
Pinzón-Sánchez et al., 2011, JDS, 94:1873-1892.
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (25)
Cole
Rule-based classification
• Classify records using a series of
“if…then” rules
• Rules come directly from the data,
or from other classification models
• e.g., if (PTA NM$ ≥ $800) and (EFI ≤
0.05) then (breed to cow)
• Easy to generate and interpret
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (26)
Cole
Regression models
• Prediction of real-valued outputs
• Given one or more attributes, we
can predict, for example—
• Breeding values
• Feed intake
• Milk and components yields
• Very mature analytical tools
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (27)
Cole
Visualization
• How do we present lots of numbers
in a compact form?
• “Graphical methods can retain the
information in the data.” ― Deming
• Complements numerical
techniques
• Tukey (1977), Tufte (1983, 1990,
1997, 2006) , Cleveland (1985,
1993), Wickham (2009)
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (28)
Cole
One image, millions of points
43,382 SNP solutions  4,064 animals = 176,304,448 data points
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (29)
Cole
Use size to denote importance
Markers are proportional in area to SNP effect sizes.
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (30)
Cole
O-Style Haplotypes
(chromosome 15)
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (31)
Cole
Correlations among calving traits
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (32)
Cole
Provide multiple cues
Lines are differentiated by color and pattern.
Cole and VanRaden. 2011. J. Anim. Breed. Genet. Online, 1-10.
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (33)
Cole
Interstitial figures
Cole and VanRaden. 2010. J. Dairy Sci. 93(6):2727-2740.
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (34)
Cole
Computational capacity is abundant
WikiMedia Commons, Wgsimon, Transistor_Count_and_Moore%27s_Law_-_2011.svg
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (35)
Cole
Supercomputer performance
• Cray-1 (1976) — 136
megaFLOPS (106)
• Fujitsu K machine
(2011) — 8.16
petaFLOPS (1015)
• Commodity hardware
also has experienced
gains in performance
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (36)
Top: Sherwin Gooch; Bottom: Riken
Cole
Storage costs are plummeting
Matthew Komorowski, http://www.mkomo.com/cost-per-gigabyte
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (37)
Cole
Data storage technologies
• Storage costs are now as
low as $100/TB
• Quality costs!
• Solid state disks are
promising, but relatively
low-capacity
• What do you do about
backups?
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (38)
Top: Snopes/IBM; Bottom: Tom’s Hardware
Cole
Memory is very cheap
Lev Lafayette, http://www.organdi.net/article.php3?id_article=82
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (39)
Cole
Random access memory
• RAM is still much
faster than disk (ns
vs. ms access times)
• A 64-bit OS can
address 16.8 EB, in
theory
• How much can your
motherboard hold?
Top: Stan Yack; Bottom: Samsung
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (40)
Cole
Software
• Complexity is increasing
• Parallelism is hard and debugging is much
harder
• Productive developers are expensive
and difficult to find
• Top programmers many times more
productive than average workers
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (41)
Cole
Conclusions
• The more data we get, the more data
we want
• Relationships among traits may become
as important as individual traits
• Software may be more limiting than
hardware
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (42)
Cole
Questions?
Really Big Data Symposium, ADSA/ASAS Joint Annual Meeting, July 2011 (43)
Cole