Document 7327198

Download Report

Transcript Document 7327198

Condor Build & Test:
NMI, OMII, ETICS
Peter F. Couvares
Associate Researcher, Condor Team
Computer Sciences Department
University of Wisconsin-Madison
[email protected]
http://www.cs.wisc.edu/condor
How the Condor Team Got
Started in the Build/Test
Business: Prehistory
› Oracle shamed^H^H^H^H^H^Hinspired us.
› The Condor team was in the stone age,
producing modern software to help people
reliably automate their computing tasks -with our bare hands.
• Every Condor release took weeks/months to do.
• Build by hand on each platform, discover lots of
bugs introduced since the last release, track
them down, re-build, etc.
www.cs.wisc.edu/condor
What Did Oracle Do?
› Oracle selected Condor as the resource manager
›
›
underneath their Automated Integration
Management Environment (AIME)
Relied on to perform automated build and
regression testing of multiple components for
Oracle's flagship Database Server product.
Oracle chose Condor because they liked the
maturity of Condor's core components.
www.cs.wisc.edu/condor
Doh!
› Oracle used distributed computing to automate
›
›
›
their build/test cycle, with huge success.
If Oracle can do it, why can’t we?
Use Condor to build Condor!
NSF Middleware Initiative (NMI)
• right initiative at the right time!
• opportunity to collaborate with others to do for
production software developers like Condor what Oracle
was doing for themselves
• important service to the scientific computing community
www.cs.wisc.edu/condor
NMI Statement
› Purpose – to develop, deploy and sustain a set of
›
reusable and expandable middleware functions
that benefit many science and engineering
applications in a networked environment
Program encourages open source software
development and development of middleware
standards
www.cs.wisc.edu/condor
Why should you care?
From our experience, the functionality,
robustness and maintainability of a
production-quality software component
depends on the effort involved in
building, deploying and testing the
component.
• If it is true for a component, it is definitely
true for a software stack
• Doing it right is much harder than it appears
from the outside
• Most of us had very little experience in this
area
www.cs.wisc.edu/condor
Goals of the
NMI Build & Test System
› Design, develop and deploy a complete build
›
system (HW and SW) capable of performing
daily builds and tests of a suite of disparate
software packages on a heterogeneous (HW,
OS, libraries, …) collection of platforms
And make it:
•
•
•
•
•
•
Dependable
Traceable
Manageable
Portable
Extensible
Schedulable
www.cs.wisc.edu/condor
The Build Challenge
› Automation - “build the component at the push of a button!”
•
•
always more to it than just “configure” & “make”
e.g., ssh to right host; cvs checkout; untar; setenv, etc.
•
•
Well-managed & comprehensive source repository
Know your “externals” and keep them around
•
•
No dependencies on “local” capabilities
Understand your hardware & software requirements
› Reproducibility – “build the version we released 2 years ago!”
› Portability – “build the component on nodeX.cluster.com!”
› Manageability – “run the build daily on 15 platforms and
email me the outcome!”
www.cs.wisc.edu/condor
The Testing Challenge
› All the same challenges as builds (automation,
›
reproducibility, portability, manageability), plus:
Flexibility
•
•
•
•
“test our RHEL4 binaries on RHEL5!”
“run our new tests on our old binaries”
important to decouple build & test functions
making tests just a part of a build -- instead of an
independent step -- makes it difficult/impossible to:
• run new tests against old builds
• test one platform’s binaries on another platform
• run different tests at different frequencies
www.cs.wisc.edu/condor
“Eating Our Own Dogfood”
› What Did We Do?
• We built the NMI Build & Test Lab on top of
Condor, DAGMan, and other distributed computing
technologies to automate the build, deploy, and
test cycle.
• To support it, we’ve had to construct and manage a
dedicated, heterogeneous distributed computing
facility.
• Opposite extreme from typical “cluster” -- instead
of 1000’s of identical CPUs, we have a handful of
CPUs each for ~40 platforms.
• Much harder to manage! You try finding a sysadmin
tool that works on 40 platforms!
• We’re just another big Condor user
• If Condor sucks, we feel the pain.
www.cs.wisc.edu/condor
INPUT
NMI Build & Test Facility
Distributed Build/Test Pool
Spe
c
File
NMI Build
& Test
Software
Spe
c
File
Customer
Source
Code
DAG
Condor
Queue
DAG
DAGMa
n
results
Customer
Build/Test
Scripts
OUTPUT
results
Web Portal
Finished
Binaries
MySQL
Results DB
build/test
jobs
results
Numbers
100
39
34
9
3
~100
~1400
~350
CPUs
HW/OS “Platforms”
OS
HW Arch
Sites
GB of results per day
Builds/tests per month
Condor jobs per day
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
Name
Arch
atlantis.mcs.anl.gov
sparc
grandcentral
i386
janet
i386
nmi-build15
i386
nmi-build16
i386
nmi-build17
i386
nmi-build18
sparc
nmi-build21
i386
nmi-build29
sparc
nmi-build33
ia64
nmi-build5
i386
nmi-build6
G5
nmi-rhas3-amd64
amd64
nmi-sles8-amd64
amd64
nmi-test-3
i386
nmi-test-4
i386
[unknown]
hp
[unknown]
sgi
[unknown]
sparc
[unknown]
sparc
[unknown]
sparc
[unknown]
sparc
nmi-build1
i386
nmi-build14
ppc
nmi-build24
i386
nmi-build31
ppc
nmi-build32
i386
nmi-build8
ia64
nmi-dux40f
alpha
nmi-hpux11
hp
nmi-ia64-1
ia64
nmi-sles8-ia64
ia64
rebbie
i386
rocks-{122,123,124}.sdsc.edu
i386
supermicro2
i386
b80n15.sdsc.edu
ppc
imola
i386
nmi-aix
ppc
nmi-build2
i386
nmi-build3
i386
nmi-build4
i386
nmi-build7
G4
nmi-build9
ia64
nmi-hpux
hp
nmi-irix
sgi
nmi-redhat72-build
i386
nmi-redhat72-dev
i386
nmi-redhat80-ia32
i386
nmi-rh72-alpha
alpha
nmi-solaris8
sparc
nmi-solaris9
sparc
nmi-test-1
i386
nmi-tru64
alpha
vger
i386
monster
i386
nmi-test-5
i386
nmi-test-6
i386
nmi-test-7
i386
nmi-build22
i386
nmi-build25
i386
nmi-build26
i386
nmi-build27
i386
nmi-fedora
i386
www.cs.wisc.edu/condor
OS
sol9
rh9
winxp
rh72
rh8
rh9
sol9
fc2
sol8
sles8
rhel3
osx
rhel3
sles8
rh9
rh9
hpux11
irix6?
sol10
sol7
sol8
sol9
rh9
aix52
tao1
aix52
fc3
rhel3
dux4
hpux11
sles8
sles8
winxp
???
rhel4
aix51
rh9
aix52
rh8
rh72
winxp
osx
rhel3
hpux10
irix65
rh72
rh72
rh8
rh72
sol8
sol9
rh9
dux51
rh73
rh9
rh9
rh9
rh9
fc2
Condor Build & Test
› Automated Condor Builds
• Two (sometimes three) separate Condor
versions, each automatically built using NMI on
13-17 platforms nightly
• Stable, developer, special release branches
› Automated Condor Tests
• Each nightly build’s output becomes the input to
a new NMI run of our full Condor test suite
› Ad-Hoc Builds & Tests
• Each Condor developer can use NMI to submit
ad-hoc builds & tests of their experimental
workspaces or CVS branches to any or all
platforms
www.cs.wisc.edu/condor
www.cs.wisc.edu/condor
More Condor Testing Work
• Advanced Test Suite
• Using binaries from each build, we deploy an
entire self-contained Condor pool on each test
machine
• Runs a battery of Condor jobs and tests to
verify critical features
• Currently >150 distinct tests
• each executed for each build, on each platform, for
each release, every night
• Flightworthy Initiative
• Ensuring continued “core” Condor scalability, robustness
• NSF funded, like NMI
• Producing new tests all the time
www.cs.wisc.edu/condor
NMI Build & Test
Customers
› NMI Build & Test Facility was built to
›
serve all NMI projects
Who else is building and testing?
• Globus
• NMI Middleware Distribution
• many “grid” tools, including Condor & Globus
• Virtual Data Toolkit (VDT) for the Open
Science Grid (OSG)
• 40+ components
• Soon TeraGrid, NEESgrid, others…
www.cs.wisc.edu/condor
Build & Test Beyond NMI
› We want to integrate with other,
related software quality projects, and
share build/test resources...
• an international (US/Europe/China) federation of
build/test grids…
• Offer our tools as the foundation for other B&T systems
• Leverage others’ work to improve out own B&T service
www.cs.wisc.edu/condor
OMII-UK
• Integrating software from multiple sources
•
•
Established open-source projects
Commissioned services & infrastructure
• Deployment across multiple platforms
• Verify interoperability between platforms & versions
• Automatic Software Testing vital for the Grid
•
•
•
•
•
•
Build Testing – Cross platform builds
Unit Testing – Local Verification of APIs
Deployment Testing – Deploy & run package
Distributed Testing – Cross domain operation
Regression Testing – Compatibility between versions
Stress Testing – Correct operation under real loads
• Distributed Testbed
•
•
Need a breadth & variety of resources not power
Needs to be a managed resource – process
www.cs.wisc.edu/condor
NMI/OMII-UK
Collaboration
› Phase I: OMII-UK developed automated builds &
›
tests using the NMI Build & Test Lab at UWMadison
Phase II: OMII-UK deployed their own instance of
the NMI Build & Test Lab at Southampton
University
• Our lab at UW-Madison is well and good, but some
collaborators want/need their own local facilities.
› Phase III (in progress): Move jobs freely between
UW and OMII-UK B&T labs as needed.
www.cs.wisc.edu/condor
Next: ETICS
Build system,
software
configuration, service
infrastructure,
dissemination, EGEE,
gLite, project coord.
NMI Build & Test
Framework, Condor,
distributed testing
tools, service
infrastructure
Software
configuration,
service
infrastructure,
dissemination
Web portals and
tools, quality
process,
dissemination,
DILIGENT
Test methods and
metrics, unit testing
tools, EBIT
www.cs.wisc.edu/condor
ETICS Project Goals
› ETICS will provide a multi-platform environment for building
and testing middleware and applications for major European
e-Science projects
› “Strong point is automation: of builds, of tests, of reporting,
etc. The goal is to simplify life when managing complex
software management tasks”
•
One button to generate finished package (e.g., RPMs) for any
chosen component
› ETICS is developing a higher-level web service and DB to
generate B&T jobs -- and use multiple, distributed NMI B&T
Labs to execute & manage them
•
This work complements the existing NMI Build & Test system
and is something we want to integrate & use to benefit other
NMI users!
www.cs.wisc.edu/condor
ETICS Web Interface
www.cs.wisc.edu/condor
OMII-Japan
•
What They’re Doing
•
•
“…provide service which can use on-demand autobuild and test systems
for Grid middlewares on on-demand virtual cluster. Developers can build
and test their software immediately by using our autobuild and test
systems”
Underlying B&T Infrastructure is NMI Build & Test Software
www.cs.wisc.edu/condor
This was a Lot of Work… But
It Got Easier Each Time
› Deployments of the NMI B&T Software
›
with international collaborators taught us
how to export Build & Test as a service.
Tolya Karp: International B&T Hero
• Improved (i.e., wrote) NMI install scripts
• Improved configuration process
• Debugged and solved a myriad of details that
didn’t work in new environments
www.cs.wisc.edu/condor
What This Means For You
› NMI B&T Lab Deployment Experience +
›
›
Improved Packaging + Improved
Portability…
We now have unique ability to give you not
only source code, but a whole production
build & test infrastructure to go along with
it
… and we have done it for a number of
users already
www.cs.wisc.edu/condor
New Condor+NMI Users
› Yahoo
• First industrial user to deploy NMI B&T
Framework to build/test custom Condor
contributions
› Hartford Financial
• Deploying it as we speak…
www.cs.wisc.edu/condor
What’s to Come
› More US & international collaborations
• OMII-Europe
• More Industrial User/Developers…
› New Features
• Becky Gietzel: parallel testing!
• Major new feature: multiple co-scheduled resources
for individual tests
• Going beyond multi-platform testing to crossplatform parallel testing
› UW-Madison B&T Lab: ever more
platforms
• “it’s time to make the doughnuts”
• Questions?
www.cs.wisc.edu/condor