Clustering the Reliable File Transfer Service

Download Report

Transcript Clustering the Reliable File Transfer Service

Clustering the
Reliable File Transfer Service
Jim Basney and Patrick Duda
NCSA, University of Illinois
This material is based upon work supported by the National
Science Foundation under Grant No. 0426972.
June 6, 2007
TeraGrid '07
1
Goal
• Provide a highly available
Reliable File Transfer (RFT) Service
– Tolerate server failures
• Hardware/software faults and resource
exhaustion
– Continue to handle incoming requests
– Continue to make forward progress on file
transfers in the queue
June 6, 2007
TeraGrid '07
2
Globus Toolkit
Reliable File Transfer Service
GridFTP
Client
RFT
GridFTP
June 6, 2007
TeraGrid '07
3
RFT and GridFTP Clustering
RFT
GridFTP
control
GridFTP
data
GridFTP
data
RFT
RFT
June 6, 2007
GridFTP
control
TeraGrid '07
GridFTP
data
GridFTP
data
4
Clustering Approach
RFT
Load
Balancer
RFT
HA
DBMS
RFT
June 6, 2007
TeraGrid '07
5
RFT State Management
Web Service
Container
Delegation
Service
Client
RFT
June 6, 2007
TeraGrid '07
DBMS
6
RFT DB Tables
Request
Transfer
Restart
ID
Termination Time
Started Flag
Max Attempts
Delegated EPR
Container ID
Start Time
ID
Request ID
Source URL
Destination URL
Status
Attempts
Retry Time
Transfer ID
Restart Marker
Last Update Time
Added Fields
June 6, 2007
TeraGrid '07
7
New Tables
Delegation Service
Persistent Subscription
Resource ID
Caller DN
Local Name
Termination Time
Listener
Certificate
Container ID
Consumer
Producer
Policy
Precondition
Selector
Topic
Security Descriptor
…
June 6, 2007
TeraGrid '07
8
RFT Fail-Over
• Based on time-outs
• Periodically query database for pending
requests with no recent activity
– Stalled requests could be caused by RFT service
crash, hardware failure, RFT service overload, etc.
– If found, obtain DB write lock, query again, claim
stalled requests, and release lock
• Configuration values:
– Query interval (default: 30 seconds)
– Recent interval (default: 60 seconds)
June 6, 2007
TeraGrid '07
9
Evaluation Environment
• Dedicated 12 node Linux cluster
– Red Hat Enterprise Linux AS Release 3
– Switched Gigabit Ethernet
– 2 GB RAM
– dual 2GHz Intel Xeon CPUs 512KB cache
• Globus Toolkit 4.0.3
• MySQL Standard 5.0.27
June 6, 2007
TeraGrid '07
10
Evaluation
• Correctness / Effectiveness
– Submitted multiple RFT requests of
different sizes to 12 RFT instances
– Verified fail-over and notification
functionality
• Performance
– Evaluate overhead of shared DBMS
– Stress test: transfer many small files
June 6, 2007
TeraGrid '07
11
14
12
60 second fail-over interval
web services
container stopped
8
6
fail-over
4
2
seconds
June 6, 2007
TeraGrid '07
12
165
160
155
150
145
140
135
130
125
120
115
110
105
100
95
90
85
80
75
70
65
60
55
50
45
40
35
30
25
20
15
10
5
0
0
files transferred per second
10
G T 4 s ubmit time
c lus ter s ubmit time
6
5
total seconds
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
number of nodes
June 6, 2007
TeraGrid '07
13
c lus ter trans fer time
G T 4 trans fer time
95%
200
82%
180
57%
160
43%
total seconds
140
22%
120
100
4%
6%
10%
14%
80
60
40
20
0
1
2
3
4
5
6
7
8
9
10
number of nodes
June 6, 2007
TeraGrid '07
14
Related Work
• HAND: Highly Available Dynamic Deployment
Infrastructure for GT4
– Migrate services between containers to maintain availability
during planned outages
– Does not address management of persistent service state or
fail-over for unplanned outages
• myGrid
– DBMS persistence of WS-ResourceProperties in Apache
WSRF
– Points to a general-purpose approach for DBMS-based
persistence of stateful WSRF services
June 6, 2007
TeraGrid '07
15
Conclusion
• Clustering RFT provides load-balancing and
fail-over with acceptable performance for
small clusters
• Clustering is a promising approach for
application to other grid services
June 6, 2007
TeraGrid '07
16
Future Work
•
•
•
•
•
Correctly handle replay of FTP deletes
Implement credentialRefreshListener
Evaluate use of different DBMS solutions
Investigate GT4 DBMS persistence in general
Investigate use of WS-Naming
June 6, 2007
TeraGrid '07
17
Thanks!
• Questions? Comments?
• This material is based upon work supported by the National
Science Foundation under Grant No. 0426972.
• Performance experiments were conducted on computers at the
Technology Research, Education, and Commercialization
Center (TRECC), a program of the University of Illinois at
Urbana-Champaign, funded by the Office of Naval Research
and administered by the National Center for Supercomputing
Applications. We thank Tom Roney for his assistance with the
TRECC cluster.
• We also thank Ravi Madduri from the Globus project for
answering our questions about RFT.
June 6, 2007
TeraGrid '07
18