MSDC - GFZ Potsdam

Download Report

Transcript MSDC - GFZ Potsdam

MSDC
MiniSeed Data Completeness
S. Pintore
Scenario




A network of SeisComP Remote
Server archiving data on mass
storage creating on each storage a
Peripheral Archive
A Server creating a Central Archive
Telecommunication network
availability < 100%
Limited bandwidth
Incomplete data

Data in CA could be incomplete if the
network become and stays
unreachable for a long time
 Some file missing in the CA
 Data gaps into the files of the CA
The Mednet SeisComP
servers network
Data are stored in 24 hours files
 The segsize parameter is set to 5000
-512 byte blocks
• If the link stays down longer than about an
hour data will present gaps.
• If the link stays down longer than 1-2 days
some file will miss.

Telecommunication network availability is
generally good
• Network faults during more than 1 day are
more frequent than faults longer than an hour
and shorter than 1 day.
Retransmit or Integrate ?


In order to insure data quality is
necessary an integrity check
Due to the bandwidth limit you must
choose between :
• retransmitting all the file containing a gap
• integrating your file transmitting only the data
needed to fill the gap

These two execution steps aren’t
necessarily distinct
Respect the environment

The procedure to rebuild the correct
data should have a low impact on the
systems, it should:
• run on Linux using low resources
• offer link security
• permit control on bandwidth use
• not need specific firewall rules
MSDC solution



MSDC uses the rsync tool that is
already available, optimised for
similar problems and well tested
The data check is made by rsync
comparing the files in the CA with
those in the PA
It uses rsync over ssh to:
• secure the connection
• avoid using the rsync port (873)
What does rsync offer ?

The features of the rsync algorithm
• it works on arbitrary data
• the total data transferred is about the
size of a compressed diff file
• it is fast for large files and large
collections of files
• it doesn’t assume any prior knowledge
of the two files, but takes advantage of
similarities
• it is computationally inexpensive
MSDC main features





The msdc.sh can be run from command
line or in a crontab line
It is a bash script
It avoids concurrent running conflicts,
using a simple locking mechanism
It logs events and the name of the files
corrected or definitely lost
The installation is made by the sysop user
in his home directory
Security




MSDC uses a ssh key pair for the
automation of the ssh connession
this key pair is dedicated to the msdc use,
no other connections are possible using it
MSDC doesn’t interfere with other keys
used to automate ssh connections
it doesn’t need an rsync server running
The MSDC package






The MSDC package msdc.tgz contains the
files listed here:
msdc/bin/msdc.sh
msdc/bin/validate_rsync
msdc/bin/rsync
msdc/doc/README.msdc –Documentationmsdc/doc/COPYING
-GPL Licensemsdc/ssh
TODO

Option to use a different date
Alternative solutions: after the
check
The data check could be done using SeedStuff
utilities (check_file, extr_file, etc.) or qlib ones
(qmerge, etc.).
For the incomplete files you can either:


•
or:

•


retransmit all the file
use qmerge to extract the data to fill the gaps, then
transmit this “patches” eventually using qmerge –
again- to fill the gaps.
Transmission: you should use a tool offering
security as scp or sftp
You should then automate this procedure