Data Management in Large

Download Report

Transcript Data Management in Large

Data Management in
Large-scale P2P Systems
Patrick Valduriez, Esther Pacitti
Atlas group, INRIA and LINA
University of Nantes, France
1
Motivations

P2P systems


Decentralized control, large scale
Low-level, simple services


Distributed database systems

High-level data management services



File sharing, computation sharing, com. sharing
queries, transactions, consistency, security, etc.
Centralized control, limited scale
P2P + distributed database

Why? How?
2/26
Why high-level P2P data sharing?

Professional community example




Medical doctors in a hospital may want to
share (some of) their patient data for an
epidemiological study
They have their own, independent patient
descriptions
They want to ask queries such as “age and
weight of male patients diagnosed with
disease X …” over their own descriptions
They don’t want to create a database and
buy a server
3/26
Problem definition

P2P system




No centralized control, very large scale
Very dynamic: peers can join and leave the
network at any time
Peers can be autonomous and unreliable
Techniques designed for distributed data
management no longer apply

Too static, need to be decentralized,
dynamic and self-adaptive
4/26
Outline




Data management in distributed
systems
P2P systems
Data management in P2P systems
Data management in APPA
5/26
Data management basic principle

Data independence

Application
Application

Logical view
(schema)
Provision for high-level
services


Storage
Storage
Hide implementation
details





Schema
Queries (SQL, XQuery)
Automatic optimization
Transactions
Consistency
Access control
…
6/26
Distributed database system (DDBS)
Queries, Transactions

Distribution transparency

Site 1


Distributed
Database
System
Site 2
Global schema


Site 3
Centralized control through
global catalog
Distributed functions


DBMS1
DBMS2
Common data descriptions
Distributed data placement



Schema mapping
Query processing
Transaction management
Access control
Etc.
7/26
Scaling up DDBS

Distributed database systems



Data integration systems




Enterprise information systems
Scale up to tens of databases
strong heterogeneity and autonomy of data sources
(files, databases, XML documents, ..)
Limited functionality (queries)
Scale up to hundreds of data sources
Parallel database systems



Focus on high-performance and high-availability
Strong homogeneity
Scale up to hundreds of data nodes
8/26
A generic P2P system

A user at a peer may access sharable
data at remote peers
P2P software
P2P software
private
private
sharable
sharable
P2P software
private
sharable
9/26
Potential benefits of P2P systems





Scale up to very large numbers of peers
Dynamic self-organization
Load balancing
Parallel processing
High availability through massive
replication
10/26
P2P vs DDBS
P2P
DDBS
Joining the
network
Upon peer’s
initiative
Controled by DBA
Queries
No schema,
key-word based
Global schema,
static optimization
Query answers
Partial
Complete
Content location
Using neighbors
or DHT
Using directory
11/26
Requirements for P2P data
management (1)

Autonomy of peers


Query expressiveness


Peers should be able to join/leave at any
time, control their data wrt other (trusted)
peers
Key-lookup, key-word search, SQL-like
Efficiency

Efficient use of bandwidth, computing
power, storage
12/26
Requirements for P2P data
management (2)

Quality of service (QoS)


Fault-tolerance


User-perceived efficiency: completeness of
results, response time, data consistency, …
Efficiency and QoS despite failures
Security

Data access control in the context of very
open systems
13/26
P2P network topologies

Unstructured systems


Structured (DHT) systems


e.g. SETI@home
e.g. CAN, CHORD
Super-peer (hybrid) systems

e.g. Napster
14/26
P2P unstructured network
p2p


p2p
p2p
data
peer 1
data
peer 2
data
peer 3
High autonomy (peer needs to know neighbor to login)
Searching by flooding the network


p2p
data
peer 4
general, inefficient
High-fault tolerance with replication
15/26
P2P structured network
Distributed Hash Table (DHT)

h(k1)= p1
h(k2)= p2
h(k3)= p3
h(k4)= p4
p2p
p2p
p2p
p2p
d(k1)
peer 1
d(k2)
peer 2
d(k3)
peer 3
d(k4)
peer 4
Efficient exact-match search


O(log n) for put(key,value), get(key)
Limited autonomy since a peer is responsible for a
range of keys
16/26
Super-peer network

sp2sp
sp2sp
sp2p
sp2p
p2sp
p2sp
p2sp
p2sp
data
peer 1
data
peer 2
data
peer 3
data
peer 4
Super-peers can perform complex functions (meta-data
management, indexing, acces control, etc.)



Efficiency and QoS
Restricted autonomy
SP = single point of failure => use several
17/26
P2P systems comparison
Requirements
Unstructure
d
DHT
Super-peer
Autonomy
high
low
avg
Query exp.
high
low
high
Efficiency
low
high
high
QoS
low
high
high
Fault-tolerance
high
high
low
Security
low
low
high
18/26
Data management in P2P systems

Current research focuses on

Decentralized schema mappings


Extending DHT for complex querying


PIER : exact-match and join queries
Query reformulation



PeerDB: unstruct. network, keyword search only
Edutella: super-peer, RDF-based schemas
Piazza: graph of pair-wise schema mappings
Replication


generally limited to static read-only files
P-Grid addresses updates in structured networks
19/26
Data management in APPA (Atlas
P2P Architecture)

Objectives


Main features







Scalability, availability and performance
Network-independent architecture
Layered, service-based architecture
Replication with semantics-based reconciliation
Decentralized schema management
Schema-based query support and optimization
Peer data caching
Prototype on JXTA

Network-independent P2P services
20/26
Network independent APPA
Advanced Services
Query
Processing
Replication
Cache
Management
Security
...
Basic Services
Group Membership
Management
Consensus
Management
P2P Data
Management
Peer
Management
Peer
Communication
...
P2P Network
Key-based Storage and Retrieval
Peer ID Assignment
Peer Linking
Internet
21/26
Different APPA architectures
Peer
Advanced
services
Basic
services
local
data
P2P
network
DHT
network
P2P
data
Peer
Super-peer
P2P
data
Basic services
P2P network
Super-peer
Peer
Peer
Peer
local
data
Advanced
services
Peer
22/26
Schema management in APPA

Takes advantage of the collaborative
nature of the applications


Given 2 CSD relation definitions, an
example of peer mapping at peer p is:


Peers that wish to cooperate agree on a
Common Schema Description (CSD)
p:r(A,B,D) csd:r1(A,B,C), csd:r2(C,D,E)
Peer mappings stored as P2P data
23/26
Replication in APPA


Small-world assumption: peers work in
smaller groups with time locality
Lazy multi-master replication

n peers can update the same replica


Improves read performance and availability
Replica divergence solved by distributed
log-based reconciliation

Exploit P2P data management service
24/26
Query processing in APPA

Given a SQL-like query on peer schema,
performs

query reformulation


query matching


Finds relevant peers
query optimization


Maps the query on CSD schemas
Selects best peers, taking replication into
account
query decomposition and execution

Exploits parallelism
25/26
Conclusion


Advanced P2P applications will need
high-level data management services
Various P2P networks will improve



Network-independence crucial to exploit
and combine them
Many technical issues
Important to characterize applications
that can most benefit from P2P wrt
other distributed architectures
26/26