XtreemFS-A Cloud File System
June 4, 2016 | Author: ubiquetec | Category: N/A
Short Description
XtreemFS-A_Cloud_File_System...
Description
XtreemFS – A Cloud File System Michael Berlin Zuse Institute Berlin Contrail Summer School, Almere, 24.07.2012
Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438
Motivation Cloud Storage / Cloud File System • Cloud Storage Requirements • highly available • scalable • elastic: add and remove capacity • suitable for wide area networks
• Support for legacy applications • POSIX-compatible file system required
• Google for “cloud file system”: www.XtreemFS.org
2
Outline • XtreemFS Architecture
• Replication in XtreemFS • Read-Only File Replication • Read/Write File Replication • Custom Replica Placement and Selection • Metadata Replication
• XtreemFS Use Cases • XtreemFS and OpenNebula
3
XtreemFS - A Cloud File System • History • 2006 initial development in XtreemOS project • 2010 further development in Contrail project • 2012 August: Release 1.3.2
• Features • Distributed File System
• POSIX compatible • Replication • X.509 Certificates and SSL Support
• Software • Open source: www.xtreemfs.org • Client software (C++) runs on Linux & OS X (Fuse), Windows (Dokan) • Server software (Java) 4
XtreemFS Architecture Separation of Metadata and File Content: Metadata and Replica Catalog (MRC): –
stores metadata per volume
Object Storage Devices (OSDs): –
directly accessed by clients
–
file content split into objects
object-based file system
5
Scalability • Storage Capacity • addition and removal of OSDs possible • OSDs may be used by multiple volumes
• File I/O Throughput • scales with number of OSDs
• Metadata Throughput • limited by MRC hardware
use many volumes spread over multiple MRCs
6
Read-Only Replication (1) • Only for “write-once” files • File must be marked as “read-only” • done automatically after close() • Use Case: CDN
• Replica Types: 1.
Full replicas •
2.
•
complete copy, fills itself as fast as possible
Partial replicas •
Initially empty
•
on-demand fetching of missing objects
P2P-like efficient transfer between all replicas
7
Read-Only Replication (2)
8
Read/Write Replication (1) • Primary/backup scheme • POSIX requires total order of update operations primary/backup • Primary fail-over?
• Leases • grants access to a resource (here: primary role) for a predefined period of time • Failover after timeout possible • Assumption: loosely synchronized clocks • max drift ε 9
Read/Write Replication (2) Replicated write():
10
Read/Write Replication (3) Replicated write(): 1. Lease Acquisition
11
Read/Write Replication (4) Replicated write(): 1. Lease Acquisition
2. Data Dissemination
12
Read/Write Replication (5) Replicated read(): 1. Lease Acquisition
1b. “Replica Reset” update primary’s replica 2. Respond to read() using local replica
13
Read/Write Replication: Distributed Lease Acquisition with Flease Central Lock Service
Flease
• Flease • Failure tolerant: majority-based • Scalable: lease per file
• Experiment: • Zookeeper: 3 servers
• Flease: 3 nodes (2 randomly selected)
14
Read/Write Replication: Data dissemination • Ensuring Consistency with Quorum Protocol • R+W>N • R - # replicas have to be read from • W - # replicas have to be updated • Quorum intersection property • Intersection never empty
• Write All, Read 1 (W = N, R = 1) • No availability
Example with 3 (=N) Replicas (W = 2, R = 2) a) Write
• Reads from Backup Replicas allowed b) Read
• Write Quorum, Read Quorum • Available if majority reachable
Quorum Read covered by “Replica Reset” phase 15
Read/Write Replication: Summary • High up-front costs (for first access to inactive file) • 3+ round-trips • 2 for Flease (lease acquisition) • 1 for Replica Reset • + further when fetching missing objects
• Minimal cost for subsequent operations • Read: identical to non-replicated case • Write: latency increases by time to update majority of backups
• Works at file-level: scales with # OSDs and # files • Flease: no I/O to stable storage for crash-recovery needed 16
Custom Replica Placement and Selection • Policies • filter and sort available OSDs/replicas • evaluates client information (IP address/hostname, estimated latency) “create file on OSD close to me” “access closest replica”
• Available default policies: • Server ID • DNS • Datacenter Map • Vivaldi
• Own policies possible (Java)
17
Replica Placement/Selection: Vivaldi Visualization
18
Metadata Replication • Replication at database level • same approach as file R/W replication
• Loosen consistency • allow stale reads
• All services replicated No single point of failure
19
XtreemFS Use Cases • Storage of VM images for IaaS solutions (OpenNebula, ...)
• Storage-as-a-Service: Volumes per User • XtreemFS as HDFS replacement in Hadoop • XtreemFS in ConPaaS: storage on demand for other services
20
XtreemFS and OpenNebula (1) • Use Case: VM images in OpenNebula cluster
• no distributed file system: scp VM images to hosts • distributed file system: shared storage, available on all nodes • Support for live migration • Fault-tolerant storage of VM images • Resume VM on another node after crash Use XtreemFS Read/Write file replication
21
XtreemFS and OpenNebula (2) • VM deployment • Create copy (clone) of original VM image • Run cloned VM image at scheduled host • (Discard cloned image after VM shutdown)
• Problems 1. cloning time-consuming 2. waste of space 3. increasing total boot time when starting multiple VMs e.g., ConPaaS image
22
XtreemFS and OpenNebula: qcow2 + Replication • qcow2 VM image format • allows snapshots 1. immutable backing file 2. mutable, initially empty snapshot file instead of cloning, snapshot original VM image (< 1 second) Use Read/Write replication for snapshot file
• Problem left: run multiple VMs simultaneously • snapshot file: R/W replication scales with # OSDs and # files • backing file: bottle neck
• use Read-Only Replication
23
XtreemFS and OpenNebula: Benchmark (1) • OpenNebula Test Cluster • Frontend + 30 Worker nodes • Gigabit Ethernet (100 MB/s) • SATA disk (70 MB/s)
• Setup • Frontend
• MRC • OSD (has the ConPaaS VM image) • Each worker node • OSD • XtreemFS Fuse client • OpenNebula node • Replica Placement + Replica Selection: prefer local OSD/replica 24
XtreemFS and OpenNebula: Benchmark (2) Setup copy (1.6 GB image file)
Total Boot Time 82 seconds (69 seconds for copy)
qcow2, 1 VM
13.6 seconds
qcow2, 30 VMs
20.8 seconds
qcow2, 30 VMs, 30 partial replicas
142.8 seconds
- second run
20.1 seconds
- after second run
17.5 seconds
+ Read/Write Replication on snapshot file
19.5 seconds
• few read()s on image, no bottleneck yet • Replication: object granularity vs. small reads/writes
25
Future Research & Work • Deduplication • Improved Elasticity
• Fault Tolerance • Optimize Storage Cost • Erasure Codes
• Self-* • Client Cache • less POSIX: replace MRC with a scalable service
26
Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438 Total cost: 11,29 million euro EU contribution: 8,3 million euro Execution: From 2010-10-01 till 2013-09-30 Duration: 36 months Contract type: Collaborative project (generic)
27
View more...
Comments