XtreemFS-A Cloud File System

June 4, 2016 | Author: ubiquetec | Category: N/A

Share Embed Donate

Report this link

Short Description

XtreemFS-A_Cloud_File_System...

Description

XtreemFS – A Cloud File System Michael Berlin Zuse Institute Berlin Contrail Summer School, Almere, 24.07.2012

Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438

Motivation Cloud Storage / Cloud File System • Cloud Storage Requirements • highly available • scalable • elastic: add and remove capacity • suitable for wide area networks

• Support for legacy applications • POSIX-compatible file system required

• Google for “cloud file system”: www.XtreemFS.org

2

Outline • XtreemFS Architecture

• Replication in XtreemFS • Read-Only File Replication • Read/Write File Replication • Custom Replica Placement and Selection • Metadata Replication

• XtreemFS Use Cases • XtreemFS and OpenNebula

3

XtreemFS - A Cloud File System • History • 2006 initial development in XtreemOS project • 2010 further development in Contrail project • 2012 August: Release 1.3.2

• Features • Distributed File System

• POSIX compatible • Replication • X.509 Certificates and SSL Support

• Software • Open source: www.xtreemfs.org • Client software (C++) runs on Linux & OS X (Fuse), Windows (Dokan) • Server software (Java) 4

XtreemFS Architecture Separation of Metadata and File Content: Metadata and Replica Catalog (MRC): –

stores metadata per volume

Object Storage Devices (OSDs): –

directly accessed by clients

–

file content split into objects

 object-based file system

5

Scalability • Storage Capacity • addition and removal of OSDs possible • OSDs may be used by multiple volumes

• File I/O Throughput • scales with number of OSDs

• Metadata Throughput • limited by MRC hardware

 use many volumes spread over multiple MRCs

6

Read-Only Replication (1) • Only for “write-once” files • File must be marked as “read-only” • done automatically after close() • Use Case: CDN

• Replica Types: 1.

Full replicas •

2.

•

complete copy, fills itself as fast as possible

Partial replicas •

Initially empty

•

on-demand fetching of missing objects

P2P-like efficient transfer between all replicas

7

Read-Only Replication (2)

8

Read/Write Replication (1) • Primary/backup scheme • POSIX requires total order of update operations  primary/backup • Primary fail-over?

• Leases • grants access to a resource (here: primary role) for a predefined period of time • Failover after timeout possible • Assumption: loosely synchronized clocks • max drift ε 9

Read/Write Replication (2) Replicated write():

10

Read/Write Replication (3) Replicated write(): 1. Lease Acquisition

11

Read/Write Replication (4) Replicated write(): 1. Lease Acquisition

2. Data Dissemination

12

Read/Write Replication (5) Replicated read(): 1. Lease Acquisition

1b. “Replica Reset”  update primary’s replica 2. Respond to read() using local replica

13

Read/Write Replication: Distributed Lease Acquisition with Flease Central Lock Service

Flease

• Flease • Failure tolerant: majority-based • Scalable: lease per file

• Experiment: • Zookeeper: 3 servers

• Flease: 3 nodes (2 randomly selected)

14

Read/Write Replication: Data dissemination • Ensuring Consistency with Quorum Protocol • R+W>N • R - # replicas have to be read from • W - # replicas have to be updated • Quorum intersection property • Intersection never empty

• Write All, Read 1 (W = N, R = 1) • No availability

Example with 3 (=N) Replicas (W = 2, R = 2) a) Write

• Reads from Backup Replicas allowed b) Read

• Write Quorum, Read Quorum • Available if majority reachable

Quorum Read covered by “Replica Reset” phase 15

Read/Write Replication: Summary • High up-front costs (for first access to inactive file) • 3+ round-trips • 2 for Flease (lease acquisition) • 1 for Replica Reset • + further when fetching missing objects

• Minimal cost for subsequent operations • Read: identical to non-replicated case • Write: latency increases by time to update majority of backups

• Works at file-level: scales with # OSDs and # files • Flease: no I/O to stable storage for crash-recovery needed 16

Custom Replica Placement and Selection • Policies • filter and sort available OSDs/replicas • evaluates client information (IP address/hostname, estimated latency) “create file on OSD close to me” “access closest replica”

• Available default policies: • Server ID • DNS • Datacenter Map • Vivaldi

• Own policies possible (Java)

17

Replica Placement/Selection: Vivaldi Visualization

18

Metadata Replication • Replication at database level • same approach as file R/W replication

• Loosen consistency • allow stale reads

• All services replicated  No single point of failure

19

XtreemFS Use Cases • Storage of VM images for IaaS solutions (OpenNebula, ...)

• Storage-as-a-Service: Volumes per User • XtreemFS as HDFS replacement in Hadoop • XtreemFS in ConPaaS: storage on demand for other services

20

XtreemFS and OpenNebula (1) • Use Case: VM images in OpenNebula cluster

• no distributed file system: scp VM images to hosts • distributed file system: shared storage, available on all nodes • Support for live migration • Fault-tolerant storage of VM images • Resume VM on another node after crash  Use XtreemFS Read/Write file replication

21

XtreemFS and OpenNebula (2) • VM deployment • Create copy (clone) of original VM image • Run cloned VM image at scheduled host • (Discard cloned image after VM shutdown)

• Problems 1. cloning time-consuming 2. waste of space 3. increasing total boot time when starting multiple VMs e.g., ConPaaS image

22

XtreemFS and OpenNebula: qcow2 + Replication • qcow2 VM image format • allows snapshots 1. immutable backing file 2. mutable, initially empty snapshot file  instead of cloning, snapshot original VM image (< 1 second)  Use Read/Write replication for snapshot file

• Problem left: run multiple VMs simultaneously • snapshot file: R/W replication scales with # OSDs and # files • backing file: bottle neck

• use Read-Only Replication

23

XtreemFS and OpenNebula: Benchmark (1) • OpenNebula Test Cluster • Frontend + 30 Worker nodes • Gigabit Ethernet (100 MB/s) • SATA disk (70 MB/s)

• Setup • Frontend

• MRC • OSD (has the ConPaaS VM image) • Each worker node • OSD • XtreemFS Fuse client • OpenNebula node • Replica Placement + Replica Selection: prefer local OSD/replica 24

XtreemFS and OpenNebula: Benchmark (2) Setup copy (1.6 GB image file)

Total Boot Time 82 seconds (69 seconds for copy)

qcow2, 1 VM

13.6 seconds

qcow2, 30 VMs

20.8 seconds

qcow2, 30 VMs, 30 partial replicas

142.8 seconds

- second run

20.1 seconds

- after second run

17.5 seconds

+ Read/Write Replication on snapshot file

19.5 seconds

• few read()s on image, no bottleneck yet • Replication: object granularity vs. small reads/writes

25

Future Research & Work • Deduplication • Improved Elasticity

• Fault Tolerance • Optimize Storage Cost • Erasure Codes

• Self-* • Client Cache • less POSIX: replace MRC with a scalable service

26

Funded under: FP7 (Seventh Framework Programme) Area: Internet of Services, Software & virtualization (ICT-2009.1.2) Project reference: 257438 Total cost: 11,29 million euro EU contribution: 8,3 million euro Execution: From 2010-10-01 till 2013-09-30 Duration: 36 months Contract type: Collaborative project (generic)

27

XtreemFS-A Cloud File System

Short Description

Description

Comments

We need your help!