Introduction to PySpark

January 29, 2017 | Author: Jason White | Category: N/A
Share Embed Donate


Short Description

Download Introduction to PySpark...

Description

Intro to PySpark Jason White, Shopify

What is PySpark? •

Python interface to Apache Spark •

Map/Reduce style distributed computing



Natively Scala



Interfaces to Python, R are well-maintained



Uses Py4J for Java Scala interface

PySpark Basics •

Distributed Computing basic premise: •

Data is big



Program to process data is relatively small



Send program to where the data lives



Leave data on multiple nodes, scale horizontally

PySpark Basics



Driver: e.g. my laptop



Cluster Manager: YARN, Mesos, etc



Workers: Containers spun up by Cluster Manager

PySpark Basics •

RDD: Resilient Distributed Dataset •

Interface to parallelized data in the cluster



Map, Filter, Reduce functions sent from driver, executed by workers on chunks of data in parallel

PySpark: Hello World •

Classic Word Count problem •

How many times does each word appear in a given text?



Approach: Each worker computes word counts independently, then results aggregated together

PySpark: Hello World Map

The brown

dog

brown dog

(The, 1) (brown, 1)

(dog, 1)

(brown, 1) (dog, 1)

Shuffle Reduce

Collect

(The, 1) (dog, 2)

(brown, 2)

(The, 1) (dog, 2) (brown, 2)

Demo # example 1 text = "the brown dog jumped over the other brown dog" text_rdd = sc.parallelize(text.split(' ')) text_rdd.map(lambda word: (word, 1)) \ .reduceByKey(lambda left, right: left + right).collect() # example 2 import string time_machine = sc.textFile('/user/jasonwhite/time_machine') time_machine_tuples = time_machine.flatMap(lambda line: line.lower().split(' ')) \ .map(lambda word: ''.join(ch for ch in word if ch in string.letters)) \ .filter(lambda word: word != '') \ .map(lambda word: (word, 1)) word_counts = time_machine_tuples.reduceByKey(lambda left, right: left + right)

Monoids •

Monoids are combinations of: •

set of data; and



associative, commutative functions



Very efficient in M/R, strongly preferred



Examples: •

addition of integers



min/max of records by timestamp

Demo # example 3 dataset = sc.parallelize([ {'id': 1, 'value': 1}, {'id': 2, 'value': 2}, {'id': 2, 'value': 6} ]) def add_tuples(left, right): left_sum, left_count = left right_sum, right_count = right return (left_sum + right_sum, left_count + right_count) averages = dataset.map(lambda d: (d['value'], 1)) \ .reduce(add_tuples) averages_by_key = dataset.map(lambda d: (d['id'], (d['value'], 1))) \ .reduceByKey(add_tuples) \ .map(lambda (key, (sum, count)): (key, sum * 1.0 / count))

# example 4 from datetime import date dataset = sc.parallelize([ {'id': 1, 'group_id': 10, {'id': 2, 'group_id': 10, {'id': 3, 'group_id': 10, {'id': 4, 'group_id': 11, {'id': 5, 'group_id': 11, ])

Demo 'timestamp': 'timestamp': 'timestamp': 'timestamp': 'timestamp':

date(1978, date(1984, date(1986, date(1956, date(1953,

3, 3, 5, 6, 2,

2)}, 24)}, 19)}, 5)}, 21)},

def calculate_age(d): d['age'] = (date.today() - d['timestamp']).days() return d def calculate_group_stats(left, right): earliest = min(left['earliest'], right['earliest']) latest = max(left['latest'], right['latest']) total_age = left['total_age'] + right['total_age'] count = left['count'] + right['count'] return { 'earliest': earliest, 'latest': latest, 'total_age': total_age, 'count': left['count'] + right[‘count'] } group_stats = dataset.map(calculate_age) \ .map(lambda d: (d['group_id'], {'earliest': d['timestamp'], 'latest': d['timestamp'], 'total_age': d['age'], 'count': 1})) \ .reduceByKey(calculate_group_stats)

Joining RDDs •

Like many RDD operations, works on (k, v) pairs •

Each side shuffled using common keys



Each node builds its part of the joined dataset

Joining RDDs {‘id’: 1, ‘field1’: ‘foo’} {‘id’: 2, ‘field1’: ‘bar’}

{‘id’: 1, ‘field2’: ‘baz’} {‘id’: 2, ‘field2’: ‘baz’}

(1, {‘id’: 1, ‘field1’: ‘foo’}) (2, {‘id’: 2, ‘field1’: ‘bar’})

(1, {‘id’: 1, ‘field2’: ‘baz’}) (2, {‘id’: 2, ‘field2’: ‘baz’})

(1, ({‘id’: 1, ‘field1’: ‘foo’}, {‘id’: 1, ‘field2’: ‘baz’}))

(2, ({‘id’: 2, ‘field1’: ‘bar’}, {‘id’: 2, ‘field2’: ‘baz’}))

Demo # example 4 first_dataset = sc.parallelize([ {'id': 1, 'field1': 'foo'}, {'id': 2, 'field1': 'bar'}, {'id': 2, 'field1': 'baz'}, {'id': 3, 'field1': 'foo'} ]) first_dataset = first_dataset.map(lambda d: (d['id'], d)) second_dataset = sc.parallelize([ {'id': 1, 'field2': 'abc'}, {'id': 2, 'field2': 'def'} ]) second_dataset = second_dataset.map(lambda d: (d['id'], d)) output = first_dataset.join(second_dataset)

Key Skew •

Achilles’ heel of M/R: key skew



Shuffle phase distributes like keys to like nodes



If billions of rows are shuffled to the same node, may cause slight memory issues

Joining RDDs w/ Skew •

When joining to small RDD, an alternative is to “broadcast” the RDD •

Instead of shuffling, entire RDD is sent to each worker



Now each worker has all data needed



Each join is now just a map. No shuffle needed!

Demo # example 5 first_dataset = sc.parallelize([ {'id': 1, 'field1': 'foo'}, {'id': 2, 'field1': 'bar'}, {'id': 2, 'field1': 'baz'}, {'id': 3, 'field1': 'foo'} ]) first_dataset = first_dataset.map(lambda d: (d['id'], d)) second_dataset = sc.parallelize([ {'id': 1, 'field2': 'abc'}, {'id': 2, 'field2': 'def'} ]) second_dataset = second_dataset.map(lambda d: (d['id'], d)) second_dict = sc.broadcast(second_dataset.collectAsMap()) def join_records((key, record)): if key in second_dict.value.keys(): yield (key, (record, second_dict.value[key])) output = first_dataset.flatMap(join_records)

Ordering •

Row order isn’t guaranteed unless you explicitly sort the RDD



But: sometimes you need to process events in order!



Solution: repartitionAndSortWithinPartitions

Ordering {‘id’: 1, ‘value’: 10} {‘id’: 2, ‘value’: 10} {‘id’: 3, ‘value’: 20}

{‘id’: 1, ‘value’: 12} {‘id’: 1, ‘value’: 5} {‘id’: 2, ‘value’: 15}

Shuffle & Sort {‘id’: 1, ‘value’: 5} {‘id’: 1, ‘value’: 10} {‘id’: 1, ‘value’: 12} {‘id’: 3, ‘value’: 20

{‘id’: 2, ‘value’: 10} {‘id’: 2, ‘value’: 15}

MapPartitions {‘id’: 1, ‘interval’: 5} {‘id’: 1, ‘interval’: 2}

{‘id’: 2, ‘interval’: 5}

Thanks!

View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF