Introduction to PySpark
January 29, 2017 | Author: Jason White | Category: N/A
Short Description
Download Introduction to PySpark...
Description
Intro to PySpark Jason White, Shopify
What is PySpark? •
Python interface to Apache Spark •
Map/Reduce style distributed computing
•
Natively Scala
•
Interfaces to Python, R are well-maintained
•
Uses Py4J for Java Scala interface
PySpark Basics •
Distributed Computing basic premise: •
Data is big
•
Program to process data is relatively small
•
Send program to where the data lives
•
Leave data on multiple nodes, scale horizontally
PySpark Basics
•
Driver: e.g. my laptop
•
Cluster Manager: YARN, Mesos, etc
•
Workers: Containers spun up by Cluster Manager
PySpark Basics •
RDD: Resilient Distributed Dataset •
Interface to parallelized data in the cluster
•
Map, Filter, Reduce functions sent from driver, executed by workers on chunks of data in parallel
PySpark: Hello World •
Classic Word Count problem •
How many times does each word appear in a given text?
•
Approach: Each worker computes word counts independently, then results aggregated together
PySpark: Hello World Map
The brown
dog
brown dog
(The, 1) (brown, 1)
(dog, 1)
(brown, 1) (dog, 1)
Shuffle Reduce
Collect
(The, 1) (dog, 2)
(brown, 2)
(The, 1) (dog, 2) (brown, 2)
Demo # example 1 text = "the brown dog jumped over the other brown dog" text_rdd = sc.parallelize(text.split(' ')) text_rdd.map(lambda word: (word, 1)) \ .reduceByKey(lambda left, right: left + right).collect() # example 2 import string time_machine = sc.textFile('/user/jasonwhite/time_machine') time_machine_tuples = time_machine.flatMap(lambda line: line.lower().split(' ')) \ .map(lambda word: ''.join(ch for ch in word if ch in string.letters)) \ .filter(lambda word: word != '') \ .map(lambda word: (word, 1)) word_counts = time_machine_tuples.reduceByKey(lambda left, right: left + right)
Monoids •
Monoids are combinations of: •
set of data; and
•
associative, commutative functions
•
Very efficient in M/R, strongly preferred
•
Examples: •
addition of integers
•
min/max of records by timestamp
Demo # example 3 dataset = sc.parallelize([ {'id': 1, 'value': 1}, {'id': 2, 'value': 2}, {'id': 2, 'value': 6} ]) def add_tuples(left, right): left_sum, left_count = left right_sum, right_count = right return (left_sum + right_sum, left_count + right_count) averages = dataset.map(lambda d: (d['value'], 1)) \ .reduce(add_tuples) averages_by_key = dataset.map(lambda d: (d['id'], (d['value'], 1))) \ .reduceByKey(add_tuples) \ .map(lambda (key, (sum, count)): (key, sum * 1.0 / count))
# example 4 from datetime import date dataset = sc.parallelize([ {'id': 1, 'group_id': 10, {'id': 2, 'group_id': 10, {'id': 3, 'group_id': 10, {'id': 4, 'group_id': 11, {'id': 5, 'group_id': 11, ])
Demo 'timestamp': 'timestamp': 'timestamp': 'timestamp': 'timestamp':
date(1978, date(1984, date(1986, date(1956, date(1953,
3, 3, 5, 6, 2,
2)}, 24)}, 19)}, 5)}, 21)},
def calculate_age(d): d['age'] = (date.today() - d['timestamp']).days() return d def calculate_group_stats(left, right): earliest = min(left['earliest'], right['earliest']) latest = max(left['latest'], right['latest']) total_age = left['total_age'] + right['total_age'] count = left['count'] + right['count'] return { 'earliest': earliest, 'latest': latest, 'total_age': total_age, 'count': left['count'] + right[‘count'] } group_stats = dataset.map(calculate_age) \ .map(lambda d: (d['group_id'], {'earliest': d['timestamp'], 'latest': d['timestamp'], 'total_age': d['age'], 'count': 1})) \ .reduceByKey(calculate_group_stats)
Joining RDDs •
Like many RDD operations, works on (k, v) pairs •
Each side shuffled using common keys
•
Each node builds its part of the joined dataset
Joining RDDs {‘id’: 1, ‘field1’: ‘foo’} {‘id’: 2, ‘field1’: ‘bar’}
{‘id’: 1, ‘field2’: ‘baz’} {‘id’: 2, ‘field2’: ‘baz’}
(1, {‘id’: 1, ‘field1’: ‘foo’}) (2, {‘id’: 2, ‘field1’: ‘bar’})
(1, {‘id’: 1, ‘field2’: ‘baz’}) (2, {‘id’: 2, ‘field2’: ‘baz’})
(1, ({‘id’: 1, ‘field1’: ‘foo’}, {‘id’: 1, ‘field2’: ‘baz’}))
(2, ({‘id’: 2, ‘field1’: ‘bar’}, {‘id’: 2, ‘field2’: ‘baz’}))
Demo # example 4 first_dataset = sc.parallelize([ {'id': 1, 'field1': 'foo'}, {'id': 2, 'field1': 'bar'}, {'id': 2, 'field1': 'baz'}, {'id': 3, 'field1': 'foo'} ]) first_dataset = first_dataset.map(lambda d: (d['id'], d)) second_dataset = sc.parallelize([ {'id': 1, 'field2': 'abc'}, {'id': 2, 'field2': 'def'} ]) second_dataset = second_dataset.map(lambda d: (d['id'], d)) output = first_dataset.join(second_dataset)
Key Skew •
Achilles’ heel of M/R: key skew
•
Shuffle phase distributes like keys to like nodes
•
If billions of rows are shuffled to the same node, may cause slight memory issues
Joining RDDs w/ Skew •
When joining to small RDD, an alternative is to “broadcast” the RDD •
Instead of shuffling, entire RDD is sent to each worker
•
Now each worker has all data needed
•
Each join is now just a map. No shuffle needed!
Demo # example 5 first_dataset = sc.parallelize([ {'id': 1, 'field1': 'foo'}, {'id': 2, 'field1': 'bar'}, {'id': 2, 'field1': 'baz'}, {'id': 3, 'field1': 'foo'} ]) first_dataset = first_dataset.map(lambda d: (d['id'], d)) second_dataset = sc.parallelize([ {'id': 1, 'field2': 'abc'}, {'id': 2, 'field2': 'def'} ]) second_dataset = second_dataset.map(lambda d: (d['id'], d)) second_dict = sc.broadcast(second_dataset.collectAsMap()) def join_records((key, record)): if key in second_dict.value.keys(): yield (key, (record, second_dict.value[key])) output = first_dataset.flatMap(join_records)
Ordering •
Row order isn’t guaranteed unless you explicitly sort the RDD
•
But: sometimes you need to process events in order!
•
Solution: repartitionAndSortWithinPartitions
Ordering {‘id’: 1, ‘value’: 10} {‘id’: 2, ‘value’: 10} {‘id’: 3, ‘value’: 20}
{‘id’: 1, ‘value’: 12} {‘id’: 1, ‘value’: 5} {‘id’: 2, ‘value’: 15}
Shuffle & Sort {‘id’: 1, ‘value’: 5} {‘id’: 1, ‘value’: 10} {‘id’: 1, ‘value’: 12} {‘id’: 3, ‘value’: 20
{‘id’: 2, ‘value’: 10} {‘id’: 2, ‘value’: 15}
MapPartitions {‘id’: 1, ‘interval’: 5} {‘id’: 1, ‘interval’: 2}
{‘id’: 2, ‘interval’: 5}
Thanks!
View more...
Comments