1/12
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No study sessions yet.
spark cluster
runs a driver programme & a set of executor programms
driver
maintatin spark application
respond to users programscheduling work on executors
executor
process data assigned by driver\\\\\\\\\\\\\\\\\\\\\\\\\\\\
read and write data to external sources
store computation
interact with storage
RDD
resilient: each rdd is stored on many nodes
distributed: different rdd parts stored on different nodes (runs parallel)immutable
DAG
done on rdd functions
parallel
splits equation
DAG lazy evaluation
optimise execution plan, records and combines transformation
only evaluated when called
job scheduler
coordinates execution for nodes
rdd operations
transformation:
applied to rdd to make new rdd
e.g. map makes a new rdd by applying function to elements
actions:
creates result from rdd and returns or stores a value
map
applies function to elements in rdd and makes a new rdd with results
flatmap
applies function to rdd elements and makes results into 1 collection
narrow transformation
each partition of parent rdd is used by at most 1 child rdd
wide transformation
multiple child rdd partition may rely on 1 parent rdd partition
dataframe
distributed data stored in colums
rows are facts: ‘john doe’
columns are properties: name