Big Data Final Exam

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/150

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

151 Terms

New cards

RDDs can hold ____ elements

primitive types: integers, character, booleans

sequence types: strings, lists, arrays, tuples, dicts (including nested)

scala/java objects (if serializeable)

mixed types

New cards

Pair RDDs

RDDs with key-value pairs

New cards

double RDDs

RDDs with numeric data

we can add, multiply, find mean, standard deviation, etc

New cards

RDDs are resilient because

if the data is lost the data can be reconstructed

New cards

if a partition fails we can reconstruct data

sitting on that partition, if the whole node fails we can reconstruct on another node

New cards

lazy execution

we free data at the record level as well, free data from the part once the child gets the data

New cards

transformations create

a new RDD

New cards

you can create RDDs from collections

sc.parallelize(collection)

**make sure to call pyspark first

New cards

collection

array of mixed types

New cards

creating RDDs from collections is useful when

testing

generating data programmatically

integrating

New cards

creating RDDs from files

sc.textFile(“filename”)

accepts a single file, a wildcard list of files, or comma-separated list of files

each line in the file is a separate record in the RDD

New cards

files are referenced by ____ URI

absolute URI

relative URI

New cards

what happens when we do

MyweblogsRDD1= sc.textFile("file:/home/training/training_materials/data/weblogs/")

nothing, happens quickly because no action has been performed

New cards

what happens when we do

MyweblogsRDD1.count()

it will take a long time and count all the number of lines (records) in all the files in that directory

New cards

when creating RDDs from files, textFile maps

each line into a separate RDD element

New cards

textFile only works with _________ text files

line-delimited

New cards

sc.wholeTextFiles(directory)

maps entire contents of each file in a directory to a single RDD element

works only for small files (element must fit in memory)

New cards

flatMap

maps one elements in the base RDD to multiple elements

New cards

distinct

filter out duplicates

New cards

sortBy

use provided function to sort

New cards

intersection

create a new RDD with all elements in both original RDDs (must be common to both)

New cards

union

add all elements of 2 RDDs into a single new RDD

New cards

zip

pair each element of the first RDD with the corresponding element of the second

New cards

what does Rdd1= sc.textFile("purplecow.txt").flatMap(lambda line:line.split()) do

Takes purple cow file, creating RDD, can do it all in one line w lambda instead of creating RDD per line manually; splits based on white space because unspecified; iterates record by record and split based on white spaces (each word is split, not line)

New cards

flatmap vs map

map- number of records = number of records output

flatmap- number of input and output does not equal

New cards

.distinct()

creates a new RDD

New cards

.collect()

retains values as a list

New cards

what happens if u do .collect().distinct()

error because lists dont have attribute distinct

New cards

what doe rdd1.distinct().count() do

counts the number of distinct elements

New cards

what does Rdd1= sc.textFile("purplecow.txt").map(lambda line:line.split()) do

Will give 4 because map the first half gives 4 and number of inputs = output

Splitting but each record has multiple elements (sentence rather than by line)

<p><span style="color: windowtext"> </span></p><p class="Paragraph SCXO255882744 BCX0" style="text-align: left"><span>Will give 4 because map the first half gives 4 and number of inputs = output</span><span style="color: windowtext"> </span></p><p class="Paragraph SCXO255882744 BCX0" style="text-align: left"><span>Splitting but each record has multiple elements (sentence rather than by line)</span><span style="color: windowtext"> </span></p>

New cards

rdd1.subtract(rdd2)

retains rdd1 elements that are in rdd1 but not rdd2

New cards

rdd1.zip(rdd2)

error if rdd1 and rdd2 do not have the same number of elements

will return

(rdd1,rdd2) pairs for each element

New cards

rdd1.union(rdd2)

New cards

first

returns the first element of the RDD in natural ordering 123, abc

New cards

foreach

apply a function to each element in an RDD

New cards

top(n)

returns the largest n elements using natural ordering

ex: A, C, D in RDD2; rdd2.top(1) will return D

New cards

sample

create a new RDD with sampling of elements

New cards

takeSample

return an array of sampled elements

New cards

double RDD operations

statistical functions like mean, sum, variance, stdev

New cards

takeSample is

an action not a transofrmation, which is why takeSample().count() will give an error

New cards

in takeSample and sample, True = ____ and False = _______

replacement

no replacement

New cards

RDDs can be created from:

files

parallelized data in memory

other RDDs

New cards

parallelizing data in memory

When you parallelize data, you're telling Spark to:

Split a collection (like a list or array) into smaller chunks and spread them across multiple workers/nodes.

Each chunk becomes a partition, and each partition is processed in parallel by different cores or machines.

New cards

pair RDDs

each element must be a key-value pair (2-element tuple)

keys and values can be any type

New cards

why do we use pair RDDs

use with map-reduce algorithms

many additional functions are available like sorting, joining, grouping, counting

New cards

creating pair RDD from a tab-separated file

New cards

what happens if we do

Rdd1= sc.textFile(file)

File= "/home/training/mysparksolutions/Lecture12/127.txt"

error, it will look in HDFS but the file is in local file system so we do

File= "file:/home/training/mysparksolutions/Lecture12/127.txt"

Rdd1= sc.textFile(file)

File= "file:/home/training/mysparksolutions/Lecture12/127.txt"

Rdd1= sc.textFile(file)

Rdd1.count()

Rdd2= rdd1.map(lambda line: line.split('/t'))

Rdd2.count()

Gives 3

Rdd2.collect()

Does not give us a key-value pair because theres no parenthesis

Rdd3= rdd2.map(lambda fields:(fields[0],fields[1]))

Doesn't have to be called fields could have been x:x[0],x[1]

Rdd3.collect()

New cards

keyBy

transforms RDD into key-value pair RDD based on a keying function similar to map but creates KV pairs by extracting the key from each element in the RDD

Whatever functions you pass to it, it retains the key

Value will become part of value itself

<p>transforms RDD into key-value pair RDD based on a keying function similar to map but creates KV pairs by extracting the key from each element in the RDD</p><p></p><p><span>Whatever functions you pass to it, it retains the key </span><span style="color: windowtext"> </span></p><p class="Paragraph SCXO125263460 BCX0" style="text-align: left"><span>Value will become part of value itself</span><span style="color: windowtext"> </span></p>

New cards

mapping single rows to multiple pairs

New cards

mapreduce

New cards

mapreduce in spark; map

a programming model for processing large datasets with a distributed algorithm on a cluster. It includes a 'Map' step that transforms input data into key-value pairs, followed by a 'Reduce' step that combines these pairs based on their keys.

<p>a programming model for processing large datasets with a distributed algorithm on a cluster. It includes a 'Map' step that transforms input data into key-value pairs, followed by a 'Reduce' step that combines these pairs based on their keys. </p>

New cards

reduceByKey

the function might called in any order therefore must be commutative (addition/subtraction) or associative (muiltiplication/division)

New cards

why do we care about counting words

word count is challening over massive amounts data, using a single compute node would be too time-consuming and number f unique words could exceed available memory

statistics are often simple aggregate functions that are distributive in nature

mapreduce breaks complex tasks down into smaller elements which can be executed in parallel

many common tasks are very similar to word count

New cards

operations specific to pair RDDS

countByKey

groupByKey

sortByKey

join

New cards

countByKey

returns a map with the count of occurrences of each key

New cards

groupByKey

groups all the values for each key in an RDD

New cards

sortByKey

sort in ascending or descending order

New cards

join

returns an RDD containing all pairs with matching keys from 2 RDDs that combines records from both RDDs based on the key.

New cards

keys

return an RDD of just the keys without the values

New cards

values

return an RDD of just the values without the keysValues from an RDD in Spark.

New cards

lookup(key)

returns values for a key

New cards

leftOuterJoin, rightOuterjoin, fullOuterJoin

joins including keys defined left, right, or full

New cards

mapValues, flatMapValues

execute a function on just the values, keeping the key the same

New cards

mapreduce is a generic programming model for distributed processing

New cards

apache spark is

a fast and general engine for large-scale data processing

New cards

spark shell

interactive for learning or data exploration

python or scala

New cards

spark applications

for large scale data processing

python, scala, or javathat run on the Spark framework and utilize its various libraries for tasks such as data analysis and machine learning.

New cards

what does the spark shell do

provides interactive data exploration (REPL)

New cards

every spark application requires

a spark context

the main entry point to the spark API

the object that allows u to start working in Spark

New cards

RDD (resilient distributed dataset)

resilient- if data in memory is lost, it can be recreated from the storage device

distributed- processed across the cluster; RDD puts the files in memory rather than disk, when a file is in memory its RDD, we have partitions instead of blocksa collection of data elements that can be processed in parallel.

dataset- initial data can come from a file or be created programmatically

New cards

____________ are the fundamental unit of data in Spark

RDDs

New cards

most spark programming consists of

performing operations on RDDs

New cards

3 ways to create an RDD

file or set of files

data in memory

from another RDD

New cards

actions

returns values

New cards

transformations

define a new RDD based on the current ones

New cards

how to search for file in local file system

[training@localhost ~]$ sudo find / -name purplecow.txt

New cards

how to get into python spark

pyspark

New cards

in RDD we look for file in

HDFS home directory /user/trainingthe Hadoop Distributed File System (HDFS).

New cards

actions

count()

take()

collect()

saveAsTextFile(file)

New cards

count()

return the number of elements

New cards

take(n)

return an array of the first n elements

New cards

collect()

return an array of all elements from a distributed dataset to the driver program.

New cards

saveAsTextFile(file)

save to text file(s)

New cards

rdd1.saveAsTextFile("rdd1_saved")

creates a directory w 2 files one w a flag and the other w the actual content of the file containing the data from the RDD.

New cards

rdd1.count()

will show the count of the lines

New cards

RDDs are mutable or immutable?

immutable

data in an RDD is never changed

transform in sequence to modify the data as needed

New cards

common transformations

map(function)

filter(function)

New cards

map(function)

creates a new RDD by performing a function on each record in the base RDD

New cards

filter(function)

creates a new RDD including or excluding each record in the base RDD according to a boolean function

New cards

anonymous functions

functions with no name that must start with lambda (indicates that it is an in-line function)

New cards

lazy execution

data in RDDs is not processed until an action is performed

RDD is not truly loaded w the data until an action is performed

If we have an RDD and do a transformation we get a new RDD

In spark everythings a transformation until we get what we want

New cards

what happens when we do

rdd= sc.textFile("fjhskjhfssdjhdfjshsjfsk")

Once we press enter it will not give an error

rdd.count()

Now there will be an error bc it's acc looking for the file and loading the data

rdd5= rdd.map(lambda line:line.upper())

This is a transformation

No error

rdd5.count()

NOW u will have an error bc we tried to load data for a file that does not exist

This is intentional as Spark would not work otherwise

New cards

when we create a variable

it will be empty bc nothing has been loaded

New cards

when we take a variable and apply a transformation like map or filter

its still empty

New cards

data in RDDs is not processed until

an action is performed

Each file will load 10 GB at a time -> a lot of memory gets utilized for one application -> could eat all of your memory

Loading all at once is not a good idea, loading piece by piece is even worse

When you call an action everything is empty; bottom most will ask for parent for data -> up the whole tree until it goes to the top and then it loads data into memory for that file now we occupy 10 GB in memory then we do a map and create an RDD and 2^nd bottom most then free up the memory of the previous one

It will go one record at a time applies what it needs, frees up memory of previous one, then go to the next one

Once loaded it frees up the memory once the action is completed and go thru the same process again if the action is called again

<p>an action is performed</p><p></p><p><span>Each file will load 10 GB at a time -> a lot of memory gets utilized for one application -> could eat all of your memory</span><span style="color: windowtext"> </span></p><p class="Paragraph BCX0 SCXO140992701" style="text-align: left"><span>Loading all at once is not a good idea, loading piece by piece is even worse</span><span style="color: windowtext"> </span></p><p class="Paragraph BCX0 SCXO140992701" style="text-align: left"><span style="color: windowtext"> </span></p><p class="Paragraph BCX0 SCXO140992701" style="text-align: left"><span>When you call an action everything is empty; bottom most will ask for parent for data -> up the whole tree until it goes to the top and then it loads data into memory for that file now we occupy 10 GB in memory then we do a map and create an RDD and 2<sup>nd</sup> bottom most then free up the memory of the previous one</span><span style="color: windowtext"> </span></p><p class="Paragraph BCX0 SCXO140992701" style="text-align: left"><span style="color: windowtext"> </span></p><p class="Paragraph BCX0 SCXO140992701" style="text-align: left"><span>It will go one record at a time applies what it needs, frees up memory of previous one, then go to the next one</span><span style="color: windowtext"> </span></p><p class="Paragraph BCX0 SCXO140992701" style="text-align: left"></p><p class="Paragraph BCX0 SCXO140992701" style="text-align: left"><span>Once loaded it frees up the memory once the action is completed and go thru the same process again if the action is called again</span><span style="color: windowtext"> </span></p>

New cards

spark depends heavily on the concepts of

functional programming

functions are the fundamental unit of programming

fnctions have input and output only

no state or side effects

New cards

in functional programming in spark, we pass ___ as input to other functions

functions

Map is a function (transformation); transformations are functions

rdd.map() we pass functions into map

New cards

we don’t have to create spark context because

it does it for us if we want to write a spark application we have to create a spark context

New cards

execution mode determines

which mode spark will use

100

New cards

2 forms of execution mode

local

cluster