Big Data Final Exam

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/150

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

151 Terms

1
New cards

RDDs can hold ____ elements

primitive types: integers, character, booleans

sequence types: strings, lists, arrays, tuples, dicts (including nested)

scala/java objects (if serializeable)

mixed types

2
New cards

Pair RDDs

RDDs with key-value pairs

3
New cards

double RDDs

RDDs with numeric data

we can add, multiply, find mean, standard deviation, etc

4
New cards

RDDs are resilient because

if the data is lost the data can be reconstructed

5
New cards

if a partition fails we can reconstruct data

sitting on that partition, if the whole node fails we can reconstruct on another node

6
New cards

lazy execution

we free data at the record level as well, free data from the part once the child gets the data

7
New cards

transformations create

a new RDD

8
New cards

you can create RDDs from collections

sc.parallelize(collection)

**make sure to call pyspark first

<p>sc.parallelize(collection)</p><p></p><p>**make sure to call pyspark first</p>
9
New cards

collection

array of mixed types

10
New cards

creating RDDs from collections is useful when

testing

generating data programmatically

integrating

11
New cards

creating RDDs from files

sc.textFile(“filename”)

accepts a single file, a wildcard list of files, or comma-separated list of files

each line in the file is a separate record in the RDD

<p>sc.textFile(“filename”)</p><p>accepts a single file, a wildcard list of files, or comma-separated list of files</p><p>each line in the file is a separate record in the RDD</p>
12
New cards

files are referenced by ____ URI

absolute URI

relative URI

<p>absolute URI</p><p>relative URI</p>
13
New cards

what happens when we do

MyweblogsRDD1= sc.textFile("file:/home/training/training_materials/data/weblogs/") 

nothing, happens quickly because no action has been performed

14
New cards

what happens when we do

MyweblogsRDD1.count()

it will take a long time and count all the number of lines (records) in all the files in that directory

15
New cards

when creating RDDs from files, textFile maps

each line into a separate RDD element

<p>each line into a separate RDD element</p>
16
New cards

textFile only works with _________ text files

line-delimited

17
New cards

sc.wholeTextFiles(directory)

maps entire contents of each file in a directory to a single RDD element

works only for small files (element must fit in memory)

18
New cards

flatMap

maps one elements in the base RDD to multiple elements

19
New cards

distinct

filter out duplicates

20
New cards

sortBy

use provided function to sort

21
New cards

intersection

create a new RDD with all elements in both original RDDs (must be common to both)

22
New cards

union

add all elements of 2 RDDs into a single new RDD

23
New cards

zip

pair each element of the first RDD with the corresponding element of the second

24
New cards

what does Rdd1= sc.textFile("purplecow.txt").flatMap(lambda line:line.split()) do

Takes purple cow file, creating RDD, can do it all in one line w lambda instead of creating RDD per line manually; splits based on white space because unspecified; iterates record by record and split based on white spaces (each word is split, not line)

25
New cards

flatmap vs map

map- number of records = number of records output

flatmap- number of input and output does not equal

26
New cards

.distinct()

creates a new RDD

27
New cards

.collect()

retains values as a list

28
New cards

what happens if u do .collect().distinct()

error because lists dont have attribute distinct

29
New cards

what doe rdd1.distinct().count() do

counts the number of distinct elements

30
New cards

what does Rdd1= sc.textFile("purplecow.txt").map(lambda line:line.split()) do

 

Will give 4 because map the first half gives 4 and number of inputs = output 

Splitting but each record has multiple elements (sentence rather than by line) 

<p><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph SCXO255882744 BCX0" style="text-align: left"><span>Will give 4 because map the first half gives 4 and number of inputs = output</span><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph SCXO255882744 BCX0" style="text-align: left"><span>Splitting but each record has multiple elements (sentence rather than by line)</span><span style="color: windowtext">&nbsp;</span></p>
31
New cards

rdd1.subtract(rdd2)

retains rdd1 elements that are in rdd1 but not rdd2

32
New cards

rdd1.zip(rdd2)

error if rdd1 and rdd2 do not have the same number of elements

will return

(rdd1,rdd2) pairs for each element

33
New cards

rdd1.union(rdd2)

knowt flashcard image
34
New cards

first

returns the first element of the RDD in natural ordering 123, abc

35
New cards

foreach

apply a function to each element in an RDD

36
New cards

top(n)

returns the largest n elements using natural ordering

ex: A, C, D in RDD2; rdd2.top(1) will return D

37
New cards

sample

create a new RDD with sampling of elements

38
New cards

takeSample

return an array of sampled elements

39
New cards

double RDD operations

statistical functions like mean, sum, variance, stdev

40
New cards

takeSample is

an action not a transofrmation, which is why takeSample().count() will give an error

41
New cards

in takeSample and sample, True = ____ and False = _______

replacement

no replacement

42
New cards

RDDs can be created from:

files

parallelized data in memory

other RDDs

43
New cards

parallelizing data in memory

When you parallelize data, you're telling Spark to:

Split a collection (like a list or array) into smaller chunks and spread them across multiple workers/nodes.

Each chunk becomes a partition, and each partition is processed in parallel by different cores or machines.

44
New cards

pair RDDs

each element must be a key-value pair (2-element tuple)

keys and values can be any type

45
New cards

why do we use pair RDDs

use with map-reduce algorithms

many additional functions are available like sorting, joining, grouping, counting

46
New cards

creating pair RDD from a tab-separated file

knowt flashcard image
47
New cards

what happens if we do

Rdd1= sc.textFile(file) 

File= "/home/training/mysparksolutions/Lecture12/127.txt" 

error, it will look in HDFS but the file is in local file system so we do

File= "file:/home/training/mysparksolutions/Lecture12/127.txt" 

Rdd1= sc.textFile(file)

File= "file:/home/training/mysparksolutions/Lecture12/127.txt"

Rdd1= sc.textFile(file)

Rdd1.count()

3

Rdd2= rdd1.map(lambda line: line.split('/t'))

Rdd2.count()

Gives 3

Rdd2.collect()

Does not give us a key-value pair because theres no parenthesis

Rdd3= rdd2.map(lambda fields:(fields[0],fields[1]))

Doesn't have to be called fields could have been x:x[0],x[1]

Rdd3.collect()

48
New cards

keyBy

transforms RDD into key-value pair RDD based on a keying function similar to map but creates KV pairs by extracting the key from each element in the RDD

Whatever functions you pass to it, it retains the key  

Value will become part of value itself 

<p>transforms RDD into key-value pair RDD based on a keying function similar to map but creates KV pairs by extracting the key from each element in the RDD</p><p></p><p><span>Whatever functions you pass to it, it retains the key&nbsp;</span><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph SCXO125263460 BCX0" style="text-align: left"><span>Value will become part of value itself</span><span style="color: windowtext">&nbsp;</span></p>
49
New cards

mapping single rows to multiple pairs

knowt flashcard image
50
New cards

mapreduce

<p></p>
51
New cards

mapreduce in spark; map

a programming model for processing large datasets with a distributed algorithm on a cluster. It includes a 'Map' step that transforms input data into key-value pairs, followed by a 'Reduce' step that combines these pairs based on their keys.

<p>a programming model for processing large datasets with a distributed algorithm on a cluster. It includes a 'Map' step that transforms input data into key-value pairs, followed by a 'Reduce' step that combines these pairs based on their keys. </p>
52
New cards

reduceByKey

the function might called in any order therefore must be commutative (addition/subtraction) or associative (muiltiplication/division)

<p>the function might called in any order therefore must be commutative (addition/subtraction) or associative (muiltiplication/division)</p>
53
New cards

why do we care about counting words

word count is challening over massive amounts data, using a single compute node would be too time-consuming and number f unique words could exceed available memory

statistics are often simple aggregate functions that are distributive in nature

mapreduce breaks complex tasks down into smaller elements which can be executed in parallel

many common tasks are very similar to word count

54
New cards

operations specific to pair RDDS

countByKey

groupByKey

sortByKey

join

<p>countByKey</p><p>groupByKey</p><p>sortByKey</p><p>join</p>
55
New cards

countByKey

returns a map with the count of occurrences of each key

56
New cards

groupByKey

groups all the values for each key in an RDD

57
New cards

sortByKey

sort in ascending or descending order

58
New cards

join

returns an RDD containing all pairs with matching keys from 2 RDDs that combines records from both RDDs based on the key.

<p>returns an RDD containing all pairs with matching keys from 2 RDDs that combines records from both RDDs based on the key. </p>
59
New cards

keys

return an RDD of just the keys without the values

60
New cards

values

return an RDD of just the values without the keysValues from an RDD in Spark.

61
New cards

lookup(key)

returns values for a key

62
New cards

leftOuterJoin, rightOuterjoin, fullOuterJoin

joins including keys defined left, right, or full

63
New cards

mapValues, flatMapValues

execute a function on just the values, keeping the key the same

64
New cards

mapreduce is a generic programming model for distributed processing

knowt flashcard image
65
New cards

apache spark is

a fast and general engine for large-scale data processing

66
New cards

spark shell

interactive for learning or data exploration

python or scala

67
New cards

spark applications

for large scale data processing

python, scala, or javathat run on the Spark framework and utilize its various libraries for tasks such as data analysis and machine learning.

68
New cards

what does the spark shell do

provides interactive data exploration (REPL)

69
New cards

every spark application requires

a spark context

the main entry point to the spark API

the object that allows u to start working in Spark

70
New cards

RDD (resilient distributed dataset)

resilient- if data in memory is lost, it can be recreated from the storage device

distributed- processed across the cluster; RDD puts the files in memory rather than disk, when a file is in memory its RDD, we have partitions instead of blocksa collection of data elements that can be processed in parallel.

dataset- initial data can come from a file or be created programmatically

71
New cards

____________ are the fundamental unit of data in Spark

RDDs

72
New cards

most spark programming consists of

performing operations on RDDs

73
New cards

3 ways to create an RDD

file or set of files

data in memory

from another RDD

74
New cards

actions

returns values

75
New cards

transformations

define a new RDD based on the current ones

76
New cards

how to search for file in local file system

[training@localhost ~]$ sudo find / -name purplecow.txt 

77
New cards

how to get into python spark

pyspark

78
New cards

in RDD we look for file in

HDFS home directory /user/trainingthe Hadoop Distributed File System (HDFS).

79
New cards

actions

count()

take()

collect()

saveAsTextFile(file)

80
New cards

count()

return the number of elements

81
New cards

take(n)

return an array of the first n elements

82
New cards

collect()

return an array of all elements from a distributed dataset to the driver program.

83
New cards

saveAsTextFile(file)

save to text file(s)

84
New cards

rdd1.saveAsTextFile("rdd1_saved")

creates a directory w 2 files one w a flag and the other w the actual content of the file containing the data from the RDD.

85
New cards

rdd1.count()

will show the count of the lines

86
New cards

RDDs are mutable or immutable?

immutable

data in an RDD is never changed

transform in sequence to modify the data as needed

87
New cards

common transformations

map(function)

filter(function)

88
New cards

map(function)

creates a new RDD by performing a function on each record in the base RDD

89
New cards

filter(function)

creates a new RDD including or excluding each record in the base RDD according to a boolean function

90
New cards

anonymous functions

functions with no name that must start with lambda (indicates that it is an in-line function)

91
New cards

lazy execution

data in RDDs is not processed until an action is performed

RDD is not truly loaded w the data until an action is performed 

If we have an RDD and do a transformation we get a new RDD  

In spark everythings a transformation until we get what we want 

92
New cards

what happens when we do

rdd= sc.textFile("fjhskjhfssdjhdfjshsjfsk") 

Once we press enter it will not give an error 

rdd.count() 

Now there will be an error bc it's acc looking for the file and loading the data 

rdd5= rdd.map(lambda line:line.upper()) 

This is a transformation 

No error 

rdd5.count() 

NOW u will have an error bc we tried to load data for a file that does not exist 

 

This is intentional as Spark would not work otherwise 

93
New cards

when we create a variable

it will be empty bc nothing has been loaded

<p>it will be empty bc nothing has been loaded</p>
94
New cards

when we take a variable and apply a transformation like map or filter

its still empty

<p>its still empty</p>
95
New cards

data in RDDs is not processed until

an action is performed

Each file will load 10 GB at a time -> a lot of memory gets utilized for one application -> could eat all of your memory 

Loading all at once is not a good idea, loading piece by piece is even worse 

 

When you call an action everything is empty; bottom most will ask for parent for data -> up the whole tree until it goes to the top and then it loads data into memory for that file now we occupy 10 GB in memory then we do a map and create an RDD and 2nd bottom most then free up the memory of the previous one 

 

It will go one record at a time applies what it needs, frees up memory of previous one, then go to the next one 

Once loaded it frees up the memory once the action is completed and go thru the same process again if the action is called again 

<p>an action is performed</p><p></p><p><span>Each file will load 10 GB at a time -&gt; a lot of memory gets utilized for one application -&gt; could eat all of your memory</span><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph  BCX0 SCXO140992701" style="text-align: left"><span>Loading all at once is not a good idea, loading piece by piece is even worse</span><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph  BCX0 SCXO140992701" style="text-align: left"><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph  BCX0 SCXO140992701" style="text-align: left"><span>When you call an action everything is empty; bottom most will ask for parent for data -&gt; up the whole tree until it goes to the top and then it loads data into memory for that file now we occupy 10 GB in memory then we do a map and create an RDD and 2<sup>nd</sup> bottom most then free up the memory of the previous one</span><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph  BCX0 SCXO140992701" style="text-align: left"><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph  BCX0 SCXO140992701" style="text-align: left"><span>It will go one record at a time applies what it needs, frees up memory of previous one, then go to the next one</span><span style="color: windowtext">&nbsp;</span></p><p class="Paragraph  BCX0 SCXO140992701" style="text-align: left"></p><p class="Paragraph  BCX0 SCXO140992701" style="text-align: left"><span>Once loaded it frees up the memory once the action is completed and go thru the same process again if the action is called again</span><span style="color: windowtext">&nbsp;</span></p>
96
New cards

spark depends heavily on the concepts of

functional programming

functions are the fundamental unit of programming

fnctions have input and output only

no state or side effects

97
New cards

in functional programming in spark, we pass ___ as input to other functions

functions

Map is a function (transformation); transformations are functions 

rdd.map() we pass functions into map 

98
New cards

we don’t have to create spark context because

it does it for us if we want to write a spark application we have to create a spark context

99
New cards

execution mode determines

which mode spark will use

100
New cards

2 forms of execution mode

local

cluster