Lecture 7 - Spark Data Management

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/95

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

96 Terms

1
New cards

what is apache spark

fast engine for large scale data processing

2
New cards

what scale is best for apache spark

Large scale

3
New cards

how big is the system footprint for apache spark

very large

4
New cards

what is used to write spark applications

spark shell

5
New cards

what shell is used for interactive data analysis

python shell

6
New cards

what is spark?

Application and ecosystem for exploring very large data sets

7
New cards

what does spark perform?

all necessary data warehouse functions like loading,

storing, filtering, grouping, and joining

8
New cards

spark is an engine for executing what?

data flows in parallel

9
New cards

what allows for very fast query response times?

in memory computing

10
New cards

what does spark architecture provide?

iterative processing necessary for algorithmic processing

11
New cards

Spark use cases

  • Traditional Extract Transform and Loading (ETL)

  • Research on Raw data

  • statistical modeling

  • algorithmic processing

12
New cards

ETL

Processing of all kinds of information system data flows

13
New cards

what kind of data can spark handle?

unstructured, structured, and unknown

14
New cards

what systems does spark work well with?

HDFS and Linux file systems

15
New cards

Statistical Modeling

Provides native processing of R functions at a petabyte scale

16
New cards

spark applications consist of what?

driver process and executors

17
New cards

what is typically used as the cluster manager?

YARN or Spark

18
New cards

what runs the main() function

Spark Driver

19
New cards

what maintains information about the spark application?

spark driver

20
New cards

what does a spark driver prompt respond to?

user’s input or program

21
New cards

what does the spark driver handle?

analyzing, distributing, and scheduling work across the executors

22
New cards

what are executors do?

carry out the tasks that the driver assigns them

23
New cards

where do executors get code from?

driver

24
New cards

what do executors report?

status of computation on that executor back to the driver

25
New cards

Ways to start a spark application

  • scala

  • scripted

  • python

26
New cards

how to engage scala?

spark command

27
New cards

how to engage python?

pyspark command

28
New cards

how to engage scripted applications?

spark-submit command

29
New cards

when submitting spark what does spark provide?

create and submit your own custom program to the cluster

30
New cards

spark-submit allows

you to specify resources the application needs

31
New cards

when can we create a dataframe?

after engaging pyspark shell

32
New cards

what identifies the application?

spark

33
New cards

what is the command that creates a list of values?

.range()

34
New cards

what does this command do? .toDF()

converts to a data frame

put name of column in the parenthesis

35
New cards

where are data frame records saved

multiple servers in the cluster

36
New cards

What is a partition

collection of rows saved to one server

37
New cards

are partitions manipulated manually?

no

38
New cards

what are transformations?

Instructions for modifying a dataframe

39
New cards

what provides logic for an action?

transformations

40
New cards

what is used to trigger computation in spark?

actions

41
New cards

what do actions triggerf

transformations to execute their logic

42
New cards

what allows users to view data in the console?

actions

43
New cards

what collects data for spark objects?

actions

44
New cards

what do actions do to output data sourcess

write

45
New cards

what is lazy evaluation

spark will wait until it receives an action to execute

46
New cards

how can spark read data files?

the command .read

47
New cards

what is Comma Separated Values? (CSV)

where a comma separates each field in a row

48
New cards

what is a parquet

column oriented storage format designed for analytics

49
New cards

what can we use to compress columns and save storage space?

parquet

50
New cards

what do parquet formats provide

highly efficient column queries

51
New cards

what is Javascript Object Notation (json)

files can be read in as as a multiLine option or as an overall object

52
New cards

what to use to save files in spark?

DataFrame Writer

53
New cards

what does .option default to

parquet

54
New cards

how to distribute files?

.partitionBy and .bucketBy

55
New cards

what allows users to order the data records in the file?

.sort

56
New cards

in pyspark what does csvFile.write.format do

specify the file format

57
New cards

what is the standard mode where a new file is created

overwrite

58
New cards

what does /tmp/my-csv-file.csv do

specifies the directory and file name for saving

59
New cards

Pyspark Data Types

  • Binary

  • Integer

  • Long

  • Float

  • Double

  • String

  • Boolean

  • Date

  • Map

  • Array

  • Struct

60
New cards

MapType()

key-value pairs separated by a comma

61
New cards

BinaryType()

an array of bytes like a long binary string, image, video, or other large object file

62
New cards

IntegerType()

4-byte integer

63
New cards

what data type is 2,147,483,648

IntegerType()

64
New cards

LongType()

8-byte integer

65
New cards

what data type is -9,223,372,036,854,775,808

LongType()

66
New cards

FloatType()

4-byte decimal

67
New cards

what data type is 2.64

FloatType()

68
New cards

DoubleType()

8-byte decimal

69
New cards

what data type is 9.223372036854775808

DoubleType()

70
New cards

StringType()

a string of characters

71
New cards

what data type is “JSOM excellent students”

StringType()

72
New cards

BooleanType()

a logical value: true or false

73
New cards

DateType

provides date, time, and timezone

74
New cards

what data type is [name#bob, name#sue, name#vinay]

MapType()

75
New cards

ArrayType()

ordered collection of values

76
New cards

what data type is (karen,25,denver)

ArrayType()

77
New cards

StructType()

distinct list of field names

78
New cards

what data type is (customer,product,region)

StructType()

79
New cards

what data type is cheaper (less overhead) than FloatType

IntegerType

80
New cards

when choosing data types what reduces processing overhead

specifying integers

81
New cards

what is the default metadata type?

BinaryType

82
New cards

Can you easily go from one data type to another

no, only cast when you need it

83
New cards

.take

will display the number of rows

84
New cards

.explain

provides the (lazy) plan of execution

85
New cards

what happens if Spakr finds ways to improve processing?

optimize the plan

86
New cards

Dataframes can be turned into what kind of table>

SQL table

87
New cards

how does spark query SQL tables

native SQL commands

88
New cards

.show()

sends the results to the screen

89
New cards

spark.sql

looks for SQL commands between (“ ” ” and “ ” “)

90
New cards

what can be used to query the dataframe?

Spark statements

91
New cards

.groupBy, .sum, and .sort

perform equivalent functions to SQL

92
New cards

what is Relational Database Data Import

Client side application that imports data from a database and writes that data into the cluster

93
New cards

what does Relational Database Data Import use to extract rows from a table

MapReduce job

94
New cards

what does Relational Database Data Import use to access data in the RDBMs

Java JDBC API

95
New cards

what is the steps of the Import Process

  1. examine table details

  2. create and submit a job to the cluster

  3. Fetch records from the relational database table and write this data to HDFS

96
New cards

newPath

provides the address of the database we are updating

Explore top flashcards

Nisäkkäät
Updated 773d ago
flashcards Flashcards (47)
31-35
Updated 79d ago
flashcards Flashcards (69)
BIOL 375 Exam 2
Updated 1026d ago
flashcards Flashcards (76)
MB3
Updated 191d ago
flashcards Flashcards (37)
Tema 6: Contexto 2
Updated 970d ago
flashcards Flashcards (30)
Emotions and moods
Updated 187d ago
flashcards Flashcards (114)
Nisäkkäät
Updated 773d ago
flashcards Flashcards (47)
31-35
Updated 79d ago
flashcards Flashcards (69)
BIOL 375 Exam 2
Updated 1026d ago
flashcards Flashcards (76)
MB3
Updated 191d ago
flashcards Flashcards (37)
Tema 6: Contexto 2
Updated 970d ago
flashcards Flashcards (30)
Emotions and moods
Updated 187d ago
flashcards Flashcards (114)