Lecture 7 - Spark Data Management

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/95

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

96 Terms

New cards

what is apache spark

fast engine for large scale data processing

New cards

what scale is best for apache spark

Large scale

New cards

how big is the system footprint for apache spark

very large

New cards

what is used to write spark applications

spark shell

New cards

what shell is used for interactive data analysis

python shell

New cards

what is spark?

Application and ecosystem for exploring very large data sets

New cards

what does spark perform?

all necessary data warehouse functions like loading,

storing, filtering, grouping, and joining

New cards

spark is an engine for executing what?

data flows in parallel

New cards

what allows for very fast query response times?

in memory computing

New cards

what does spark architecture provide?

iterative processing necessary for algorithmic processing

New cards

Spark use cases

Traditional Extract Transform and Loading (ETL)
Research on Raw data
statistical modeling
algorithmic processing

New cards

ETL

Processing of all kinds of information system data flows

New cards

what kind of data can spark handle?

unstructured, structured, and unknown

New cards

what systems does spark work well with?

HDFS and Linux file systems

New cards

Statistical Modeling

Provides native processing of R functions at a petabyte scale

New cards

spark applications consist of what?

driver process and executors

New cards

what is typically used as the cluster manager?

YARN or Spark

New cards

what runs the main() function

Spark Driver

New cards

what maintains information about the spark application?

spark driver

New cards

what does a spark driver prompt respond to?

user’s input or program

New cards

what does the spark driver handle?

analyzing, distributing, and scheduling work across the executors

New cards

what are executors do?

carry out the tasks that the driver assigns them

New cards

where do executors get code from?

driver

New cards

what do executors report?

status of computation on that executor back to the driver

New cards

Ways to start a spark application

scala
scripted
python

New cards

how to engage scala?

spark command

New cards

how to engage python?

pyspark command

New cards

how to engage scripted applications?

spark-submit command

New cards

when submitting spark what does spark provide?

create and submit your own custom program to the cluster

New cards

spark-submit allows

you to specify resources the application needs

New cards

when can we create a dataframe?

after engaging pyspark shell

New cards

what identifies the application?

spark

New cards

what is the command that creates a list of values?

.range()

New cards

what does this command do? .toDF()

converts to a data frame

put name of column in the parenthesis

New cards

where are data frame records saved

multiple servers in the cluster

New cards

What is a partition

collection of rows saved to one server

New cards

are partitions manipulated manually?

New cards

what are transformations?

Instructions for modifying a dataframe

New cards

what provides logic for an action?

transformations

New cards

what is used to trigger computation in spark?

actions

New cards

what do actions triggerf

transformations to execute their logic

New cards

what allows users to view data in the console?

actions

New cards

what collects data for spark objects?

actions

New cards

what do actions do to output data sourcess

write

New cards

what is lazy evaluation

spark will wait until it receives an action to execute

New cards

how can spark read data files?

the command .read

New cards

what is Comma Separated Values? (CSV)

where a comma separates each field in a row

New cards

what is a parquet

column oriented storage format designed for analytics

New cards

what can we use to compress columns and save storage space?

parquet

New cards

what do parquet formats provide

highly efficient column queries

New cards

what is Javascript Object Notation (json)

files can be read in as as a multiLine option or as an overall object

New cards

what to use to save files in spark?

DataFrame Writer

New cards

what does .option default to

parquet

New cards

how to distribute files?

.partitionBy and .bucketBy

New cards

what allows users to order the data records in the file?

.sort

New cards

in pyspark what does csvFile.write.format do

specify the file format

New cards

what is the standard mode where a new file is created

overwrite

New cards

what does /tmp/my-csv-file.csv do

specifies the directory and file name for saving

New cards

Pyspark Data Types

Binary
Integer
Long
Float
Double
String
Boolean
Date
Map
Array
Struct

New cards

MapType()

key-value pairs separated by a comma

New cards

BinaryType()

an array of bytes like a long binary string, image, video, or other large object file

New cards

IntegerType()

4-byte integer

New cards

what data type is 2,147,483,648

IntegerType()

New cards

LongType()

8-byte integer

New cards

what data type is -9,223,372,036,854,775,808

LongType()

New cards

FloatType()

4-byte decimal

New cards

what data type is 2.64

FloatType()

New cards

DoubleType()

8-byte decimal

New cards

what data type is 9.223372036854775808

DoubleType()

New cards

StringType()

a string of characters

New cards

what data type is “JSOM excellent students”

StringType()

New cards

BooleanType()

a logical value: true or false

New cards

DateType

provides date, time, and timezone

New cards

what data type is [name#bob, name#sue, name#vinay]

MapType()

New cards

ArrayType()

ordered collection of values

New cards

what data type is (karen,25,denver)

ArrayType()

New cards

StructType()

distinct list of field names

New cards

what data type is (customer,product,region)

StructType()

New cards

what data type is cheaper (less overhead) than FloatType

IntegerType

New cards

when choosing data types what reduces processing overhead

specifying integers

New cards

what is the default metadata type?

BinaryType

New cards

Can you easily go from one data type to another

no, only cast when you need it

New cards

.take

will display the number of rows

New cards

.explain

provides the (lazy) plan of execution

New cards

what happens if Spakr finds ways to improve processing?

optimize the plan

New cards

Dataframes can be turned into what kind of table>

SQL table

New cards

how does spark query SQL tables

native SQL commands

New cards

.show()

sends the results to the screen

New cards

spark.sql

looks for SQL commands between (“ ” ” and “ ” “)

New cards

what can be used to query the dataframe?

Spark statements

New cards

.groupBy, .sum, and .sort

perform equivalent functions to SQL

New cards

what is Relational Database Data Import

Client side application that imports data from a database and writes that data into the cluster

New cards

what does Relational Database Data Import use to extract rows from a table

MapReduce job

New cards

what does Relational Database Data Import use to access data in the RDBMs

Java JDBC API

New cards

what is the steps of the Import Process

examine table details
create and submit a job to the cluster
Fetch records from the relational database table and write this data to HDFS

New cards

newPath

provides the address of the database we are updating

Explore top notes

IB Chemistry 3.1 Periodic Table

Updated 1160d ago

Note

IT EXAM OUTLINE

Updated 546d ago

Note

Chapter 2 | Geographic Inquiry: Data, Tools, and Technology

Updated 1039d ago

Note

Homeostasis

Updated 1182d ago

Note

🩺chapter 24- growth, development and aging

Updated 129d ago

Note

Chapter 17 Notes - Evolution of Populations

Updated 1054d ago

Note

Chapter 24: Solutions, Acids, and Bases

Updated 941d ago

Note

3.1-3.7: Atoms, Elements and Compounds

Updated 1221d ago

Note

IB Chemistry 3.1 Periodic Table

Updated 1160d ago

Note

IT EXAM OUTLINE

Updated 546d ago

Note

Chapter 2 | Geographic Inquiry: Data, Tools, and Technology

Updated 1039d ago

Note

Homeostasis

Updated 1182d ago

Note

🩺chapter 24- growth, development and aging

Updated 129d ago

Note

Chapter 17 Notes - Evolution of Populations

Updated 1054d ago

Note

Chapter 24: Solutions, Acids, and Bases

Updated 941d ago

Note

3.1-3.7: Atoms, Elements and Compounds

Updated 1221d ago

Note

Explore top flashcards

Nisäkkäät

Updated 773d ago

Flashcards (47)

Y10 Biology - Nutrition

Flashcards (40)

Flashcards (69)

Flashcards (76)

Flashcards (37)

Flashcards (30)

something for joey part one vocab

Flashcards (20)

Flashcards (114)

Flashcards (47)

Y10 Biology - Nutrition

Flashcards (40)

Flashcards (69)

Flashcards (76)

Flashcards (37)

Flashcards (30)

something for joey part one vocab

Updated 847d ago

Flashcards (20)

Emotions and moods

Updated 187d ago

Flashcards (114)