1/95
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
what is apache spark
fast engine for large scale data processing
what scale is best for apache spark
Large scale
how big is the system footprint for apache spark
very large
what is used to write spark applications
spark shell
what shell is used for interactive data analysis
python shell
what is spark?
Application and ecosystem for exploring very large data sets
what does spark perform?
all necessary data warehouse functions like loading,
storing, filtering, grouping, and joining
spark is an engine for executing what?
data flows in parallel
what allows for very fast query response times?
in memory computing
what does spark architecture provide?
iterative processing necessary for algorithmic processing
Spark use cases
Traditional Extract Transform and Loading (ETL)
Research on Raw data
statistical modeling
algorithmic processing
ETL
Processing of all kinds of information system data flows
what kind of data can spark handle?
unstructured, structured, and unknown
what systems does spark work well with?
HDFS and Linux file systems
Statistical Modeling
Provides native processing of R functions at a petabyte scale
spark applications consist of what?
driver process and executors
what is typically used as the cluster manager?
YARN or Spark
what runs the main() function
Spark Driver
what maintains information about the spark application?
spark driver
what does a spark driver prompt respond to?
user’s input or program
what does the spark driver handle?
analyzing, distributing, and scheduling work across the executors
what are executors do?
carry out the tasks that the driver assigns them
where do executors get code from?
driver
what do executors report?
status of computation on that executor back to the driver
Ways to start a spark application
scala
scripted
python
how to engage scala?
spark command
how to engage python?
pyspark command
how to engage scripted applications?
spark-submit command
when submitting spark what does spark provide?
create and submit your own custom program to the cluster
spark-submit allows
you to specify resources the application needs
when can we create a dataframe?
after engaging pyspark shell
what identifies the application?
spark
what is the command that creates a list of values?
.range()
what does this command do? .toDF()
converts to a data frame
put name of column in the parenthesis
where are data frame records saved
multiple servers in the cluster
What is a partition
collection of rows saved to one server
are partitions manipulated manually?
no
what are transformations?
Instructions for modifying a dataframe
what provides logic for an action?
transformations
what is used to trigger computation in spark?
actions
what do actions triggerf
transformations to execute their logic
what allows users to view data in the console?
actions
what collects data for spark objects?
actions
what do actions do to output data sourcess
write
what is lazy evaluation
spark will wait until it receives an action to execute
how can spark read data files?
the command .read
what is Comma Separated Values? (CSV)
where a comma separates each field in a row
what is a parquet
column oriented storage format designed for analytics
what can we use to compress columns and save storage space?
parquet
what do parquet formats provide
highly efficient column queries
what is Javascript Object Notation (json)
files can be read in as as a multiLine option or as an overall object
what to use to save files in spark?
DataFrame Writer
what does .option default to
parquet
how to distribute files?
.partitionBy and .bucketBy
what allows users to order the data records in the file?
.sort
in pyspark what does csvFile.write.format do
specify the file format
what is the standard mode where a new file is created
overwrite
what does /tmp/my-csv-file.csv do
specifies the directory and file name for saving
Pyspark Data Types
Binary
Integer
Long
Float
Double
String
Boolean
Date
Map
Array
Struct
MapType()
key-value pairs separated by a comma
BinaryType()
an array of bytes like a long binary string, image, video, or other large object file
IntegerType()
4-byte integer
what data type is 2,147,483,648
IntegerType()
LongType()
8-byte integer
what data type is -9,223,372,036,854,775,808
LongType()
FloatType()
4-byte decimal
what data type is 2.64
FloatType()
DoubleType()
8-byte decimal
what data type is 9.223372036854775808
DoubleType()
StringType()
a string of characters
what data type is “JSOM excellent students”
StringType()
BooleanType()
a logical value: true or false
DateType
provides date, time, and timezone
what data type is [name#bob, name#sue, name#vinay]
MapType()
ArrayType()
ordered collection of values
what data type is (karen,25,denver)
ArrayType()
StructType()
distinct list of field names
what data type is (customer,product,region)
StructType()
what data type is cheaper (less overhead) than FloatType
IntegerType
when choosing data types what reduces processing overhead
specifying integers
what is the default metadata type?
BinaryType
Can you easily go from one data type to another
no, only cast when you need it
.take
will display the number of rows
.explain
provides the (lazy) plan of execution
what happens if Spakr finds ways to improve processing?
optimize the plan
Dataframes can be turned into what kind of table>
SQL table
how does spark query SQL tables
native SQL commands
.show()
sends the results to the screen
spark.sql
looks for SQL commands between (“ ” ” and “ ” “)
what can be used to query the dataframe?
Spark statements
.groupBy, .sum, and .sort
perform equivalent functions to SQL
what is Relational Database Data Import
Client side application that imports data from a database and writes that data into the cluster
what does Relational Database Data Import use to extract rows from a table
MapReduce job
what does Relational Database Data Import use to access data in the RDBMs
Java JDBC API
what is the steps of the Import Process
examine table details
create and submit a job to the cluster
Fetch records from the relational database table and write this data to HDFS
newPath
provides the address of the database we are updating