1/26
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
what is MapReduce?
a programming framework centered around map and reduce functions to process big data
what is the map function vs. reduce function?
map splits input into key value pairs, reduce merges values from duplicate keys
what do you use MapReduce for?
linking webpages, counting number of words in large text
what is Apache Hadoop?
open source MapReduce framework
what is a Hadoop node?
single system for storing and processing data
what is a Hadoop cluster?
multiple nodes that are configured to perform big data operations
What is a Hadoop cluster structure
one master node called the namenode that stores the metadata, and several worker nodes that store the actual data
what is a Hadoop distributed file system (HDFS)?
a cluster based storage service for hadoop apps and big data
what is Apache YARN?
a resource/ cluster manager that assigns MapReduce jobs
why use Apache YARN with HDFS?
it increases big data processing efficiency by allowing for different types of data processing stored on HDFS like batch, graph, and streaming
What is Apache Pig?
high level framework for running MapReduce jobs on Hadoop cluster
What is Apache Spark?
cluster computing framework for processing data workloads in parallel
Why is Apache Spark faster than Hadoop?
because apache spark performs computations in memory and in parallel
What is Apache Spark setup?
one cluster manager like Hadoop yarn/ Kubernetes and one distributed storage system like HDFS/ HBase/ Cassandra
What is a distributed streaming platform?
a distributed system that allows you to publish or subscribe to streams of records
What is Apache Kafka
a cluster based distributed streaming platform
What are the 4 API’s of Apache Kafka?
producer, consumer, streams, connector
What is Apache Beam?
a programming model that defines & execute data processing pipelines
What does Apache Beam ParDo transform do?
generic transform that the user can specify
What does Apache Beam GroupByKey transform do?
collect all the values of a unique key
What does Apache Beam CoGroupByKey transform do?
performs a relational join of multiple PC values with the same key
What does Apache Beam Combine transform do?
combines elements of PC that a user-defined function tells it to combine
What does Apache Beam Flatten transform do?
merges multiple input PC
What does Apache Beam Partition transform do?
provides the logic for how to partition a PC.
What is Zookeeper?
open-source coordination service for distributed systems
What is Hive?
data warehouse system to analyze structured data
What is Tez?
framework for building batch and interactive data processing apps like mobile payment apps