Ch. 3 Big Data Ecosystem

0.0(0)

Studied by 1 person

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/26

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

27 Terms

New cards

what is MapReduce?

a programming framework centered around map and reduce functions to process big data

New cards

what is the map function vs. reduce function?

map splits input into key value pairs, reduce merges values from duplicate keys

New cards

what do you use MapReduce for?

linking webpages, counting number of words in large text

New cards

what is Apache Hadoop?

open source MapReduce framework

New cards

what is a Hadoop node?

single system for storing and processing data

New cards

what is a Hadoop cluster?

multiple nodes that are configured to perform big data operations

New cards

What is a Hadoop cluster structure

one master node called the namenode that stores the metadata, and several worker nodes that store the actual data

New cards

what is a Hadoop distributed file system (HDFS)?

a cluster based storage service for hadoop apps and big data

New cards

what is Apache YARN?

a resource/ cluster manager that assigns MapReduce jobs

New cards

why use Apache YARN with HDFS?

it increases big data processing efficiency by allowing for different types of data processing stored on HDFS like batch, graph, and streaming

New cards

What is Apache Pig?

high level framework for running MapReduce jobs on Hadoop cluster

New cards

What is Apache Spark?

cluster computing framework for processing data workloads in parallel

New cards

Why is Apache Spark faster than Hadoop?

because apache spark performs computations in memory and in parallel

New cards

What is Apache Spark setup?

one cluster manager like Hadoop yarn/ Kubernetes and one distributed storage system like HDFS/ HBase/ Cassandra

New cards

What is a distributed streaming platform?

a distributed system that allows you to publish or subscribe to streams of records

New cards

What is Apache Kafka

a cluster based distributed streaming platform

New cards

What are the 4 API’s of Apache Kafka?

producer, consumer, streams, connector

New cards

What is Apache Beam?

a programming model that defines & execute data processing pipelines

New cards

What does Apache Beam ParDo transform do?

generic transform that the user can specify

New cards

What does Apache Beam GroupByKey transform do?

collect all the values of a unique key

New cards

What does Apache Beam CoGroupByKey transform do?

performs a relational join of multiple PC values with the same key

New cards

What does Apache Beam Combine transform do?

combines elements of PC that a user-defined function tells it to combine

New cards

What does Apache Beam Flatten transform do?

merges multiple input PC

New cards

What does Apache Beam Partition transform do?

provides the logic for how to partition a PC.

New cards

What is Zookeeper?

open-source coordination service for distributed systems

New cards

What is Hive?

data warehouse system to analyze structured data

New cards

What is Tez?

framework for building batch and interactive data processing apps like mobile payment apps