Ch. 3 Big Data Ecosystem

0.0(0)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/26

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

27 Terms

1
New cards

what is MapReduce?

a programming framework centered around map and reduce functions to process big data

2
New cards

what is the map function vs. reduce function?

map splits input into key value pairs, reduce merges values from duplicate keys

3
New cards

what do you use MapReduce for?

linking webpages, counting number of words in large text

4
New cards

what is Apache Hadoop?

open source MapReduce framework

5
New cards

what is a Hadoop node?

single system for storing and processing data

6
New cards

what is a Hadoop cluster?

multiple nodes that are configured to perform big data operations

7
New cards

What is a Hadoop cluster structure

one master node called the namenode that stores the metadata, and several worker nodes that store the actual data

8
New cards

what is a Hadoop distributed file system (HDFS)?

a cluster based storage service for hadoop apps and big data

9
New cards

what is Apache YARN?

a resource/ cluster manager that assigns MapReduce jobs

10
New cards

why use Apache YARN with HDFS?

it increases big data processing efficiency by allowing for different types of data processing stored on HDFS like batch, graph, and streaming

11
New cards

What is Apache Pig?

high level framework for running MapReduce jobs on Hadoop cluster

12
New cards

What is Apache Spark?

cluster computing framework for processing data workloads in parallel

13
New cards

Why is Apache Spark faster than Hadoop?

because apache spark performs computations in memory and in parallel

14
New cards

What is Apache Spark setup?

one cluster manager like Hadoop yarn/ Kubernetes and one distributed storage system like HDFS/ HBase/ Cassandra

15
New cards

What is a distributed streaming platform?

a distributed system that allows you to publish or subscribe to streams of records

16
New cards

What is Apache Kafka

a cluster based distributed streaming platform

17
New cards

What are the 4 API’s of Apache Kafka?

producer, consumer, streams, connector

18
New cards

What is Apache Beam?

a programming model that defines & execute data processing pipelines

19
New cards

What does Apache Beam ParDo transform do?

generic transform that the user can specify

20
New cards

What does Apache Beam GroupByKey transform do?

collect all the values of a unique key

21
New cards

What does Apache Beam CoGroupByKey transform do?

performs a relational join of multiple PC values with the same key

22
New cards

What does Apache Beam Combine transform do?

combines elements of PC that a user-defined function tells it to combine

23
New cards

What does Apache Beam Flatten transform do?

merges multiple input PC

24
New cards

What does Apache Beam Partition transform do?

provides the logic for how to partition a PC.

25
New cards

What is Zookeeper?

open-source coordination service for distributed systems

26
New cards

What is Hive?

data warehouse system to analyze structured data

27
New cards

What is Tez?

framework for building batch and interactive data processing apps like mobile payment apps