chapter 11

5.0(1)

Studied by 0 people

5.0(1)

View linked note

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/51

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

52 Terms

New cards

What is Apache Storm?

A distributed, fault-tolerant framework for real-time computation that processes data streams from sources like Kafka, Kinesis, and RabbitMQ.

New cards

What are the main Storm Concepts?

Topology, Stream, Spouts, Bolts, Tasks.

New cards

What is a Topology in Apache Storm?

A graph of computations that defines how data flows and is processed across the cluster.

New cards

What is a Stream in Apache Storm?

An unbounded sequence of tuples (data records) that flows through the topology.

New cards

What are Spouts in Apache Storm?

Components that act as sources of streams, emitting tuples into the topology.

New cards

What are Bolts in Apache Storm?

Components that process incoming tuples, performing operations like filtering, aggregating, or joining data.

New cards

What are Tasks in Apache Storm?

Parallel threads that execute spouts and bolts within worker processes to handle data processing.

New cards

What are the different Stream Groupings in Storm?

Shuffle, Field Grouping, All/Global, Direct.

New cards

What are the main components of a Storm Cluster?

Nimbus, Supervisor, Zookeeper.

New cards

What is Nimbus in a Storm Cluster?

The master node that manages topologies and distributes tasks to supervisors.

New cards

What is the role of the Supervisor in Storm?

Executes worker processes that run spouts and bolts as part of the topology.

New cards

What is Zookeeper used for in a Storm Cluster?

Coordinates the cluster by managing configuration, synchronization, and leader election.

New cards

What is Spark Streaming?

A high-throughput, fault-tolerant stream processing component of Apache Spark.

New cards

What are DStreams in Spark Streaming?

Sequences of Resilient Distributed Datasets (RDDs) representing data from specific time intervals.

New cards

What are the primary sources for Spark Streaming?

Kafka, HDFS, custom connectors, and other streaming data sources.

New cards

How are DStreams similar to RDDs in Spark?

DStreams represent data over time intervals, allowing for batch-like processing of streaming data.

New cards

What are the two types of DStream Transformations?

Stateless Transformations and Stateful Transformations.

New cards

What are Window Operations in Spark Streaming?

Operations that compute over sliding data windows, allowing aggregation and analysis within specified time frames.

New cards

Name some Window Operations in Spark Streaming.

Window, CountByWindow/ReduceByWindow, ReduceByKeyAndWindow, CountByValueAndWindow, UpdateStateByKey.

New cards

What is Apache Flink?

A framework for real-time, stateful stream processing that supports both bounded and unbounded data streams.

New cards

What are the main APIs provided by Apache Flink?

DataStream, DataSet, Table, CEP, Gelly, FlinkML.

New cards

What are Streaming Dataflows in Apache Flink?

Directed Acyclic Graphs (DAGs) consisting of sources, transformations, and sinks.

New cards

What deployment options does Flink Architecture support?

Local, cluster, and cloud deployments.

New cards

What libraries does Apache Flink provide?

Libraries for graph processing, machine learning, and event processing.

New cards

How does Apache Flink handle stateful stream processing?

By maintaining and managing state information across events.

New cards

What makes Apache Flink suitable for both bounded and unbounded data streams?

Its flexible architecture supporting batch and real-time processing paradigms.

New cards

Compare Apache Storm and Spark Streaming.

Storm focuses on real-time computation with a topology-based approach; Spark Streaming utilizes DStreams for micro-batch processing.

New cards

Compare Apache Flink with Spark Streaming.

Flink provides true stream processing with low latency; Spark Streaming uses micro-batching.

New cards

What is the primary advantage of using Apache Flink for real-time analytics?

Its ability to handle both batch and stream processing seamlessly.

New cards

What is a Tuple in Apache Storm?

A data record emitted by spouts and processed by bolts.

New cards

How does Shuffle Grouping work in Storm?

Distributes tuples evenly across all target bolts.

New cards

What is Field Grouping in Storm?

Groups tuples based on specific fields.

New cards

What is the purpose of All/Global Grouping in Storm?

Broadcasts each tuple to all bolt instances.

New cards

What does Direct Grouping enable in Storm?

Allows the sender to specify the exact bolt instance for each tuple.

New cards

What ensures fault tolerance in Apache Storm?

The distributed architecture with supervisors and the use of Zookeeper for coordination.

New cards

What is the main use case for Spark Streaming?

High-throughput, fault-tolerant stream processing for real-time data analytics.

New cards

How do DStreams achieve fault tolerance in Spark Streaming?

By using RDD lineage information and checkpointing.

New cards

What is checkpointing in Spark Streaming?

A mechanism to save the state of DStreams to reliable storage.

New cards

How does Apache Flink achieve high performance in stream processing?

Through advanced scheduling, efficient state management, and support for event-time processing.

New cards

What is CEP in Apache Flink?

Complex Event Processing API for detecting patterns in event streams.

New cards

What is Gelly in Apache Flink?

Flink’s API for graph processing.

New cards

What is FlinkML?

Flink’s machine learning library for scalable algorithms.

New cards

How do Flink's Streaming Dataflows differ from traditional batch processing?

They process data continuously as it arrives.

New cards

What is the Driver in a Spark Cluster?

The program that creates a SparkContext to coordinate task execution.

New cards

What is the role of the Cluster Manager in Spark?

Allocates resources across the cluster and manages task distribution.

New cards

What are Executors in Spark?

Processes on worker nodes that run application code.

New cards

What is the Driver Program in Spark?

The process that runs the main function of the application.

New cards

How does Apache Flink support multi-tenancy?

By managing resources and isolating jobs from different users.

New cards

What is the primary difference between Apache Storm and Apache Flink?

Storm focuses on unbounded stream processing; Flink supports both bounded and unbounded with advanced state management.

New cards

What is event-time processing in Apache Flink?

Processing events based on the time they occurred.

New cards

What is a DAG in the context of Apache Flink?

Directed Acyclic Graphs that represent the flow of data and transformations.

New cards

What are the main advantages of using Apache Flink for real-time analytics?

True stream processing, robust state management, support for complex event processing, and flexibility to handle data streams.