0.0(0)

Generate Practice test

Chat with Kai

undefined Flashcards

0 Cards0.0(0)

Flashcard 1

Q: What is Apache Storm?

A: A distributed, fault-tolerant framework for real-time computation that processes data streams from sources like Kafka, Kinesis, and RabbitMQ, designed for scalable and reliable real-time data processing.

Flashcard 2

Q: What are the main Storm Concepts?

Topology: Graph of computations executed across a cluster.
Stream: Unbounded sequence of tuples.
Spouts: Sources of streams, a type of node.
Bolts: Process tuples, a type of node.
Tasks: Parallel threads executing within worker processes.

Flashcard 3

Q: What is a Topology in Apache Storm?

A: A graph of computations that defines how data flows and is processed across the cluster.

Flashcard 4

Q: What is a Stream in Apache Storm?

A: An unbounded sequence of tuples (data records) that flows through the topology.

Flashcard 5

Q: What are Spouts in Apache Storm?

A: Components that act as sources of streams, emitting tuples into the topology.

Flashcard 6

Q: What are Bolts in Apache Storm?

A: Components that process incoming tuples, performing operations like filtering, aggregating, or joining data.

Flashcard 7

Q: What are Tasks in Apache Storm?

A: Parallel threads that execute spouts and bolts within worker processes to handle data processing.

Flashcard 8

Q: What are the different Stream Groupings in Storm?

Shuffle: Even distribution of tuples across bolts.
Field Grouping: Groups tuples based on specific fields.
All/Global: Broadcasts tuples to all bolts or targets the bolt with the lowest ID.
Direct: Sender determines the destination bolt.

Flashcard 9

Q: What are the main components of a Storm Cluster?

Nimbus: Manages topologies and task distribution.
Supervisor: Executes worker processes on cluster nodes.
Zookeeper: Coordinates the cluster, managing configuration and synchronization.

Flashcard 10

Q: What is Nimbus in a Storm Cluster?

A: The master node that manages topologies and distributes tasks to supervisors.

Flashcard 11

Q: What is the role of the Supervisor in Storm?

A: Executes worker processes that run spouts and bolts as part of the topology.

Flashcard 12

Q: What is Zookeeper used for in a Storm Cluster?

A: Coordinates the cluster by managing configuration, synchronization, and leader election.

Flashcard 13

Q: What is Spark Streaming?

A: A high-throughput, fault-tolerant stream processing component of Apache Spark that processes real-time data streams.

Flashcard 14

Q: What are DStreams in Spark Streaming?

A: Discretized Streams, which are sequences of Resilient Distributed Datasets (RDDs) representing data from specific time intervals.

Flashcard 15

Q: What are the primary sources for Spark Streaming?

A: Kafka, HDFS, custom connectors, and other streaming data sources.

Flashcard 16

Q: How are DStreams similar to RDDs in Spark?

A: DStreams are like RDDs but represent data over time intervals, allowing for batch-like processing of streaming data.

Flashcard 17

Q: What are the two types of DStream Transformations?

Stateless Transformations: Operations like map, filter, and reduceByKey applied independently on each RDD.
Stateful Transformations: Operations like windowing and updateStateByKey that maintain state across RDDs and require checkpointing.

Flashcard 18

Q: What are Window Operations in Spark Streaming?

A: Operations that compute over sliding data windows, allowing aggregation and analysis of data within specified time frames.

Flashcard 19

Q: Name some Window Operations in Spark Streaming.

Window: Returns a new DStream representing data within a window.
CountByWindow/ReduceByWindow: Aggregates data within the window.
ReduceByKeyAndWindow: Performs key-based aggregation within a window.
CountByValueAndWindow: Counts elements per key within a window.
UpdateStateByKey: Tracks and updates state information for each key.

Flashcard 20

Q: What is Apache Flink?

A: A framework for real-time, stateful stream processing that supports both bounded and unbounded data streams.

Flashcard 21

Q: What are the main APIs provided by Apache Flink?

DataStream
DataSet
Table
CEP (Complex Event Processing)
Gelly
FlinkML

Flashcard 22

Q: What are Streaming Dataflows in Apache Flink?

A: Directed Acyclic Graphs (DAGs) consisting of sources, transformations, and sinks that define the flow of data processing.

Flashcard 23

Q: What deployment options does Flink Architecture support?

A: Local, cluster, and cloud deployments.

Flashcard 24

Q: What libraries does Apache Flink provide?

A: Libraries for graph processing, machine learning (ML), and event processing.

Flashcard 25

Q: How does Apache Flink handle stateful stream processing?

A: By maintaining and managing state information across events, enabling complex event processing and real-time analytics.

Flashcard 26

Q: What makes Apache Flink suitable for both bounded and unbounded data streams?

A: Its flexible architecture that supports batch and real-time processing paradigms within the same framework.

Flashcard 27

Q: Compare Apache Storm and Spark Streaming.

Storm: Focuses on real-time computation with a topology-based approach using spouts and bolts.
Spark Streaming: Integrates with Spark's ecosystem, using DStreams for micro-batch processing with in-memory capabilities for higher throughput.

Flashcard 28

Q: Compare Apache Flink with Spark Streaming.

Flink: Provides true stream processing with low latency and stateful computations, supporting event-time processing.
Spark Streaming: Utilizes micro-batching, which introduces slight latency but benefits from Spark's in-memory processing.

Flashcard 29

Q: What is the primary advantage of using Apache Flink for real-time analytics?

A: Its ability to handle both batch and stream processing seamlessly with low latency and robust state management.

Flashcard 30

Q: What is a Tuple in Apache Storm?

A: A data record emitted by spouts and processed by bolts within a topology.

Flashcard 31

Q: How does Shuffle Grouping work in Storm?

A: Distributes tuples evenly across all target bolts to balance the load.

Flashcard 32

Q: What is Field Grouping in Storm?

A: Groups tuples based on specific fields, ensuring that tuples with the same field values are sent to the same bolt instance.

Flashcard 33

Q: What is the purpose of All/Global Grouping in Storm?

A: Broadcasts each tuple to all bolt instances or sends tuples to a single bolt based on a global identifier.

Flashcard 34

Q: What does Direct Grouping enable in Storm?

A: Allows the sender to specify the exact bolt instance that should receive each tuple.

Flashcard 35

Q: What ensures fault tolerance in Apache Storm?

A: The distributed architecture with supervisors and the use of Zookeeper for coordination, allowing automatic task reassignment in case of failures.

Flashcard 36

Q: What is the main use case for Spark Streaming?

A: High-throughput, fault-tolerant stream processing for real-time data analytics and event-driven applications.

Flashcard 37

Q: How do DStreams achieve fault tolerance in Spark Streaming?

A: By using RDD lineage information and checkpointing to recover lost data in case of failures.

Flashcard 38

Q: What is checkpointing in Spark Streaming?

A: A mechanism to save the state of DStreams periodically to reliable storage, enabling recovery from failures.

Flashcard 39

Q: How does Apache Flink achieve high performance in stream processing?

A: Through its advanced scheduling, efficient state management, and support for event-time processing, allowing low-latency and high-throughput data handling.

Flashcard 40

Q: What is CEP in Apache Flink?

A: Complex Event Processing API for detecting patterns and relationships within event streams.

Flashcard 41

Q: What is Gelly in Apache Flink?

A: Flink’s API for graph processing, enabling the execution of graph algorithms on large-scale data.

Flashcard 42

Q: What is FlinkML?

A: Flink’s machine learning library that provides scalable algorithms for various ML tasks.

Flashcard 43

Q: How do Flink's Streaming Dataflows differ from traditional batch processing?

A: They process data continuously as it arrives, maintaining state and handling events in real-time rather than processing data in large, discrete batches.

Flashcard 44

Q: What is the Driver in a Spark Cluster?

A: The program that contains the main function and creates a SparkContext to coordinate the execution of tasks across the cluster.

Flashcard 45

Q: What is the role of the Cluster Manager in Spark?

A: Allocates resources across the cluster and manages the distribution of tasks to Executors.

Flashcard 46

Q: What are Executors in Spark?

A: Processes allocated on worker nodes that run application code and perform tasks such as transformations and actions on RDDs.

Flashcard 47

Q: What is the Driver Program in Spark?

A: The process that runs the main function of the application and schedules tasks to be executed on the cluster.

Flashcard 48

Q: How does Apache Flink support multi-tenancy?

A: By managing resources and isolating jobs from different users or applications within the same cluster to ensure fair resource sharing and performance.

Flashcard 49

Q: What is the primary difference between Apache Storm and Apache Flink?

A: Storm focuses on unbounded stream processing with a topology-based model, while Flink provides both bounded and unbounded stream processing with advanced state management and event-time processing.

Flashcard 50

Q: What is event-time processing in Apache Flink?

A: Processing events based on the time they occurred, allowing accurate handling of out-of-order events and late data.

Flashcard 51

Q: What is a DAG in the context of Apache Flink?

A: Directed Acyclic Graphs that represent the flow of data and transformations in streaming dataflows.

Flashcard 52

Q: What are the main advantages of using Apache Flink for real-time analytics?

True stream processing with low latency.
Robust state management and fault tolerance.
Support for complex event processing and machine learning.
Flexibility to handle both bounded and unbounded data streams

chapter 11