chapter 11
Q: What is Apache Storm?
A: A distributed, fault-tolerant framework for real-time computation that processes data streams from sources like Kafka, Kinesis, and RabbitMQ, designed for scalable and reliable real-time data processing.
Q: What are the main Storm Concepts?
A:
Topology: Graph of computations executed across a cluster.
Stream: Unbounded sequence of tuples.
Spouts: Sources of streams, a type of node.
Bolts: Process tuples, a type of node.
Tasks: Parallel threads executing within worker processes.
Q: What is a Topology in Apache Storm?
A: A graph of computations that defines how data flows and is processed across the cluster.
Q: What is a Stream in Apache Storm?
A: An unbounded sequence of tuples (data records) that flows through the topology.
Q: What are Spouts in Apache Storm?
A: Components that act as sources of streams, emitting tuples into the topology.
Flashcard 6
Q: What are Bolts in Apache Storm?
A: Components that process incoming tuples, performing operations like filtering, aggregating, or joining data.
Q: What are Tasks in Apache Storm?
A: Parallel threads that execute spouts and bolts within worker processes to handle data processing.
Q: What are the different Stream Groupings in Storm?
A:
Shuffle: Even distribution of tuples across bolts.
Field Grouping: Groups tuples based on specific fields.
All/Global: Broadcasts tuples to all bolts or targets the bolt with the lowest ID.
Direct: Sender determines the destination bolt.
Q: What are the main components of a Storm Cluster?
A:
Nimbus: Manages topologies and task distribution.
Supervisor: Executes worker processes on cluster nodes.
Zookeeper: Coordinates the cluster, managing configuration and synchronization.
Q: What is Nimbus in a Storm Cluster?
A: The master node that manages topologies and distributes tasks to supervisors.
Q: What is the role of the Supervisor in Storm?
A: Executes worker processes that run spouts and bolts as part of the topology.
Q: What is Zookeeper used for in a Storm Cluster?
A: Coordinates the cluster by managing configuration, synchronization, and leader election.
Q: What is Spark Streaming?
A: A high-throughput, fault-tolerant stream processing component of Apache Spark that processes real-time data streams.
Q: What are DStreams in Spark Streaming?
A: Discretized Streams, which are sequences of Resilient Distributed Datasets (RDDs) representing data from specific time intervals.
Q: What are the primary sources for Spark Streaming?
A: Kafka, HDFS, custom connectors, and other streaming data sources.
Q: How are DStreams similar to RDDs in Spark?
A: DStreams are like RDDs but represent data over time intervals, allowing for batch-like processing of streaming data.
Q: What are the two types of DStream Transformations?
A:
Stateless Transformations: Operations like map, filter, and reduceByKey applied independently on each RDD.
Stateful Transformations: Operations like windowing and updateStateByKey that maintain state across RDDs and require checkpointing.
Q: What are Window Operations in Spark Streaming?
A: Operations that compute over sliding data windows, allowing aggregation and analysis of data within specified time frames.
Flashcard 19
Q: Name some Window Operations in Spark Streaming.
A:
Window: Returns a new DStream representing data within a window.
CountByWindow/ReduceByWindow: Aggregates data within the window.
ReduceByKeyAndWindow: Performs key-based aggregation within a window.
CountByValueAndWindow: Counts elements per key within a window.
UpdateStateByKey: Tracks and updates state information for each key.
Q: What is Apache Flink?
A: A framework for real-time, stateful stream processing that supports both bounded and unbounded data streams.
Flashcard 21
Q: What are the main APIs provided by Apache Flink?
A:
DataStream
DataSet
Table
CEP (Complex Event Processing)
Gelly
FlinkML
Q: What are Streaming Dataflows in Apache Flink?
A: Directed Acyclic Graphs (DAGs) consisting of sources, transformations, and sinks that define the flow of data processing.
Q: What deployment options does Flink Architecture support?
A: Local, cluster, and cloud deployments.
Q: What libraries does Apache Flink provide?
A: Libraries for graph processing, machine learning (ML), and event processing.
Q: How does Apache Flink handle stateful stream processing?
A: By maintaining and managing state information across events, enabling complex event processing and real-time analytics.
Q: What makes Apache Flink suitable for both bounded and unbounded data streams?
A: Its flexible architecture that supports batch and real-time processing paradigms within the same framework.
Q: Compare Apache Storm and Spark Streaming.
A:
Storm: Focuses on real-time computation with a topology-based approach using spouts and bolts.
Spark Streaming: Integrates with Spark's ecosystem, using DStreams for micro-batch processing with in-memory capabilities for higher throughput.
Q: Compare Apache Flink with Spark Streaming.
A:
Flink: Provides true stream processing with low latency and stateful computations, supporting event-time processing.
Spark Streaming: Utilizes micro-batching, which introduces slight latency but benefits from Spark's in-memory processing.
Q: What is the primary advantage of using Apache Flink for real-time analytics?
A: Its ability to handle both batch and stream processing seamlessly with low latency and robust state management.
Q: What is a Tuple in Apache Storm?
A: A data record emitted by spouts and processed by bolts within a topology.
Flashcard 31
Q: How does Shuffle Grouping work in Storm?
A: Distributes tuples evenly across all target bolts to balance the load.
Q: What is Field Grouping in Storm?
A: Groups tuples based on specific fields, ensuring that tuples with the same field values are sent to the same bolt instance.
Q: What is the purpose of All/Global Grouping in Storm?
A: Broadcasts each tuple to all bolt instances or sends tuples to a single bolt based on a global identifier.
Q: What does Direct Grouping enable in Storm?
A: Allows the sender to specify the exact bolt instance that should receive each tuple.
Q: What ensures fault tolerance in Apache Storm?
A: The distributed architecture with supervisors and the use of Zookeeper for coordination, allowing automatic task reassignment in case of failures.
Q: What is the main use case for Spark Streaming?
A: High-throughput, fault-tolerant stream processing for real-time data analytics and event-driven applications.
Q: How do DStreams achieve fault tolerance in Spark Streaming?
A: By using RDD lineage information and checkpointing to recover lost data in case of failures.
Q: What is checkpointing in Spark Streaming?
A: A mechanism to save the state of DStreams periodically to reliable storage, enabling recovery from failures.
Q: How does Apache Flink achieve high performance in stream processing?
A: Through its advanced scheduling, efficient state management, and support for event-time processing, allowing low-latency and high-throughput data handling.
Q: What is CEP in Apache Flink?
A: Complex Event Processing API for detecting patterns and relationships within event streams.
Q: What is Gelly in Apache Flink?
A: Flink’s API for graph processing, enabling the execution of graph algorithms on large-scale data.
Q: What is FlinkML?
A: Flink’s machine learning library that provides scalable algorithms for various ML tasks.
Q: How do Flink's Streaming Dataflows differ from traditional batch processing?
A: They process data continuously as it arrives, maintaining state and handling events in real-time rather than processing data in large, discrete batches.
Q: What is the Driver in a Spark Cluster?
A: The program that contains the main function and creates a SparkContext to coordinate the execution of tasks across the cluster.
Q: What is the role of the Cluster Manager in Spark?
A: Allocates resources across the cluster and manages the distribution of tasks to Executors.
Q: What are Executors in Spark?
A: Processes allocated on worker nodes that run application code and perform tasks such as transformations and actions on RDDs.
Q: What is the Driver Program in Spark?
A: The process that runs the main function of the application and schedules tasks to be executed on the cluster.
Q: How does Apache Flink support multi-tenancy?
A: By managing resources and isolating jobs from different users or applications within the same cluster to ensure fair resource sharing and performance.
Q: What is the primary difference between Apache Storm and Apache Flink?
A: Storm focuses on unbounded stream processing with a topology-based model, while Flink provides both bounded and unbounded stream processing with advanced state management and event-time processing.
Q: What is event-time processing in Apache Flink?
A: Processing events based on the time they occurred, allowing accurate handling of out-of-order events and late data.
Q: What is a DAG in the context of Apache Flink?
A: Directed Acyclic Graphs that represent the flow of data and transformations in streaming dataflows.
Q: What are the main advantages of using Apache Flink for real-time analytics?
A:
True stream processing with low latency.
Robust state management and fault tolerance.
Support for complex event processing and machine learning.
Flexibility to handle both bounded and unbounded data streams
Q: What is Apache Storm?
A: A distributed, fault-tolerant framework for real-time computation that processes data streams from sources like Kafka, Kinesis, and RabbitMQ, designed for scalable and reliable real-time data processing.
Q: What are the main Storm Concepts?
A:
Topology: Graph of computations executed across a cluster.
Stream: Unbounded sequence of tuples.
Spouts: Sources of streams, a type of node.
Bolts: Process tuples, a type of node.
Tasks: Parallel threads executing within worker processes.
Q: What is a Topology in Apache Storm?
A: A graph of computations that defines how data flows and is processed across the cluster.
Q: What is a Stream in Apache Storm?
A: An unbounded sequence of tuples (data records) that flows through the topology.
Q: What are Spouts in Apache Storm?
A: Components that act as sources of streams, emitting tuples into the topology.
Flashcard 6
Q: What are Bolts in Apache Storm?
A: Components that process incoming tuples, performing operations like filtering, aggregating, or joining data.
Q: What are Tasks in Apache Storm?
A: Parallel threads that execute spouts and bolts within worker processes to handle data processing.
Q: What are the different Stream Groupings in Storm?
A:
Shuffle: Even distribution of tuples across bolts.
Field Grouping: Groups tuples based on specific fields.
All/Global: Broadcasts tuples to all bolts or targets the bolt with the lowest ID.
Direct: Sender determines the destination bolt.
Q: What are the main components of a Storm Cluster?
A:
Nimbus: Manages topologies and task distribution.
Supervisor: Executes worker processes on cluster nodes.
Zookeeper: Coordinates the cluster, managing configuration and synchronization.
Q: What is Nimbus in a Storm Cluster?
A: The master node that manages topologies and distributes tasks to supervisors.
Q: What is the role of the Supervisor in Storm?
A: Executes worker processes that run spouts and bolts as part of the topology.
Q: What is Zookeeper used for in a Storm Cluster?
A: Coordinates the cluster by managing configuration, synchronization, and leader election.
Q: What is Spark Streaming?
A: A high-throughput, fault-tolerant stream processing component of Apache Spark that processes real-time data streams.
Q: What are DStreams in Spark Streaming?
A: Discretized Streams, which are sequences of Resilient Distributed Datasets (RDDs) representing data from specific time intervals.
Q: What are the primary sources for Spark Streaming?
A: Kafka, HDFS, custom connectors, and other streaming data sources.
Q: How are DStreams similar to RDDs in Spark?
A: DStreams are like RDDs but represent data over time intervals, allowing for batch-like processing of streaming data.
Q: What are the two types of DStream Transformations?
A:
Stateless Transformations: Operations like map, filter, and reduceByKey applied independently on each RDD.
Stateful Transformations: Operations like windowing and updateStateByKey that maintain state across RDDs and require checkpointing.
Q: What are Window Operations in Spark Streaming?
A: Operations that compute over sliding data windows, allowing aggregation and analysis of data within specified time frames.
Flashcard 19
Q: Name some Window Operations in Spark Streaming.
A:
Window: Returns a new DStream representing data within a window.
CountByWindow/ReduceByWindow: Aggregates data within the window.
ReduceByKeyAndWindow: Performs key-based aggregation within a window.
CountByValueAndWindow: Counts elements per key within a window.
UpdateStateByKey: Tracks and updates state information for each key.
Q: What is Apache Flink?
A: A framework for real-time, stateful stream processing that supports both bounded and unbounded data streams.
Flashcard 21
Q: What are the main APIs provided by Apache Flink?
A:
DataStream
DataSet
Table
CEP (Complex Event Processing)
Gelly
FlinkML
Q: What are Streaming Dataflows in Apache Flink?
A: Directed Acyclic Graphs (DAGs) consisting of sources, transformations, and sinks that define the flow of data processing.
Q: What deployment options does Flink Architecture support?
A: Local, cluster, and cloud deployments.
Q: What libraries does Apache Flink provide?
A: Libraries for graph processing, machine learning (ML), and event processing.
Q: How does Apache Flink handle stateful stream processing?
A: By maintaining and managing state information across events, enabling complex event processing and real-time analytics.
Q: What makes Apache Flink suitable for both bounded and unbounded data streams?
A: Its flexible architecture that supports batch and real-time processing paradigms within the same framework.
Q: Compare Apache Storm and Spark Streaming.
A:
Storm: Focuses on real-time computation with a topology-based approach using spouts and bolts.
Spark Streaming: Integrates with Spark's ecosystem, using DStreams for micro-batch processing with in-memory capabilities for higher throughput.
Q: Compare Apache Flink with Spark Streaming.
A:
Flink: Provides true stream processing with low latency and stateful computations, supporting event-time processing.
Spark Streaming: Utilizes micro-batching, which introduces slight latency but benefits from Spark's in-memory processing.
Q: What is the primary advantage of using Apache Flink for real-time analytics?
A: Its ability to handle both batch and stream processing seamlessly with low latency and robust state management.
Q: What is a Tuple in Apache Storm?
A: A data record emitted by spouts and processed by bolts within a topology.
Flashcard 31
Q: How does Shuffle Grouping work in Storm?
A: Distributes tuples evenly across all target bolts to balance the load.
Q: What is Field Grouping in Storm?
A: Groups tuples based on specific fields, ensuring that tuples with the same field values are sent to the same bolt instance.
Q: What is the purpose of All/Global Grouping in Storm?
A: Broadcasts each tuple to all bolt instances or sends tuples to a single bolt based on a global identifier.
Q: What does Direct Grouping enable in Storm?
A: Allows the sender to specify the exact bolt instance that should receive each tuple.
Q: What ensures fault tolerance in Apache Storm?
A: The distributed architecture with supervisors and the use of Zookeeper for coordination, allowing automatic task reassignment in case of failures.
Q: What is the main use case for Spark Streaming?
A: High-throughput, fault-tolerant stream processing for real-time data analytics and event-driven applications.
Q: How do DStreams achieve fault tolerance in Spark Streaming?
A: By using RDD lineage information and checkpointing to recover lost data in case of failures.
Q: What is checkpointing in Spark Streaming?
A: A mechanism to save the state of DStreams periodically to reliable storage, enabling recovery from failures.
Q: How does Apache Flink achieve high performance in stream processing?
A: Through its advanced scheduling, efficient state management, and support for event-time processing, allowing low-latency and high-throughput data handling.
Q: What is CEP in Apache Flink?
A: Complex Event Processing API for detecting patterns and relationships within event streams.
Q: What is Gelly in Apache Flink?
A: Flink’s API for graph processing, enabling the execution of graph algorithms on large-scale data.
Q: What is FlinkML?
A: Flink’s machine learning library that provides scalable algorithms for various ML tasks.
Q: How do Flink's Streaming Dataflows differ from traditional batch processing?
A: They process data continuously as it arrives, maintaining state and handling events in real-time rather than processing data in large, discrete batches.
Q: What is the Driver in a Spark Cluster?
A: The program that contains the main function and creates a SparkContext to coordinate the execution of tasks across the cluster.
Q: What is the role of the Cluster Manager in Spark?
A: Allocates resources across the cluster and manages the distribution of tasks to Executors.
Q: What are Executors in Spark?
A: Processes allocated on worker nodes that run application code and perform tasks such as transformations and actions on RDDs.
Q: What is the Driver Program in Spark?
A: The process that runs the main function of the application and schedules tasks to be executed on the cluster.
Q: How does Apache Flink support multi-tenancy?
A: By managing resources and isolating jobs from different users or applications within the same cluster to ensure fair resource sharing and performance.
Q: What is the primary difference between Apache Storm and Apache Flink?
A: Storm focuses on unbounded stream processing with a topology-based model, while Flink provides both bounded and unbounded stream processing with advanced state management and event-time processing.
Q: What is event-time processing in Apache Flink?
A: Processing events based on the time they occurred, allowing accurate handling of out-of-order events and late data.
Q: What is a DAG in the context of Apache Flink?
A: Directed Acyclic Graphs that represent the flow of data and transformations in streaming dataflows.
Q: What are the main advantages of using Apache Flink for real-time analytics?
A:
True stream processing with low latency.
Robust state management and fault tolerance.
Support for complex event processing and machine learning.
Flexibility to handle both bounded and unbounded data streams