knowt logo

chapter 11

Flashcard 1

Q: What is Apache Storm? 

A: A distributed, fault-tolerant framework for real-time computation that processes data streams from sources like Kafka, Kinesis, and RabbitMQ, designed for scalable and reliable real-time data processing.


Flashcard 2

Q: What are the main Storm Concepts? 

A:

  • Topology: Graph of computations executed across a cluster.

  • Stream: Unbounded sequence of tuples.

  • Spouts: Sources of streams, a type of node.

  • Bolts: Process tuples, a type of node.

  • Tasks: Parallel threads executing within worker processes.


Flashcard 3

Q: What is a Topology in Apache Storm? 

A: A graph of computations that defines how data flows and is processed across the cluster.


Flashcard 4

Q: What is a Stream in Apache Storm? 

A: An unbounded sequence of tuples (data records) that flows through the topology.


Flashcard 5

Q: What are Spouts in Apache Storm? 

A: Components that act as sources of streams, emitting tuples into the topology.

Flashcard 6

Q: What are Bolts in Apache Storm? 

A: Components that process incoming tuples, performing operations like filtering, aggregating, or joining data.


Flashcard 7

Q: What are Tasks in Apache Storm? 

A: Parallel threads that execute spouts and bolts within worker processes to handle data processing.


Flashcard 8

Q: What are the different Stream Groupings in Storm? 

A:

  • Shuffle: Even distribution of tuples across bolts.

  • Field Grouping: Groups tuples based on specific fields.

  • All/Global: Broadcasts tuples to all bolts or targets the bolt with the lowest ID.

  • Direct: Sender determines the destination bolt.


Flashcard 9

Q: What are the main components of a Storm Cluster? 

A:

  • Nimbus: Manages topologies and task distribution.

  • Supervisor: Executes worker processes on cluster nodes.

  • Zookeeper: Coordinates the cluster, managing configuration and synchronization.


Flashcard 10

Q: What is Nimbus in a Storm Cluster? 

A: The master node that manages topologies and distributes tasks to supervisors.


Flashcard 11

Q: What is the role of the Supervisor in Storm? 

A: Executes worker processes that run spouts and bolts as part of the topology.


Flashcard 12

Q: What is Zookeeper used for in a Storm Cluster? 

A: Coordinates the cluster by managing configuration, synchronization, and leader election.


Flashcard 13

Q: What is Spark Streaming? 

A: A high-throughput, fault-tolerant stream processing component of Apache Spark that processes real-time data streams.


Flashcard 14

Q: What are DStreams in Spark Streaming? 

A: Discretized Streams, which are sequences of Resilient Distributed Datasets (RDDs) representing data from specific time intervals.


Flashcard 15

Q: What are the primary sources for Spark Streaming? 

A: Kafka, HDFS, custom connectors, and other streaming data sources.




Flashcard 16

Q: How are DStreams similar to RDDs in Spark? 

A: DStreams are like RDDs but represent data over time intervals, allowing for batch-like processing of streaming data.


Flashcard 17

Q: What are the two types of DStream Transformations

A:

  • Stateless Transformations: Operations like map, filter, and reduceByKey applied independently on each RDD.

  • Stateful Transformations: Operations like windowing and updateStateByKey that maintain state across RDDs and require checkpointing.


Flashcard 18

Q: What are Window Operations in Spark Streaming? 

A: Operations that compute over sliding data windows, allowing aggregation and analysis of data within specified time frames.


Flashcard 19

Q: Name some Window Operations in Spark Streaming. 

A:

  • Window: Returns a new DStream representing data within a window.

  • CountByWindow/ReduceByWindow: Aggregates data within the window.

  • ReduceByKeyAndWindow: Performs key-based aggregation within a window.

  • CountByValueAndWindow: Counts elements per key within a window.

  • UpdateStateByKey: Tracks and updates state information for each key.


Flashcard 20

Q: What is Apache Flink

A: A framework for real-time, stateful stream processing that supports both bounded and unbounded data streams.

Flashcard 21

Q: What are the main APIs provided by Apache Flink? 

A:

  • DataStream

  • DataSet

  • Table

  • CEP (Complex Event Processing)

  • Gelly

  • FlinkML


Flashcard 22

Q: What are Streaming Dataflows in Apache Flink? 

A: Directed Acyclic Graphs (DAGs) consisting of sources, transformations, and sinks that define the flow of data processing.


Flashcard 23

Q: What deployment options does Flink Architecture support? 

A: Local, cluster, and cloud deployments.


Flashcard 24

Q: What libraries does Apache Flink provide? 

A: Libraries for graph processing, machine learning (ML), and event processing.


Flashcard 25

Q: How does Apache Flink handle stateful stream processing? 

A: By maintaining and managing state information across events, enabling complex event processing and real-time analytics.

Flashcard 26

Q: What makes Apache Flink suitable for both bounded and unbounded data streams? 

A: Its flexible architecture that supports batch and real-time processing paradigms within the same framework.


Flashcard 27

Q: Compare Apache Storm and Spark Streaming

A:

  • Storm: Focuses on real-time computation with a topology-based approach using spouts and bolts.

  • Spark Streaming: Integrates with Spark's ecosystem, using DStreams for micro-batch processing with in-memory capabilities for higher throughput.


Flashcard 28

Q: Compare Apache Flink with Spark Streaming

A:

  • Flink: Provides true stream processing with low latency and stateful computations, supporting event-time processing.

  • Spark Streaming: Utilizes micro-batching, which introduces slight latency but benefits from Spark's in-memory processing.


Flashcard 29

Q: What is the primary advantage of using Apache Flink for real-time analytics? 

A: Its ability to handle both batch and stream processing seamlessly with low latency and robust state management.


Flashcard 30

Q: What is a Tuple in Apache Storm? 

A: A data record emitted by spouts and processed by bolts within a topology.

Flashcard 31

Q: How does Shuffle Grouping work in Storm? 

A: Distributes tuples evenly across all target bolts to balance the load.


Flashcard 32

Q: What is Field Grouping in Storm? 

A: Groups tuples based on specific fields, ensuring that tuples with the same field values are sent to the same bolt instance.


Flashcard 33

Q: What is the purpose of All/Global Grouping in Storm? 

A: Broadcasts each tuple to all bolt instances or sends tuples to a single bolt based on a global identifier.


Flashcard 34

Q: What does Direct Grouping enable in Storm? 

A: Allows the sender to specify the exact bolt instance that should receive each tuple.


Flashcard 35

Q: What ensures fault tolerance in Apache Storm? 

A: The distributed architecture with supervisors and the use of Zookeeper for coordination, allowing automatic task reassignment in case of failures.




Flashcard 36

Q: What is the main use case for Spark Streaming

A: High-throughput, fault-tolerant stream processing for real-time data analytics and event-driven applications.


Flashcard 37

Q: How do DStreams achieve fault tolerance in Spark Streaming? 

A: By using RDD lineage information and checkpointing to recover lost data in case of failures.


Flashcard 38

Q: What is checkpointing in Spark Streaming? 

A: A mechanism to save the state of DStreams periodically to reliable storage, enabling recovery from failures.


Flashcard 39

Q: How does Apache Flink achieve high performance in stream processing? 

A: Through its advanced scheduling, efficient state management, and support for event-time processing, allowing low-latency and high-throughput data handling.


Flashcard 40

Q: What is CEP in Apache Flink? 

A: Complex Event Processing API for detecting patterns and relationships within event streams.




Flashcard 41

Q: What is Gelly in Apache Flink? 

A: Flink’s API for graph processing, enabling the execution of graph algorithms on large-scale data.


Flashcard 42

Q: What is FlinkML

A: Flink’s machine learning library that provides scalable algorithms for various ML tasks.


Flashcard 43

Q: How do Flink's Streaming Dataflows differ from traditional batch processing? 

A: They process data continuously as it arrives, maintaining state and handling events in real-time rather than processing data in large, discrete batches.


Flashcard 44

Q: What is the Driver in a Spark Cluster? 

A: The program that contains the main function and creates a SparkContext to coordinate the execution of tasks across the cluster.


Flashcard 45

Q: What is the role of the Cluster Manager in Spark? 

A: Allocates resources across the cluster and manages the distribution of tasks to Executors.




Flashcard 46

Q: What are Executors in Spark? 

A: Processes allocated on worker nodes that run application code and perform tasks such as transformations and actions on RDDs.


Flashcard 47

Q: What is the Driver Program in Spark? 

A: The process that runs the main function of the application and schedules tasks to be executed on the cluster.


Flashcard 48

Q: How does Apache Flink support multi-tenancy

A: By managing resources and isolating jobs from different users or applications within the same cluster to ensure fair resource sharing and performance.


Flashcard 49

Q: What is the primary difference between Apache Storm and Apache Flink

A: Storm focuses on unbounded stream processing with a topology-based model, while Flink provides both bounded and unbounded stream processing with advanced state management and event-time processing.


Flashcard 50

Q: What is event-time processing in Apache Flink? 

A: Processing events based on the time they occurred, allowing accurate handling of out-of-order events and late data.


Flashcard 51

Q: What is a DAG in the context of Apache Flink? 

A: Directed Acyclic Graphs that represent the flow of data and transformations in streaming dataflows.


Flashcard 52

Q: What are the main advantages of using Apache Flink for real-time analytics? 

A:

  • True stream processing with low latency.

  • Robust state management and fault tolerance.

  • Support for complex event processing and machine learning.

  • Flexibility to handle both bounded and unbounded data streams

chapter 11

Flashcard 1

Q: What is Apache Storm? 

A: A distributed, fault-tolerant framework for real-time computation that processes data streams from sources like Kafka, Kinesis, and RabbitMQ, designed for scalable and reliable real-time data processing.


Flashcard 2

Q: What are the main Storm Concepts? 

A:

  • Topology: Graph of computations executed across a cluster.

  • Stream: Unbounded sequence of tuples.

  • Spouts: Sources of streams, a type of node.

  • Bolts: Process tuples, a type of node.

  • Tasks: Parallel threads executing within worker processes.


Flashcard 3

Q: What is a Topology in Apache Storm? 

A: A graph of computations that defines how data flows and is processed across the cluster.


Flashcard 4

Q: What is a Stream in Apache Storm? 

A: An unbounded sequence of tuples (data records) that flows through the topology.


Flashcard 5

Q: What are Spouts in Apache Storm? 

A: Components that act as sources of streams, emitting tuples into the topology.

Flashcard 6

Q: What are Bolts in Apache Storm? 

A: Components that process incoming tuples, performing operations like filtering, aggregating, or joining data.


Flashcard 7

Q: What are Tasks in Apache Storm? 

A: Parallel threads that execute spouts and bolts within worker processes to handle data processing.


Flashcard 8

Q: What are the different Stream Groupings in Storm? 

A:

  • Shuffle: Even distribution of tuples across bolts.

  • Field Grouping: Groups tuples based on specific fields.

  • All/Global: Broadcasts tuples to all bolts or targets the bolt with the lowest ID.

  • Direct: Sender determines the destination bolt.


Flashcard 9

Q: What are the main components of a Storm Cluster? 

A:

  • Nimbus: Manages topologies and task distribution.

  • Supervisor: Executes worker processes on cluster nodes.

  • Zookeeper: Coordinates the cluster, managing configuration and synchronization.


Flashcard 10

Q: What is Nimbus in a Storm Cluster? 

A: The master node that manages topologies and distributes tasks to supervisors.


Flashcard 11

Q: What is the role of the Supervisor in Storm? 

A: Executes worker processes that run spouts and bolts as part of the topology.


Flashcard 12

Q: What is Zookeeper used for in a Storm Cluster? 

A: Coordinates the cluster by managing configuration, synchronization, and leader election.


Flashcard 13

Q: What is Spark Streaming? 

A: A high-throughput, fault-tolerant stream processing component of Apache Spark that processes real-time data streams.


Flashcard 14

Q: What are DStreams in Spark Streaming? 

A: Discretized Streams, which are sequences of Resilient Distributed Datasets (RDDs) representing data from specific time intervals.


Flashcard 15

Q: What are the primary sources for Spark Streaming? 

A: Kafka, HDFS, custom connectors, and other streaming data sources.




Flashcard 16

Q: How are DStreams similar to RDDs in Spark? 

A: DStreams are like RDDs but represent data over time intervals, allowing for batch-like processing of streaming data.


Flashcard 17

Q: What are the two types of DStream Transformations

A:

  • Stateless Transformations: Operations like map, filter, and reduceByKey applied independently on each RDD.

  • Stateful Transformations: Operations like windowing and updateStateByKey that maintain state across RDDs and require checkpointing.


Flashcard 18

Q: What are Window Operations in Spark Streaming? 

A: Operations that compute over sliding data windows, allowing aggregation and analysis of data within specified time frames.


Flashcard 19

Q: Name some Window Operations in Spark Streaming. 

A:

  • Window: Returns a new DStream representing data within a window.

  • CountByWindow/ReduceByWindow: Aggregates data within the window.

  • ReduceByKeyAndWindow: Performs key-based aggregation within a window.

  • CountByValueAndWindow: Counts elements per key within a window.

  • UpdateStateByKey: Tracks and updates state information for each key.


Flashcard 20

Q: What is Apache Flink

A: A framework for real-time, stateful stream processing that supports both bounded and unbounded data streams.

Flashcard 21

Q: What are the main APIs provided by Apache Flink? 

A:

  • DataStream

  • DataSet

  • Table

  • CEP (Complex Event Processing)

  • Gelly

  • FlinkML


Flashcard 22

Q: What are Streaming Dataflows in Apache Flink? 

A: Directed Acyclic Graphs (DAGs) consisting of sources, transformations, and sinks that define the flow of data processing.


Flashcard 23

Q: What deployment options does Flink Architecture support? 

A: Local, cluster, and cloud deployments.


Flashcard 24

Q: What libraries does Apache Flink provide? 

A: Libraries for graph processing, machine learning (ML), and event processing.


Flashcard 25

Q: How does Apache Flink handle stateful stream processing? 

A: By maintaining and managing state information across events, enabling complex event processing and real-time analytics.

Flashcard 26

Q: What makes Apache Flink suitable for both bounded and unbounded data streams? 

A: Its flexible architecture that supports batch and real-time processing paradigms within the same framework.


Flashcard 27

Q: Compare Apache Storm and Spark Streaming

A:

  • Storm: Focuses on real-time computation with a topology-based approach using spouts and bolts.

  • Spark Streaming: Integrates with Spark's ecosystem, using DStreams for micro-batch processing with in-memory capabilities for higher throughput.


Flashcard 28

Q: Compare Apache Flink with Spark Streaming

A:

  • Flink: Provides true stream processing with low latency and stateful computations, supporting event-time processing.

  • Spark Streaming: Utilizes micro-batching, which introduces slight latency but benefits from Spark's in-memory processing.


Flashcard 29

Q: What is the primary advantage of using Apache Flink for real-time analytics? 

A: Its ability to handle both batch and stream processing seamlessly with low latency and robust state management.


Flashcard 30

Q: What is a Tuple in Apache Storm? 

A: A data record emitted by spouts and processed by bolts within a topology.

Flashcard 31

Q: How does Shuffle Grouping work in Storm? 

A: Distributes tuples evenly across all target bolts to balance the load.


Flashcard 32

Q: What is Field Grouping in Storm? 

A: Groups tuples based on specific fields, ensuring that tuples with the same field values are sent to the same bolt instance.


Flashcard 33

Q: What is the purpose of All/Global Grouping in Storm? 

A: Broadcasts each tuple to all bolt instances or sends tuples to a single bolt based on a global identifier.


Flashcard 34

Q: What does Direct Grouping enable in Storm? 

A: Allows the sender to specify the exact bolt instance that should receive each tuple.


Flashcard 35

Q: What ensures fault tolerance in Apache Storm? 

A: The distributed architecture with supervisors and the use of Zookeeper for coordination, allowing automatic task reassignment in case of failures.




Flashcard 36

Q: What is the main use case for Spark Streaming

A: High-throughput, fault-tolerant stream processing for real-time data analytics and event-driven applications.


Flashcard 37

Q: How do DStreams achieve fault tolerance in Spark Streaming? 

A: By using RDD lineage information and checkpointing to recover lost data in case of failures.


Flashcard 38

Q: What is checkpointing in Spark Streaming? 

A: A mechanism to save the state of DStreams periodically to reliable storage, enabling recovery from failures.


Flashcard 39

Q: How does Apache Flink achieve high performance in stream processing? 

A: Through its advanced scheduling, efficient state management, and support for event-time processing, allowing low-latency and high-throughput data handling.


Flashcard 40

Q: What is CEP in Apache Flink? 

A: Complex Event Processing API for detecting patterns and relationships within event streams.




Flashcard 41

Q: What is Gelly in Apache Flink? 

A: Flink’s API for graph processing, enabling the execution of graph algorithms on large-scale data.


Flashcard 42

Q: What is FlinkML

A: Flink’s machine learning library that provides scalable algorithms for various ML tasks.


Flashcard 43

Q: How do Flink's Streaming Dataflows differ from traditional batch processing? 

A: They process data continuously as it arrives, maintaining state and handling events in real-time rather than processing data in large, discrete batches.


Flashcard 44

Q: What is the Driver in a Spark Cluster? 

A: The program that contains the main function and creates a SparkContext to coordinate the execution of tasks across the cluster.


Flashcard 45

Q: What is the role of the Cluster Manager in Spark? 

A: Allocates resources across the cluster and manages the distribution of tasks to Executors.




Flashcard 46

Q: What are Executors in Spark? 

A: Processes allocated on worker nodes that run application code and perform tasks such as transformations and actions on RDDs.


Flashcard 47

Q: What is the Driver Program in Spark? 

A: The process that runs the main function of the application and schedules tasks to be executed on the cluster.


Flashcard 48

Q: How does Apache Flink support multi-tenancy

A: By managing resources and isolating jobs from different users or applications within the same cluster to ensure fair resource sharing and performance.


Flashcard 49

Q: What is the primary difference between Apache Storm and Apache Flink

A: Storm focuses on unbounded stream processing with a topology-based model, while Flink provides both bounded and unbounded stream processing with advanced state management and event-time processing.


Flashcard 50

Q: What is event-time processing in Apache Flink? 

A: Processing events based on the time they occurred, allowing accurate handling of out-of-order events and late data.


Flashcard 51

Q: What is a DAG in the context of Apache Flink? 

A: Directed Acyclic Graphs that represent the flow of data and transformations in streaming dataflows.


Flashcard 52

Q: What are the main advantages of using Apache Flink for real-time analytics? 

A:

  • True stream processing with low latency.

  • Robust state management and fault tolerance.

  • Support for complex event processing and machine learning.

  • Flexibility to handle both bounded and unbounded data streams

robot