chapter 9
Q: What are the primary source types for data acquisition in big data systems?
A: Publishing big data in batches, microbatches, or streaming real-time data.
Q: How is velocity defined in the context of data acquisition?
A: The speed at which data is generated and how frequently it is produced.
Q: What characterizes high velocity data?
A: Real-time or streaming data generated and processed continuously.
Q: What are the two main ingestion mechanisms for data?
A: Push and pull mechanisms, driven by the data consumer.
Q: What is Kafka?
A: A high-throughput distributed messaging system used for building real-time data pipelines and streaming applications.
Q: In Kafka, what is a broker?
A: A server that manages topics, handles persistence, partitions, and replicates data.
Q: What is a topic in Kafka?
A: A stream of messages of a particular type, similar to tables in databases.
Q: How does Kafka store messages?
A: On disk using partitioned commit logs.
Q: What roles do producers and consumers play in Kafka?
A: Producers publish messages to topics, while consumers subscribe to topics and process messages.
Q: What is a partition in Kafka?
A: A division of a Kafka topic that allows messages to be consumed in parallel, maintaining an ordered and immutable sequence of messages.
Q: How does Kafka achieve parallel consumption of messages?
A: By dividing topics into multiple partitions and allowing multiple consumers to read from different partitions simultaneously.
Q: What is the role of a server leader in Kafka?
A: It handles read and write operations for a partition.
Q: What are replicas in Kafka?
A: Followers that replicate the data of the leader to ensure fault tolerance.
Q: Describe Kafka's publish-subscribe messaging framework.
A: Producers publish messages to topics, and consumers subscribe to those topics to receive messages.
Q: What is an offset in Kafka?
A: A unique sequence ID assigned to each message within a partition.
Q: What is a consumer group in Kafka?
A: A group of consumers that work together to consume messages from one or more topics, ensuring each message is processed by only one consumer in the group.
Q: How does Kafka handle log storage?
A: Messages are stored in append-only, ordered, and immutable logs.
Q: What is log compaction in Kafka?
A: A process to clean out obsolete records by retaining only the latest message for each key within a log segment.
Q: What are log segments in Kafka?
A: Portions of a topic partition's log stored as directories of segment files.
Q: What determines when log segments are deleted in Kafka?
A: Reaching size or time limits as defined by the delete policy.
Q: What is Amazon Kinesis?
A: A managed service for ingesting, processing, and analyzing real-time streaming data on AWS.
Q: What are Kinesis Data Streams?
A: Services that allow ingestion and processing of streaming data in real-time.
Q: What are Firehose Delivery Streams in Kinesis?
A: Services that collect, transform, and load ETL streaming data into destinations like S3, Redshift, and Splunk.
Q: What does Kinesis Analytics do?
A: Runs continuous SQL queries on streaming data from Kinesis Data Streams and Firehose Delivery Streams.
Q: What are Kinesis Video Streams used for?
A: Streaming live video from devices to the AWS cloud for real-time video processing and batch-oriented analytics.
Q: What is AWS IoT?
A: A service for collecting and managing data from Internet of Things (IoT) devices.
Q: What is the Device Gateway in AWS IoT?
A: Enables IoT devices to securely communicate with AWS IoT.
Q: What is the Device Registry in AWS IoT?
A: Maintains resources and information associated with each IoT device.
Q: What is a Device Shadow in AWS IoT?
A: Maintains the state of a device as a JSON document, allowing applications to interact with devices even when they are offline.
Q: What does the Rules Engine in AWS IoT do?
A: Defines rules for processing incoming messages from devices.
Q: What is Apache Flume?
A: A distributed system for collecting, aggregating, and moving large amounts of data from various sources to a centralized data store.
Q: What is a checkpoint file in Flume?
A: Keeps track of the last committed transactions, acting as a snapshot for data reliability.
Q: What are the main components of Flume Architecture?
A: Source, Channel, Sink, and Agent.
Q: What is a Source in Flume?
A: The component that receives or polls data from external sources.
Q: What is a Channel in Flume?
A: Transmits data from the source to the sink.
Q: What is a Sink in Flume?
A: Drains data from the channel to the final data store.
Q: What is an Agent in Flume?
A: A collection of sources, channels, and sinks that moves data from external sources to destinations.
Q: What is an Event in Flume?
A: A unit of data flow, consisting of a payload and optional attributes.
Q: Name the types of Flume Channels.
A: Memory channel, File channel, JDBC channel, and Spillable Memory channel.
Q: What is a Memory Channel in Flume?
A: Stores events in memory for fast access.
Q: What is a File Channel in Flume?
A: Stores events in files on the local filesystem for durability.
Q: What is a JDBC Channel in Flume?
A: Stores events in an embedded Derby database for durable storage.
Q: What is a Spillable Memory Channel in Flume?
A: Stores events in an in-memory queue and spills to disk when the queue is full.
Q: What is Apache Sqoop?
A: A tool for importing data from relational databases into Hadoop Distributed File System (HDFS), Hive, or HBase, and exporting data back to RDBMS.
Q: How does Sqoop import data?
A: By launching multiple map tasks to transfer data as delimited text files, binary Avro files, or Hadoop sequence files.
Q: What are Hadoop Sequence Files?
A: A binary file format specific to Hadoop for storing sequences of key-value pairs.
Q: What is Apache Avro?
A: A serialization framework that provides rich data structures, a compact binary data format, and container files for data storage and processing.
Q: What is RabbitMQ?
A: A messaging queue that implements the Advanced Message Queuing Protocol (AMQP) for exchanging messages between systems.
Q: What is the Advanced Message Queuing Protocol (AMQP)?
A: A protocol that defines the exchange of messages between systems, specifying roles like producers, consumers, and brokers.
Q: In RabbitMQ, what are producers and consumers?
A: Producers publish messages to exchanges, and consumers receive messages from queues based on bindings and routing rules.
Q: What is ZeroMQ?
A: A high-performance messaging library that provides tools to build custom messaging systems without requiring a message broker.
Q: What messaging patterns does ZeroMQ support?
A: Request-Reply, Publish-Subscribe, Push-Pull, and Exclusive Pair.
Q: What is RestMQ?
A: A message queue based on a simple JSON-based protocol using HTTP as the transport, organized as REST resources.
Q: How do producers interact with RestMQ?
A: By making HTTP POST requests with data payloads to publish messages to queues.
Q: What is Amazon SQS?
A: A scalable and reliable hosted queue service that stores messages for distributed applications.
Q: What are the two types of queues in Amazon SQS?
A: Standard queues and FIFO (First-In-First-Out) queues.
Q: What are the characteristics of Standard Queues in Amazon SQS?
A:
Guarantees message delivery but not order.
Supports unlimited transactions per second.
Operates on an at-least-once delivery model, occasionally delivering duplicate messages.
Q: What are the characteristics of FIFO Queues in Amazon SQS?
A:
Ensures messages are received in the exact order they were sent.
Supports up to 3,000 messages per second with batching or 300 messages per second without batching.
Provides exactly-once processing.
Q: What are Connectors in the context of messaging systems?
A: Interfaces that allow data to be published to and consumed from messaging queues, often exposing REST web services or other protocols.
Q: How does a REST-based Connector work?
A: Producers publish data using HTTP POST requests with data payloads, and the connector processes the requests and stores the data to the sink.
Q: What is a WebSocket-based Connector?
A: A connector that uses full-duplex communication, allowing continuous data exchange without setting up new connections for each message.
Q: What is an MQTT-based Connector?
A: A lightweight, publish-subscribe messaging protocol designed for constrained devices, suitable for IoT applications.
Q: In MQTT-based systems, what are the main entities?
A: Publisher, Broker/Server, and Subscriber.
Q: What is the role of a Publisher in MQTT?
A: Publishes data to topics managed by the broker.
Q: What does the Broker/Server do in MQTT?
A: Manages topics and forwards published data to subscribed subscribers.
Q: What is the role of a Subscriber in MQTT?
A: Receives data from topics to which it has subscribed.
Q: What are the primary source types for data acquisition in big data systems?
A: Publishing big data in batches, microbatches, or streaming real-time data.
Q: How is velocity defined in the context of data acquisition?
A: The speed at which data is generated and how frequently it is produced.
Q: What characterizes high velocity data?
A: Real-time or streaming data generated and processed continuously.
Q: What are the two main ingestion mechanisms for data?
A: Push and pull mechanisms, driven by the data consumer.
Q: What is Kafka?
A: A high-throughput distributed messaging system used for building real-time data pipelines and streaming applications.
Q: In Kafka, what is a broker?
A: A server that manages topics, handles persistence, partitions, and replicates data.
Q: What is a topic in Kafka?
A: A stream of messages of a particular type, similar to tables in databases.
Q: How does Kafka store messages?
A: On disk using partitioned commit logs.
Q: What roles do producers and consumers play in Kafka?
A: Producers publish messages to topics, while consumers subscribe to topics and process messages.
Q: What is a partition in Kafka?
A: A division of a Kafka topic that allows messages to be consumed in parallel, maintaining an ordered and immutable sequence of messages.
Q: How does Kafka achieve parallel consumption of messages?
A: By dividing topics into multiple partitions and allowing multiple consumers to read from different partitions simultaneously.
Q: What is the role of a server leader in Kafka?
A: It handles read and write operations for a partition.
Q: What are replicas in Kafka?
A: Followers that replicate the data of the leader to ensure fault tolerance.
Q: Describe Kafka's publish-subscribe messaging framework.
A: Producers publish messages to topics, and consumers subscribe to those topics to receive messages.
Q: What is an offset in Kafka?
A: A unique sequence ID assigned to each message within a partition.
Q: What is a consumer group in Kafka?
A: A group of consumers that work together to consume messages from one or more topics, ensuring each message is processed by only one consumer in the group.
Q: How does Kafka handle log storage?
A: Messages are stored in append-only, ordered, and immutable logs.
Q: What is log compaction in Kafka?
A: A process to clean out obsolete records by retaining only the latest message for each key within a log segment.
Q: What are log segments in Kafka?
A: Portions of a topic partition's log stored as directories of segment files.
Q: What determines when log segments are deleted in Kafka?
A: Reaching size or time limits as defined by the delete policy.
Q: What is Amazon Kinesis?
A: A managed service for ingesting, processing, and analyzing real-time streaming data on AWS.
Q: What are Kinesis Data Streams?
A: Services that allow ingestion and processing of streaming data in real-time.
Q: What are Firehose Delivery Streams in Kinesis?
A: Services that collect, transform, and load ETL streaming data into destinations like S3, Redshift, and Splunk.
Q: What does Kinesis Analytics do?
A: Runs continuous SQL queries on streaming data from Kinesis Data Streams and Firehose Delivery Streams.
Q: What are Kinesis Video Streams used for?
A: Streaming live video from devices to the AWS cloud for real-time video processing and batch-oriented analytics.
Q: What is AWS IoT?
A: A service for collecting and managing data from Internet of Things (IoT) devices.
Q: What is the Device Gateway in AWS IoT?
A: Enables IoT devices to securely communicate with AWS IoT.
Q: What is the Device Registry in AWS IoT?
A: Maintains resources and information associated with each IoT device.
Q: What is a Device Shadow in AWS IoT?
A: Maintains the state of a device as a JSON document, allowing applications to interact with devices even when they are offline.
Q: What does the Rules Engine in AWS IoT do?
A: Defines rules for processing incoming messages from devices.
Q: What is Apache Flume?
A: A distributed system for collecting, aggregating, and moving large amounts of data from various sources to a centralized data store.
Q: What is a checkpoint file in Flume?
A: Keeps track of the last committed transactions, acting as a snapshot for data reliability.
Q: What are the main components of Flume Architecture?
A: Source, Channel, Sink, and Agent.
Q: What is a Source in Flume?
A: The component that receives or polls data from external sources.
Q: What is a Channel in Flume?
A: Transmits data from the source to the sink.
Q: What is a Sink in Flume?
A: Drains data from the channel to the final data store.
Q: What is an Agent in Flume?
A: A collection of sources, channels, and sinks that moves data from external sources to destinations.
Q: What is an Event in Flume?
A: A unit of data flow, consisting of a payload and optional attributes.
Q: Name the types of Flume Channels.
A: Memory channel, File channel, JDBC channel, and Spillable Memory channel.
Q: What is a Memory Channel in Flume?
A: Stores events in memory for fast access.
Q: What is a File Channel in Flume?
A: Stores events in files on the local filesystem for durability.
Q: What is a JDBC Channel in Flume?
A: Stores events in an embedded Derby database for durable storage.
Q: What is a Spillable Memory Channel in Flume?
A: Stores events in an in-memory queue and spills to disk when the queue is full.
Q: What is Apache Sqoop?
A: A tool for importing data from relational databases into Hadoop Distributed File System (HDFS), Hive, or HBase, and exporting data back to RDBMS.
Q: How does Sqoop import data?
A: By launching multiple map tasks to transfer data as delimited text files, binary Avro files, or Hadoop sequence files.
Q: What are Hadoop Sequence Files?
A: A binary file format specific to Hadoop for storing sequences of key-value pairs.
Q: What is Apache Avro?
A: A serialization framework that provides rich data structures, a compact binary data format, and container files for data storage and processing.
Q: What is RabbitMQ?
A: A messaging queue that implements the Advanced Message Queuing Protocol (AMQP) for exchanging messages between systems.
Q: What is the Advanced Message Queuing Protocol (AMQP)?
A: A protocol that defines the exchange of messages between systems, specifying roles like producers, consumers, and brokers.
Q: In RabbitMQ, what are producers and consumers?
A: Producers publish messages to exchanges, and consumers receive messages from queues based on bindings and routing rules.
Q: What is ZeroMQ?
A: A high-performance messaging library that provides tools to build custom messaging systems without requiring a message broker.
Q: What messaging patterns does ZeroMQ support?
A: Request-Reply, Publish-Subscribe, Push-Pull, and Exclusive Pair.
Q: What is RestMQ?
A: A message queue based on a simple JSON-based protocol using HTTP as the transport, organized as REST resources.
Q: How do producers interact with RestMQ?
A: By making HTTP POST requests with data payloads to publish messages to queues.
Q: What is Amazon SQS?
A: A scalable and reliable hosted queue service that stores messages for distributed applications.
Q: What are the two types of queues in Amazon SQS?
A: Standard queues and FIFO (First-In-First-Out) queues.
Q: What are the characteristics of Standard Queues in Amazon SQS?
A:
Guarantees message delivery but not order.
Supports unlimited transactions per second.
Operates on an at-least-once delivery model, occasionally delivering duplicate messages.
Q: What are the characteristics of FIFO Queues in Amazon SQS?
A:
Ensures messages are received in the exact order they were sent.
Supports up to 3,000 messages per second with batching or 300 messages per second without batching.
Provides exactly-once processing.
Q: What are Connectors in the context of messaging systems?
A: Interfaces that allow data to be published to and consumed from messaging queues, often exposing REST web services or other protocols.
Q: How does a REST-based Connector work?
A: Producers publish data using HTTP POST requests with data payloads, and the connector processes the requests and stores the data to the sink.
Q: What is a WebSocket-based Connector?
A: A connector that uses full-duplex communication, allowing continuous data exchange without setting up new connections for each message.
Q: What is an MQTT-based Connector?
A: A lightweight, publish-subscribe messaging protocol designed for constrained devices, suitable for IoT applications.
Q: In MQTT-based systems, what are the main entities?
A: Publisher, Broker/Server, and Subscriber.
Q: What is the role of a Publisher in MQTT?
A: Publishes data to topics managed by the broker.
Q: What does the Broker/Server do in MQTT?
A: Manages topics and forwards published data to subscribed subscribers.
Q: What is the role of a Subscriber in MQTT?
A: Receives data from topics to which it has subscribed.