chapter 9
Flashcard 1
Q: What are the primary source types for data acquisition in big data systems?ย
A: Publishing big data in batches, microbatches, or streaming real-time data.
Flashcard 2
Q: How is velocity defined in the context of data acquisition?ย
A: The speed at which data is generated and how frequently it is produced.
Flashcard 3
Q: What characterizes high velocity data?ย
A: Real-time or streaming data generated and processed continuously.
Flashcard 4
Q: What are the two main ingestion mechanisms for data?ย
A: Push and pull mechanisms, driven by the data consumer.
Flashcard 5
Q: What is Kafka?ย
A: A high-throughput distributed messaging system used for building real-time data pipelines and streaming applications.
Flashcard 6
Q: In Kafka, what is a broker?ย
A: A server that manages topics, handles persistence, partitions, and replicates data.
Flashcard 7
Q: What is a topic in Kafka?ย
A: A stream of messages of a particular type, similar to tables in databases.
Flashcard 8
Q: How does Kafka store messages?ย
A: On disk using partitioned commit logs.
Flashcard 9
Q: What roles do producers and consumers play in Kafka?ย
A: Producers publish messages to topics, while consumers subscribe to topics and process messages.
Flashcard 10
Q: What is a partition in Kafka?ย
A: A division of a Kafka topic that allows messages to be consumed in parallel, maintaining an ordered and immutable sequence of messages.
Flashcard 11
Q: How does Kafka achieve parallel consumption of messages?ย
A: By dividing topics into multiple partitions and allowing multiple consumers to read from different partitions simultaneously.
Flashcard 12
Q: What is the role of a server leader in Kafka?ย
A: It handles read and write operations for a partition.
Flashcard 13
Q: What are replicas in Kafka?ย
A: Followers that replicate the data of the leader to ensure fault tolerance.
Flashcard 14
Q: Describe Kafka's publish-subscribe messaging framework.ย
A: Producers publish messages to topics, and consumers subscribe to those topics to receive messages.
Flashcard 15
Q: What is an offset in Kafka?ย
A: A unique sequence ID assigned to each message within a partition.
Flashcard 16
Q: What is a consumer group in Kafka?ย
A: A group of consumers that work together to consume messages from one or more topics, ensuring each message is processed by only one consumer in the group.
Flashcard 17
Q: How does Kafka handle log storage?ย
A: Messages are stored in append-only, ordered, and immutable logs.
Flashcard 18
Q: What is log compaction in Kafka?ย
A: A process to clean out obsolete records by retaining only the latest message for each key within a log segment.
Flashcard 19
Q: What are log segments in Kafka?ย
A: Portions of a topic partition's log stored as directories of segment files.
Flashcard 20
Q: What determines when log segments are deleted in Kafka?ย
A: Reaching size or time limits as defined by the delete policy.
Flashcard 21
Q: What is Amazon Kinesis?ย
A: A managed service for ingesting, processing, and analyzing real-time streaming data on AWS.
Flashcard 22
Q: What are Kinesis Data Streams?ย
A: Services that allow ingestion and processing of streaming data in real-time.
Flashcard 23
Q: What are Firehose Delivery Streams in Kinesis?ย
A: Services that collect, transform, and load ETL streaming data into destinations like S3, Redshift, and Splunk.
Flashcard 24
Q: What does Kinesis Analytics do?ย
A: Runs continuous SQL queries on streaming data from Kinesis Data Streams and Firehose Delivery Streams.
Flashcard 25
Q: What are Kinesis Video Streams used for?ย
A: Streaming live video from devices to the AWS cloud for real-time video processing and batch-oriented analytics.
Flashcard 26
Q: What is AWS IoT?ย
A: A service for collecting and managing data from Internet of Things (IoT) devices.
Flashcard 27
Q: What is the Device Gateway in AWS IoT?ย
A: Enables IoT devices to securely communicate with AWS IoT.
Flashcard 28
Q: What is the Device Registry in AWS IoT?ย
A: Maintains resources and information associated with each IoT device.
Flashcard 29
Q: What is a Device Shadow in AWS IoT?ย
A: Maintains the state of a device as a JSON document, allowing applications to interact with devices even when they are offline.
Flashcard 30
Q: What does the Rules Engine in AWS IoT do?ย
A: Defines rules for processing incoming messages from devices.
Flashcard 31
Q: What is Apache Flume?ย
A: A distributed system for collecting, aggregating, and moving large amounts of data from various sources to a centralized data store.
Flashcard 32
Q: What is a checkpoint file in Flume?ย
A: Keeps track of the last committed transactions, acting as a snapshot for data reliability.
Flashcard 33
Q: What are the main components of Flume Architecture?ย
A: Source, Channel, Sink, and Agent.
Flashcard 34
Q: What is a Source in Flume?ย
A: The component that receives or polls data from external sources.
Flashcard 35
Q: What is a Channel in Flume?ย
A: Transmits data from the source to the sink.
Flashcard 36
Q: What is a Sink in Flume?ย
A: Drains data from the channel to the final data store.
Flashcard 37
Q: What is an Agent in Flume?ย
A: A collection of sources, channels, and sinks that moves data from external sources to destinations.
Flashcard 38
Q: What is an Event in Flume?ย
A: A unit of data flow, consisting of a payload and optional attributes.
Flashcard 39
Q: Name the types of Flume Channels.ย
A: Memory channel, File channel, JDBC channel, and Spillable Memory channel.
Flashcard 40
Q: What is a Memory Channel in Flume?
A: Stores events in memory for fast access.
Flashcard 41
Q: What is a File Channel in Flume?ย
A: Stores events in files on the local filesystem for durability.
Flashcard 42
Q: What is a JDBC Channel in Flume?
A: Stores events in an embedded Derby database for durable storage.
Flashcard 43
Q: What is a Spillable Memory Channel in Flume?ย
A: Stores events in an in-memory queue and spills to disk when the queue is full.
Flashcard 44
Q: What is Apache Sqoop?ย
A: A tool for importing data from relational databases into Hadoop Distributed File System (HDFS), Hive, or HBase, and exporting data back to RDBMS.
Flashcard 45
Q: How does Sqoop import data?ย
A: By launching multiple map tasks to transfer data as delimited text files, binary Avro files, or Hadoop sequence files.
Flashcard 46
Q: What are Hadoop Sequence Files?ย
A: A binary file format specific to Hadoop for storing sequences of key-value pairs.
Flashcard 47
Q: What is Apache Avro?ย
A: A serialization framework that provides rich data structures, a compact binary data format, and container files for data storage and processing.
Flashcard 48
Q: What is RabbitMQ?ย
A: A messaging queue that implements the Advanced Message Queuing Protocol (AMQP) for exchanging messages between systems.
Flashcard 49
Q: What is the Advanced Message Queuing Protocol (AMQP)?ย
A: A protocol that defines the exchange of messages between systems, specifying roles like producers, consumers, and brokers.
Flashcard 50
Q: In RabbitMQ, what are producers and consumers?ย
A: Producers publish messages to exchanges, and consumers receive messages from queues based on bindings and routing rules.
Flashcard 51
Q: What is ZeroMQ?ย
A: A high-performance messaging library that provides tools to build custom messaging systems without requiring a message broker.
Flashcard 52
Q: What messaging patterns does ZeroMQ support?ย
A: Request-Reply, Publish-Subscribe, Push-Pull, and Exclusive Pair.
Flashcard 53
Q: What is RestMQ?ย
A: A message queue based on a simple JSON-based protocol using HTTP as the transport, organized as REST resources.
Flashcard 54
Q: How do producers interact with RestMQ?ย
A: By making HTTP POST requests with data payloads to publish messages to queues.
Flashcard 55
Q: What is Amazon SQS?ย
A: A scalable and reliable hosted queue service that stores messages for distributed applications.
Flashcard 56
Q: What are the two types of queues in Amazon SQS?ย
A: Standard queues and FIFO (First-In-First-Out) queues.
Flashcard 57
Q: What are the characteristics of Standard Queues in Amazon SQS?ย
A:
Guarantees message delivery but not order.
Supports unlimited transactions per second.
Operates on an at-least-once delivery model, occasionally delivering duplicate messages.
Flashcard 58
Q: What are the characteristics of FIFO Queues in Amazon SQS?ย
A:
Ensures messages are received in the exact order they were sent.
Supports up to 3,000 messages per second with batching or 300 messages per second without batching.
Provides exactly-once processing.
Flashcard 59
Q: What are Connectors in the context of messaging systems?ย
A: Interfaces that allow data to be published to and consumed from messaging queues, often exposing REST web services or other protocols.
Flashcard 60
Q: How does a REST-based Connector work?ย
A: Producers publish data using HTTP POST requests with data payloads, and the connector processes the requests and stores the data to the sink.
Flashcard 61
Q: What is a WebSocket-based Connector?ย
A: A connector that uses full-duplex communication, allowing continuous data exchange without setting up new connections for each message.
Flashcard 62
Q: What is an MQTT-based Connector?ย
A: A lightweight, publish-subscribe messaging protocol designed for constrained devices, suitable for IoT applications.
Flashcard 63
Q: In MQTT-based systems, what are the main entities?ย
A: Publisher, Broker/Server, and Subscriber.
Flashcard 64
Q: What is the role of a Publisher in MQTT?ย
A: Publishes data to topics managed by the broker.
Flashcard 65
Q: What does the Broker/Server do in MQTT?ย
A: Manages topics and forwards published data to subscribed subscribers.
Flashcard 66
Q: What is the role of a Subscriber in MQTT?ย
A: Receives data from topics to which it has subscribed.