Looks like no one added any tags here yet for you.
What are the primary source types for data acquisition in big data systems?
Publishing big data in batches, microbatches, or streaming real-time data.
How is velocity defined in the context of data acquisition?
The speed at which data is generated and how frequently it is produced.
What characterizes high velocity data?
Real-time or streaming data generated and processed continuously.
What are the two main ingestion mechanisms for data?
Push and pull mechanisms, driven by the data consumer.
What is Kafka?
A high-throughput distributed messaging system used for building real-time data pipelines and streaming applications.
In Kafka, what is a broker?
A server that manages topics, handles persistence, partitions, and replicates data.
What is a topic in Kafka?
A stream of messages of a particular type, similar to tables in databases.
How does Kafka store messages?
On disk using partitioned commit logs.
What roles do producers and consumers play in Kafka?
Producers publish messages to topics, while consumers subscribe to topics and process messages.
What is a partition in Kafka?
A division of a Kafka topic that allows messages to be consumed in parallel, maintaining an ordered and immutable sequence of messages.
How does Kafka achieve parallel consumption of messages?
By dividing topics into multiple partitions and allowing multiple consumers to read from different partitions simultaneously.
What is the role of a server leader in Kafka?
It handles read and write operations for a partition.
What are replicas in Kafka?
Followers that replicate the data of the leader to ensure fault tolerance.
Describe Kafka's publish-subscribe messaging framework.
Producers publish messages to topics, and consumers subscribe to those topics to receive messages.
What is an offset in Kafka?
A unique sequence ID assigned to each message within a partition.
What is a consumer group in Kafka?
A group of consumers that work together to consume messages from one or more topics, ensuring each message is processed by only one consumer in the group.
How does Kafka handle log storage?
Messages are stored in append-only, ordered, and immutable logs.
What is log compaction in Kafka?
A process to clean out obsolete records by retaining only the latest message for each key within a log segment.
What are log segments in Kafka?
Portions of a topic partition's log stored as directories of segment files.
What determines when log segments are deleted in Kafka?
Reaching size or time limits as defined by the delete policy.
What is Amazon Kinesis?
A managed service for ingesting, processing, and analyzing real-time streaming data on AWS.
What are Kinesis Data Streams?
Services that allow ingestion and processing of streaming data in real-time.
What are Firehose Delivery Streams in Kinesis?
Services that collect, transform, and load ETL streaming data into destinations like S3, Redshift, and Splunk.
What does Kinesis Analytics do?
Runs continuous SQL queries on streaming data from Kinesis Data Streams and Firehose Delivery Streams.
What are Kinesis Video Streams used for?
Streaming live video from devices to the AWS cloud for real-time video processing and batch-oriented analytics.
What is AWS IoT?
A service for collecting and managing data from Internet of Things (IoT) devices.
What is the Device Gateway in AWS IoT?
Enables IoT devices to securely communicate with AWS IoT.
What is the Device Registry in AWS IoT?
Maintains resources and information associated with each IoT device.
What is a Device Shadow in AWS IoT?
Maintains the state of a device as a JSON document, allowing applications to interact with devices even when they are offline.
What does the Rules Engine in AWS IoT do?
Defines rules for processing incoming messages from devices.
What is Apache Flume?
A distributed system for collecting, aggregating, and moving large amounts of data from various sources to a centralized data store.
What is a checkpoint file in Flume?
Keeps track of the last committed transactions, acting as a snapshot for data reliability.
What are the main components of Flume Architecture?
Source, Channel, Sink, and Agent.
What is a Source in Flume?
The component that receives or polls data from external sources.
What is a Channel in Flume?
Transmits data from the source to the sink.
What is a Sink in Flume?
Drains data from the channel to the final data store.
What is an Agent in Flume?
A collection of sources, channels, and sinks that moves data from external sources to destinations.
What is an Event in Flume?
A unit of data flow, consisting of a payload and optional attributes.
Name the types of Flume Channels.
Memory channel, File channel, JDBC channel, and Spillable Memory channel.
What is a Memory Channel in Flume?
Stores events in memory for fast access.
What is a File Channel in Flume?
Stores events in files on the local filesystem for durability.
What is a JDBC Channel in Flume?
Stores events in an embedded Derby database for durable storage.
What is a Spillable Memory Channel in Flume?
Stores events in an in-memory queue and spills to disk when the queue is full.
What is Apache Sqoop?
A tool for importing data from relational databases into Hadoop Distributed File System (HDFS), Hive, or HBase, and exporting data back to RDBMS.
How does Sqoop import data?
By launching multiple map tasks to transfer data as delimited text files, binary Avro files, or Hadoop sequence files.
What are Hadoop Sequence Files?
A binary file format specific to Hadoop for storing sequences of key-value pairs.
What is Apache Avro?
A serialization framework that provides rich data structures, a compact binary data format, and container files for data storage and processing.
What is RabbitMQ?
A messaging queue that implements the Advanced Message Queuing Protocol (AMQP) for exchanging messages between systems.
What is the Advanced Message Queuing Protocol (AMQP)?
A protocol that defines the exchange of messages between systems, specifying roles like producers, consumers, and brokers.
In RabbitMQ, what are producers and consumers?
Producers publish messages to exchanges, and consumers receive messages from queues based on bindings and routing rules.
What is ZeroMQ?
A high-performance messaging library that provides tools to build custom messaging systems without requiring a message broker.
What messaging patterns does ZeroMQ support?
Request-Reply, Publish-Subscribe, Push-Pull, and Exclusive Pair.
What is RestMQ?
A message queue based on a simple JSON-based protocol using HTTP as the transport, organized as REST resources.
How do producers interact with RestMQ?
By making HTTP POST requests with data payloads to publish messages to queues.
What is Amazon SQS?
A scalable and reliable hosted queue service that stores messages for distributed applications.
What are the two types of queues in Amazon SQS?
Standard queues and FIFO (First-In-First-Out) queues.
What are the characteristics of Standard Queues in Amazon SQS?
Guarantees message delivery but not order. Supports unlimited transactions per second. Operates on an at-least-once delivery model, occasionally delivering duplicate messages.
What are the characteristics of FIFO Queues in Amazon SQS?
Ensures messages are received in the exact order they were sent. Supports up to 3,000 messages per second with batching or 300 messages per second without batching. Provides exactly-once processing.
What are Connectors in the context of messaging systems?
Interfaces that allow data to be published to and consumed from messaging queues, often exposing REST web services or other protocols.
How does a REST-based Connector work?
Producers publish data using HTTP POST requests with data payloads, and the connector processes the requests and stores the data to the sink.
What is a WebSocket-based Connector?
A connector that uses full-duplex communication, allowing continuous data exchange without setting up new connections for each message.
What is an MQTT-based Connector?
A lightweight, publish-subscribe messaging protocol designed for constrained devices, suitable for IoT applications.
In MQTT-based systems, what are the main entities?
Publisher, Broker/Server, and Subscriber.
What is the role of a Publisher in MQTT?
Publishes data to topics managed by the broker.
What does the Broker/Server do in MQTT?
Manages topics and forwards published data to subscribed subscribers.
What is the role of a Subscriber in MQTT?
Receives data from topics to which it has subscribed.