chapter 9

5.0(1)

Studied by 0 people

View linked note

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/65

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

66 Terms

New cards

What are the primary source types for data acquisition in big data systems?

Publishing big data in batches, microbatches, or streaming real-time data.

New cards

How is velocity defined in the context of data acquisition?

The speed at which data is generated and how frequently it is produced.

New cards

What characterizes high velocity data?

Real-time or streaming data generated and processed continuously.

New cards

What are the two main ingestion mechanisms for data?

Push and pull mechanisms, driven by the data consumer.

New cards

What is Kafka?

A high-throughput distributed messaging system used for building real-time data pipelines and streaming applications.

New cards

In Kafka, what is a broker?

A server that manages topics, handles persistence, partitions, and replicates data.

New cards

What is a topic in Kafka?

A stream of messages of a particular type, similar to tables in databases.

New cards

How does Kafka store messages?

On disk using partitioned commit logs.

New cards

What roles do producers and consumers play in Kafka?

Producers publish messages to topics, while consumers subscribe to topics and process messages.

New cards

What is a partition in Kafka?

A division of a Kafka topic that allows messages to be consumed in parallel, maintaining an ordered and immutable sequence of messages.

New cards

How does Kafka achieve parallel consumption of messages?

By dividing topics into multiple partitions and allowing multiple consumers to read from different partitions simultaneously.

New cards

What is the role of a server leader in Kafka?

It handles read and write operations for a partition.

New cards

What are replicas in Kafka?

Followers that replicate the data of the leader to ensure fault tolerance.

New cards

Describe Kafka's publish-subscribe messaging framework.

Producers publish messages to topics, and consumers subscribe to those topics to receive messages.

New cards

What is an offset in Kafka?

A unique sequence ID assigned to each message within a partition.

New cards

What is a consumer group in Kafka?

A group of consumers that work together to consume messages from one or more topics, ensuring each message is processed by only one consumer in the group.

New cards

How does Kafka handle log storage?

Messages are stored in append-only, ordered, and immutable logs.

New cards

What is log compaction in Kafka?

A process to clean out obsolete records by retaining only the latest message for each key within a log segment.

New cards

What are log segments in Kafka?

Portions of a topic partition's log stored as directories of segment files.

New cards

What determines when log segments are deleted in Kafka?

Reaching size or time limits as defined by the delete policy.

New cards

What is Amazon Kinesis?

A managed service for ingesting, processing, and analyzing real-time streaming data on AWS.

New cards

What are Kinesis Data Streams?

Services that allow ingestion and processing of streaming data in real-time.

New cards

What are Firehose Delivery Streams in Kinesis?

Services that collect, transform, and load ETL streaming data into destinations like S3, Redshift, and Splunk.

New cards

What does Kinesis Analytics do?

Runs continuous SQL queries on streaming data from Kinesis Data Streams and Firehose Delivery Streams.

New cards

What are Kinesis Video Streams used for?

Streaming live video from devices to the AWS cloud for real-time video processing and batch-oriented analytics.

New cards

What is AWS IoT?

A service for collecting and managing data from Internet of Things (IoT) devices.

New cards

What is the Device Gateway in AWS IoT?

Enables IoT devices to securely communicate with AWS IoT.

New cards

What is the Device Registry in AWS IoT?

Maintains resources and information associated with each IoT device.

New cards

What is a Device Shadow in AWS IoT?

Maintains the state of a device as a JSON document, allowing applications to interact with devices even when they are offline.

New cards

What does the Rules Engine in AWS IoT do?

Defines rules for processing incoming messages from devices.

New cards

What is Apache Flume?

A distributed system for collecting, aggregating, and moving large amounts of data from various sources to a centralized data store.

New cards

What is a checkpoint file in Flume?

Keeps track of the last committed transactions, acting as a snapshot for data reliability.

New cards

What are the main components of Flume Architecture?

Source, Channel, Sink, and Agent.

New cards

What is a Source in Flume?

The component that receives or polls data from external sources.

New cards

What is a Channel in Flume?

Transmits data from the source to the sink.

New cards

What is a Sink in Flume?

Drains data from the channel to the final data store.

New cards

What is an Agent in Flume?

A collection of sources, channels, and sinks that moves data from external sources to destinations.

New cards

What is an Event in Flume?

A unit of data flow, consisting of a payload and optional attributes.

New cards

Name the types of Flume Channels.

Memory channel, File channel, JDBC channel, and Spillable Memory channel.

New cards

What is a Memory Channel in Flume?

Stores events in memory for fast access.

New cards

What is a File Channel in Flume?

Stores events in files on the local filesystem for durability.

New cards

What is a JDBC Channel in Flume?

Stores events in an embedded Derby database for durable storage.

New cards

What is a Spillable Memory Channel in Flume?

Stores events in an in-memory queue and spills to disk when the queue is full.

New cards

What is Apache Sqoop?

A tool for importing data from relational databases into Hadoop Distributed File System (HDFS), Hive, or HBase, and exporting data back to RDBMS.

New cards

How does Sqoop import data?

By launching multiple map tasks to transfer data as delimited text files, binary Avro files, or Hadoop sequence files.

New cards

What are Hadoop Sequence Files?

A binary file format specific to Hadoop for storing sequences of key-value pairs.

New cards

What is Apache Avro?

A serialization framework that provides rich data structures, a compact binary data format, and container files for data storage and processing.

New cards

What is RabbitMQ?

A messaging queue that implements the Advanced Message Queuing Protocol (AMQP) for exchanging messages between systems.

New cards

What is the Advanced Message Queuing Protocol (AMQP)?

A protocol that defines the exchange of messages between systems, specifying roles like producers, consumers, and brokers.

New cards

In RabbitMQ, what are producers and consumers?

Producers publish messages to exchanges, and consumers receive messages from queues based on bindings and routing rules.

New cards

What is ZeroMQ?

A high-performance messaging library that provides tools to build custom messaging systems without requiring a message broker.

New cards

What messaging patterns does ZeroMQ support?

Request-Reply, Publish-Subscribe, Push-Pull, and Exclusive Pair.

New cards

What is RestMQ?

A message queue based on a simple JSON-based protocol using HTTP as the transport, organized as REST resources.

New cards

How do producers interact with RestMQ?

By making HTTP POST requests with data payloads to publish messages to queues.

New cards

What is Amazon SQS?

A scalable and reliable hosted queue service that stores messages for distributed applications.

New cards

What are the two types of queues in Amazon SQS?

Standard queues and FIFO (First-In-First-Out) queues.

New cards

What are the characteristics of Standard Queues in Amazon SQS?

Guarantees message delivery but not order. Supports unlimited transactions per second. Operates on an at-least-once delivery model, occasionally delivering duplicate messages.

New cards

What are the characteristics of FIFO Queues in Amazon SQS?

Ensures messages are received in the exact order they were sent. Supports up to 3,000 messages per second with batching or 300 messages per second without batching. Provides exactly-once processing.

New cards

What are Connectors in the context of messaging systems?

Interfaces that allow data to be published to and consumed from messaging queues, often exposing REST web services or other protocols.

New cards

How does a REST-based Connector work?

Producers publish data using HTTP POST requests with data payloads, and the connector processes the requests and stores the data to the sink.

New cards

What is a WebSocket-based Connector?

A connector that uses full-duplex communication, allowing continuous data exchange without setting up new connections for each message.

New cards

What is an MQTT-based Connector?

A lightweight, publish-subscribe messaging protocol designed for constrained devices, suitable for IoT applications.