knowt logo

chapter 9

Flashcard 1

Q: What are the primary source types for data acquisition in big data systems? 

A: Publishing big data in batches, microbatches, or streaming real-time data.


Flashcard 2

Q: How is velocity defined in the context of data acquisition? 

A: The speed at which data is generated and how frequently it is produced.


Flashcard 3

Q: What characterizes high velocity data? 

A: Real-time or streaming data generated and processed continuously.


Flashcard 4

Q: What are the two main ingestion mechanisms for data? 

A: Push and pull mechanisms, driven by the data consumer.


Flashcard 5

Q: What is Kafka? 

A: A high-throughput distributed messaging system used for building real-time data pipelines and streaming applications.



Flashcard 6

Q: In Kafka, what is a broker

A: A server that manages topics, handles persistence, partitions, and replicates data.


Flashcard 7

Q: What is a topic in Kafka? 

A: A stream of messages of a particular type, similar to tables in databases.


Flashcard 8

Q: How does Kafka store messages? 

A: On disk using partitioned commit logs.


Flashcard 9

Q: What roles do producers and consumers play in Kafka? 

A: Producers publish messages to topics, while consumers subscribe to topics and process messages.


Flashcard 10

Q: What is a partition in Kafka? 

A: A division of a Kafka topic that allows messages to be consumed in parallel, maintaining an ordered and immutable sequence of messages.




Flashcard 11

Q: How does Kafka achieve parallel consumption of messages? 

A: By dividing topics into multiple partitions and allowing multiple consumers to read from different partitions simultaneously.


Flashcard 12

Q: What is the role of a server leader in Kafka? 

A: It handles read and write operations for a partition.


Flashcard 13

Q: What are replicas in Kafka? 

A: Followers that replicate the data of the leader to ensure fault tolerance.


Flashcard 14

Q: Describe Kafka's publish-subscribe messaging framework. 

A: Producers publish messages to topics, and consumers subscribe to those topics to receive messages.


Flashcard 15

Q: What is an offset in Kafka? 

A: A unique sequence ID assigned to each message within a partition.




Flashcard 16

Q: What is a consumer group in Kafka? 

A: A group of consumers that work together to consume messages from one or more topics, ensuring each message is processed by only one consumer in the group.


Flashcard 17

Q: How does Kafka handle log storage

A: Messages are stored in append-only, ordered, and immutable logs.


Flashcard 18

Q: What is log compaction in Kafka? 

A: A process to clean out obsolete records by retaining only the latest message for each key within a log segment.


Flashcard 19

Q: What are log segments in Kafka? 

A: Portions of a topic partition's log stored as directories of segment files.


Flashcard 20

Q: What determines when log segments are deleted in Kafka? 

A: Reaching size or time limits as defined by the delete policy.




Flashcard 21

Q: What is Amazon Kinesis

A: A managed service for ingesting, processing, and analyzing real-time streaming data on AWS.


Flashcard 22

Q: What are Kinesis Data Streams

A: Services that allow ingestion and processing of streaming data in real-time.


Flashcard 23

Q: What are Firehose Delivery Streams in Kinesis? 

A: Services that collect, transform, and load ETL streaming data into destinations like S3, Redshift, and Splunk.


Flashcard 24

Q: What does Kinesis Analytics do? 

A: Runs continuous SQL queries on streaming data from Kinesis Data Streams and Firehose Delivery Streams.


Flashcard 25

Q: What are Kinesis Video Streams used for? 

A: Streaming live video from devices to the AWS cloud for real-time video processing and batch-oriented analytics.



Flashcard 26

Q: What is AWS IoT

A: A service for collecting and managing data from Internet of Things (IoT) devices.


Flashcard 27

Q: What is the Device Gateway in AWS IoT? 

A: Enables IoT devices to securely communicate with AWS IoT.


Flashcard 28

Q: What is the Device Registry in AWS IoT? 

A: Maintains resources and information associated with each IoT device.


Flashcard 29

Q: What is a Device Shadow in AWS IoT? 

A: Maintains the state of a device as a JSON document, allowing applications to interact with devices even when they are offline.


Flashcard 30

Q: What does the Rules Engine in AWS IoT do? 

A: Defines rules for processing incoming messages from devices.





Flashcard 31

Q: What is Apache Flume

A: A distributed system for collecting, aggregating, and moving large amounts of data from various sources to a centralized data store.


Flashcard 32

Q: What is a checkpoint file in Flume? 

A: Keeps track of the last committed transactions, acting as a snapshot for data reliability.


Flashcard 33

Q: What are the main components of Flume Architecture

A: Source, Channel, Sink, and Agent.


Flashcard 34

Q: What is a Source in Flume? 

A: The component that receives or polls data from external sources.


Flashcard 35

Q: What is a Channel in Flume? 

A: Transmits data from the source to the sink.





Flashcard 36

Q: What is a Sink in Flume? 

A: Drains data from the channel to the final data store.


Flashcard 37

Q: What is an Agent in Flume? 

A: A collection of sources, channels, and sinks that moves data from external sources to destinations.


Flashcard 38

Q: What is an Event in Flume? 

A: A unit of data flow, consisting of a payload and optional attributes.


Flashcard 39

Q: Name the types of Flume Channels

A: Memory channel, File channel, JDBC channel, and Spillable Memory channel.


Flashcard 40

Q: What is a Memory Channel in Flume?

A: Stores events in memory for fast access.





Flashcard 41

Q: What is a File Channel in Flume? 

A: Stores events in files on the local filesystem for durability.


Flashcard 42

Q: What is a JDBC Channel in Flume?

A: Stores events in an embedded Derby database for durable storage.


Flashcard 43

Q: What is a Spillable Memory Channel in Flume? 

A: Stores events in an in-memory queue and spills to disk when the queue is full.


Flashcard 44

Q: What is Apache Sqoop

A: A tool for importing data from relational databases into Hadoop Distributed File System (HDFS), Hive, or HBase, and exporting data back to RDBMS.


Flashcard 45

Q: How does Sqoop import data? 

A: By launching multiple map tasks to transfer data as delimited text files, binary Avro files, or Hadoop sequence files.




Flashcard 46

Q: What are Hadoop Sequence Files

A: A binary file format specific to Hadoop for storing sequences of key-value pairs.


Flashcard 47

Q: What is Apache Avro

A: A serialization framework that provides rich data structures, a compact binary data format, and container files for data storage and processing.


Flashcard 48

Q: What is RabbitMQ

A: A messaging queue that implements the Advanced Message Queuing Protocol (AMQP) for exchanging messages between systems.


Flashcard 49

Q: What is the Advanced Message Queuing Protocol (AMQP)

A: A protocol that defines the exchange of messages between systems, specifying roles like producers, consumers, and brokers.


Flashcard 50

Q: In RabbitMQ, what are producers and consumers

A: Producers publish messages to exchanges, and consumers receive messages from queues based on bindings and routing rules.



Flashcard 51

Q: What is ZeroMQ

A: A high-performance messaging library that provides tools to build custom messaging systems without requiring a message broker.


Flashcard 52

Q: What messaging patterns does ZeroMQ support? 

A: Request-Reply, Publish-Subscribe, Push-Pull, and Exclusive Pair.


Flashcard 53

Q: What is RestMQ

A: A message queue based on a simple JSON-based protocol using HTTP as the transport, organized as REST resources.


Flashcard 54

Q: How do producers interact with RestMQ

A: By making HTTP POST requests with data payloads to publish messages to queues.


Flashcard 55

Q: What is Amazon SQS

A: A scalable and reliable hosted queue service that stores messages for distributed applications.




Flashcard 56

Q: What are the two types of queues in Amazon SQS

A: Standard queues and FIFO (First-In-First-Out) queues.


Flashcard 57

Q: What are the characteristics of Standard Queues in Amazon SQS? 

A:

  • Guarantees message delivery but not order.

  • Supports unlimited transactions per second.

  • Operates on an at-least-once delivery model, occasionally delivering duplicate messages.


Flashcard 58

Q: What are the characteristics of FIFO Queues in Amazon SQS? 

A:

  • Ensures messages are received in the exact order they were sent.

  • Supports up to 3,000 messages per second with batching or 300 messages per second without batching.

  • Provides exactly-once processing.


Flashcard 59

Q: What are Connectors in the context of messaging systems? 

A: Interfaces that allow data to be published to and consumed from messaging queues, often exposing REST web services or other protocols.


Flashcard 60

Q: How does a REST-based Connector work? 

A: Producers publish data using HTTP POST requests with data payloads, and the connector processes the requests and stores the data to the sink.


Flashcard 61

Q: What is a WebSocket-based Connector

A: A connector that uses full-duplex communication, allowing continuous data exchange without setting up new connections for each message.


Flashcard 62

Q: What is an MQTT-based Connector

A: A lightweight, publish-subscribe messaging protocol designed for constrained devices, suitable for IoT applications.


Flashcard 63

Q: In MQTT-based systems, what are the main entities

A: Publisher, Broker/Server, and Subscriber.


Flashcard 64

Q: What is the role of a Publisher in MQTT? 

A: Publishes data to topics managed by the broker.


Flashcard 65

Q: What does the Broker/Server do in MQTT? 

A: Manages topics and forwards published data to subscribed subscribers.


Flashcard 66

Q: What is the role of a Subscriber in MQTT? 

A: Receives data from topics to which it has subscribed.


chapter 9

Flashcard 1

Q: What are the primary source types for data acquisition in big data systems? 

A: Publishing big data in batches, microbatches, or streaming real-time data.


Flashcard 2

Q: How is velocity defined in the context of data acquisition? 

A: The speed at which data is generated and how frequently it is produced.


Flashcard 3

Q: What characterizes high velocity data? 

A: Real-time or streaming data generated and processed continuously.


Flashcard 4

Q: What are the two main ingestion mechanisms for data? 

A: Push and pull mechanisms, driven by the data consumer.


Flashcard 5

Q: What is Kafka? 

A: A high-throughput distributed messaging system used for building real-time data pipelines and streaming applications.



Flashcard 6

Q: In Kafka, what is a broker

A: A server that manages topics, handles persistence, partitions, and replicates data.


Flashcard 7

Q: What is a topic in Kafka? 

A: A stream of messages of a particular type, similar to tables in databases.


Flashcard 8

Q: How does Kafka store messages? 

A: On disk using partitioned commit logs.


Flashcard 9

Q: What roles do producers and consumers play in Kafka? 

A: Producers publish messages to topics, while consumers subscribe to topics and process messages.


Flashcard 10

Q: What is a partition in Kafka? 

A: A division of a Kafka topic that allows messages to be consumed in parallel, maintaining an ordered and immutable sequence of messages.




Flashcard 11

Q: How does Kafka achieve parallel consumption of messages? 

A: By dividing topics into multiple partitions and allowing multiple consumers to read from different partitions simultaneously.


Flashcard 12

Q: What is the role of a server leader in Kafka? 

A: It handles read and write operations for a partition.


Flashcard 13

Q: What are replicas in Kafka? 

A: Followers that replicate the data of the leader to ensure fault tolerance.


Flashcard 14

Q: Describe Kafka's publish-subscribe messaging framework. 

A: Producers publish messages to topics, and consumers subscribe to those topics to receive messages.


Flashcard 15

Q: What is an offset in Kafka? 

A: A unique sequence ID assigned to each message within a partition.




Flashcard 16

Q: What is a consumer group in Kafka? 

A: A group of consumers that work together to consume messages from one or more topics, ensuring each message is processed by only one consumer in the group.


Flashcard 17

Q: How does Kafka handle log storage

A: Messages are stored in append-only, ordered, and immutable logs.


Flashcard 18

Q: What is log compaction in Kafka? 

A: A process to clean out obsolete records by retaining only the latest message for each key within a log segment.


Flashcard 19

Q: What are log segments in Kafka? 

A: Portions of a topic partition's log stored as directories of segment files.


Flashcard 20

Q: What determines when log segments are deleted in Kafka? 

A: Reaching size or time limits as defined by the delete policy.




Flashcard 21

Q: What is Amazon Kinesis

A: A managed service for ingesting, processing, and analyzing real-time streaming data on AWS.


Flashcard 22

Q: What are Kinesis Data Streams

A: Services that allow ingestion and processing of streaming data in real-time.


Flashcard 23

Q: What are Firehose Delivery Streams in Kinesis? 

A: Services that collect, transform, and load ETL streaming data into destinations like S3, Redshift, and Splunk.


Flashcard 24

Q: What does Kinesis Analytics do? 

A: Runs continuous SQL queries on streaming data from Kinesis Data Streams and Firehose Delivery Streams.


Flashcard 25

Q: What are Kinesis Video Streams used for? 

A: Streaming live video from devices to the AWS cloud for real-time video processing and batch-oriented analytics.



Flashcard 26

Q: What is AWS IoT

A: A service for collecting and managing data from Internet of Things (IoT) devices.


Flashcard 27

Q: What is the Device Gateway in AWS IoT? 

A: Enables IoT devices to securely communicate with AWS IoT.


Flashcard 28

Q: What is the Device Registry in AWS IoT? 

A: Maintains resources and information associated with each IoT device.


Flashcard 29

Q: What is a Device Shadow in AWS IoT? 

A: Maintains the state of a device as a JSON document, allowing applications to interact with devices even when they are offline.


Flashcard 30

Q: What does the Rules Engine in AWS IoT do? 

A: Defines rules for processing incoming messages from devices.





Flashcard 31

Q: What is Apache Flume

A: A distributed system for collecting, aggregating, and moving large amounts of data from various sources to a centralized data store.


Flashcard 32

Q: What is a checkpoint file in Flume? 

A: Keeps track of the last committed transactions, acting as a snapshot for data reliability.


Flashcard 33

Q: What are the main components of Flume Architecture

A: Source, Channel, Sink, and Agent.


Flashcard 34

Q: What is a Source in Flume? 

A: The component that receives or polls data from external sources.


Flashcard 35

Q: What is a Channel in Flume? 

A: Transmits data from the source to the sink.





Flashcard 36

Q: What is a Sink in Flume? 

A: Drains data from the channel to the final data store.


Flashcard 37

Q: What is an Agent in Flume? 

A: A collection of sources, channels, and sinks that moves data from external sources to destinations.


Flashcard 38

Q: What is an Event in Flume? 

A: A unit of data flow, consisting of a payload and optional attributes.


Flashcard 39

Q: Name the types of Flume Channels

A: Memory channel, File channel, JDBC channel, and Spillable Memory channel.


Flashcard 40

Q: What is a Memory Channel in Flume?

A: Stores events in memory for fast access.





Flashcard 41

Q: What is a File Channel in Flume? 

A: Stores events in files on the local filesystem for durability.


Flashcard 42

Q: What is a JDBC Channel in Flume?

A: Stores events in an embedded Derby database for durable storage.


Flashcard 43

Q: What is a Spillable Memory Channel in Flume? 

A: Stores events in an in-memory queue and spills to disk when the queue is full.


Flashcard 44

Q: What is Apache Sqoop

A: A tool for importing data from relational databases into Hadoop Distributed File System (HDFS), Hive, or HBase, and exporting data back to RDBMS.


Flashcard 45

Q: How does Sqoop import data? 

A: By launching multiple map tasks to transfer data as delimited text files, binary Avro files, or Hadoop sequence files.




Flashcard 46

Q: What are Hadoop Sequence Files

A: A binary file format specific to Hadoop for storing sequences of key-value pairs.


Flashcard 47

Q: What is Apache Avro

A: A serialization framework that provides rich data structures, a compact binary data format, and container files for data storage and processing.


Flashcard 48

Q: What is RabbitMQ

A: A messaging queue that implements the Advanced Message Queuing Protocol (AMQP) for exchanging messages between systems.


Flashcard 49

Q: What is the Advanced Message Queuing Protocol (AMQP)

A: A protocol that defines the exchange of messages between systems, specifying roles like producers, consumers, and brokers.


Flashcard 50

Q: In RabbitMQ, what are producers and consumers

A: Producers publish messages to exchanges, and consumers receive messages from queues based on bindings and routing rules.



Flashcard 51

Q: What is ZeroMQ

A: A high-performance messaging library that provides tools to build custom messaging systems without requiring a message broker.


Flashcard 52

Q: What messaging patterns does ZeroMQ support? 

A: Request-Reply, Publish-Subscribe, Push-Pull, and Exclusive Pair.


Flashcard 53

Q: What is RestMQ

A: A message queue based on a simple JSON-based protocol using HTTP as the transport, organized as REST resources.


Flashcard 54

Q: How do producers interact with RestMQ

A: By making HTTP POST requests with data payloads to publish messages to queues.


Flashcard 55

Q: What is Amazon SQS

A: A scalable and reliable hosted queue service that stores messages for distributed applications.




Flashcard 56

Q: What are the two types of queues in Amazon SQS

A: Standard queues and FIFO (First-In-First-Out) queues.


Flashcard 57

Q: What are the characteristics of Standard Queues in Amazon SQS? 

A:

  • Guarantees message delivery but not order.

  • Supports unlimited transactions per second.

  • Operates on an at-least-once delivery model, occasionally delivering duplicate messages.


Flashcard 58

Q: What are the characteristics of FIFO Queues in Amazon SQS? 

A:

  • Ensures messages are received in the exact order they were sent.

  • Supports up to 3,000 messages per second with batching or 300 messages per second without batching.

  • Provides exactly-once processing.


Flashcard 59

Q: What are Connectors in the context of messaging systems? 

A: Interfaces that allow data to be published to and consumed from messaging queues, often exposing REST web services or other protocols.


Flashcard 60

Q: How does a REST-based Connector work? 

A: Producers publish data using HTTP POST requests with data payloads, and the connector processes the requests and stores the data to the sink.


Flashcard 61

Q: What is a WebSocket-based Connector

A: A connector that uses full-duplex communication, allowing continuous data exchange without setting up new connections for each message.


Flashcard 62

Q: What is an MQTT-based Connector

A: A lightweight, publish-subscribe messaging protocol designed for constrained devices, suitable for IoT applications.


Flashcard 63

Q: In MQTT-based systems, what are the main entities

A: Publisher, Broker/Server, and Subscriber.


Flashcard 64

Q: What is the role of a Publisher in MQTT? 

A: Publishes data to topics managed by the broker.


Flashcard 65

Q: What does the Broker/Server do in MQTT? 

A: Manages topics and forwards published data to subscribed subscribers.


Flashcard 66

Q: What is the role of a Subscriber in MQTT? 

A: Receives data from topics to which it has subscribed.


robot