Data Streams: System Design and Information Management Notes

Introduction to Data Stream Management Systems (DSMS)

Data streams represent a fundamental shift in how data is managed and processed compared to traditional database systems. In the context of modern informatics, particularly in course CSE4202: Database System Design and Information Management II, the focus moves from static data storage to the real-time analysis of information in motion.

  • Conceptual Shift: Traditional Database Management Systems (DBMS) rely on persistent storage where records are relatively static and there is typically no predefined notion of time. Queries are usually complex and executed as one-off operations.

  • Modern Requirements: Certain applications necessitate a different approach because:

    • Data arrives in real-time.

    • Data is ordered, either implicitly by arrival time or explicitly via a timestamp.

    • The volume of data is too large to store in its entirety.

    • Data arrival is continuous and never-ending.

    • Analysis must be ongoing for rapidly changing data sets.

Big Data: The 4 Vs

The management of data streams is closely linked to the four pillars of Big Data. Understanding these is essential for designing effective DSMS.

  • Volume: Refers to the sheer amount of data being generated.

  • Variety: Data comes in many forms, including semi-structured, unstructured, and schema-free formats.

  • Veracity: Pertains to the reliability of the data, accounting for untrusted, inaccurate, or noisy information.

  • Velocity: The speed of operation and the rate at which analysis must occur to keep up with incoming data.

Application Domains for Data Streams

Data stream management is critical across various industries where data is generated as a continuous sequence of tuples.

  • Measurement Data Streams: These monitor the evolution of entity states.

    • Sensor Networks: Capturing physical phenomena or road traffic data.

    • IP Networks: Monitoring traffic at router interfaces.

    • Earth Climate: Recording temperature and moisture levels at weather stations.

  • Transactional Data Streams: These log interactions between entities.

    • Credit Cards: Tracking purchases made by consumers from various merchants.

    • Telecommunications: Logging phone calls between callers and dialed parties (Call Detail Records).

    • Web Services: Monitoring accesses by clients to resources hosted on servers (clickstreams).

  • Additional Domains:

    • Network monitoring and traffic engineering.

    • RFID tags.

    • Financial applications (e.g., stock market feeds).

    • Manufacturing processes.

Example: Intelligent Transport Systems (ITS)

ITS utilizes data streams to manage a complex ecosystem of communication involving:

  • Satellite communications and terrestrial broadcast.

  • Navigation and Trip Planning.

  • Mobile and Intermodal communications.

  • Vehicle-to-Vehicle safety systems.

  • Passenger information and Travel assistance.

  • Infrastructure like Traffic Signs and WLAN.

  • Adaptive Cruise Control and Fleet Management.

  • Toll collection and emergency services (e.g., Fire).

Example: Advanced Metering Infrastructure (AMI)

AMI represents a smart grid application with a Consumers Side and an Electricity Utilities Side:

  • Consumers Side: Smart meters collect consumption data.

  • Communication Network: A Data Concentrator facilitates the flow of information.

  • Utilities Side: Data is processed through a Meter Data Management System (MDMS), Outage Management System (OMS), and Distribution Management System (DMS).

Definition and Paradigm of Stream Processing

Streaming data is an analytical computing platform specifically focused on speed. It allows data to be continuously analyzed and transformed in memory before being stored on disk.

  • Definition: Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.

  • Related Paradigms: It is equivalent to dataflow programming, event stream processing, and reactive programming.

  • Mechanism: Processing works by evaluating "time windows" of data in memory across a cluster of servers.

Comparative Analysis: DBMS vs. DSMS

There are significant architectural and functional differences between traditional DBMS and DSMS.

Query Types

  • One-time Queries (DBMS): These are run once to completion over the current, static data set.

  • Continuous Queries (DSMS): These are issued once and then continuously evaluated over the data stream. Examples include:

    • "Notify me when the temperature drops below XX."

    • "Tell me when the prices of stock Y > 300."

Data and Storage

  • DBMS: Focuses on persistent relations that are stored and relatively static. It uses random access and an effectively "unbounded" disk store. Only the current state of the data is typically relevant.

  • DSMS: Handles transient streams for online analysis. It relies on sequential access and is constrained by bounded main memory. Historical data is considered important in the context of the stream.

Performance and Requirements

  • DBMS: No inherent real-time service requirements. It handles a relatively low update rate and assumes precise data. The query processor and physical DB design determine the access plan.

  • DSMS: Has strict real-time requirements and must handle arrival rates that can reach multi-GB levels. Data is handled at a fine granularity, though it may be stale or imprecise. Data arrival and characteristics are often unpredictable and variable.

Resource and Plan Management

  • DBMS: Resource-rich (memory, disk, computation). It uses sophisticated, fixed query plans.

  • DSMS: Resource-limited (memory and per-tuple computation). It uses adaptive query plans and one-pass query evaluation. It is often used to identify which specific data should eventually populate a traditional database.

Query Processing in Data Streams

Continuous Query Language (CQL) extends SQL to handle the unique requirements of streams.

  • New Features: Adds Streams as a new data type, continuous semantics, windows on streams (derived from SQL-99), and basic sampling.

  • Relational Operators: Construction of a query plan is based on selection, projection, join, and aggregation (group by). Plans from multiple continuous queries are combined to reduce redundancy.

Operator Classifications

  • Tuple-at-a-time Operators: These only require consideration of a single tuple to produce an output. Examples include Selection and Projection.

  • Full Relation Operators:

    • Some can work on a tuple-at-a-time using an accumulator, such as Count, Sum, Average, Max, and Min (even with Group By).

    • Others, such as Order By, cannot work on a tuple-at-a-time basis.

    • Binary operators like Intersection, Difference, Product, and Join can block because they may require the entire input to be seen (which is impossible for unbounded streams) or may need to join tuples that are widely separated in the stream. Union, however, can be evaluated tuple-by-tuple.

Windowing Mechanisms

Windows allow for the extraction of a finite relation (a synopsis) from an infinite stream to restrict the scope of operators.

  • Ordering Attribute Windows: Based on time, such as the "last 5 minutes of tuples."

  • Tuple Count Windows: Based on count, such as the "last 1000 tuples."

  • Explicit Marker Windows: Based on punctuations in the stream.

  • Window Behaviors:

    • Sliding Windows: The window moves forward as new data arrives, maintaining a set size (e.g., Window 1, Window 2, Window 3). Defined by "window size" and "window slide."

    • Tumbling Windows: The window "tumbles" forward, creating discrete, non-overlapping segments (e.g., Window 1 followed by Window 2).

Relation/Stream Translation Operators

Conversion between streams and relations is necessary for complex query processing.

  • Stream-to-Relation: Handled via Windows.

  • Relation-to-Stream:

    • Istream (Insert Stream): Whenever a tuple is inserted into a relation, it is emitted on the stream.

    • Dstream (Delete Stream): Whenever a tuple is deleted from a relation, it is emitted on the stream.

    • Rstream (Relation Stream): At every time instant, every tuple in the relation is emitted on the stream.

Example CQL Queries

  • Selection on Stream:     SELECT Istream(*) FROM S [rows unbounded] WHERE S.A > 10
        Here, stream SS is converted to a relation of unbounded size, filtered, and converted back via Istream. If only selection is involved, the window specification may be unnecessary: SELECT * FROM S WHERE S.A > 10.

  • Joining Windows:     SELECT * FROM S1 [rows 1000], S2 [range 2 minutes] WHERE S1.A = S2.A AND S1.A > 10
        This combines a tuple-based sliding window ([rows 1000]) and a time-based sliding window ([range 2 minutes]).

  • Probing Stored Tables:     SELECT Rstream(S.A, R.B) FROM S [now], R WHERE S.A = R.A
        This uses a [now] window (tuples from the last time step) to probe a stored table RR.

Scalability and Optimisation

In DSMS, there is a trade-off between resource use and result completeness.

  • Completeness: While DBMS must produce all results for finite relations, DSMS may not be able to return all results for unbounded streams.

  • Buffer Management: The size of buffers for windows affects both resource consumption and completeness. Random sampling can further reduce resource use.

  • Optimisation Objectives: Since streams are infinite, traditional cardinality-based optimization (minimizing intermediate results) is problematic. Novel objectives include:

    • Stream Rate Based: e.g., NiagaraCQ.

    • Resource Based: e.g., STREAM.

    • Quality of Service (QoS) Based: e.g., Aurora.

    • Continuous Adaptive Optimisation: Re-evaluating plans as stream characteristics change.

Stream Processing Architectures and Frameworks

Frameworks

  • Open Source:

    • Apache Flink

    • Apache Kafka (developed by LinkedIn)

    • Apache Storm (developed by Twitter)

  • Cloud-Based:

    • AWS Kinesis

    • Google Cloud Dataflow

Architectural Styles

  • Event-Driven Architecture (EDA): Uses components like relays and event-driven dashboards (e.g., using Socket.io and Postgres for stats).

  • Message Queues (MQS): Utilizes brokers like RabbitMQ with dedicated queues for users, store operations, and notifications.

  • Complex Event Processing (CEP): Processes multiple data streams (PMU, SCADA, AMI, Weather) through a rules engine to make decisions (Demand Response, Load Forecast, Billing).

  • Lambda Architecture: Features two paths for data—a Real-Time Layer (Data Processing in Motion) for low latency and a Batch Layer (Data Processing at Rest) for high accuracy, both feeding into a unified Serving Layer.

  • Microservices: Connects various services (Catalog, Shopping Cart, Discount, Ordering) through a Message Broker.

Deep Dive: Analytical Applications

Click Stream Analysis

  • Definition: Tracking, collecting, and reporting on the sequence and count of visits to website components.

  • Identification: Visitors are typically identified via cookies.

  • Use Cases: Identifying "hot links," tracking frequent/unique visitors, and determining the probability of engagement with new content.

Fraud Detection

  • Mechanism: Monitoring a stream of actions (any possible transaction or interaction).

  • Use Cases: Detecting credit card fraud, identifying stock trading irregularities, and mitigating cyber security risks.

  • Real-world Example: Banks in Thailand utilize Apache Kafka for these purposes.