H7 - Data processing architectures

0.0(0)
studied byStudied by 1 person
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/15

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

16 Terms

1
New cards

Lambda architecture

  • figuur

  • streaming and batch layer

  • serving layer

  • Streaming layer

    • Handles all requests that require low latency

    • Recent data only, providing a real-time view on that recent data

    • Technology used e.g. Apache Storm or Spark Streaming

    • Output usually stored in NoSQL DBs (e.g. Cassandra)

  • Batch layer

    • Manages master dataset (all data) - Append-only set of raw data usually on distributed filesystem

    • Computes batch views (e.g. via Apache Hadoop)

    • Accurate → processes all data when generating views

      • Output usually stored in read-only database

  • serving layer

    • Serves output from both batch and speed layer for querying

    • Stored on e.g. Hbase (open-source non-relational distributed database modeled after Google BigTable)

2
New cards

Lambda architecture

  • downsides

  • Need to maintain two complex distributed systems (batch and stream)

  • Need to create apps for two different systems

    • Streaming: Storm / Spark Streaming

    • Batch: Hadoop

  • Debugging and interaction with products is different

  • Implemented in Spring XD, which was given End-of-Life (EOL) in 2017

3
New cards

Kappa architecture

  • figuur

  • algemeen

  • Simplification of Lambda architecture (Kappa origin: LinkedIn)

    • In short: remove the batch processing system

    • Make the streaming system also deal with historical data

  • Data stored in a Kappa architecture is an append-only immutable log (e.g. Kafka)

  • Log streams data through stream processing system e.g. Apache Storm, Spark Structured Streaming, Kafka Streams, Flink

    • Only one (streaming) code set needs to be maintained

  • Output stored in auxiliary stores for serving (any type of DB suffices)

    • Can be deleted & regenerated from source of truth (data on immutable log)

4
New cards

Kappa architecture

  • dealing with processing changes

  • figuur

  • Use e.g. Kafka to store incoming data in an immutable log

    • Retain full log of data you wish to reprocess (retention interval)

    • Allows multiple subscribers

  • When wishing to reprocess

    • Start second instance of stream processing job

      • Begin processing from beginning of retained data

      • Direct output to a new output table

    • When second job catches up

      • Switch application to read from new table

      • Stop old version of the stream processing job and optionally delete old output table

5
New cards

Kappa architecture

  • conclusion

  • Four pillars

    1. Data is immutable

    2. Everything is a stream

    3. Single stream engine is used

    4. Data can be replayed

  • Only need to do reprocessing when you change the processing code

  • Downsides

    • Storing and managing large volumes of logs over time can be resource-intensive

    • Extra temporary storage required when reprocessing

    • Not ideal for cases where batch processing are clearly more suited (e.g. nightly analytical summary jobs) due to lower throughput

6
New cards

Apache samza

  • algemeen

  • figuur

  • Stream processing based on Kappa architecture, originated at LinkedIn (2013)

    • Optimised for fast, near-realtime processing (true streaming)

    • Inherent support for state

  • Scalable

    • LinkedIn has > million messages/s processed

    • Kafka + YARN / standalone

    • Not really tailored to Kubernetes

  • APIs: low-level + stream + Samza SQL

  • Fault tolerant

  • At-least-once message guarantees

  • Notable deployments

    • LinkedIn (origin), Slack, Ebay, TripAdvisor

7
New cards

Apache samza

  • voor en nadelen

voordelen:

  • Fault tolerant and high performance due to reliance on Kafka

  • If you already use Yarn and Kafka, Samza can be a natural choice

  • Low latency, high throughput, mature and tested at scale

nadelen:

  • Tightly coupled with Kafka and Yarn, Kubernetes not a target

  • At least once guarantees

  • Lack of advanced streaming features like watermarks, sessions, triggers, etc.

  • Not seeing much activity lately

8
New cards

Zeta architecture

  • 7 componenten (ezelsbrug)

  • figuur!

ESC DRCG

  • 7 components (Zeta = 7)

    • Enterprise applications

    • Solution architecture

    • Compute model / execution engine

    • Distributed file system

    • Real-time data storage

    • Container system

    • Global Resource Management

  • Google one of the pioneers

  • Properties

    • All servers under supervision of global resource management and participate in Distributed File System

    • Dynamic allocation of resources: resources do not have to be hardwired to specific applications, reducing cost

    • Data locality: store and process data where it was created

9
New cards

Google

  • cloud dataproc

  • Managed Spark, Flink, Presto and Hadoop service integrated with Google Cloud Platform

  • Ease-of-use

    • Create, monitor, delete Cloud Dataproc clusters and jobs through Google Developers console

    • Latest stable Spark, Flink and Hadoop software releases

  • Low-cost

    • Dataproc solutions costs 0.01 $ per virtual CPU in your cluster per hour

    • But this comes on top of compute engine + disk storage + cloud storage + monitoring (also billed per hour)

  • Speed: cluster start and stop operations take 90 seconds or less

10
New cards

Google

  • cloud dataflow

  • Unified programming model (donated to Apache Beam)

    • For building batch and streaming data processing pipelines

    • Monitor their execution

  • Google-proprietary solution, integrated with Google Cloud Platform

    • Fully managed service

    • Handles resource lifetime (serverless)

    • Dynamically provision resources to reach latency goal or high utilization efficiency

  • Competitive pricing (retrieved 26/04/2025)

11
New cards

Microsoft

  • azure hdinsight

  • User-friendly set up of open-source big data solutions + cluster

  • Managed Hadoop, Spark service (PaaS)

    • Hive, Kafka, etc. available

    • .NET SDK and Powershell integration

  • On top of Microsoft Azure Cloud, integrating with Data Factory and Data Lake Storage

12
New cards

Microsoft

  • azure stream analytics

Microsoft Azure based stream processing solution

  • Standalone Microsoft solution

  • SQL-like language to transform, enrich and correlate data

  • Pro: easy to use, easy to scale, good integration with Azure, cheap

  • Con: proprietary / vendor lock, limited integration with non-Azure products, aimed at on-complex streaming applications

13
New cards

Amazon

  • elastic map reduce (EMR)

  • Amazon managed Spark, Hive, Presto framework on EC2

  • Processes data from HDFS, Amazon S3, DynamoDB, Kinesis

  • CloudWatch monitors performance and can trigger scale-up / scale down

  • Notable users: Netflix, Expedia, Yelp

14
New cards

Amazon

  • Kinesis

  • Amazon’s AWS stream processing solution

  • Collect and process data streams in real time (proprietary or Spark on EMR)

  • Thread with care when choosing the proprietary solution

15
New cards

Amazon lambda

  • serverless architecture

  • Build and run applications without thinking about servers (function-as-a-service)

  • Run code without provisioning / managing servers, just provide triggers (e.g. response time) which will be monitored

  • Other Cloud vendors soon followed and released their own serverless solutions

16
New cards

Hortonworks + Cloudera

Hortonworks: US-based company offering enterprise data processing solutions (Hadoop/Spark)

  • Cloudera: US-based company focussed on enterprise data cloud solutions (private + public)

  • Merged in January 2019, considered one of the larger players now in enterprise data solutions

  • High focus on open-source solutions and paid-for support

    • More than likely surviving due to the big cloud vendors initially not focussing on the on-premise market (note: this is changing!)