H3 - Data System Foundations

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/47

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

48 Terms

1
New cards

problems of Data-Intensive Applications

Problems

  • Amount of data

  • Complexity of data

  • Speed at which data is changing

Typically built from standard building blocks

  • Store data so it can be retrieved again later (databases)

  • Remember the result of an expensive operation, speeding up reads (cache)

  • Allow users to search data by keyword or filter it in various ways (search indexes)

  • Send a message to another process, to be handled asynchronously (stream processing)

  • Crunch large amount of accumulated data (batch processing)

Need to figure out which tools and which approaches are most appropriate for task at hand

2
New cards

Definitions:

  • systems of record

  • derived data systems

Systems that store and process data grouped into two broad categories

Systems of record

  • A.k.a. source of truth, holds authoritative version of your data

  • When new data comes in → first written here

  • Each fact represented exactly once

  • In case of discrepancy between another system and system of record, value in system of record is correct one

Derived data systems

  • Result of taking existing data from another system and transforming or processing it in some way

  • If you lose derived data → can recreate it from original source

  • Technically speaking derived data is redundant, however often essential to get good performance on read queries

3
New cards

Three Main Concerns for Data Systems (EX)

  • ezelsbrug

ReScMa

  1. Reliability
    The system should continue to work correctly (correct functionality at the desired level of performance) even in the face of adversity (hardware or software faults or even human errors)

  2. Scalability
    As the system grows (in data volume, traffic or complexity), there should be reasonable ways of dealing with that growth

  3. Maintainability
    Over time many different people will work on the system and they should all be able to work on it productively

4
New cards

Concern 1: Reliability – Faults and Failures (EX)

Continuing to work correctly, even when things go wrong

  • Things that can go wrong = faults

  • Systems that anticipate faults and can cope with them = fault-tolerant or resilient systems

  • Note: not all faults can be tolerated: if planet Earth is swallowed by a black hole, tolerance of that fault would require web hosting in space

Faults are not failures

  • Fault = one component of the system deviates from its spec

  • Failure = system as a whole stops providing the required service to the user

Systems can be tested in terms of fault-tolerance by deliberately inducing faults

5
New cards

Concern 2: Scalability (EX): Describing load e.g. Twitter

Describing load e.g. Twitter (now-called X)

  • Post tweet (4.6k write requests/s on average, 12k write requests/s peak)

  • View home timeline (300k read requests/s)

Handling 12k writes/s is not that hard

  • Difficulty is fan-out: each user follows many people and is followed by many people

Two ways of handling this

  1. Posting a tweet inserts the new tweet into a global collection of tweets

    • User requesting their home timeline: look up all the people they follow, find all tweets for each of those users and merge them (sorted by time)

  2. Maintain a cache for each user’s home timeline

    • When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches

    • Request to read the home timeline is then cheap, because result has been computed ahead of time

First version of Twitter used approach 1

  • Systems struggled to keep up with load of queries

Switch to approach 2

  • Downside: posting a tweet requires a lot of extra work

  • On average: a tweet is delivered to 75 followers

    • 4.6k tweets per second becomes 345k writes per second to home timeline caches

  • BUT: some users have over 30 million followers

    • Would result in single tweet → 30 million writes to home timelines

Thus, for Twitter distribution of followers per user (weighted by how often a user tweets) is a key load parameter for scalability

  • Nowadays Twitter uses a hybrid approach (approach 2 for regular users combined with approach 1 for celebrities)

6
New cards

Concern 2: Scalability (EX): Describing performance

  • mean

  • percentiles (en in practice)

DESCRIBING PERFORMANCE
Mean (given n values, add them up and divide by n)

  • Usually not a very good metric if you want to know typical response time, because it does not tell you how many users actually experienced that delay

Percentiles

  • Median or 50th percentile: take sorted list of response times, median is halfway

    • e.g. if median response time is 200ms, half of requests return in less than 200ms, and half of requests take longer than that

  • To know how bad outliers are: look at 95th, 99th or 99,9th percentile

    • 950, 990 or 999 out of 1000 requests take less than given time to complete

  • Tail latencies are important: directly affect user experience of service

Often used in SLAs (Service Level Agreements), typically both in median and 99th percentile quantifications

Percentiles in practice

  • When several backend calls are needed to serve a request, it takes a single slow backend request to slow down the entire request

  • Add response time percentiles to monitoring dashboards for services

    • Rolling window of response times of requests in the last timeframe (e.g. median, different percentiles over last 10 minutes).

7
New cards

Concern 2: Scalability (EX): handling load increases

  • horizontal vs vertical scaling

  • automated vs manual scaling

HANDLING LOAD INCREASES


Vertical scaling = moving to more powerful hardware
Horizontal scaling = distributing load over more machines

Automated vs manual scaling

  • Elastic / autoscaling systems: detect load increases or decreases and automatically add or remove resources

    • Drawback: complex to implement correctly

  • Manual scaling: human analyses capacity and decides upon adding more machines to the system

  • Scaling can be proactive or reactive

There is no generic magic scaling sauce

  • Volume of reads, volume of writes, volume of data to store, complexity of the data, response time requirements, access patterns, security, etc.

    • 100.000 requests/second, each 1 kB <-> 3 requests/minute, each 2 GB in size

    • Both systems deal with 0.1GB/s data throughput but should look very different

    • tweede zal makkelijker te implementeren zijn

8
New cards

Concern 3: Maintainability

  • ezelsbrug

OSE

Operability

  • Make it easy for operations team to keep system running smoothly

Simplicity

  • Make it easy for new engineers to understand system, removing as much complexity as possible from the system

Evolvability

  • Make it easy for engineers to make changes to system in the future, adapting it for unanticipated use cases as requirements change

  • Impacts velocity of integrating new functionality

9
New cards

Data Model Layering

Most applications built layering one data model on top of another
For each layer key question: representation in terms of next-lower layer?

vb:

App developer: model real world in terms of objects and data structures and APIs that manipulate them

  • Structures often application-specific

10
New cards

SQL relational model

Data organized in relations (tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

Most application development nowadays is in object-oriented programming languages

  • Awkward translation layer needed between objects in application code and the DB model of tables, rows and columns

  • Partially solved through Object-Relational Mapping (ORM) frameworks like ActiveRecord (Ruby) or Hibernate (Java)

11
New cards

NoSQL (algemeen)

  • driving forces

  • 2 movements

  • verschillende types NoSql datastores (EX)

NoSQL = Not Only SQL

Driving forces

  • Need for greater scalability: storing more and more data

  • Need for greater availability: if one or more nodes is no longer responsive the rest of the system can keep going

Also

  • Preference for free and open-source software over commercial costly DB products

  • Specialised query operations e.g. undefined number of joins not well supported by the relational model

  • Frustration with restrictiveness of relational schemas

Two movements

  • Data stores that could be distributed better and without limitations

  • Availability more important than overall consistency

In the foreseeable future, relational DBs will continue to be used alongside a broad variety of non-relational datastores

Different types of NoSQL datastores: document store, graph store, time series DB, triple/quad store, etc.

12
New cards

NoSQL: Key-Value Store (EX)

  • wat

  • voor en nadelen

  • gebruikt voor

  • vbn

One of the simplest NoSQL DBs


Datastore focused on storing, retrieving and managing data by key – value pairs

  • Keys are unique identifiers

  • Both keys and values range from simple data types to complex composed objects(unformatted text, documents, photos, music, video)

  • No schemas, no relationships between data

Voor en nadelen:

- Drawback: can only use the key to retrieve elements, typically unable to do things like filtering or retrieval of elements with common attributes

- Only supports simple commands like get, put, delete

+ High horizontal scalability
+ Very fast for both read and write operations


Good for

  • Real-time recommendations and advertising: rapid access to new recommendations as web site visitor moves through site

  • In-memory datastore (e.g. caching or web session details / preferences)

Examples: Redis, Memcached, TiKV

13
New cards

NoSQL: Document Store

Document store focusses on storing document-oriented information (XML, JSON, etc.)

  • Seen as extension to key-value stores, value being a structured document

  • No rigid schema (like relational model)

  • All information of an object stored together

  • Contrary to a key-value store, can filter / aggregate on document value

Examples: MongoDB, Couchbase

14
New cards

NoSQL: Wide Column store (EX)

  • wat

  • voor en nadelen

  • gebruikt voor

  • real life vbn

Seen as a 2-dimensional key-value store

  • Uses tables, rows(identified by key), columns(identified by column keys)

  • Each row does not have to contain the same columns

  • Values can be complex data types like unformatted text and imagery

Combines strict table structure of relational DB with flexibility of document DB

Voor en nadelen

- Not optimized for joins

+ Designed for very high horizontal scalability

  • Each row can be stored as individual file, addressable by its key

  • Up to every attribute of a single row, addressable by key and column key, could be stored separately

  • Such extensive splitting of data could lead to very bad query performance

    • Solution: Column families, describing which attributes belong together and will most likely be queried together

    • Downside: makes schema stricter, need to know and define columns and column families up front

+ Optimised for

  • High volumes of data

  • High write throughput with slower reads

Examples: Cassandra, Hbase, ScyllaDB

15
New cards

NoSql: Time Series DB (EX)

  • wat

  • voor en nadelen

  • gebruikt voor

  • real life vbn

Database optimized for time-stamped or time series data

  • Metrics (regular time series) or events (irregular time series) that are tracked, monitored, aggregated over time

  • Time becomes the key index (when plotted, one of the axis is always time)

Time series DBs have difficulty with handling high cardinality data

  • Each table can have many indexed columns (dimensions) each containing many unique values

  • Data ingest rates plummet as cardinality rises exponentially

Typically support for dropping older data: retention policies

Specialized compression algorithms can operate over time series data, offering higher compression rates than more generic data

Many types of data: sensor data, performance monitoring data, server metrics, clicks, financial data, etc.

Optimized for measuring change over time: IoT, DevOps monitoring real-time analytics

Examples: InfluxDB, TimescaleDB

16
New cards

NoSQL: Graph DBs (EX)

  • wat

  • voor wat

  • real life vb

Data can be modeled as a graph

  • Vertices (nodes or entities) + Edges (relationships or arcs)

Important aspects

  • Any vertex can have an edge connecting it with any other vertex

    • No schema restricting which kind of things can or cannot be associated

  • Given any vertex, efficiently find both incoming and outgoing edges and thus traverse graph

  • By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while maintaining a clean data model

Good choice for data that can evolve or which there are a lot of many-to-many relationships

Examples: Neo4J, ArangoDB

17
New cards

NoSQL: Search Engines (EX)

  • wat

  • gebruikt voor

  • real life vb

Datastore designed for finding content (typically full text) through search functionality

  • Data does not have to conform to a rigid schema

  • Typically heavily indexes content (grouping similar characteristics) in order to offer search results in a rapid manner

Indexing through tokenization (chopping up data in small pieces) and stemming (bringing words back to their root form)

Offers different methods for full-text search, complex search expressions and rankings of search results

Good choice for

  • Autocompletion / suggestions based on customer input

  • Searching through logs for analysis

Examples: ElasticSearch

18
New cards

Relational vs document model

  • many to one relationships

  • many to many relationships

  • relation vs document model

  • choice?

MANY-TO-ONE RELATIONSHIPS

  • Relational DBs: normal to refer to rows in other tables by ID, because joins are easy

  • Document DBs: support for many-to-one relations is weak

    • Joins are not needed for one-to-many table structures and support for joins is weak

    • Work of creating join is typically moved to application instead of DB (e.g. in-memory list of regions or industries)

MANY-TO-MANY RELATIONSHIPS

Well-supported by SQL datastores and NoSQL graph stores, not so well by NoSQL document datastores

RELATIONAL VS DOCUMENT DATA MODEL
Document data model

  • Schema flexibility

    • Most document DBs do not enforce a schema on data in the documents

  • Better performance due to locality (see JSON example)

  • For some applications closer to the data structures used by the application

  • Concept of foreign key: document reference

    • Resolved at read time by join or follow-up queries

Relational data model

  • Better support for joins

  • Good support for many-to-one and many-to-many relationships

CHOICE FOR RELATIONAL OR DOCUMENT MODEL
If data in application has document-like structure (tree of one-to-many relationships, where typically entire tree is loaded)

  • Good idea to use document model

  • Splitting a document-like structure into multiple tables can lead to cumbersome schemas and complicated app code

  • Limitations

    • Cannot refer directly to a nested item within a document but need to say something like “second item in list of positions for user X251”

    • If documents are not too deeply nested usually not an issue

If data in application has many-to-many relationships

  • Document model less appealing

  • Possible to reduce need for joins by denormalizing, but then app code needs to keep denormalized data consistent

    • Normalization: process of organizing a DB to reduce redundancy and improve data integrity by dividing tables into smaller, related tables and defining relationships between them

    • Denormalization: process of combining related tables into a single table to improve read performance at the expense of data redundancy and update / write performance

      • Duplicates data that would otherwise require joins

  • Joins can be emulated by making multiple requests to DB, but complicates application code and usually slower than join made inside DB

Highly interconnected data: graph model may be most appropriate (see later)

19
New cards

Document DBs

  • schema flexibility (2)

  • Data locality for queries

SCHEMA FLEXIBILITY

  • Document DBs and JSON support in relational DBs usually do not enforce any schema on data in documents

  • Schema-on-read

    • Structure of the data is implicit and only interpreted when the data is read

    • Similar to dynamic (runtime) type checking

  • Schema-on-write

    • Approach of relational DBs with explicit schema

      • DB ensures all written data conforms to explicit schema

    • Similar to static (compile-time) type checking

  • Difference between approaches noticeable when application wants to change format of its data

  • Schema-on-read approach is advantageous if items in collection do not all have same structure

DATA LOCALITY FOR QUERIES
Document usually stored as a single continuous string (JSON, XML or binary variant)

  • If application often needs to access entire document → performance advantage to storage locality

  • Locality advantage only applies if you need large parts of document at once

    • DB needs to load entire document

    • On updates entire document usually needs to be rewritten

  • Recommended to keep documents small and avoid writes increasing size

    • These limitations reduce the set of situations in which document DBs are useful

Grouping related data for locality is not limited to document model

20
New cards

Graph like data models

  • wanneer

  • Property graph

  • voordelen graph data models (EX)

  • querying data

What if many-to-many relations are very common in your data?
-> More natural to model data as a graph

  • Vertices (nodes or entities)

  • Edges (relationships or arcs)

PROPERTY GRAPHS
Each vertex

  • Unique identifier, Set of outgoing edges, Set of incoming edges, Collection of properties (key-value pairs)

Each edge

  • Unique identifier, Vertex at which edge starts (tail vertex), Vertex at which edge ends (head vertex), Label to describe type of relationship between two vertices, Collection of properties (key-value pairs)

Graph store can be thought of as consisting of two relational tables (vertices and edges)

Important Properties

  • Any vertex can have an edge connecting it with any other vertex

    • No schema restricting which kind of things can or cannot be associated

  • Given any vertex, efficiently find both incoming and outgoing edges and thus traverse graph

    • Indexes on both tail_vertex and head_vertex

  • By using different labels for different kinds of relationships, able to store several different kinds of information in a single graph, while maintaining a clean data model

Voordelen van Graph Data Models

Noteworthy difficulties for relational models

  • Different kinds of regional structures in different countries

  • Quirks of history: country within a country

  • Varying granularity of data

Graphs good for evolvability

  • As features get added: easily extend the graph to accommodate changes in application’s data structures

Querying data

  • Cypher query language:

    • Declarative query language for property graphs

  • As graph data can be put in relational tables can it be queried using SQL?

    • Yes, but

      • In relational DB usually known in advance which joins are needed in query

      • In graph DB query may need to traverse a variable number of edges before finding vertex you are looking for

      • Number of joins is not fixed in advance

    • Variable-length traversal in SQL = recursive common table expressions (WITH RECURSIVE)

21
New cards

OLTP vs OLAP (EX)

In early days writes to DB typically corresponded to commercial transactions (e.g. making a sale, placing an order, paying a salary)
DBs started being used for many different kinds of data

  • E.g. comments on posts, actions in a game, contacts in an address book

  • Basic access pattern remained similar to processing business transactions

    • App looks up small number of records by some key, using an index

    • Records are inserted or updated based on user input

  • Because these apps are interactive, access pattern became known as Online Transaction Processing (OLTP)

DBs also started being used for data analytics with very different access patterns

  • Analytic query scans over huge number of records, reading few columns per record and calculating aggregate statistics E.g. what was total revenue of each of the stores in January?

    • E.g. how many more products than usual were sold during our last promotion?

  • Analytic queries often written by business analysts

  • Query results feed reports to help management make informed decisions → business intelligence

  • Access pattern became known as Online Analytic Processing (OLAP)

At first same DBs were used for OLTP and OLAP queries
Early 1990s however OLAP queries started being run on a separate DB called the data warehouse

22
New cards

data warehouse (EX)

  • waarom

  • wat

  • ETL principe → figuur!

OLTP systems usually expected to be highly available and to process transactions with low latency as they are critical to business

  • Reluctance to let business analysts run ad hoc analytic queries on OLTP databases (queries usually expensive, harming performance)

Introduction of data warehouse, allowing queries without affecting OLTP operations

  • Contains read-only copy of data in various OLTP systems

    • Extracted from OLTP systems by periodic data dump or stream of updates

    • Transformed into analysis friendly schema

    • Cleaned up

    • Loaded into data warehouse

  • Process of getting data into warehouse is known as Extract-Transform-Load (ETL)

23
New cards

Data warehouse vs Data mart

Data warehouse

  • Centralised location for data

  • Holds multiple subject areas / very detailed information

  • Works to integrate all data sources

  • Source of data for reporting, analytics and offline operational processes

  • Typically employs expensive DB technology

Data mart

  • DBs used to provide fast, independent access to a subset of warehouse data

  • Often created for departments, projects, users

  • Compared to data warehouse

    • Similar technology, subset of data, relieves pressure on enterprise data warehouse, provides sandbox for analysis

24
New cards

Data flow in the datawarehouse (5) (EX)

→ figuur!

DQ: Data Quality Tools

  • Defining quality rules, applying those rules to data to detect violations/exceptions and fixing those exceptions

  • Data profiling automatically gathers statistics about data to ascertain its quality e.g. how many values are empty, minimum and maximum values, most frequent values, etc.

ETL: Extract Transform Load (zie eerder)

MDM: Master Data Management systems

  • Creation of master lists of various entities: customers, products, suppliers, etc. and detecting multiple records that apply to the same entity and fixing ambiguities

OLAP: Online Analytical Processing Tools (technology behind BI)

BI: Business Intelligence

25
New cards

Metadata flow in datawarehouse

→ figuur!

Metadata repositories

  • Contain technical metadata across the data assets

  • Three main use cases

    • Finding data assets

    • Tracking lineage (provenance): where data came from and how it was generated/transformed a.k.a. data trail auditing

    • Impact analysis: allows developers to see all data assets that rely on a particular field or integration job before making a change

26
New cards

Data warehouse data model

  • algemeen (verschil OLTP OLAP)

  • star schema

  • slowly changing dimensions

DB at the heart of a data warehouse

  • Usually relational DB optimized for analytics-type processing

  • DB usually heavily indexed and tuned to ensure optimal performance for most common queries

When relational DBs are used to support operational systems and applications (OLTP) data is usually stored in highly normalized data models

  • Create tables with minimum redundancy and smallest possible number of fields

  • Makes updates (writes) very fast

Data warehouses however favour denormalized data models (for OLAP)

  • Each table contains as many related attributes as possible

  • Typically contains data from many sources and applications each with their own schema

    • Data from those sources has to be converted to a single schema

  • Typically uses a star schema (dimensions / fact tables)

    • Many DBs include specialized query optimizations for handling star schema-like joins

STAR SCHEMA
Data warehouses use star schemas a.k.a. dimensional modelling

  • Centre of schema is fact table

    • Each row represents an event occurring at a particular time

    • Fact tables can get very large

    • Some columns are attributes

    • Other columns foreign key references to other tables: dimension tables

Why is it called star schema?

  • When table relationships are visualized, fact table is in middle surrounded by dimension tables

Typical data warehouses often have fact tables that have over 100 columns, up to several 100s

SLOWLY CHANGING DIMENSIONS
To allow accurate data analysis sometimes needed to keep track of e.g. a person’s state over time

  • Ensures each transaction corresponds to person’s state at time of transaction

  • Most common type of slowly changing dimension tracks historical data by creating multiple records

Example

  • Keep track of customer purchases along with customer demographics (single, married, parent, divorced, etc.)

  • Without slowly changing dimensions we would have one single record for a customer reflecting only current demographic state

    • Difficult to analyse how many people with children spend money on specific items

Add complexity to ETL jobs and analytic queries

27
New cards

Data warehouse: Column Oriented storage (EX)

  • vergelijk met traditioneel

  • eigen voorbeeld

Typical data warehouse query only accesses 4 or 5 columns at one time

In most OLTP DBs storage is laid out in row-oriented fashion

  • All values from one row of a table are stored next to each other

Document DBs are similar: entire document is stored as one continuous sequence of bytes

Performance issues when loading all these rows from disk into memory, parsing them and filtering out those that don’t meet required conditions

Column-oriented storage: do not store all values from one row together, but store all values from each column together instead

  • If each column is stored in a separate file, a query needs only read and parse those columns used in the query

Example

  • Customers table with 300 columns, average of 5 bytes per attribute stored

    • 1.5KB for each user

    • 1 million customers

  • Relational

    • 1.5TB storage for DB

  • Columnar

    • If age takes 2 bytes and a record identifier 6 bytes, DB needs 8 bytes per field, or 8GB for 1 mil customers

    • Queries like: “how many customers are under 30?” can be answered very fast

28
New cards

ETL vs ELT

→ figuur!

Extract Transform Load (ETL)

  • Data is extracted, transformed and loaded into data warehouse

  • Sometimes data from different systems used to create or update a single record = conforming dimension

Extract Load Transform (ELT)

  • Data is extracted, loaded into data warehouse and transformed

29
New cards

Virtualised data warehouse

  • wat

  • wnr

  • nadelen

  • figuur!

Data warehouse approach is to bring all data together in one place, integrate it into a single conforming schema and use it for analytics queries

Alternative

  • Create logical or virtual schema across multiple systems and issue queries against that schema

    • A.k.a. Federation, Enterprise Information Integration (EII) and data virtualisation

Approach is more appropriate than using a data warehouse when

  • Data must be kept fresh in the face of changes: execute queries against original sources, results always up to date

  • Data access is infrequent: building very expensive data warehouses for data infrequently used is not cost-effective

  • Compliance and data residency clauses constrain data from being copied to a data warehouse

Drawbacks

  • Labor-intensive manual process: virtual tables must be manually defined across disparate systems

  • Schema and logic changes: schema change can break queries and make all data unavailable until queries are fixed

  • Performance: queries that span multiple systems (federated queries) have significant performance challenges

  • Frequency: if there are a lot of queries against virtual schema, it becomes more advantageous to extract data once, store it in a data warehouse and query it there: lowers load on source systems

30
New cards

wat is een Data lake

Self-service is taking over from carefully crafted and labor-intensive approaches of the past

  • IT professionals created well-governed data warehouses and data marts but took months to make changes

Data lake focusses on self-service

  • Take-aways: data in original form and format and used by various users

Allow analysts to analyse data without having to get help from IT

  • Data preparation tools that help analysts shape data for analytics

  • Catalog tools that help analysts find the data they need

  • Data science tools that help perform advanced analytics

Challenge with self-service is governance and data security

  • How to make data available to analysts without violating internal and external data compliance regulations?

31
New cards

Data lake maturity (4)

(denk aan figuur)

Data puddle

  • Single purpose or single project data mart built using big data / cost-efficient technology

Data pond

  • Collection of data puddles

  • Focus on scalable / cost-efficient technology compared to data warehouse / mart

Data lake

  • Supports self-service: business users able to find and use data sets that they want to use without having to rely on help from IT department

  • Aims to contain data that business users may possibly want even if there is no project requiring it at the time

Data ocean

  • Expands self-service data and data-driven decision making to all enterprise data, wherever it may be


32
New cards

Data lake succes factors: Platform

  • 4 zaken

  • ezelsbrug

VCV FP

Big data technologies like Hadoop, cloud solutions like AWS, Azure and Google Cloud Platform

  • Volume: designed to scale out

  • Cost: ability to store and process huge volumes of data inexpensively

    • Usually at one-tenth to one-hundredth the cost of a commercial relational DB

  • Variety: filesystems or object stores that allow to store all sorts of files e.g. Hadoop HDFS, MAPR FS, AWS S3

    • Unlike relational DB requiring schema on write, a filesystem or object store does not care what you write (schema on read)

  • Future-proofing

    • If data is stored in a relational DB it can only be accessed by that relational DB

    • Hadoop and big data platforms very modular: Hive can provide a SQL interface to Hadoop files, to Pig scripts, to Spark, to MapReduce, etc. (seen in later slidedecks)

33
New cards

Data lake succes factors: Data

Most data collected by enterprises today is thrown away

  • Small percentage of data aggregated and kept in data warehouse for a few years

Issue with data silos: different departments hoard their data, making it difficult to use data cross-groups

Data lake: store as much data as possible for future use

  • Typically, no known reason for storing the data, but do so in case data is needed one day

  • Makes no sense to convert or treat data prematurely → save data in its native format

  • Consumes raw data and prevents data silos

34
New cards

Data lake succes factors: Interface

To gain wide adoption and reap benefits of helping business users make data driven decisions, solutions companies provide must be self-service

  • Users must be able to find, understand and use data without help from IT department

    • IT unable to scale support to such a large user community and such a large variety of data

Interface at the right level of expertise

  • Analysts

    • Often do not possess skills to use raw data: too much detail, too granular, too many quality issues

    • Data has to be harmonized: put in same schema with same field names and units of measure

    • They want “cooked” data, not raw data

  • Data scientists

    • Want access to raw data, not the cooked data

  • Possible by setting up multiple zones / areas that meet particular needs

35
New cards

Setting up data lake

  • 4 stappen

  1. Set up the infrastructure

    • E.g. get the Hadoop cluster up and running

  2. Organize the data lake

    • Create zones for use by various user communities and ingest data

  3. Set the data lake up for self-service

    • Create catalogue of data assets, set up permissions and provide tools for analysts to use

  4. Open data lake up to users

36
New cards

Setting up data lake: 1 Set up the infrastructure (EX)

  • 3 mogelijkheden

  • figuur!

Initially data lakes were built mostly on-premise using open-source or commercial Hadoop distributions

Later cloud-based data lakes or hybrid cloud/on-prem data lakes started being used

Latest evolution: logical data lake: a virtual data lake layer across multiple heterogeneous systems

  • Hadoop, relational, NoSQL DBs, on-premise and in the cloud

All use a catalogue to find data assets

37
New cards

Setting up data lake: 2 Zones: organizing the data lake

  • zones → figuur!

  • governance

Raw or landing zone where data is ingested and kept as close as possible to the original state

Gold or production zone where clean, processed data is kept

Dev or work zone where more technical users (data scientists / engineers) work

  • Can be organised by user, by project, by subject, etc.

  • Once analytics performed in work zone get productized → moved into gold zone

Sensitive zone that contains sensitive data

Governance should reflect data usage and user community requirements

Different zones have different levels of governance and SLAs

  • Data in the gold zone is strongly governed, well curated, well documented, carries quality and freshness SLAs

  • Data in the work area has minimal governance (mostly making sure there is no sensitive data) and SLAs that vary from project to project

38
New cards

Setting up data lake: 3 setting up for self service

  • 4 stappen

Analysts typically go through four steps to do their job

  1. Find and understand

  2. Provision (get access to the data)

  3. Prepare (clean and convert)

  4. Analyse (answer questions, create visualisations and report)

First three steps typically take 80% of an analyst’s time

  • 60% is in the first step of finding and understanding data

39
New cards

setting up for self service: 1 finding and understanding data

Variety and complexity of available data far exceeds human ability to remember it

  • Typical companies have 1000s of DBs, with typically many 100s of tables with each table having many 100s of fields

Typical project

  1. Ask around to see whether anyone has ever used a particular type of data

  2. Stumble onto data set that someone has already used

  3. Once decided to use that data set spend a lot of time deciphering what the data it contains means

    • Knowledge about what data means is usually spread throughout the company

    • Analyst crowdsourcing tools collect this information to document data sets using simple descriptions composed of business terms and builds a search index to help them find what they are looking for

40
New cards

setting up for self service: 2 Accessing and provisioning the data

Once right data sets identified → analysts need to be able to use them

Publish information about all data sets in a metadata catalogue so analysts can find useful data sets and request access as needed

  • Requests include justification for access, project that requires the data and duration of access required

  • Incoming request may trigger work to deidentify sensitive data

Provisioning or physical access can be granted in a number of ways

  • Users can be granted read access to the entire data set

  • If only partial access should be granted, a copy of the file containing just the data appropriate to the user can be created (and kept up to date) or a table or view can be created that contains only the fields and rows that the analyst should see

  • If needed, a de-identified version of the data set can be generated that replaces sensitive information with randomly generated equivalent information, so all applications still work, but no sensitive data is leaked

41
New cards

setting up for self service: 3 Preparing the data

Most of the time data needs work to render it appropriate for analysts. This generally involves the following operations

  • Shaping

    • Selecting a subset of fields and rows to work on, combining multiple files and tables into one (joining), transforming and aggregating, converting variables to features (e.g. converting age into a feature that has a value of 0 if a person is over 65 and 1 if not)

  • Cleaning

    • Filling in missing values, correcting bad values, resolving conflicting data, normalizing units of measure and codes to common units

  • Blending

    • Harmonizing different data sets to the same schema, same units of measure, same codes, etc.

42
New cards

Data lake architectures

Initially thought one huge on-premise data lake would contain all their data

Not ideal, multiple data lakes typically proven to be a better solution

  • Different data sovereignty regulations

  • Organizational pressures

  • Frustration at finding experienced administrators for setting up and maintaining complex Hadoop (and other big data technologies) cluster

High interest in Cloud-based data lakes where most hardware and platform components are managed by experts that work for Amazon, Microsoft, Google and others

DATA LAKES IN THE PUBLIC CLOUD
Access to big data technology expertise

Short deployment times

Low cost of storage and elastic nature of cloud computing make it an attractive option for implementing a data lake

  • A lot of data is being stored for future use, so makes sense to store it as inexpensively as possible

    • Cost optimization possible through various storage tiers offered by public cloud vendors: from high- speed to glacial, with slower-access media being significantly cheaper

  • Elasticity of cloud computing allows a very large cluster to be spun up on demand when needed

    • Compared to on-premise: as nodes fill up with data, new nodes need to be added just for storage

      • If analytic loads are CPU-heavy and need more compute power, you need to add even more nodes, even though you may only use them for a short time

    • In the cloud you pay for the storage and processing that you need

43
New cards

Data lake virtualization: logical data lakes

→ figuur

nstead of loading all data into the data lake in case someone may actually need it, it is made available to analysts through a central catalogue or through data virtualization software

Address the issues of completeness and redundancy

  • Completeness

    • If analysts can find only data that is already in the data lake other data that has not been ingested into the data lake won’t be found

  • Redundancy

    • If we ingest all data into the data lake, we will have redundancy between the sources of data and the data lake

    • With multiple data lakes, to achieve completeness we would need to ingest the same data into each data lake

    • Already a lot of redundancy in the enterprise

44
New cards

Data lake virtualization: managing data in the logical data lake

→ figuur

Handling completeness

  • Create a catalog of all the data assets so the analysts can find and request any data set that is available in the enterprise

Handling redundancy

  • Store data that is not stored anywhere else in the data lake

  • Bring data that is stored in other systems into the data lake if and when it is needed, and keep it in sync while it is needed

  • Bring each data set in only once for all users

45
New cards

Data lake virtualization: Virtualisation vs catalog based data lake

→ figuur

Virtualization a.k.a. federation or EII (Enterprise Information Integration)

  • Technology developed in 1980s improved into the 2010s

  • Creates virtual view or table that hides location and implementation of the physical tables

    • E.g. view created by joining two tables from different DBs

  • For logical data lake: would require every data set to be published as a virtual table and kept up to date as underlying table schemas change

  • Views present significant problems

    • Creating a virtual view does not make data any easier to find

    • Joining data from multiple heterogeneous systems is complex and compute-intensive, causing massive loads on the systems and long execution cycles

By contrast: in catalog-driven approach only metadata about each data set is published in order to make it findable

  • Datasets then provisioned to the same system (e.g. Hadoop cluster) to be processed locally

46
New cards

Data lake virtualization: catalog based logical data lake

Makes all data findable and accessible to analysts

Can serve as a single point of access, governance and auditing

47
New cards

Data swamp

Data pond that has grown to size of a data lake but failed to attract a wide analyst community

Usually due to lack of self-service and governance facilities

Figure

  • Various teams use small areas of lake for projects (white data pond area)

  • Majority of data is dark, undocumented / unusable

Historically a lot of companies rushed to buy Hadoop clusters and filled them with raw data

  • Millions of files containing petabytes of data and no way to make sense of that data

  • No one could tell where the sensitive data was, so users could not be given access and data remained largely unusable and unused

Example: company built a data lake, encrypted all data in lake to protect it, required data scientists to prove that data they wanted was not sensitive before it would unencrypt it and let them use it

  • Because everything was encrypted, data scientists could not find anything, much less prove it was not sensitive

  • Noone used the data lake / swamp

48
New cards

Data Warehouse vs Data Lake (EX)