H3 - Data System Foundations

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/47

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

48 Terms

New cards

problems of Data-Intensive Applications

Problems

Amount of data
Complexity of data
Speed at which data is changing

Typically built from standard building blocks

Store data so it can be retrieved again later (databases)
Remember the result of an expensive operation, speeding up reads (cache)
Allow users to search data by keyword or filter it in various ways (search indexes)
Send a message to another process, to be handled asynchronously (stream processing)
Crunch large amount of accumulated data (batch processing)

Need to figure out which tools and which approaches are most appropriate for task at hand

New cards

Definitions:

systems of record
derived data systems

Systems that store and process data grouped into two broad categories

Systems of record

A.k.a. source of truth, holds authoritative version of your data
When new data comes in → first written here
Each fact represented exactly once
In case of discrepancy between another system and system of record, value in system of record is correct one

Derived data systems

Result of taking existing data from another system and transforming or processing it in some way
If you lose derived data → can recreate it from original source
Technically speaking derived data is redundant, however often essential to get good performance on read queries

New cards

Three Main Concerns for Data Systems (EX)

ezelsbrug

ReScMa

Reliability
The system should continue to work correctly (correct functionality at the desired level of performance) even in the face of adversity (hardware or software faults or even human errors)
Scalability
As the system grows (in data volume, traffic or complexity), there should be reasonable ways of dealing with that growth
Maintainability
Over time many different people will work on the system and they should all be able to work on it productively

New cards

Concern 1: Reliability – Faults and Failures (EX)

Continuing to work correctly, even when things go wrong

Things that can go wrong = faults
Systems that anticipate faults and can cope with them = fault-tolerant or resilient systems
Note: not all faults can be tolerated: if planet Earth is swallowed by a black hole, tolerance of that fault would require web hosting in space

Faults are not failures

Fault = one component of the system deviates from its spec
Failure = system as a whole stops providing the required service to the user

Systems can be tested in terms of fault-tolerance by deliberately inducing faults

New cards

Concern 2: Scalability (EX): Describing load e.g. Twitter

Describing load e.g. Twitter (now-called X)

Post tweet (4.6k write requests/s on average, 12k write requests/s peak)
View home timeline (300k read requests/s)

Handling 12k writes/s is not that hard

Difficulty is fan-out: each user follows many people and is followed by many people

Two ways of handling this

Posting a tweet inserts the new tweet into a global collection of tweets
- User requesting their home timeline: look up all the people they follow, find all tweets for each of those users and merge them (sorted by time)
Maintain a cache for each user’s home timeline
- When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches
- Request to read the home timeline is then cheap, because result has been computed ahead of time

First version of Twitter used approach 1

Systems struggled to keep up with load of queries

Switch to approach 2

Downside: posting a tweet requires a lot of extra work
On average: a tweet is delivered to 75 followers
- 4.6k tweets per second becomes 345k writes per second to home timeline caches
BUT: some users have over 30 million followers
- Would result in single tweet → 30 million writes to home timelines

Thus, for Twitter distribution of followers per user (weighted by how often a user tweets) is a key load parameter for scalability

Nowadays Twitter uses a hybrid approach (approach 2 for regular users combined with approach 1 for celebrities)

New cards

Concern 2: Scalability (EX): Describing performance

mean
percentiles (en in practice)

DESCRIBING PERFORMANCE
Mean (given n values, add them up and divide by n)

Usually not a very good metric if you want to know typical response time, because it does not tell you how many users actually experienced that delay

Percentiles

Median or 50th percentile: take sorted list of response times, median is halfway
- e.g. if median response time is 200ms, half of requests return in less than 200ms, and half of requests take longer than that
To know how bad outliers are: look at 95th, 99th or 99,9th percentile
- 950, 990 or 999 out of 1000 requests take less than given time to complete
Tail latencies are important: directly affect user experience of service

Often used in SLAs (Service Level Agreements), typically both in median and 99th percentile quantifications

Percentiles in practice

When several backend calls are needed to serve a request, it takes a single slow backend request to slow down the entire request
Add response time percentiles to monitoring dashboards for services
- Rolling window of response times of requests in the last timeframe (e.g. median, different percentiles over last 10 minutes).

New cards

Concern 2: Scalability (EX): handling load increases

horizontal vs vertical scaling
automated vs manual scaling

HANDLING LOAD INCREASES

Vertical scaling = moving to more powerful hardware
Horizontal scaling = distributing load over more machines

Automated vs manual scaling

Elastic / autoscaling systems: detect load increases or decreases and automatically add or remove resources
- Drawback: complex to implement correctly
Manual scaling: human analyses capacity and decides upon adding more machines to the system
Scaling can be proactive or reactive

There is no generic magic scaling sauce

Volume of reads, volume of writes, volume of data to store, complexity of the data, response time requirements, access patterns, security, etc.
- 100.000 requests/second, each 1 kB <-> 3 requests/minute, each 2 GB in size
- Both systems deal with 0.1GB/s data throughput but should look very different
- tweede zal makkelijker te implementeren zijn

New cards

Concern 3: Maintainability

ezelsbrug

OSE

Operability

Make it easy for operations team to keep system running smoothly

Simplicity

Make it easy for new engineers to understand system, removing as much complexity as possible from the system

Evolvability

Make it easy for engineers to make changes to system in the future, adapting it for unanticipated use cases as requirements change
Impacts velocity of integrating new functionality

New cards

Data Model Layering

Most applications built layering one data model on top of another
For each layer key question: representation in terms of next-lower layer?

vb:

App developer: model real world in terms of objects and data structures and APIs that manipulate them

Structures often application-specific

New cards

SQL relational model

Data organized in relations (tables in SQL), where each relation is an unordered collection of tuples (rows in SQL).

Most application development nowadays is in object-oriented programming languages

Awkward translation layer needed between objects in application code and the DB model of tables, rows and columns
Partially solved through Object-Relational Mapping (ORM) frameworks like ActiveRecord (Ruby) or Hibernate (Java)

New cards

NoSQL (algemeen)

driving forces
2 movements
verschillende types NoSql datastores (EX)

NoSQL = Not Only SQL

Driving forces

Need for greater scalability: storing more and more data
Need for greater availability: if one or more nodes is no longer responsive the rest of the system can keep going

Also

Preference for free and open-source software over commercial costly DB products
Specialised query operations e.g. undefined number of joins not well supported by the relational model
Frustration with restrictiveness of relational schemas

Two movements

Data stores that could be distributed better and without limitations
Availability more important than overall consistency

In the foreseeable future, relational DBs will continue to be used alongside a broad variety of non-relational datastores

Different types of NoSQL datastores: document store, graph store, time series DB, triple/quad store, etc.

New cards

NoSQL: Key-Value Store (EX)

wat
voor en nadelen
gebruikt voor
vbn

One of the simplest NoSQL DBs

Datastore focused on storing, retrieving and managing data by key – value pairs

Keys are unique identifiers
Both keys and values range from simple data types to complex composed objects(unformatted text, documents, photos, music, video)
No schemas, no relationships between data

Voor en nadelen:

- Drawback: can only use the key to retrieve elements, typically unable to do things like filtering or retrieval of elements with common attributes

- Only supports simple commands like get, put, delete

+ High horizontal scalability
+ Very fast for both read and write operations

Good for

Real-time recommendations and advertising: rapid access to new recommendations as web site visitor moves through site
In-memory datastore (e.g. caching or web session details / preferences)

Examples: Redis, Memcached, TiKV

New cards

NoSQL: Document Store

Document store focusses on storing document-oriented information (XML, JSON, etc.)

Seen as extension to key-value stores, value being a structured document
No rigid schema (like relational model)
All information of an object stored together
Contrary to a key-value store, can filter / aggregate on document value

Examples: MongoDB, Couchbase

New cards

NoSQL: Wide Column store (EX)

wat
voor en nadelen
gebruikt voor
real life vbn

Seen as a 2-dimensional key-value store

Uses tables, rows(identified by key), columns(identified by column keys)
Each row does not have to contain the same columns
Values can be complex data types like unformatted text and imagery

Combines strict table structure of relational DB with flexibility of document DB

Voor en nadelen

- Not optimized for joins

+ Designed for very high horizontal scalability

Each row can be stored as individual file, addressable by its key
Up to every attribute of a single row, addressable by key and column key, could be stored separately
Such extensive splitting of data could lead to very bad query performance
- Solution: Column families, describing which attributes belong together and will most likely be queried together
- Downside: makes schema stricter, need to know and define columns and column families up front

+ Optimised for

High volumes of data
High write throughput with slower reads

Examples: Cassandra, Hbase, ScyllaDB

New cards

NoSql: Time Series DB (EX)

wat
voor en nadelen
gebruikt voor
real life vbn

Database optimized for time-stamped or time series data

Metrics (regular time series) or events (irregular time series) that are tracked, monitored, aggregated over time
Time becomes the key index (when plotted, one of the axis is always time)

Time series DBs have difficulty with handling high cardinality data

Each table can have many indexed columns (dimensions) each containing many unique values
Data ingest rates plummet as cardinality rises exponentially

Typically support for dropping older data: retention policies

Specialized compression algorithms can operate over time series data, offering higher compression rates than more generic data

Many types of data: sensor data, performance monitoring data, server metrics, clicks, financial data, etc.

Optimized for measuring change over time: IoT, DevOps monitoring real-time analytics

Examples: InfluxDB, TimescaleDB

New cards

NoSQL: Graph DBs (EX)

wat
voor wat
real life vb

Data can be modeled as a graph

Vertices (nodes or entities) + Edges (relationships or arcs)

Important aspects

Any vertex can have an edge connecting it with any other vertex
- No schema restricting which kind of things can or cannot be associated
Given any vertex, efficiently find both incoming and outgoing edges and thus traverse graph
By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while maintaining a clean data model

Good choice for data that can evolve or which there are a lot of many-to-many relationships

Examples: Neo4J, ArangoDB

New cards

NoSQL: Search Engines (EX)

wat
gebruikt voor
real life vb

Datastore designed for finding content (typically full text) through search functionality

Data does not have to conform to a rigid schema
Typically heavily indexes content (grouping similar characteristics) in order to offer search results in a rapid manner

Indexing through tokenization (chopping up data in small pieces) and stemming (bringing words back to their root form)

Offers different methods for full-text search, complex search expressions and rankings of search results

Good choice for

Autocompletion / suggestions based on customer input
Searching through logs for analysis

Examples: ElasticSearch

New cards

Relational vs document model

many to one relationships
many to many relationships
relation vs document model
choice?

MANY-TO-ONE RELATIONSHIPS

Relational DBs: normal to refer to rows in other tables by ID, because joins are easy
Document DBs: support for many-to-one relations is weak
- Joins are not needed for one-to-many table structures and support for joins is weak
- Work of creating join is typically moved to application instead of DB (e.g. in-memory list of regions or industries)

MANY-TO-MANY RELATIONSHIPS

Well-supported by SQL datastores and NoSQL graph stores, not so well by NoSQL document datastores

RELATIONAL VS DOCUMENT DATA MODEL
Document data model

Schema flexibility
- Most document DBs do not enforce a schema on data in the documents
Better performance due to locality (see JSON example)
For some applications closer to the data structures used by the application
Concept of foreign key: document reference
- Resolved at read time by join or follow-up queries

Relational data model

Better support for joins
Good support for many-to-one and many-to-many relationships

CHOICE FOR RELATIONAL OR DOCUMENT MODEL
If data in application has document-like structure (tree of one-to-many relationships, where typically entire tree is loaded)

Good idea to use document model
Splitting a document-like structure into multiple tables can lead to cumbersome schemas and complicated app code
Limitations
- Cannot refer directly to a nested item within a document but need to say something like “second item in list of positions for user X251”
- If documents are not too deeply nested usually not an issue

If data in application has many-to-many relationships

Document model less appealing
Possible to reduce need for joins by denormalizing, but then app code needs to keep denormalized data consistent
- Normalization: process of organizing a DB to reduce redundancy and improve data integrity by dividing tables into smaller, related tables and defining relationships between them
- Denormalization: process of combining related tables into a single table to improve read performance at the expense of data redundancy and update / write performance
  - Duplicates data that would otherwise require joins
Joins can be emulated by making multiple requests to DB, but complicates application code and usually slower than join made inside DB

Highly interconnected data: graph model may be most appropriate (see later)

New cards

Document DBs

schema flexibility (2)
Data locality for queries

SCHEMA FLEXIBILITY

Document DBs and JSON support in relational DBs usually do not enforce any schema on data in documents
Schema-on-read
- Structure of the data is implicit and only interpreted when the data is read
- Similar to dynamic (runtime) type checking
Schema-on-write
- Approach of relational DBs with explicit schema
  - DB ensures all written data conforms to explicit schema
- Similar to static (compile-time) type checking
Difference between approaches noticeable when application wants to change format of its data
Schema-on-read approach is advantageous if items in collection do not all have same structure

DATA LOCALITY FOR QUERIES
Document usually stored as a single continuous string (JSON, XML or binary variant)

If application often needs to access entire document → performance advantage to storage locality
Locality advantage only applies if you need large parts of document at once
- DB needs to load entire document
- On updates entire document usually needs to be rewritten
Recommended to keep documents small and avoid writes increasing size
- These limitations reduce the set of situations in which document DBs are useful

Grouping related data for locality is not limited to document model

New cards

Graph like data models

wanneer
Property graph
voordelen graph data models (EX)
querying data

What if many-to-many relations are very common in your data?
-> More natural to model data as a graph

Vertices (nodes or entities)
Edges (relationships or arcs)

PROPERTY GRAPHS
Each vertex

Unique identifier, Set of outgoing edges, Set of incoming edges, Collection of properties (key-value pairs)

Each edge

Unique identifier, Vertex at which edge starts (tail vertex), Vertex at which edge ends (head vertex), Label to describe type of relationship between two vertices, Collection of properties (key-value pairs)

Graph store can be thought of as consisting of two relational tables (vertices and edges)

Important Properties

Any vertex can have an edge connecting it with any other vertex
- No schema restricting which kind of things can or cannot be associated
Given any vertex, efficiently find both incoming and outgoing edges and thus traverse graph
- Indexes on both tail_vertex and head_vertex
By using different labels for different kinds of relationships, able to store several different kinds of information in a single graph, while maintaining a clean data model

Voordelen van Graph Data Models

Noteworthy difficulties for relational models

Different kinds of regional structures in different countries
Quirks of history: country within a country
Varying granularity of data

Graphs good for evolvability

As features get added: easily extend the graph to accommodate changes in application’s data structures

Querying data

Cypher query language:
- Declarative query language for property graphs
As graph data can be put in relational tables can it be queried using SQL?
- Yes, but
  - In relational DB usually known in advance which joins are needed in query
  - In graph DB query may need to traverse a variable number of edges before finding vertex you are looking for
  - Number of joins is not fixed in advance
- Variable-length traversal in SQL = recursive common table expressions (WITH RECURSIVE)

New cards

OLTP vs OLAP (EX)

In early days writes to DB typically corresponded to commercial transactions (e.g. making a sale, placing an order, paying a salary)
DBs started being used for many different kinds of data

E.g. comments on posts, actions in a game, contacts in an address book
Basic access pattern remained similar to processing business transactions
- App looks up small number of records by some key, using an index
- Records are inserted or updated based on user input
Because these apps are interactive, access pattern became known as Online Transaction Processing (OLTP)

DBs also started being used for data analytics with very different access patterns

Analytic query scans over huge number of records, reading few columns per record and calculating aggregate statistics E.g. what was total revenue of each of the stores in January?
- E.g. how many more products than usual were sold during our last promotion?
Analytic queries often written by business analysts
Query results feed reports to help management make informed decisions → business intelligence
Access pattern became known as Online Analytic Processing (OLAP)

At first same DBs were used for OLTP and OLAP queries
Early 1990s however OLAP queries started being run on a separate DB called the data warehouse

New cards

data warehouse (EX)

waarom
wat
ETL principe → figuur!

OLTP systems usually expected to be highly available and to process transactions with low latency as they are critical to business

Reluctance to let business analysts run ad hoc analytic queries on OLTP databases (queries usually expensive, harming performance)

Introduction of data warehouse, allowing queries without affecting OLTP operations

Contains read-only copy of data in various OLTP systems
- Extracted from OLTP systems by periodic data dump or stream of updates
- Transformed into analysis friendly schema
- Cleaned up
- Loaded into data warehouse
Process of getting data into warehouse is known as Extract-Transform-Load (ETL)

New cards

Data warehouse vs Data mart

Data warehouse

Centralised location for data
Holds multiple subject areas / very detailed information
Works to integrate all data sources
Source of data for reporting, analytics and offline operational processes
Typically employs expensive DB technology

Data mart

DBs used to provide fast, independent access to a subset of warehouse data
Often created for departments, projects, users
Compared to data warehouse
- Similar technology, subset of data, relieves pressure on enterprise data warehouse, provides sandbox for analysis

New cards

Data flow in the datawarehouse (5) (EX)

→ figuur!

DQ: Data Quality Tools

Defining quality rules, applying those rules to data to detect violations/exceptions and fixing those exceptions
Data profiling automatically gathers statistics about data to ascertain its quality e.g. how many values are empty, minimum and maximum values, most frequent values, etc.

ETL: Extract Transform Load (zie eerder)

MDM: Master Data Management systems

Creation of master lists of various entities: customers, products, suppliers, etc. and detecting multiple records that apply to the same entity and fixing ambiguities

OLAP: Online Analytical Processing Tools (technology behind BI)

BI: Business Intelligence

New cards

Metadata flow in datawarehouse

→ figuur!

Metadata repositories

Contain technical metadata across the data assets
Three main use cases
- Finding data assets
- Tracking lineage (provenance): where data came from and how it was generated/transformed a.k.a. data trail auditing
- Impact analysis: allows developers to see all data assets that rely on a particular field or integration job before making a change

New cards

Data warehouse data model

algemeen (verschil OLTP OLAP)
star schema
slowly changing dimensions

DB at the heart of a data warehouse

Usually relational DB optimized for analytics-type processing
DB usually heavily indexed and tuned to ensure optimal performance for most common queries

When relational DBs are used to support operational systems and applications (OLTP) data is usually stored in highly normalized data models

Create tables with minimum redundancy and smallest possible number of fields
Makes updates (writes) very fast

Data warehouses however favour denormalized data models (for OLAP)

Each table contains as many related attributes as possible
Typically contains data from many sources and applications each with their own schema
- Data from those sources has to be converted to a single schema
Typically uses a star schema (dimensions / fact tables)
- Many DBs include specialized query optimizations for handling star schema-like joins

STAR SCHEMA
Data warehouses use star schemas a.k.a. dimensional modelling

Centre of schema is fact table
- Each row represents an event occurring at a particular time
- Fact tables can get very large
- Some columns are attributes
- Other columns foreign key references to other tables: dimension tables

Why is it called star schema?

When table relationships are visualized, fact table is in middle surrounded by dimension tables

Typical data warehouses often have fact tables that have over 100 columns, up to several 100s

SLOWLY CHANGING DIMENSIONS
To allow accurate data analysis sometimes needed to keep track of e.g. a person’s state over time

Ensures each transaction corresponds to person’s state at time of transaction
Most common type of slowly changing dimension tracks historical data by creating multiple records

Example

Keep track of customer purchases along with customer demographics (single, married, parent, divorced, etc.)
Without slowly changing dimensions we would have one single record for a customer reflecting only current demographic state
- Difficult to analyse how many people with children spend money on specific items

Add complexity to ETL jobs and analytic queries

New cards

Data warehouse: Column Oriented storage (EX)

vergelijk met traditioneel
eigen voorbeeld

Typical data warehouse query only accesses 4 or 5 columns at one time

In most OLTP DBs storage is laid out in row-oriented fashion

All values from one row of a table are stored next to each other

Document DBs are similar: entire document is stored as one continuous sequence of bytes

Performance issues when loading all these rows from disk into memory, parsing them and filtering out those that don’t meet required conditions

Column-oriented storage: do not store all values from one row together, but store all values from each column together instead

If each column is stored in a separate file, a query needs only read and parse those columns used in the query

Example

Customers table with 300 columns, average of 5 bytes per attribute stored
- 1.5KB for each user
- 1 million customers
Relational
- 1.5TB storage for DB
Columnar
- If age takes 2 bytes and a record identifier 6 bytes, DB needs 8 bytes per field, or 8GB for 1 mil customers
- Queries like: “how many customers are under 30?” can be answered very fast

New cards

ETL vs ELT

→ figuur!

Extract Transform Load (ETL)

Data is extracted, transformed and loaded into data warehouse
Sometimes data from different systems used to create or update a single record = conforming dimension

Extract Load Transform (ELT)

Data is extracted, loaded into data warehouse and transformed

New cards

Virtualised data warehouse

wat
wnr
nadelen
figuur!

Data warehouse approach is to bring all data together in one place, integrate it into a single conforming schema and use it for analytics queries

Alternative

Create logical or virtual schema across multiple systems and issue queries against that schema
- A.k.a. Federation, Enterprise Information Integration (EII) and data virtualisation

Approach is more appropriate than using a data warehouse when

Data must be kept fresh in the face of changes: execute queries against original sources, results always up to date
Data access is infrequent: building very expensive data warehouses for data infrequently used is not cost-effective
Compliance and data residency clauses constrain data from being copied to a data warehouse

Drawbacks

Labor-intensive manual process: virtual tables must be manually defined across disparate systems
Schema and logic changes: schema change can break queries and make all data unavailable until queries are fixed
Performance: queries that span multiple systems (federated queries) have significant performance challenges
Frequency: if there are a lot of queries against virtual schema, it becomes more advantageous to extract data once, store it in a data warehouse and query it there: lowers load on source systems

New cards

wat is een Data lake

Self-service is taking over from carefully crafted and labor-intensive approaches of the past

IT professionals created well-governed data warehouses and data marts but took months to make changes

Data lake focusses on self-service

Take-aways: data in original form and format and used by various users

Allow analysts to analyse data without having to get help from IT

Data preparation tools that help analysts shape data for analytics
Catalog tools that help analysts find the data they need
Data science tools that help perform advanced analytics

Challenge with self-service is governance and data security

How to make data available to analysts without violating internal and external data compliance regulations?

New cards

Data lake maturity (4)

(denk aan figuur)

Data puddle

Single purpose or single project data mart built using big data / cost-efficient technology

Data pond

Collection of data puddles
Focus on scalable / cost-efficient technology compared to data warehouse / mart

Data lake

Supports self-service: business users able to find and use data sets that they want to use without having to rely on help from IT department
Aims to contain data that business users may possibly want even if there is no project requiring it at the time

Data ocean

Expands self-service data and data-driven decision making to all enterprise data, wherever it may be

New cards

Data lake succes factors: Platform

4 zaken
ezelsbrug

VCV FP

Big data technologies like Hadoop, cloud solutions like AWS, Azure and Google Cloud Platform

Volume: designed to scale out
Cost: ability to store and process huge volumes of data inexpensively
- Usually at one-tenth to one-hundredth the cost of a commercial relational DB
Variety: filesystems or object stores that allow to store all sorts of files e.g. Hadoop HDFS, MAPR FS, AWS S3
- Unlike relational DB requiring schema on write, a filesystem or object store does not care what you write (schema on read)
Future-proofing
- If data is stored in a relational DB it can only be accessed by that relational DB
- Hadoop and big data platforms very modular: Hive can provide a SQL interface to Hadoop files, to Pig scripts, to Spark, to MapReduce, etc. (seen in later slidedecks)

New cards

Data lake succes factors: Data

Most data collected by enterprises today is thrown away

Small percentage of data aggregated and kept in data warehouse for a few years

Issue with data silos: different departments hoard their data, making it difficult to use data cross-groups

Data lake: store as much data as possible for future use

Typically, no known reason for storing the data, but do so in case data is needed one day
Makes no sense to convert or treat data prematurely → save data in its native format
Consumes raw data and prevents data silos

New cards

Data lake succes factors: Interface

To gain wide adoption and reap benefits of helping business users make data driven decisions, solutions companies provide must be self-service

Users must be able to find, understand and use data without help from IT department
- IT unable to scale support to such a large user community and such a large variety of data

Interface at the right level of expertise

Analysts
- Often do not possess skills to use raw data: too much detail, too granular, too many quality issues
- Data has to be harmonized: put in same schema with same field names and units of measure
- They want “cooked” data, not raw data
Data scientists
- Want access to raw data, not the cooked data
Possible by setting up multiple zones / areas that meet particular needs

New cards

Setting up data lake

4 stappen

Set up the infrastructure
- E.g. get the Hadoop cluster up and running
Organize the data lake
- Create zones for use by various user communities and ingest data
Set the data lake up for self-service
- Create catalogue of data assets, set up permissions and provide tools for analysts to use
Open data lake up to users

New cards

Setting up data lake: 1 Set up the infrastructure (EX)

3 mogelijkheden

figuur!

Initially data lakes were built mostly on-premise using open-source or commercial Hadoop distributions

Later cloud-based data lakes or hybrid cloud/on-prem data lakes started being used

Latest evolution: logical data lake: a virtual data lake layer across multiple heterogeneous systems

Hadoop, relational, NoSQL DBs, on-premise and in the cloud

All use a catalogue to find data assets

New cards

Setting up data lake: 2 Zones: organizing the data lake

zones → figuur!
governance

Raw or landing zone where data is ingested and kept as close as possible to the original state

Gold or production zone where clean, processed data is kept

Dev or work zone where more technical users (data scientists / engineers) work

Can be organised by user, by project, by subject, etc.
Once analytics performed in work zone get productized → moved into gold zone

Sensitive zone that contains sensitive data

Governance should reflect data usage and user community requirements

Different zones have different levels of governance and SLAs

Data in the gold zone is strongly governed, well curated, well documented, carries quality and freshness SLAs
Data in the work area has minimal governance (mostly making sure there is no sensitive data) and SLAs that vary from project to project

New cards

Setting up data lake: 3 setting up for self service

4 stappen

Analysts typically go through four steps to do their job

Find and understand
Provision (get access to the data)
Prepare (clean and convert)
Analyse (answer questions, create visualisations and report)

First three steps typically take 80% of an analyst’s time

60% is in the first step of finding and understanding data

New cards

setting up for self service: 1 finding and understanding data

Variety and complexity of available data far exceeds human ability to remember it

Typical companies have 1000s of DBs, with typically many 100s of tables with each table having many 100s of fields

Typical project

Ask around to see whether anyone has ever used a particular type of data
Stumble onto data set that someone has already used
Once decided to use that data set spend a lot of time deciphering what the data it contains means
- Knowledge about what data means is usually spread throughout the company
- Analyst crowdsourcing tools collect this information to document data sets using simple descriptions composed of business terms and builds a search index to help them find what they are looking for

New cards

setting up for self service: 2 Accessing and provisioning the data

Once right data sets identified → analysts need to be able to use them

Publish information about all data sets in a metadata catalogue so analysts can find useful data sets and request access as needed

Requests include justification for access, project that requires the data and duration of access required
Incoming request may trigger work to deidentify sensitive data

Provisioning or physical access can be granted in a number of ways

Users can be granted read access to the entire data set
If only partial access should be granted, a copy of the file containing just the data appropriate to the user can be created (and kept up to date) or a table or view can be created that contains only the fields and rows that the analyst should see
If needed, a de-identified version of the data set can be generated that replaces sensitive information with randomly generated equivalent information, so all applications still work, but no sensitive data is leaked

New cards

setting up for self service: 3 Preparing the data

Most of the time data needs work to render it appropriate for analysts. This generally involves the following operations

Shaping
- Selecting a subset of fields and rows to work on, combining multiple files and tables into one (joining), transforming and aggregating, converting variables to features (e.g. converting age into a feature that has a value of 0 if a person is over 65 and 1 if not)
Cleaning
- Filling in missing values, correcting bad values, resolving conflicting data, normalizing units of measure and codes to common units
Blending
- Harmonizing different data sets to the same schema, same units of measure, same codes, etc.

New cards

Data lake architectures

Initially thought one huge on-premise data lake would contain all their data

Not ideal, multiple data lakes typically proven to be a better solution

Different data sovereignty regulations
Organizational pressures
Frustration at finding experienced administrators for setting up and maintaining complex Hadoop (and other big data technologies) cluster

High interest in Cloud-based data lakes where most hardware and platform components are managed by experts that work for Amazon, Microsoft, Google and others

DATA LAKES IN THE PUBLIC CLOUD
Access to big data technology expertise

Short deployment times

Low cost of storage and elastic nature of cloud computing make it an attractive option for implementing a data lake

A lot of data is being stored for future use, so makes sense to store it as inexpensively as possible
- Cost optimization possible through various storage tiers offered by public cloud vendors: from high- speed to glacial, with slower-access media being significantly cheaper
Elasticity of cloud computing allows a very large cluster to be spun up on demand when needed
- Compared to on-premise: as nodes fill up with data, new nodes need to be added just for storage
  - If analytic loads are CPU-heavy and need more compute power, you need to add even more nodes, even though you may only use them for a short time
- In the cloud you pay for the storage and processing that you need

New cards

Data lake virtualization: logical data lakes

→ figuur

nstead of loading all data into the data lake in case someone may actually need it, it is made available to analysts through a central catalogue or through data virtualization software

Address the issues of completeness and redundancy

Completeness
- If analysts can find only data that is already in the data lake other data that has not been ingested into the data lake won’t be found
Redundancy
- If we ingest all data into the data lake, we will have redundancy between the sources of data and the data lake
- With multiple data lakes, to achieve completeness we would need to ingest the same data into each data lake
- Already a lot of redundancy in the enterprise

New cards

Data lake virtualization: managing data in the logical data lake

→ figuur

Handling completeness

Create a catalog of all the data assets so the analysts can find and request any data set that is available in the enterprise

Handling redundancy

Store data that is not stored anywhere else in the data lake
Bring data that is stored in other systems into the data lake if and when it is needed, and keep it in sync while it is needed
Bring each data set in only once for all users

New cards

Data lake virtualization: Virtualisation vs catalog based data lake

→ figuur

Virtualization a.k.a. federation or EII (Enterprise Information Integration)

Technology developed in 1980s improved into the 2010s
Creates virtual view or table that hides location and implementation of the physical tables
- E.g. view created by joining two tables from different DBs
For logical data lake: would require every data set to be published as a virtual table and kept up to date as underlying table schemas change
Views present significant problems
- Creating a virtual view does not make data any easier to find
- Joining data from multiple heterogeneous systems is complex and compute-intensive, causing massive loads on the systems and long execution cycles

By contrast: in catalog-driven approach only metadata about each data set is published in order to make it findable

Datasets then provisioned to the same system (e.g. Hadoop cluster) to be processed locally

New cards

Data lake virtualization: catalog based logical data lake

Makes all data findable and accessible to analysts

Can serve as a single point of access, governance and auditing

New cards

Data swamp

Data pond that has grown to size of a data lake but failed to attract a wide analyst community

Usually due to lack of self-service and governance facilities

Figure

Various teams use small areas of lake for projects (white data pond area)
Majority of data is dark, undocumented / unusable

Historically a lot of companies rushed to buy Hadoop clusters and filled them with raw data

Millions of files containing petabytes of data and no way to make sense of that data
No one could tell where the sensitive data was, so users could not be given access and data remained largely unusable and unused

Example: company built a data lake, encrypted all data in lake to protect it, required data scientists to prove that data they wanted was not sensitive before it would unencrypt it and let them use it