H5 - Batch Processing

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/27

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

28 Terms

1
New cards

Data processing system types (h4-h6)

  • Services (online systems)

    • Service waits for request or instruction from a client to arrive

    • When request is received service tries to handle it as quickly as possible and sends response back

    • Response time usually primary measure of performance of a service

    • Availability often also important

  • Batch processing systems (offline systems)

    • Take a large amount of input data, run a job to process it and produce output data

    • Jobs often take a while (minutes/hours/days)

    • Often scheduled to run periodically

    • Primary performance measure is usually throughput

      • Time it takes to crunch through an input dataset of a certain size

  • Stream processing systems (near-real-time systems)

    • Between online and offline/batch processing

    • Stream processor consumes inputs and produces outputs (rather than responding to events)

    • A stream job however operates on events shortly after they happen versus batch job operates on fixed set of input data

    • Allows stream processing systems to have lower latency than equivalent batch systems

2
New cards

Compute intensive vs data intensive

figuur(en)

Compute-intensive tasks (traditional HPC = high performance computing)

  • Input data is relatively small

  • Data is usually kept in-memory during job execution

  • Many computations need to be performed per data-element

  • Runtime is limited by compute or interconnection network capacity

Data-intensive tasks (“big data”)

  • Input data is (very) large

  • Data is usually not kept in-memory during job execution

  • Few computations need to be performed per data-element

  • Runtime is limited by data transfer

3
New cards

Apache hadoop

  • algemeen

Apache Hadoop

  • Open-source framework written in Java

  • Distributed storage and processing of very large data sets (Big Data)

  • Scales from single server to thousands of nodes

Framework consists of a number of modules

  • Hadoop Common – shared libraries and utilities

  • Hadoop Distributed File System (HDFS) - distributed file system for storage

  • Hadoop YARN – resource manager

  • Hadoop MapReduce – programming model

  • assume failures are common

4
New cards

main ideas behind hadoop (6)

  • ezelsbrug

DD SO PFL HD

MAIN IDEAS BEHIND HADOOP

  • Data-driven philosophy

    • Growing evidence that having more data and simple models outperforms complex features/algorithms with less data

  • Scale out instead of up

    • Prefer a larger number of low-end workstations over a smaller number of high-end workstations (for big-data problems)

      • Buying low-end workstations (“desktop-like machines”) is 4-12x as cost efficient as buying high-end workstations (“high-memory machines with a high number of CPUs / cores”)

    • Idea is also valid for network interconnects

      • E.g. Infiniband versus “commodity” Gigabit Ethernet

  • Move processing to where the data resides

  • assume failures are common

  • acces data lineary - avoid random access (veel sneller)

  • Hide system level details from dev

    • Parallel programming is difficult

      • Code runs in unpredictable order

      • Race conditions, deadlocks, etc.

    • Main idea is to be able to program the “what” (i.e. functionality of the code) and hide the “how” (i.e. low-level details)

5
New cards

Mapreduce basics

→ figuur

MapReduce programming model (algorithm published in 2004)

  • Divide & conquer strategy

    • Divide: partition dataset into smaller, independent chunks to be processed in parallel ( = “map”)

    • Conquer: combine, merge or otherwise aggregate the results from the previous step (= “reduce”)

  • Mappers and reducers

    • Map task processes a subset of input <key, value> pairs and puts out intermediate pairs

    • Reduce task processes a subset of the intermediate pairs

  • Map function operates on a single input pair and outputs a list of intermediate pairs

  • Reduce function operates on a single key and all of its values and outputs a list of resulting <key, value> pairs

  • Sorts intermediate keys between map and reduce phase

Input files not modified, output files created sequentially

File reads and writes on distributed filesystem

6
New cards

Mapreduce: Distributed file system: HDFS

  • wat

Hadoop’s implementation: Hadoop Distributed File System, open-source implementation of Google File System (GFS)

HDFS is based on shared-nothing principle: commodity hardware connected by conventional datacentre network

  • Daemon process running on each machine, exposing a network service that allows other nodes to access files stored on that machine

  • Central server (NameNode) keeps track of which file blocks are stored on which machine

  • Conceptually creates one big filesystem that can use space on disks of all machines running daemons

To handle machine and disk failures, file blocks are replicated on multiple machines

Some HDFS deployments scale to tens of thousands of machines, combining petabytes of storage

7
New cards

Mapreduce: Distributed file system: HDFS

  • werking

  • figuur

  • Typical usage pattern

    • Huge files (100s of GB to TBs)

    • Data rarely updated in place

    • Reads and appends are common

  • Chunk servers

    • File is split into chunks

    • Typically, each chunk is 16-64MB

    • Each chunk replicated (usually 2x or 3x)

    • Try to keep replicas in different racks

  • Master node

    • A.k.a. NameNode in Hadoop’s HDFS

    • Stores metadata about where files are stored

    • May be replicated

  • Client library for file access

    • Talks to master to find chunk servers

    • Connects directly to chunk servers to access data

8
New cards

Mapreduce: Distributed file system: HDFS (EX)

  • client reading data

  • figuur!

  • Client opens file

  • NameNode knows location / addresses of blocks in the file

    • Datanodes sorted according to proximity to client

  • Client receives an inputstream object which abstracts reading data from HDFS datanodes

9
New cards

Mapreduce: Distributed file system: HDFS (EX)

  • client writing data

  • figuur!

  • Client creates file by calling create

  • NameNode makes a record of new file

  • Client receives an outputstream object abstracting block placement

    • Data is automatically replicated (default replication level is three)

10
New cards

Mapreduce: Distributed file system HDFS (EX)

  • Data flow

  • figuur!

11
New cards

Mapreduce

  • fixed steps + examples

  • figuur!

  • Five MapReduce steps

    1. Sequentially read (a lot of) data

    2. Map: extract key, values you care about

    3. Group by key: sort and shuffle

    4. Reduce: aggregate, summarize, filter or transform

    5. Write result

  • MapReduce environment takes care of

    1. Partitioning the input data

    2. Scheduling the program’s execution across a set of machines

    3. Performing the shuffle / group-by-key step

    4. Handling machine failures

    5. Managing required inter-machine communication

knowt flashcard imageknowt flashcard imageknowt flashcard image

12
New cards

Mapreduce

  • workflow

  • coordination: master (figuur)

Workflow

  • Previous example showed need for two MapReduce passes

  • Very common for MapReduce jobs to be chained together into workflows so output of job becomes input to next job

Coordination: master

  • Master node takes care of coordination

    • Task status: idle, in-progress, completed

    • Idle tasks get scheduled as workers become available

    • When map task completes, it sends master: location and sizes of its R intermediate files, one for each reducer

    • Master pushes info to reducers

  • Master pings workers periodically to detect failures

    • Distributed implementation should produce same output as a non-faulty sequential execution

13
New cards

Mapreduce

  • distributed processing parallelism

  • overzicht figuur

14
New cards

Mapreduce

  • how many map and reduce jobs

  • M map tasks, R reduce tasks

  • Rule of a thumb

    • Make M much larger than the number of nodes in the cluster

    • One DFS chunk per map is common

    • Improves dynamic load balancing and speeds up recovery from worker failures

  • Usually, R is smaller than M

    • Because output is spread across R files

15
New cards

Mapreduce

  • voor en nadelen

  • Advantages

    • Model easy to use, hides details of parallelization and fault recovery

    • Many problems can be expressed in the MapReduce framework

    • Scales to thousands of machines

  • Disadvantages

    • 1-input, 2-stage data flow is rigid, hard to adapt to other scenarios

    • Custom code is needed for even the most common operations e.g. filtering

    • Untransparent nature of map/reduce functions impedes optimization

16
New cards

RDBMs vs Hadoop mapreduce (EX)

  • MapReduce suits applications where data is written once and read many times, whereas relational DBs are good for datasets that are continually updated

  • RDBMs handle structured data (defined format), Hadoop works well on unstructured or semi-structured data → designed to interpret data at processing time

  • Relational data often normalized to retain integrity and remove redundancy. For Hadoop, normalization poses problems because it makes reading a record a non-local operation (requiring joins).

17
New cards

Mapreduce advanced features

  • backup tasks

  • Problem: slow workers (so-called stragglers) significantly lengthen the job completion time

    • Other jobs on machine

    • Bad disks

    • Generic issues

  • Solution: spawn backup copies of tasks

    • Whichever one finishes first “wins”

  • Effect

    • Dramatically shortens job completion time, but must be able to eliminate redundant results

    • Increases load on cluster

18
New cards

Mapreduce advanced features

  • Combiners

  • Often a Map task will produce many pairs of form (k,v1), (k,v2), ... for same key k

    • E.g., popular words in word count example

  • Save network time by pre-aggregating values in mapper

    • Combine(k, list(v1)) → v2

    • Combiner is usually the same as reduce function, but running after map task on map worker

19
New cards

Mapreduce advanced features

  • Custom partition function

  • Want to control how keys get partitioned over reducers

    • Inputs to map tasks are created by contiguous splits of input file

    • Reduce task requires records with same intermediate key end up at same worker

  • System uses a default partition function

    • hash(key) mod R, with R = number of reducers

  • Sometimes useful to override hash function

    • E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in same output file

    • URLs (keys) like www.ugent.be and www.ugent.be/ea/nl would end up at different reducers

20
New cards

Mapreduce advanced features

  • The need for joins

  • mapreduce joins and secondary sort (figuur)

THE NEED FOR JOINS

  • In many datasets common for one record to have an association with another record

  • Denormalization can reduce need for joins

  • In a DB, executing query involving small number of records, DB will use an index to quickly locate records of interest

    • If query involves joins → may require multiple index lookups

    • Issue: MapReduce has no concept of indexes

  • When MapReduce job is given set of files as input → reads entire content of those files

    • In DB terms: full table scan

    • Very expensive if you only want to read a small number of records

MAP-REDUCE JOINS AND SECONDARY SORT

  • Purpose of mapper is to extract a key and value from each input record

  • MapReduce framework partitions mapper output by key and sorts key-value pairs

  • Reducer performs actual join: called once for every user ID and thanks to secondary sort, first value is date-of-birth

  • Known as sort-merge joins

    • Mapper output sorted by key

    • Reducers merge together sorted lists

21
New cards

Apache Hive

  • Data warehouse software built on top of Hadoop

  • Declarative SQL-like interface to analyse data stored on Hadoop HDFS, Amazon S3

    • Hive QL, very similar to SQL (SELECT, JOIN, GROUP BY, etc.)

    • Querying from Hive shell, Web UI, applications

  • Queries broken down into MapReduce jobs and executed across a Hadoop cluster (note: can also use Spark for processing)

  • Best used for batch jobs over large sets of immutable data (e.g. log processing, analytics, creation of reports)

    • High latency

    • Read-based, inappropriate for transaction processing (OLTP)

  • Originated at Facebook

    • 2013: need to analyse 500.000.000 logs per day, 75k queries

    • Used at Netflix, Amazon Elastic MapReduce, etc.

22
New cards

Apache Pig

  • Pig Latin procedural scripting language to create programs that run on Apache Hadoop

    • Data model: nested bags of items -> (‘key1', 222, {('value1'),('value2')})

    • Provides relational operators JOIN, GROUP BY, etc.

  • Compiler produces sequences of MapReduce programs (or Spark)

  • Compared to Hive

    • Quite similar, Pig focusses on programming, Hive focusses on declarative SQL-like language

    • Both Pig and Hive operate on top of Hadoop MapReduce and HDFS

    • Both have an easier learning curve than MapReduce and can do things in much less code

    • Pig: focus on data engineers/programmers, Hive: focus on analysts and SQL-knowledgeable users

  • Originated @Yahoo! around 2006

23
New cards

Apache Spark

  • algemeen (RDD?)

  • figuur

  • MapReduce issue = linear dataflow structure

    • Read → Process → Write

    • Reload data from stable storage on each iteration

  • Spark aims for in-memory processing, dealing with out-of-memory can be an issue

  • Apache Spark → concept of Resilient Distributed Dataset (RDD)

    • Resilient: if data in memory is lost, it can be recreated

    • Distributed: stored in memory across the cluster

    • Dataset: coming from file(s) or created programmatically

  • Spark facilitates implementation of

    • Iterative algorithms (e.g. machine learning)

    • Interactive / explorative data analysis

    • Machine learning

24
New cards

Apache Spark

  • pillars

  • figuur

  • Computing API

    • Scala (native), Java, Python, R, SQL

  • Resource management

    • Stand-alone, YARN, Mesos and Kubernetes

  • Data Storage

    • HDFS (or Cassandra, S3, HBase, etc.)

  • Sparkinteraction

    • Spark Shell: interactive shell

    • SparkContext = main entry point to the Spark API

      • sc handle in Spark Shell

    • Spark Applications: for large scale data processing

25
New cards

Apache Spark: RDD

  • two types (figuur)

  • elements

  • addtitional functionality

Two types

  • Actions – returning values

    • Count

  • Transformations – define new RDDs based on current one

    • Filter

    • Map

    • Reduce

RDDs can hold any type of element

  • Primitive types: integers, characters, bools, strings, etc.

  • Sequence types: lists, arrays, tuples, etc.

  • Scala/Java serializable objects

  • Mixed types

Some types of RDDs have additional functionality

  • Double RDDs: consisting of numeric data, offering e.g. variance, stdev methods

  • Pair RDDs: RDDs consisting of key-value pairs

    • For use with map-reduce algorithms

26
New cards

Apache Spark: RDD

  • newer features

  • RDD was the first and fundamental data structure of Spark

    • Low-level

  • DataFrame API (Spark >=v1.3)

    • Allows processing structured data

    • Data organised into named columns (e.g. relational database)

    • Why DataFrame over RDD?

      • Memory savings (binary format stored)

      • Optimized query execution

  • Dataset API (Spark >= v1.6)

    • Builds further on DataFrame API

    • Provides type safety and an object-oriented programming interface

    • More high-level expressions and queries available

27
New cards

Apache zeppelin

  • Apache project allowing rapid Spark script construction and visualisation of results

    • Web-based notebook for interactive data analysis

    • Currently Zeppelin supports Scala (Spark), Python (Spark), SparkSQL, Hive and more

28
New cards

Spark vs Hadoop mapreduce (EX)

  • On-disk versus in-memory processing (Hadoop vs Spark)

    • RAM access 10-100 times faster than disk-based access

    • Memory in the cluster will need to be at least as large as the amount of data to process

    • Memory cost >> disk cost

  • Fault-tolerance

    • In Hadoop MapReduce: intermediate files can be used as checkpoints

    • In Spark: recomputing RDD for the failed partition (RDDs have lineage information)

  • Type of jobs

    • For one-pass Big Data jobs MapReduce may be the better choice

    • For iterative jobs Spark is the better choice

  • Spark offers strongly reduced coding volume compared to MapReduce

  • Spark has streaming capabilities via Spark Streaming (see streaming section)

  • Spark maturity (2014 released but initiated in 2009) < MapReduce maturity (2006) (both are considered mature)

  • Ideally used symbiotic

    • Hadoop for distributed file system and processing of large immutable datasets (e.g. Extract Transform Load)

    • Spark for real-time in-memory processing of datasets that require it