Class 1: Elastic Map Reduce

0.0(0)
studied byStudied by 0 people
GameKnowt Play
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/65

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

66 Terms

1
New cards

Big Data Def

Describes the realization of greater business intelligence by storing processing and analyzing data that was previously ignored or siloed due to the limitations of traditional data management technologies

2
New cards

Velocity

speed data travels

3
New cards

Volume

space the data required

4
New cards

variety

heterogenous types of files

5
New cards

Velcity Examples

Internet of things

Clickstream data

Environmental data

6
New cards

Volume Example

Standard work year: 2016 hours

YouTube Content ID system: processes 250 years of content in 24 hours

7
New cards

Variety Examples

RDBMS

XML

Log

Unstructured text files

HTML

PDF

Video

8
New cards

Traditional Computing Model

  • data stored in a central location like a SAN

  • Dara copied to processors at run time

    • Large volumes bottleneck on the transfer rate

9
New cards

Hadoop Computing model

  • bring program to data

  • Replicate and distribute data when the data is stored

    • Run the program where the data resides

10
New cards

Distribution

a collection of open source Apache applications that have been tested to work together

11
New cards

Prominted providers of distributions

cloudera

Horton works

Amazon

Google

Microsoft Asure

12
New cards

What is Apache Hadoop software Libraru

a framework that allows for the distributed processing of a large data sets across clusters of computers using simple programming models

13
New cards

Hadoop Characteristics

Inexpensive data storage

Inexpensive severs

Combines up to 1000 distributed servers for massive performance

14
New cards

Storage Trends

commodity

Cheaper and more abundant

Normalization vs Denormalization

Data schema on read vs schema on write

Data lakes

Solid state

15
New cards

Memory Trends

commodity

Cheaper more abundant

More = merrier

In memory computing benefits from massive allocations of RAM

Hadoop name node needs, depending on size of cluster, lots of RAM

16
New cards

Distributed Processing

cheaper to store massive quantities of data using big data architecture

17
New cards

Hadoop Distributed File System

HDFS is the data storage layer for a Hadoop system

 Inexpensive reliable storage for massive amounts of data

 Uses low cost industry standard hardware

 Data is replicated and distributed to multiple nodes of storage

18
New cards

HDFS: Hadoop file system

distributes data blocks across the cluster in a redundant manner

Data is lost on cluster termination

19
New cards

YARN: Yet Another Resource Negotiator

Manages cluster resources for the collection of applications

20
New cards

MapReduce

base code that handles all data processing

Maps data to key/Value pairs

21
New cards

what is the mechanism for bringing the processing to the stored data?

Map Reduce

22
New cards

In Map Reduce where is data stored?

HDFS node

23
New cards

What does Map Reduce Contain?

master job tracker managing task resources

24
New cards

what executes tasks in map reduce for the HDFS node

task tracker

25
New cards

EMR: Elastic Map Reduce is managed by what

hadoop severvice provided by AWS

26
New cards

AWS distribution

provided support for most popular open source application like Spark, Hive, HDFS, Presto, and flink

27
New cards

EMR Cluster Architecture

Master node

Core node

Task node

28
New cards

Master Node “leader node”

  • manages the cluster

  • Tracks status of tasks

  • Monitors cluster health

    • Single EC2 instance

29
New cards

Core Node

Saves HDFS data

Used in multi node clusters

Runs tasks

Can be scaled up or down

30
New cards

Task Node

runs task only

Doesnt store data

Spot instances can be used

31
New cards

Transient Clusters

terminate once all steps are complete

Loading data processing data and storing data

Perform work and then shut down to save costs

32
New cards

Long-running

manually terminated

Functions as data warehouse with periodic processing on large data sets

Tasks nodes can be scaled using spot instances

Setup with termination protection on an auto termination off

33
New cards

Using EMR

frameworks and applications are part of cluster creation

Users connect directly to the master node to run jobs

Configure the steps in a cluster

Submit orders steps via the console

34
New cards

EMR and AWS integration

EC2 provides the EMR nodes

VPC provides a virtual network for nodes

S3 stores input and output data

Data pipeline to schedule and start clusters

IAM to configure permissions

35
New cards

EMR capabilities

EMR by hour service change

EC2 separate set of charged

Auto provisions core nodes when they fail

Cluster core nodes can be resized on the fly

Core nodes can be removed but risk data loss

Tasks nodes can be added on the fly

36
New cards

Hive

It is a tool that provides SQL querying of data stored in HDFS or Hbase

Accessed using the HiveQL language

Allows for easy ad-hoc queries

Transforms log file data into structures like tables

Consists of a schema in the Metastore and data in HDFS

37
New cards

Success of Hive

Uses familiar SQL syntax for OLAP queries

Interactive and scalable on a big data cluster

Works very well for data warehouse applications

JDBC / ODBC driver

38
New cards

Hive Approach

Designed to organize data into tables

Build table structures over HDFS files

Table schemas and metadata are stored in a database called the Metastore

Hive table definitions are local to the machine they are created on

39
New cards

Hive Metastore and Glue

Shares schema across EMR and other AWS service

Used to create data lakes

40
New cards

Schema on Read

Verifies data organization when a query is issued

Provides much faster loading as structure is not validated

Multiple schemas serving different needs for the same data

Better option when schema is not known at loading time

41
New cards

Hive query exmple

INSERT OVERWRITE TABLE user_active

SELECT user.*

FROM user

WHERE user.active = A;

Creates a table

Selects all users

From a table called user

That have an active indication

42
New cards

How to load data into Hive?

create a table called records

43
New cards

Query Against a Hive Table

 SELECT using SQL familiar commands and specifications

 MAX defines the maximum temperature for each year

 FROM defines our records table

 WHERE ensures clean data selections

 GROUP BY issues the value by yEar

44
New cards

File storage Formats

--as-format is the nomenclature

--as-textfile

--as-sequencefile

--as-parquetfile

--as-avrodatafilE

45
New cards

Binary Column FormatS

 Column oriented formats work best when on a few columns are used in queries/calculations

 Hive provides native support for Parquet

 STORED AS … (PARQUET) or (ETC)

46
New cards

S3DistCP Copy

Transaction for copying large amounts of data from S3 to HDFS

Copies in a distributed manner using MapReduce

Provides for parallel path copying across buckets

47
New cards

what is the encompassing big data service in AWS

EMR

48
New cards

in EMR service what applications are in distribution

hadoop, spark, hive

49
New cards

clusters

computing network of computersc

50
New cards

core nodes

the cluster store Hadoop data files

51
New cards

EMR release

the distribution of applications tested together

52
New cards

Configure Cluster Nodes

Primary node configuration is for the master node

Core node configuration is for storage and processing

Task node configuration is for processing

53
New cards

Cluster Configuration

master node

Core/task nodes

Spot instances

54
New cards

Master node

m5.xlarge is standard configuration

55
New cards

core/task nodes cluster config

m5.xlarge is a good choice

 External dependencies use t2.medium

 Improved performance with m4.xlaRge

56
New cards

Spot instances cluster Configuration

Good choice for task nodes as they can scale

Avoid using on master and core nodes as it may cause data loss

57
New cards

VPC: Virtual Private cloud

created as a protected network for the cluster

58
New cards

S3 Bucket

created to capture the processing logs for the cluster

Error log files are captured in the same bucket

59
New cards

IAM policies

grant or deny permissions to control clusters acces

60
New cards

IAM roles

control access to EMFRS data based on user

61
New cards

IAM polices are attatched to what

IAM roles

62
New cards

SSh

provieds a secure connection to the command line interface

63
New cards

Kerebos

provides secure user authentication

64
New cards

Block Public Access

setting prevents public access to data stored on your EMR cluster

65
New cards

Apache Spark

fast and general engine for large scale data processing

In-Memory caching

Optimized query execution

Spark SQL Queries

Machine Learning MLlib

Spark StreaminG

66
New cards

EMR notebook

AWS Notebook (Jupyter)

Backed up to S3 data storage

Provision clusters from the notebook

Accessed via AWS Console

Hosted inside a VPC