Class 1: Elastic Map Reduce

0.0(0)

Studied by 0 people

Knowt Play

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/65

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

66 Terms

New cards

Big Data Def

Describes the realization of greater business intelligence by storing processing and analyzing data that was previously ignored or siloed due to the limitations of traditional data management technologies

New cards

Velocity

speed data travels

New cards

Volume

space the data required

New cards

variety

heterogenous types of files

New cards

Velcity Examples

Internet of things

Clickstream data

Environmental data

New cards

Volume Example

Standard work year: 2016 hours

YouTube Content ID system: processes 250 years of content in 24 hours

New cards

Variety Examples

RDBMS

XML

Log

Unstructured text files

HTML

PDF

Video

New cards

Traditional Computing Model

data stored in a central location like a SAN
Dara copied to processors at run time
- Large volumes bottleneck on the transfer rate

New cards

Hadoop Computing model

bring program to data
Replicate and distribute data when the data is stored
- Run the program where the data resides

New cards

Distribution

a collection of open source Apache applications that have been tested to work together

New cards

Prominted providers of distributions

cloudera

Horton works

Amazon

Google

Microsoft Asure

New cards

What is Apache Hadoop software Libraru

a framework that allows for the distributed processing of a large data sets across clusters of computers using simple programming models

New cards

Hadoop Characteristics

Inexpensive data storage

Inexpensive severs

Combines up to 1000 distributed servers for massive performance

New cards

Storage Trends

commodity

Cheaper and more abundant

Normalization vs Denormalization

Data schema on read vs schema on write

Data lakes

Solid state

New cards

Memory Trends

commodity

Cheaper more abundant

More = merrier

In memory computing benefits from massive allocations of RAM

Hadoop name node needs, depending on size of cluster, lots of RAM

New cards

Distributed Processing

cheaper to store massive quantities of data using big data architecture

New cards

Hadoop Distributed File System

HDFS is the data storage layer for a Hadoop system

 Inexpensive reliable storage for massive amounts of data

 Uses low cost industry standard hardware

 Data is replicated and distributed to multiple nodes of storage

New cards

HDFS: Hadoop file system

distributes data blocks across the cluster in a redundant manner

Data is lost on cluster termination

New cards

YARN: Yet Another Resource Negotiator

Manages cluster resources for the collection of applications

New cards

MapReduce

base code that handles all data processing

Maps data to key/Value pairs

New cards

what is the mechanism for bringing the processing to the stored data?

Map Reduce

New cards

In Map Reduce where is data stored?

HDFS node

New cards

What does Map Reduce Contain?

master job tracker managing task resources

New cards

what executes tasks in map reduce for the HDFS node

task tracker

New cards

EMR: Elastic Map Reduce is managed by what

hadoop severvice provided by AWS

New cards

AWS distribution

provided support for most popular open source application like Spark, Hive, HDFS, Presto, and flink

New cards

EMR Cluster Architecture

Master node

Core node

Task node

New cards

Master Node “leader node”

manages the cluster
Tracks status of tasks
Monitors cluster health
- Single EC2 instance

New cards

Core Node

Saves HDFS data

Used in multi node clusters

Runs tasks

Can be scaled up or down

New cards

Task Node

runs task only

Doesnt store data

Spot instances can be used

New cards

Transient Clusters

terminate once all steps are complete

Loading data processing data and storing data

Perform work and then shut down to save costs

New cards

Long-running

manually terminated

Functions as data warehouse with periodic processing on large data sets

Tasks nodes can be scaled using spot instances

Setup with termination protection on an auto termination off

New cards

Using EMR

frameworks and applications are part of cluster creation

Users connect directly to the master node to run jobs

Configure the steps in a cluster

Submit orders steps via the console

New cards

EMR and AWS integration

EC2 provides the EMR nodes

VPC provides a virtual network for nodes

S3 stores input and output data

Data pipeline to schedule and start clusters

IAM to configure permissions

New cards

EMR capabilities

EMR by hour service change

EC2 separate set of charged

Auto provisions core nodes when they fail

Cluster core nodes can be resized on the fly

Core nodes can be removed but risk data loss

Tasks nodes can be added on the fly

New cards

Hive

It is a tool that provides SQL querying of data stored in HDFS or Hbase

Accessed using the HiveQL language

Allows for easy ad-hoc queries

Transforms log file data into structures like tables

Consists of a schema in the Metastore and data in HDFS

New cards

Success of Hive

Uses familiar SQL syntax for OLAP queries

Interactive and scalable on a big data cluster

Works very well for data warehouse applications

JDBC / ODBC driver

New cards

Hive Approach

Designed to organize data into tables

Build table structures over HDFS files

Table schemas and metadata are stored in a database called the Metastore

Hive table definitions are local to the machine they are created on

New cards

Hive Metastore and Glue

Shares schema across EMR and other AWS service

Used to create data lakes

New cards

Schema on Read

Verifies data organization when a query is issued

Provides much faster loading as structure is not validated

Multiple schemas serving different needs for the same data

Better option when schema is not known at loading time

New cards

Hive query exmple

INSERT OVERWRITE TABLE user_active

SELECT user.*

FROM user

WHERE user.active = A;

Creates a table

Selects all users

From a table called user

That have an active indication

New cards

How to load data into Hive?

create a table called records

New cards

Query Against a Hive Table

 SELECT using SQL familiar commands and specifications

 MAX defines the maximum temperature for each year

 FROM defines our records table

 WHERE ensures clean data selections

 GROUP BY issues the value by yEar

New cards

File storage Formats

--as-format is the nomenclature

--as-textfile

--as-sequencefile

--as-parquetfile

--as-avrodatafilE

New cards

Binary Column FormatS

 Column oriented formats work best when on a few columns are used in queries/calculations

 Hive provides native support for Parquet

 STORED AS … (PARQUET) or (ETC)

New cards

S3DistCP Copy

Transaction for copying large amounts of data from S3 to HDFS

Copies in a distributed manner using MapReduce

Provides for parallel path copying across buckets

New cards

what is the encompassing big data service in AWS

EMR

New cards

in EMR service what applications are in distribution

hadoop, spark, hive

New cards

clusters

computing network of computersc

New cards

core nodes

the cluster store Hadoop data files

New cards

EMR release

the distribution of applications tested together

New cards

Configure Cluster Nodes

Primary node configuration is for the master node

Core node configuration is for storage and processing

Task node configuration is for processing

New cards

Cluster Configuration

master node

Core/task nodes

Spot instances

New cards

Master node

m5.xlarge is standard configuration

New cards

core/task nodes cluster config

m5.xlarge is a good choice

 External dependencies use t2.medium

 Improved performance with m4.xlaRge

New cards

Spot instances cluster Configuration

Good choice for task nodes as they can scale

Avoid using on master and core nodes as it may cause data loss

New cards

VPC: Virtual Private cloud

created as a protected network for the cluster

New cards

S3 Bucket

created to capture the processing logs for the cluster

Error log files are captured in the same bucket

New cards

IAM policies

grant or deny permissions to control clusters acces

New cards

IAM roles

control access to EMFRS data based on user

New cards

IAM polices are attatched to what

IAM roles

New cards

SSh

provieds a secure connection to the command line interface

New cards

Kerebos

provides secure user authentication

New cards

Block Public Access

setting prevents public access to data stored on your EMR cluster

New cards

Apache Spark

fast and general engine for large scale data processing

In-Memory caching

Optimized query execution

Spark SQL Queries

Machine Learning MLlib

Spark StreaminG

New cards

EMR notebook

AWS Notebook (Jupyter)

Backed up to S3 data storage

Provision clusters from the notebook

Accessed via AWS Console

Hosted inside a VPC