1/65
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Big Data Def
Describes the realization of greater business intelligence by storing processing and analyzing data that was previously ignored or siloed due to the limitations of traditional data management technologies
Velocity
speed data travels
Volume
space the data required
variety
heterogenous types of files
Velcity Examples
Internet of things
Clickstream data
Environmental data
Volume Example
Standard work year: 2016 hours
YouTube Content ID system: processes 250 years of content in 24 hours
Variety Examples
RDBMS
XML
Log
Unstructured text files
HTML
Video
Traditional Computing Model
data stored in a central location like a SAN
Dara copied to processors at run time
Large volumes bottleneck on the transfer rate
Hadoop Computing model
bring program to data
Replicate and distribute data when the data is stored
Run the program where the data resides
Distribution
a collection of open source Apache applications that have been tested to work together
Prominted providers of distributions
cloudera
Horton works
Amazon
Microsoft Asure
What is Apache Hadoop software Libraru
a framework that allows for the distributed processing of a large data sets across clusters of computers using simple programming models
Hadoop Characteristics
Inexpensive data storage
Inexpensive severs
Combines up to 1000 distributed servers for massive performance
Storage Trends
commodity
Cheaper and more abundant
Normalization vs Denormalization
Data schema on read vs schema on write
Data lakes
Solid state
Memory Trends
commodity
Cheaper more abundant
More = merrier
In memory computing benefits from massive allocations of RAM
Hadoop name node needs, depending on size of cluster, lots of RAM
Distributed Processing
cheaper to store massive quantities of data using big data architecture
Hadoop Distributed File System
HDFS is the data storage layer for a Hadoop system
Inexpensive reliable storage for massive amounts of data
Uses low cost industry standard hardware
Data is replicated and distributed to multiple nodes of storage
HDFS: Hadoop file system
distributes data blocks across the cluster in a redundant manner
Data is lost on cluster termination
YARN: Yet Another Resource Negotiator
Manages cluster resources for the collection of applications
MapReduce
base code that handles all data processing
Maps data to key/Value pairs
what is the mechanism for bringing the processing to the stored data?
Map Reduce
In Map Reduce where is data stored?
HDFS node
What does Map Reduce Contain?
master job tracker managing task resources
what executes tasks in map reduce for the HDFS node
task tracker
EMR: Elastic Map Reduce is managed by what
hadoop severvice provided by AWS
AWS distribution
provided support for most popular open source application like Spark, Hive, HDFS, Presto, and flink
EMR Cluster Architecture
Master node
Core node
Task node
Master Node “leader node”
manages the cluster
Tracks status of tasks
Monitors cluster health
Single EC2 instance
Core Node
Saves HDFS data
Used in multi node clusters
Runs tasks
Can be scaled up or down
Task Node
runs task only
Doesnt store data
Spot instances can be used
Transient Clusters
terminate once all steps are complete
Loading data processing data and storing data
Perform work and then shut down to save costs
Long-running
manually terminated
Functions as data warehouse with periodic processing on large data sets
Tasks nodes can be scaled using spot instances
Setup with termination protection on an auto termination off
Using EMR
frameworks and applications are part of cluster creation
Users connect directly to the master node to run jobs
Configure the steps in a cluster
Submit orders steps via the console
EMR and AWS integration
EC2 provides the EMR nodes
VPC provides a virtual network for nodes
S3 stores input and output data
Data pipeline to schedule and start clusters
IAM to configure permissions
EMR capabilities
EMR by hour service change
EC2 separate set of charged
Auto provisions core nodes when they fail
Cluster core nodes can be resized on the fly
Core nodes can be removed but risk data loss
Tasks nodes can be added on the fly
Hive
It is a tool that provides SQL querying of data stored in HDFS or Hbase
Accessed using the HiveQL language
Allows for easy ad-hoc queries
Transforms log file data into structures like tables
Consists of a schema in the Metastore and data in HDFS
Success of Hive
Uses familiar SQL syntax for OLAP queries
Interactive and scalable on a big data cluster
Works very well for data warehouse applications
JDBC / ODBC driver
Hive Approach
Designed to organize data into tables
Build table structures over HDFS files
Table schemas and metadata are stored in a database called the Metastore
Hive table definitions are local to the machine they are created on
Hive Metastore and Glue
Shares schema across EMR and other AWS service
Used to create data lakes
Schema on Read
Verifies data organization when a query is issued
Provides much faster loading as structure is not validated
Multiple schemas serving different needs for the same data
Better option when schema is not known at loading time
Hive query exmple
INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = A;
Creates a table
Selects all users
From a table called user
That have an active indication
How to load data into Hive?
create a table called records
Query Against a Hive Table
SELECT using SQL familiar commands and specifications
MAX defines the maximum temperature for each year
FROM defines our records table
WHERE ensures clean data selections
GROUP BY issues the value by yEar
File storage Formats
--as-format is the nomenclature
--as-textfile
--as-sequencefile
--as-parquetfile
--as-avrodatafilE
Binary Column FormatS
Column oriented formats work best when on a few columns are used in queries/calculations
Hive provides native support for Parquet
STORED AS … (PARQUET) or (ETC)
S3DistCP Copy
Transaction for copying large amounts of data from S3 to HDFS
Copies in a distributed manner using MapReduce
Provides for parallel path copying across buckets
what is the encompassing big data service in AWS
EMR
in EMR service what applications are in distribution
hadoop, spark, hive
clusters
computing network of computersc
core nodes
the cluster store Hadoop data files
EMR release
the distribution of applications tested together
Configure Cluster Nodes
Primary node configuration is for the master node
Core node configuration is for storage and processing
Task node configuration is for processing
Cluster Configuration
master node
Core/task nodes
Spot instances
Master node
m5.xlarge is standard configuration
core/task nodes cluster config
m5.xlarge is a good choice
External dependencies use t2.medium
Improved performance with m4.xlaRge
Spot instances cluster Configuration
Good choice for task nodes as they can scale
Avoid using on master and core nodes as it may cause data loss
VPC: Virtual Private cloud
created as a protected network for the cluster
S3 Bucket
created to capture the processing logs for the cluster
Error log files are captured in the same bucket
IAM policies
grant or deny permissions to control clusters acces
IAM roles
control access to EMFRS data based on user
IAM polices are attatched to what
IAM roles
SSh
provieds a secure connection to the command line interface
Kerebos
provides secure user authentication
Block Public Access
setting prevents public access to data stored on your EMR cluster
Apache Spark
fast and general engine for large scale data processing
In-Memory caching
Optimized query execution
Spark SQL Queries
Machine Learning MLlib
Spark StreaminG
EMR notebook
AWS Notebook (Jupyter)
Backed up to S3 data storage
Provision clusters from the notebook
Accessed via AWS Console
Hosted inside a VPC