Data Engineering

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/56

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

57 Terms

New cards

Kinesis Data Stream

collecting and storing streaming data in real time (e.g. IoT devices) with producers (Kinesis Agents) and Consumers (applications, Lambda, Firehose, Flink)

New cards

Kinesis Features

1 year retention and rep-processing by consumers, persistent data that cannot be deleted (only expired) up to 1MB, with a data ordering guarantee and at-rest/in-flight encryption.

New cards

Kinesis Producer Library

optimized producer applications

New cards

Kinesis Client Library

optimized consumer applications

New cards

Kinesis Streams (Provisioned)

choose the number of shards (size of stream) on stream, the more shards the more throughput for inbound messages, with 2MB OUT for each shard, pay per shard provisioned per hour.

New cards

Kinesis Streams (On-Demand)

no need to provision or manage stream capacity, default of 4MB/defaults IN and scales automatically based on peak in last month, pay per stream hour and data I/O/GB

New cards

Amazon Data Firehose

store data into target destinations, batch writing to target destination (S3, Redshift (COPY from S3), OpenSearch, Splunk, 3rd party partner destinations and custom destinations. Near-Realtime service due to buffer time and size, pay for the amount of data going into Firehose.

New cards

Kinesis Data Transformations

can perform data transformations with Lambda + Kinesis, supporting compression when target is S3 (only GZIP for Redshift)

New cards

Firehose + Spark

cannot be done, Spark + KCL do not read from Firehose, they only read from Streams

New cards

Transformation Failures

transformation failures (e.g. type conversion and compression from Firehose/Lambda) can be stored in S3 as a feature of Streams

New cards

Firehose Size Rule

size limit of the Firehose buffer, if it is reached then the data is processed (i.e. high-throughput applications)

New cards

Firehose Time Rule

time limit that if reached will clear the buffer (i.e. low-throughput applications)

New cards

AWS Apache Flink

Flink framework used for processing data streams in real time (can use Streams or MSK), provisioning application on managed cluster on AWS, cannot be read from Firehose.

New cards

Amazon Managed Apache Kafka (MFK)

alternative to Kinesis, fully managed Kafka cluster on AWS that can create, update and delete cluster son the fly. Creates and manages broker nodes and zookeeper nodes with automatic recovery. Can be multi-AZ and data is stored on EBS volumes.

New cards

MKF Cluster

made of brokers

New cards

MKF Producers

ingest data from places and send data to a topic that will be replicated into the brokers

New cards

MFK Consumers

processed or ingested data to multiple destinations, Flink, Glue, Lambda, or EC2/EKS/ECS applications

New cards

MFK vs Firehose

MFK can support messages greater than 1MB, with topics and partitions.

New cards

AWS Batch

running batch jobs as Docker images, that can run on Fargate or dynamic provisioning of instances, with optimal quantity and type based on volumes and requirements. Pay for underlying resources used, that can be orchestrated in Step Functions or Event Bridge.

New cards

Managed Batch Compute Environment

managed capacity of underling instances that can be on -demand or spot instances, with a set max price for spot instances (must have a NAT Gateway/Instance if in a VPC) with a set min and max vCPU.

New cards

Batch Job Queue

will distribute jobs to managed batch compute environments.

New cards

Batch Multi-Node Mode

large scale batch good for HPC, that leverages multiple EC2/ECS instances at the same time, good for tightly coupled workloads, representing a single main node and many child nodes, but cannot work for spot instances.

New cards

Elastic MapReduce (EMR)

created Hadoop clusters (for big data) to analyze and process vast amounts of data, made of hundreds of EC2 instances that can be shut down once processing is completed. Support for Spark, Hbase, Presto, Flink. Takes care of provisioning and autoscaling with CloudWatch.

New cards

EMR Master Node

manage cluster, coordinates and manages other nodes, long running process

New cards

EMR Core Node

long running node that runs tasks sand stores data

New cards

EMR Task Nodes

optional node that is just to run tasks, for short running processes.

New cards

On-Demand + EMR

reliable workloads that won't be terminated (use for core nodes)

New cards

Reserved + EMR

1 year minimum instance with cost savings, EMR will automatically use if available, good for master + core nodes

New cards

Spot Instance + EMR

cheap, less reliable instances that should only be used for task nodes.

New cards

Uniform Instance Groups

single instance type and purchasing option for each not (with support for autoscaling)

New cards

Instance Fleet

target capacity with min instance types and purchasing options, with no autoscaling capability

New cards

AWS Glue

ETL service that is managed, useful to prepare and transform data for analytics.

New cards

AWS Glue Catalogue

datasets with a crawler that will extract data from any database (or S3) and write into S3 glue (metadata), once metadata is persisted other services will be able to do data discovery.

New cards

Redshift

data warehousing solution based on PostgreSQL, not used for OLTP, better for OLAP (online analytical processing). 10x performance than other data warehouses, that can scale to PBs of data.

New cards

Redshift Features

columnar storage of data (instead of rows) that can allow for massively parallel query execution, that spreads queries across many nodes

New cards

Provisioned Cluster Node

on-demand servers

New cards

Serverless Cluster Nodes

non-managed infrastructure for Redshift

New cards

Redshift Leader Node

used for query planning and result aggregation

New cards

Redshift Nodes

hundreds of nodes that can hold up to 16TB space per node, can be in multiple regions (for some clusters)

New cards

Redshift Compute Node

performs queries and sends results to the parent node

New cards

Redshift Limitations

enhanced VPC routing, COPY/UNLOAD will go through VPC, backup and restore functionality, provisioned service so better for sustained usage.

New cards

Redshift Snapshots

point in time incremental backups that are stored in S3, can restore into a new cluster. Can configure to automatically copy snapshots into another region

New cards

Cross Region Snapshot Recovery for Redshift

using a snapshot copy grant, enables Redshift to perform encryption operations in designated region and copy to other regions (for KMS encrypted Redshift databases)

New cards

Redshift Spectrum

query data that is already in S3 without loading it, must have a cluster available to start query, and query is submitted to thousands of spectrum nodes.

New cards

Workload Management

prevent short queries getting stuck behind long-running queries by defining query queues, and route queries to the appropriate queue at runtime; can be automatic or manual

New cards

Redshift Concurrency Scaling

fast performance with unlimited concurrent users and queries, automatically adds additional cluster capacity to process an increase in requests, charged per second.

New cards

DocumentDB

managed MongoDB type service, used to store, query, and index JSON data. Fully managed and highly available with replication across 3 AZ, growing in 10GB increments, automatically scaling to millions of req/sec. Pay for usage.

New cards

On-Demand DocumentDB

pay per second, minimum of 10 minutes, with primary and replica instances.

New cards

DocumentDB Backups

stored in S3 with per GB/month pricing

New cards

TimeStream

fully managed, scalable serverless time series database. Automatically scales up/down to adjust capacity. Store and analyze trillions of events per day, with cost savings and performance boosts compared to other database offerings Encryption in transit and at rest, used in IoT applications, operational applications, and real-time applications

New cards

Athena

serverless query service to analyze data in S3, using SQL to query files and support for CSV, JSON, ORC, Parquet, and other file formats. Priced at ~$5 per TB of data queries.

New cards

Athena Performance Improvements

use columnar data for cost savings, use Parquet or ORC format, compress data for smaller retrievals, partition datasets in S3 (by variables in folder names) and only use larger files (>128MB)

New cards

Athena Federated Queries

allows SQL across data stored in relational, non-relational, object and custom data sources, using data connectors that run on Lambda to run queries and stores results in S3.

New cards

QuickSight

serverless ML based BI service to create interactive dashboards, with per-session pricing. Can integrate with RDS, Aurora, Athena, Redshift, S3, OpenSearch, TimeStream, Salesforce, Jira, 3rd party databases, and imported (static) data.

New cards

QuickSight SPICE Engine

in-memory computation for imported data into QuickSight

New cards

QuickSight Enterprise

column level security added to QuickSight

New cards

QuickSight Dashboards

users and groups that are managed in QuickSight (not IAM) that have read-only access to a dashboard with preserved filtering, parameters, controls, and sorting. Dashboards need to be published to be seen, and users can see the underlying data used to create dashboards