collecting and storing streaming data in real time (e.g. IoT devices) with producers (Kinesis Agents) and Consumers (applications, Lambda, Firehose, Flink)
2
New cards
Kinesis Features
1 year retention and rep-processing by consumers, persistent data that cannot be deleted (only expired) up to 1MB, with a data ordering guarantee and at-rest/in-flight encryption.
3
New cards
Kinesis Producer Library
optimized producer applications
4
New cards
Kinesis Client Library
optimized consumer applications
5
New cards
Kinesis Streams (Provisioned)
choose the number of shards (size of stream) on stream, the more shards the more throughput for inbound messages, with 2MB OUT for each shard, pay per shard provisioned per hour.
6
New cards
Kinesis Streams (On-Demand)
no need to provision or manage stream capacity, default of 4MB/defaults IN and scales automatically based on peak in last month, pay per stream hour and data I/O/GB
7
New cards
Amazon Data Firehose
store data into target destinations, batch writing to target destination (S3, Redshift (COPY from S3), OpenSearch, Splunk, 3rd party partner destinations and custom destinations. Near-Realtime service due to buffer time and size, pay for the amount of data going into Firehose.
8
New cards
Kinesis Data Transformations
can perform data transformations with Lambda + Kinesis, supporting compression when target is S3 (only GZIP for Redshift)
9
New cards
Firehose + Spark
cannot be done, Spark + KCL do not read from Firehose, they only read from Streams
10
New cards
Transformation Failures
transformation failures (e.g. type conversion and compression from Firehose/Lambda) can be stored in S3 as a feature of Streams
11
New cards
Firehose Size Rule
size limit of the Firehose buffer, if it is reached then the data is processed (i.e. high-throughput applications)
12
New cards
Firehose Time Rule
time limit that if reached will clear the buffer (i.e. low-throughput applications)
13
New cards
AWS Apache Flink
Flink framework used for processing data streams in real time (can use Streams or MSK), provisioning application on managed cluster on AWS, cannot be read from Firehose.
14
New cards
Amazon Managed Apache Kafka (MFK)
alternative to Kinesis, fully managed Kafka cluster on AWS that can create, update and delete cluster son the fly. Creates and manages broker nodes and zookeeper nodes with automatic recovery. Can be multi-AZ and data is stored on EBS volumes.
15
New cards
MKF Cluster
made of brokers
16
New cards
MKF Producers
ingest data from places and send data to a topic that will be replicated into the brokers
17
New cards
MFK Consumers
processed or ingested data to multiple destinations, Flink, Glue, Lambda, or EC2/EKS/ECS applications
18
New cards
MFK vs Firehose
MFK can support messages greater than 1MB, with topics and partitions.
19
New cards
AWS Batch
running batch jobs as Docker images, that can run on Fargate or dynamic provisioning of instances, with optimal quantity and type based on volumes and requirements. Pay for underlying resources used, that can be orchestrated in Step Functions or Event Bridge.
20
New cards
Managed Batch Compute Environment
managed capacity of underling instances that can be on -demand or spot instances, with a set max price for spot instances (must have a NAT Gateway/Instance if in a VPC) with a set min and max vCPU.
21
New cards
Batch Job Queue
will distribute jobs to managed batch compute environments.
22
New cards
Batch Multi-Node Mode
large scale batch good for HPC, that leverages multiple EC2/ECS instances at the same time, good for tightly coupled workloads, representing a single main node and many child nodes, but cannot work for spot instances.
23
New cards
Elastic MapReduce (EMR)
created Hadoop clusters (for big data) to analyze and process vast amounts of data, made of hundreds of EC2 instances that can be shut down once processing is completed. Support for Spark, Hbase, Presto, Flink. Takes care of provisioning and autoscaling with CloudWatch.
24
New cards
EMR Master Node
manage cluster, coordinates and manages other nodes, long running process
25
New cards
EMR Core Node
long running node that runs tasks sand stores data
26
New cards
EMR Task Nodes
optional node that is just to run tasks, for short running processes.
27
New cards
On-Demand + EMR
reliable workloads that won't be terminated (use for core nodes)
28
New cards
Reserved + EMR
1 year minimum instance with cost savings, EMR will automatically use if available, good for master + core nodes
29
New cards
Spot Instance + EMR
cheap, less reliable instances that should only be used for task nodes.
30
New cards
Uniform Instance Groups
single instance type and purchasing option for each not (with support for autoscaling)
31
New cards
Instance Fleet
target capacity with min instance types and purchasing options, with no autoscaling capability
32
New cards
AWS Glue
ETL service that is managed, useful to prepare and transform data for analytics.
33
New cards
AWS Glue Catalogue
datasets with a crawler that will extract data from any database (or S3) and write into S3 glue (metadata), once metadata is persisted other services will be able to do data discovery.
34
New cards
Redshift
data warehousing solution based on PostgreSQL, not used for OLTP, better for OLAP (online analytical processing). 10x performance than other data warehouses, that can scale to PBs of data.
35
New cards
Redshift Features
columnar storage of data (instead of rows) that can allow for massively parallel query execution, that spreads queries across many nodes
36
New cards
Provisioned Cluster Node
on-demand servers
37
New cards
Serverless Cluster Nodes
non-managed infrastructure for Redshift
38
New cards
Redshift Leader Node
used for query planning and result aggregation
39
New cards
Redshift Nodes
hundreds of nodes that can hold up to 16TB space per node, can be in multiple regions (for some clusters)
40
New cards
Redshift Compute Node
performs queries and sends results to the parent node
41
New cards
Redshift Limitations
enhanced VPC routing, COPY/UNLOAD will go through VPC, backup and restore functionality, provisioned service so better for sustained usage.
42
New cards
Redshift Snapshots
point in time incremental backups that are stored in S3, can restore into a new cluster. Can configure to automatically copy snapshots into another region
43
New cards
Cross Region Snapshot Recovery for Redshift
using a snapshot copy grant, enables Redshift to perform encryption operations in designated region and copy to other regions (for KMS encrypted Redshift databases)
44
New cards
Redshift Spectrum
query data that is already in S3 without loading it, must have a cluster available to start query, and query is submitted to thousands of spectrum nodes.
45
New cards
Workload Management
prevent short queries getting stuck behind long-running queries by defining query queues, and route queries to the appropriate queue at runtime; can be automatic or manual
46
New cards
Redshift Concurrency Scaling
fast performance with unlimited concurrent users and queries, automatically adds additional cluster capacity to process an increase in requests, charged per second.
47
New cards
DocumentDB
managed MongoDB type service, used to store, query, and index JSON data. Fully managed and highly available with replication across 3 AZ, growing in 10GB increments, automatically scaling to millions of req/sec. Pay for usage.
48
New cards
On-Demand DocumentDB
pay per second, minimum of 10 minutes, with primary and replica instances.
49
New cards
DocumentDB Backups
stored in S3 with per GB/month pricing
50
New cards
TimeStream
fully managed, scalable serverless time series database. Automatically scales up/down to adjust capacity. Store and analyze trillions of events per day, with cost savings and performance boosts compared to other database offerings Encryption in transit and at rest, used in IoT applications, operational applications, and real-time applications
51
New cards
Athena
serverless query service to analyze data in S3, using SQL to query files and support for CSV, JSON, ORC, Parquet, and other file formats. Priced at ~$5 per TB of data queries.
52
New cards
Athena Performance Improvements
use columnar data for cost savings, use Parquet or ORC format, compress data for smaller retrievals, partition datasets in S3 (by variables in folder names) and only use larger files (>128MB)
53
New cards
Athena Federated Queries
allows SQL across data stored in relational, non-relational, object and custom data sources, using data connectors that run on Lambda to run queries and stores results in S3.
54
New cards
QuickSight
serverless ML based BI service to create interactive dashboards, with per-session pricing. Can integrate with RDS, Aurora, Athena, Redshift, S3, OpenSearch, TimeStream, Salesforce, Jira, 3rd party databases, and imported (static) data.
55
New cards
QuickSight SPICE Engine
in-memory computation for imported data into QuickSight
56
New cards
QuickSight Enterprise
column level security added to QuickSight
57
New cards
QuickSight Dashboards
users and groups that are managed in QuickSight (not IAM) that have read-only access to a dashboard with preserved filtering, parameters, controls, and sorting. Dashboards need to be published to be seen, and users can see the underlying data used to create dashboards