1/60
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is Kinesis?
Data streaming service
Use for logs, metrics, IoT, clickstreams
Kinesis is automatically replicated synchronously to ____
3 AZ
Kinesis streams vs analytics vs firehose
streams: low latency streaming at scale
analytics: real-time analytics using SQL
firehose: load streams into S3, redshift, elasticsearch & splunk
kinesis streams are divided into ____ / _____
shards / partitions
Kinesis streams data retention is ___ by default, can go up to ___
24 hours, 365 days
T/F: Multiple applications can consume the same stream
True
Once data is inserted in kinesis, it can’t be ____
deleted
Two types of kinesis streams modes for capacity
on-demand, provisioned
3 types of Kinesis producers
AWS SDK
Kinesis Producer Library (KPL)
Kinesis agent
3 types of kinesis consumers
AWS SDK
Lambda (through event source mapping)
KCL
Kinesis producer limits
1MB/s or 1000 messages/s PER SHARD
ProvisionedThroughputException otherwise
Kinesis Consumer Classic limits
2MB/s at read PER SHARD
5 API calls per second PER SHARD
Kinesis Consumer Enhanced Fan-Out limits
2MB/s at read PER SHARD, PER ENHANCED CONSUMER
No API calls needed (push model)
Kinesis firehose destinations
S3
Redshift
OpenSearch
Datadog
Splunk
New Relic
MongoDB
Custom HTTP endpoint
Firehose buffer size
32MB
Firehose buffer time
1 min
dynamodb streams vs kinesis data streams
dynamodb much more expensive for streaming
What is Amazon MSK
Managed Streaming for apache Kafka
An alternative to kinesis
kinesis vs MSK
Need kafka compatibility
Kafka gives longer term storage
kinesis is more ismple, cost efficient for simple streams, serverless
What is AWS Batch?
Run batch jobs as docker images
AWS Batch can run with one of what two options?
Fargate
Dynamic provisioning (EC2 & Spot instances) - in VPC
T/F: AWS Batch is fully serverless
True
How can you trigger a batch job?
Event notification → lambda → api call
Eventbridge → batch trigger
batch vs lambda
lambda: time limit, limited runtimes, serverless, less temporary storage
batch: no time limit, any docker runtime, uses EBS, uses EC2 or fargate
AWS Batch managed vs unmanaged compute environment
Managed: choose on-demand or spot, set max-price for spot, launched within yrou VPC
Unmanaged: You control everything
What is AWS Batch multi node mode?
Large scale, good for HPC
Does not work with spot instances
What is Amazon EMR?
Elastic MapReduce
Helps creating hadoop clusters (big data)
Made of hundreds of EC2 instances
How do you do long term storage with EMR?
EMRFS
What is an EMR master node?
Manage the cluster, coordinate, manage health
What is an EMR core node?
Run task and store data
What is an EMR task node?
Just to run tasks, usually spot. Optional.
EMR purchasing options?
On-demand
Reserved
What are EMR uniform instance groups?
Select a single instance type and purchasing option for each node
What are EMR instance fleets?
Select target capacity, mix instance types, and purchasing options
Which has auto scaling, uniform instance groups or instance fleet?
Uniform instance groups
What is AWS Glue?
Managed ‘extract, transform, and load’ (ETL) service
Used to prepare and transform data for analytics
What is AWS Redshift?
Based on PostgreSQL, not used for OLTP
What is OLAP?
Online analytical processing
Redshift can scale to ___ of data
PB
Is redshift row or column based?
column
Redshift uses ___ queries
SQL
How to let redshift go through VPC?
Redshift enhanced VPC routing
T/F: Redshift is always provisioned
True
Since Redshift is always provisioned, if you’re only doing sporadic queries you should instead use _____
athena
Redshift snapshots are automated every hours, _ GB, or on a schedule
8 hours, 5 GB
How can you copy encrypted redshift data?
A redshift copy grant
What is redshift spectrum?
query data that is already in s3 without loading it
Query is submitted to thousands of redshift spectrum nodes
For redshift spectrum, you must have a ________
redshift cluster available
What is Redshift Workload Management? (WLM)
Manage queries priorities within workloads
What is redshift concurrency scaling?
Enables consistently fast performance with virtually unlimited concurrent users and queries.
Redshift automatically adds additional cluster capacity (CONCURRENCY-SCALING CLUSTER)
What is DocumentDB?
aws implementation of MongoDB
What is Amazon Timestream?
Time series database
(a bunch of points with timestamps)
Store and analyze trillions of events / day
Good for IoT
What is Amazon Athena?
Serverless query service to analyze data stored in S3
Support CSV,JSON,Parquet
Athena cost?
$5/TB of data scanned
Athena is commonly used with ________
Amazon Quicksight for reporting/dashboards
How to improve Athena performance?
Use columnar data
- Use glue to convert your data to parquet or ORC
Compress data
Partition datasets
What is an athena federated query?
Run SQL queries across data stored in various data sources
Uses a lambda data source connector
What is Amazon Quicksight?
Business Intelligence service
Create dashboards
Quicksight does in-memory computation using ____
SPICE engine
Quicksight enterprise has the possibility to setup _____
Column-Level Security (CLS)