SAP 10: Data Engineering

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/60

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

61 Terms

1
New cards

What is Kinesis?

Data streaming service
Use for logs, metrics, IoT, clickstreams

2
New cards

Kinesis is automatically replicated synchronously to ____

3 AZ

3
New cards

Kinesis streams vs analytics vs firehose

streams: low latency streaming at scale
analytics: real-time analytics using SQL
firehose: load streams into S3, redshift, elasticsearch & splunk

4
New cards

kinesis streams are divided into ____ / _____

shards / partitions

5
New cards

Kinesis streams data retention is ___ by default, can go up to ___

24 hours, 365 days

6
New cards

T/F: Multiple applications can consume the same stream

True

7
New cards

Once data is inserted in kinesis, it can’t be ____

deleted

8
New cards

Two types of kinesis streams modes for capacity

on-demand, provisioned

9
New cards

3 types of Kinesis producers

AWS SDK
Kinesis Producer Library (KPL)
Kinesis agent

10
New cards

3 types of kinesis consumers

AWS SDK
Lambda (through event source mapping)
KCL

11
New cards

Kinesis producer limits

1MB/s or 1000 messages/s PER SHARD
ProvisionedThroughputException otherwise

12
New cards

Kinesis Consumer Classic limits

2MB/s at read PER SHARD
5 API calls per second PER SHARD

13
New cards

Kinesis Consumer Enhanced Fan-Out limits

2MB/s at read PER SHARD, PER ENHANCED CONSUMER
No API calls needed (push model)

14
New cards

Kinesis firehose destinations

S3
Redshift
OpenSearch

Datadog
Splunk
New Relic
MongoDB

Custom HTTP endpoint

15
New cards

Firehose buffer size

32MB

16
New cards

Firehose buffer time

1 min

17
New cards

dynamodb streams vs kinesis data streams

dynamodb much more expensive for streaming

18
New cards

What is Amazon MSK

Managed Streaming for apache Kafka
An alternative to kinesis

19
New cards

kinesis vs MSK

Need kafka compatibility
Kafka gives longer term storage

kinesis is more ismple, cost efficient for simple streams, serverless

20
New cards

What is AWS Batch?

Run batch jobs as docker images

21
New cards

AWS Batch can run with one of what two options?

Fargate
Dynamic provisioning (EC2 & Spot instances) - in VPC

22
New cards

T/F: AWS Batch is fully serverless

True

23
New cards

How can you trigger a batch job?

Event notification → lambda → api call
Eventbridge → batch trigger

24
New cards

batch vs lambda

lambda: time limit, limited runtimes, serverless, less temporary storage

batch: no time limit, any docker runtime, uses EBS, uses EC2 or fargate

25
New cards

AWS Batch managed vs unmanaged compute environment

Managed: choose on-demand or spot, set max-price for spot, launched within yrou VPC

Unmanaged: You control everything

26
New cards

What is AWS Batch multi node mode?

Large scale, good for HPC
Does not work with spot instances

27
New cards

What is Amazon EMR?

Elastic MapReduce
Helps creating hadoop clusters (big data)
Made of hundreds of EC2 instances

28
New cards

How do you do long term storage with EMR?

EMRFS

29
New cards

What is an EMR master node?

Manage the cluster, coordinate, manage health

30
New cards

What is an EMR core node?

Run task and store data

31
New cards

What is an EMR task node?

Just to run tasks, usually spot. Optional.

32
New cards

EMR purchasing options?

On-demand
Reserved

33
New cards

What are EMR uniform instance groups?

Select a single instance type and purchasing option for each node

34
New cards

What are EMR instance fleets?

Select target capacity, mix instance types, and purchasing options

35
New cards

Which has auto scaling, uniform instance groups or instance fleet?

Uniform instance groups

36
New cards

What is AWS Glue?

Managed ‘extract, transform, and load’ (ETL) service
Used to prepare and transform data for analytics

37
New cards

What is AWS Redshift?

Based on PostgreSQL, not used for OLTP

38
New cards

What is OLAP?

Online analytical processing

39
New cards

Redshift can scale to ___ of data

PB

40
New cards

Is redshift row or column based?

column

41
New cards

Redshift uses ___ queries

SQL

42
New cards

How to let redshift go through VPC?

Redshift enhanced VPC routing

43
New cards

T/F: Redshift is always provisioned

True

44
New cards

Since Redshift is always provisioned, if you’re only doing sporadic queries you should instead use _____

athena

45
New cards

Redshift snapshots are automated every hours, _ GB, or on a schedule

8 hours, 5 GB

46
New cards

How can you copy encrypted redshift data?

A redshift copy grant

47
New cards

What is redshift spectrum?

query data that is already in s3 without loading it
Query is submitted to thousands of redshift spectrum nodes

48
New cards

For redshift spectrum, you must have a ________

redshift cluster available

49
New cards

What is Redshift Workload Management? (WLM)

Manage queries priorities within workloads

50
New cards

What is redshift concurrency scaling?

Enables consistently fast performance with virtually unlimited concurrent users and queries.
Redshift automatically adds additional cluster capacity (CONCURRENCY-SCALING CLUSTER)

51
New cards

What is DocumentDB?

aws implementation of MongoDB

52
New cards

What is Amazon Timestream?

Time series database
(a bunch of points with timestamps)
Store and analyze trillions of events / day

Good for IoT

53
New cards

What is Amazon Athena?

Serverless query service to analyze data stored in S3
Support CSV,JSON,Parquet

54
New cards

Athena cost?

$5/TB of data scanned

55
New cards

Athena is commonly used with ________

Amazon Quicksight for reporting/dashboards

56
New cards

How to improve Athena performance?

Use columnar data
- Use glue to convert your data to parquet or ORC
Compress data
Partition datasets

57
New cards

What is an athena federated query?

Run SQL queries across data stored in various data sources
Uses a lambda data source connector

58
New cards

What is Amazon Quicksight?

Business Intelligence service
Create dashboards

59
New cards

Quicksight does in-memory computation using ____

SPICE engine

60
New cards

Quicksight enterprise has the possibility to setup _____

Column-Level Security (CLS)

61
New cards