data ingestion

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/48

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

49 Terms

1
New cards

What are the three types of data structures in AWS/ML context?

  1. Structured Data: Organized with defined schema (e.g., database tables, CSV)
    2. Unstructured Data: No predefined structure (e.g., text, video, audio, images)
    3. Semi-structured Data: Tagged/categorized elements (e.g., XML, JSON, log files)

2
New cards

What are the "3 V's" of data properties?

  1. Volume: Amount/size of data
    2. Velocity: Speed of data generation and processing
    3. Variety: Different types, structures, and sources of data

3
New cards

What is a Data Warehouse and when should you use it?

Definition: Centralized repository optimized for analysis of structured data from different sources

Eg. Redshift
Key features: Schema-on-write (ETL), star/snowflake schemas, complex read-heavy queries
Use when: You need structured data with fast complex queries, BI/analytics, integration from multiple sources

4
New cards

What is a Data Lake and when should you use it?

Definition: Storage repository for large amounts of raw data in native format
Key features: Schema-on-read (ELT), no preprocessing, flexible and agile

Eg. s3
Use when: Mix of structured/unstructured data, large volumes where cost-effectiveness matters, flexibility for future needs

5
New cards
What is a Data Lakehouse?
Definition: Hybrid architecture combining data warehouse performance/reliability with data lake flexibility/scale/low-cost
Example: AWS Lake Formation (S3 + Redshift Spectrum)
Features: Supports both schema-on-write and schema-on-read
6
New cards
What is a Data Mesh?
Definition: Governance and organization paradigm where individual teams own 'data products' within a domain
Key features: Domain-based data management, federated governance with central standards
Note: More about data management paradigm than specific technology
7
New cards
What are the three steps of ETL and their purposes?
1. Extract: Retrieve raw data, ensure integrity (real-time or batches)
2. Transform: Convert to suitable format (cleansing, enrichment, encoding, format changes)
3. Load: Move transformed data into warehouse (batches or streamed)
Tools: AWS Glue, EventBridge, Lambda
8
New cards
What is the difference between JDBC and ODBC?
JDBC (Java Database Connectivity): Platform-independent, language-dependent (Java only)
ODBC (Open Database Connectivity): Platform-dependent, language-independent
9
New cards
When should you use CSV format for data?
Use cases:
- Small to medium datasets
- Human-readable format needed
- Importing/exporting from databases and spreadsheets
10
New cards
When should you use JSON format for data?
Use cases:
- Web server/client communication
- Configuration files
- Flexible/nested schema requirements
11
New cards
When should you use AVRO format for data?
Definition: Binary format storing both data and schema
Use cases:
- Big data and real-time processing
- Schema evolution needed
- Efficient data transport
12
New cards
When should you use Parquet format for data?
Definition: Columnar format optimized for analytics
Use cases:
- Large data analysis
- Reading specific columns instead of entire records
- Optimized I/O operations needed
13
New cards
What are S3 buckets and objects?
Buckets: Must have globally unique names, defined at region level
Objects: Have a key (full path), no concept of directories, max size 5TB
Example key: s3://my-bucket/my_file.txt
14
New cards
How does S3 security work?
User-based: IAM policies
Resource-based: Bucket policies, Object ACL, Bucket ACL
Access rule: Object accessible if (IAM allows OR resource policy allows) AND no explicit deny
Encryption: Using encryption keys
15
New cards
What are the key features of S3 versioning?
Features:
- Enabled at bucket level
- Protects against unintended deletes
- Easy rollback
Notes: Files before enabling have version 'null', suspending doesn't delete previous versions
16
New cards
What are the requirements for S3 replication (CRR/SRR)?
Requirements:
- Versioning enabled in both source and destination
- Can be across different accounts
- Copying is asynchronous
Limitations: Can't chain to third bucket, existing items not replicated unless batch replication enabled
17
New cards
What are the use cases for S3 Cross-Region Replication (CRR)?
- Compliance requirements
- Low latency access
- Replication across accounts
18
New cards
What are the use cases for S3 Same-Region Replication (SRR)?
- Log aggregation
- Live replication between production and test environments
19
New cards
What is S3 Standard - General Purpose?
Availability: 99.99%
Use for: Frequently accessed data
Features: High throughput, low latency, sustains 2 concurrent facility failures
Use cases: Big data, mobile, gaming, content distribution
20
New cards
What is S3 Standard - Infrequent Access (IA)?
Standard IA:
- 99.99% availability
- Use cases: Disaster recovery, backups
One-Zone IA:
- 99.5% availability
- Data lost if AZ destroyed
- Use cases: Secondary backups, recreatable data
21
New cards

What are the three S3 Glacier tiers and their retrieval times?

Glacier Instant Retrieval: Millisecond, 90-day minimum billing period
Glacier Flexible Retrieval: Expedited (1-5 min), Standard (3-5 hr), Bulk (5-12 hr free), 90-day minimum
Glacier Deep Archive: Standard (12 hr), Bulk (48 hr), 180-day minimum

22
New cards
How does S3 Intelligent-Tiering work?
Purpose: Automatically moves objects between tiers based on usage
Cost: Small monthly monitoring fee, no retrieval charges
Tiers: Frequent (default) → Infrequent (30 days) → Archive IA (90 days) → Archive (90-700+ days) → Deep Archive (180-700+ days)
23
New cards
What are S3 Lifecycle Rules?
Can apply to: Entire buckets, specific files, document tags, non-current versions
Actions:
- Transition: Move to different storage class (e.g., Standard to IA after 60 days)
- Expiration: Delete objects (e.g., delete logs after 365 days)
24
New cards
How do S3 Event Notifications work?
Trigger: Events like S3:ObjectCreated
Delivery: Typically seconds, sometimes up to a minute
Targets: SNS, SQS, Lambda (requires IAM/resource policies)
Advanced: All events can go to EventBridge (18+ AWS services)
25
New cards
What are S3 performance characteristics?
Latency: 100-200ms
PUT/POST/COPY/DELETE: 3,500 requests/sec per prefix
GET/HEAD: 5,500 requests/sec per prefix
26
New cards
What are S3 file write optimization techniques?
Multi-part upload: Recommended for files >100MB
S3 Transfer Acceleration: Uses AWS edge locations to forward data to S3 bucket (maximizes private AWS network time, minimizes public internet time)
27
New cards
What is S3 byte-range fetch?
Purpose: File read optimization
Method: Parallelize GETs by requesting specific byte ranges
Use case: Reading just the head of a file
28
New cards
What are the three types of S3 Server-Side Encryption (SSE)?
SSE-S3: AWS-managed keys, enabled by default, AES-256
SSE-KMS: User control + CloudTrail auditing, may hit KMS API limits, includes DSSE-KMS option
SSE-C: Customer-provided keys NOT stored by AWS, requires HTTPS with key in header
29
New cards
What is S3 Client-Side Encryption?
Uses AWS Client-Side Encryption Library to encrypt data before sending to and after receiving from S3
30
New cards
What are S3 Access Points?
Purpose: Policies granting read/write for specific prefixes
Features: Each has own DNS name and access policy
VPC option: Can specify access point only accessible from VPC (requires VPC endpoint)
31
New cards
What is S3 Object Lambda?
Purpose: Modify objects before retrieval by caller application using Lambda functions
Use cases: Redacting information, format conversion, client-specific transformations
Requirements: S3 bucket, S3 Access Point, S3 Object Lambda Access Points
32
New cards
What is EBS (Elastic Block Store)?
Definition: Network-attached storage ('network USB stick') for EC2 instances
Features: Persist data, bound to specific AZ (unless snapshotted), billed by provisioned capacity
Types: gp2, gp3, io
Latency: Uses network, so some latency exists
33
New cards
What is the EBS Delete on Termination feature?
Purpose: Controls EBS behavior when EC2 instance terminates
Default: Root EBS deleted, other attached volumes preserved
Use case: Preserve root volume when instance terminated
34
New cards
What are EBS Elastic Volumes?
Feature: Change volume size and type without detaching volume or restarting instance
35
New cards
What is EFS (Elastic File System)?
Definition: Managed network filesystem mountable on multiple EC2 instances
Features: Multi-AZ, highly available, auto-scaling
Cost: Billed per use (3x gp2 price)
Requirements: Security group, Linux AMI only (not Windows)
36
New cards
What are the EFS Storage Tiers?
Storage Tiers: Standard (frequently accessed), EFS-IA (infrequent access), Archive (rarely accessed, 50% cheaper)
Availability: Standard (multi-AZ, production), One Zone (dev, with backup)
Cost savings: Lifecycle policies can achieve 90% savings
37
New cards
What is FSx for Windows File Server?
Purpose: Windows file system share drive
Features: Works on Linux EC2, supports MS Distributed File System, accessible from on-prem, multi-AZ, daily S3 backups, massive scale
Storage: HDD or SSD options
38
New cards
What is FSx for Lustre?
Purpose: Parallel distributed file system for HPC/video processing
Features: Accessible from on-prem, reads/writes to S3, massive scale and throughput
Storage options: SSD (low-latency, IOPS intensive), HDD (throughput-intensive)
39
New cards
What is FSx for NetApp ONTAP?
Purpose: Move ONTAP/NAS workloads to AWS
Protocols: NFS, SMB, iSCSI
Features: Works with all OS, auto-scaling, snapshots, replication, point-in-time cloning for testing
40
New cards
What is FSx for OpenZFS?
Purpose: Move ZFS workloads to AWS
Protocol: NFS
Features: Compatible with all OS, low latency, snapshots, replication, point-in-time cloning
41
New cards
What are FSx File System Deployment Options?
Scratch: Temporary storage, data not replicated, 6x faster burst, for short-term cost-optimized processing
Persistent: Long-term storage replicated in same AZ, failed files replaced in minutes, for long-term processing and sensitive data
42
New cards
What is Kinesis Data Streams?
Purpose: Collect and store streaming data in REAL-TIME
Retention: Up to 365 days
Features: Data can't be deleted (must expire), reprocess/replay capability, ordering guaranteed per partition ID, >1MB data
Encryption: At-rest (KMS), in-flight (HTTPS)
43
New cards
What are Kinesis Data Streams capacity modes?
Provisioned: Manually scale shards, 1MB/s in + 2MB/s out per shard, pay per shard per hour
On-demand: Default 4MB/s in or 4000 records/s, auto-scales based on 30-day peak, pay per stream per hour + data in/out per GB
44
New cards
What are common Kinesis Data Streams troubleshooting solutions?
Writing too slow: Service/shard limits exceeded, data not evenly distributed
Large producer: Use batching
500/503 errors: Implement retry with exponential backoff
Connection errors to Flink: Network/VPC issues
Throttling: Check for hot shards, use random partition key
45
New cards
What is Kinesis Data Firehose?
Definition: Fully managed NEAR-REAL-TIME service with buffering
Features: Auto-scaling, pay-per-use, supports JSON/CSV/Parquet/AVRO/binary
Flow: Producers → Firehose buffer → (optional Lambda transform) → Destinations (S3, Redshift, OpenSearch, 3rd party, HTTP endpoints)
46
New cards
What is Managed Service for Apache Flink (MSAF)?
Definition: Managed, serverless Flink service on AWS
Flink: Framework for processing data streams
Features: Can custom-build apps from scratch and load from S3
Use cases: Streaming ETL, continuous metric generation
47
New cards
What is MSK (Managed Streaming for Apache Kafka)?
Definition: Alternative to Kinesis using Kafka producers/consumers
Features: Fully managed (create/update/delete clusters), multi-AZ (recommended 3), data stored in EBS
Flow: Source → Producers → MSK Cluster (topics/brokers) → Consumers → Destinations
48
New cards
What are MSK security features?
Encryption: TLS in-flight between brokers/clients, at-rest EBS with KMS
Network: Security groups
Authentication/Authorization: Mutual TLS, SASL/SCRAM (authentication), Kafka ACLs (authorization), IAM Access Control (both)
49
New cards
When should you use MSK over Kinesis?
Use MSK when:
- Message size > 1MB
- Need Kafka-specific features
- Existing Kafka infrastructure to migrate