data ingestion

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall with Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/48

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

49 Terms

1
New cards

What are the three types of data structures in AWS/ML context?

  1. Structured Data: Organized with defined schema (e.g., database tables, CSV)
    2. Unstructured Data: No predefined structure (e.g., text, video, audio, images)
    3. Semi-structured Data: Tagged/categorized elements (e.g., XML, JSON, log files)

2
New cards

What are the "3 V's" of data properties?

  1. Volume: Amount/size of data
    2. Velocity: Speed of data generation and processing
    3. Variety: Different types, structures, and sources of data

3
New cards

What is a Data Warehouse and when should you use it?

Definition: Centralized repository optimized for analysis of structured data from different sources

Eg. Redshift
Key features: Schema-on-write (ETL), star/snowflake schemas, complex read-heavy queries
Use when: You need structured data with fast complex queries, BI/analytics, integration from multiple sources

4
New cards

What is a Data Lake and when should you use it?

Definition: Storage repository for large amounts of raw data in native format
Key features: Schema-on-read (ELT), no preprocessing, flexible and agile

Eg. s3
Use when: Mix of structured/unstructured data, large volumes where cost-effectiveness matters, flexibility for future needs

5
New cards
What is a Data Lakehouse?
Definition: Hybrid architecture combining data warehouse performance/reliability with data lake flexibility/scale/low-cost
Example: AWS Lake Formation (S3 + Redshift Spectrum)
Features: Supports both schema-on-write and schema-on-read
6
New cards
What is a Data Mesh?
Definition: Governance and organization paradigm where individual teams own 'data products' within a domain
Key features: Domain-based data management, federated governance with central standards
Note: More about data management paradigm than specific technology
7
New cards
What are the three steps of ETL and their purposes?
1. Extract: Retrieve raw data, ensure integrity (real-time or batches)
2. Transform: Convert to suitable format (cleansing, enrichment, encoding, format changes)
3. Load: Move transformed data into warehouse (batches or streamed)
Tools: AWS Glue, EventBridge, Lambda
8
New cards
What is the difference between JDBC and ODBC?
JDBC (Java Database Connectivity): Platform-independent, language-dependent (Java only)
ODBC (Open Database Connectivity): Platform-dependent, language-independent
9
New cards
When should you use CSV format for data?
Use cases:
- Small to medium datasets
- Human-readable format needed
- Importing/exporting from databases and spreadsheets
10
New cards
When should you use JSON format for data?
Use cases:
- Web server/client communication
- Configuration files
- Flexible/nested schema requirements
11
New cards
When should you use AVRO format for data?
Definition: Binary format storing both data and schema
Use cases:
- Big data and real-time processing
- Schema evolution needed
- Efficient data transport
12
New cards
When should you use Parquet format for data?
Definition: Columnar format optimized for analytics
Use cases:
- Large data analysis
- Reading specific columns instead of entire records
- Optimized I/O operations needed
13
New cards
What are S3 buckets and objects?
Buckets: Must have globally unique names, defined at region level
Objects: Have a key (full path), no concept of directories, max size 5TB
Example key: s3://my-bucket/my_file.txt
14
New cards
How does S3 security work?
User-based: IAM policies
Resource-based: Bucket policies, Object ACL, Bucket ACL
Access rule: Object accessible if (IAM allows OR resource policy allows) AND no explicit deny
Encryption: Using encryption keys
15
New cards
What are the key features of S3 versioning?
Features:
- Enabled at bucket level
- Protects against unintended deletes
- Easy rollback
Notes: Files before enabling have version 'null', suspending doesn't delete previous versions
16
New cards
What are the requirements for S3 replication (CRR/SRR)?
Requirements:
- Versioning enabled in both source and destination
- Can be across different accounts
- Copying is asynchronous
Limitations: Can't chain to third bucket, existing items not replicated unless batch replication enabled
17
New cards
What are the use cases for S3 Cross-Region Replication (CRR)?
- Compliance requirements
- Low latency access
- Replication across accounts
18
New cards
What are the use cases for S3 Same-Region Replication (SRR)?
- Log aggregation
- Live replication between production and test environments
19
New cards
What is S3 Standard - General Purpose?
Availability: 99.99%
Use for: Frequently accessed data
Features: High throughput, low latency, sustains 2 concurrent facility failures
Use cases: Big data, mobile, gaming, content distribution
20
New cards
What is S3 Standard - Infrequent Access (IA)?
Standard IA:
- 99.99% availability
- Use cases: Disaster recovery, backups
One-Zone IA:
- 99.5% availability
- Data lost if AZ destroyed
- Use cases: Secondary backups, recreatable data
21
New cards

What are the three S3 Glacier tiers and their retrieval times?

Glacier Instant Retrieval: Millisecond, 90-day minimum billing period
Glacier Flexible Retrieval: Expedited (1-5 min), Standard (3-5 hr), Bulk (5-12 hr free), 90-day minimum
Glacier Deep Archive: Standard (12 hr), Bulk (48 hr), 180-day minimum

22
New cards
How does S3 Intelligent-Tiering work?
Purpose: Automatically moves objects between tiers based on usage
Cost: Small monthly monitoring fee, no retrieval charges
Tiers: Frequent (default) → Infrequent (30 days) → Archive IA (90 days) → Archive (90-700+ days) → Deep Archive (180-700+ days)
23
New cards
What are S3 Lifecycle Rules?
Can apply to: Entire buckets, specific files, document tags, non-current versions
Actions:
- Transition: Move to different storage class (e.g., Standard to IA after 60 days)
- Expiration: Delete objects (e.g., delete logs after 365 days)
24
New cards
How do S3 Event Notifications work?
Trigger: Events like S3:ObjectCreated
Delivery: Typically seconds, sometimes up to a minute
Targets: SNS, SQS, Lambda (requires IAM/resource policies)
Advanced: All events can go to EventBridge (18+ AWS services)
25
New cards
What are S3 performance characteristics?
Latency: 100-200ms
PUT/POST/COPY/DELETE: 3,500 requests/sec per prefix
GET/HEAD: 5,500 requests/sec per prefix
26
New cards
What are S3 file write optimization techniques?
Multi-part upload: Recommended for files >100MB
S3 Transfer Acceleration: Uses AWS edge locations to forward data to S3 bucket (maximizes private AWS network time, minimizes public internet time)
27
New cards
What is S3 byte-range fetch?
Purpose: File read optimization
Method: Parallelize GETs by requesting specific byte ranges
Use case: Reading just the head of a file
28
New cards
What are the three types of S3 Server-Side Encryption (SSE)?
SSE-S3: AWS-managed keys, enabled by default, AES-256
SSE-KMS: User control + CloudTrail auditing, may hit KMS API limits, includes DSSE-KMS option
SSE-C: Customer-provided keys NOT stored by AWS, requires HTTPS with key in header
29
New cards
What is S3 Client-Side Encryption?
Uses AWS Client-Side Encryption Library to encrypt data before sending to and after receiving from S3
30
New cards
What are S3 Access Points?
Purpose: Policies granting read/write for specific prefixes
Features: Each has own DNS name and access policy
VPC option: Can specify access point only accessible from VPC (requires VPC endpoint)
31
New cards
What is S3 Object Lambda?
Purpose: Modify objects before retrieval by caller application using Lambda functions
Use cases: Redacting information, format conversion, client-specific transformations
Requirements: S3 bucket, S3 Access Point, S3 Object Lambda Access Points
32
New cards
What is EBS (Elastic Block Store)?
Definition: Network-attached storage ('network USB stick') for EC2 instances
Features: Persist data, bound to specific AZ (unless snapshotted), billed by provisioned capacity
Types: gp2, gp3, io
Latency: Uses network, so some latency exists
33
New cards
What is the EBS Delete on Termination feature?
Purpose: Controls EBS behavior when EC2 instance terminates
Default: Root EBS deleted, other attached volumes preserved
Use case: Preserve root volume when instance terminated
34
New cards
What are EBS Elastic Volumes?
Feature: Change volume size and type without detaching volume or restarting instance
35
New cards
What is EFS (Elastic File System)?
Definition: Managed network filesystem mountable on multiple EC2 instances
Features: Multi-AZ, highly available, auto-scaling
Cost: Billed per use (3x gp2 price)
Requirements: Security group, Linux AMI only (not Windows)
36
New cards
What are the EFS Storage Tiers?
Storage Tiers: Standard (frequently accessed), EFS-IA (infrequent access), Archive (rarely accessed, 50% cheaper)
Availability: Standard (multi-AZ, production), One Zone (dev, with backup)
Cost savings: Lifecycle policies can achieve 90% savings
37
New cards
What is FSx for Windows File Server?
Purpose: Windows file system share drive
Features: Works on Linux EC2, supports MS Distributed File System, accessible from on-prem, multi-AZ, daily S3 backups, massive scale
Storage: HDD or SSD options
38
New cards
What is FSx for Lustre?
Purpose: Parallel distributed file system for HPC/video processing
Features: Accessible from on-prem, reads/writes to S3, massive scale and throughput
Storage options: SSD (low-latency, IOPS intensive), HDD (throughput-intensive)
39
New cards
What is FSx for NetApp ONTAP?
Purpose: Move ONTAP/NAS workloads to AWS
Protocols: NFS, SMB, iSCSI
Features: Works with all OS, auto-scaling, snapshots, replication, point-in-time cloning for testing
40
New cards
What is FSx for OpenZFS?
Purpose: Move ZFS workloads to AWS
Protocol: NFS
Features: Compatible with all OS, low latency, snapshots, replication, point-in-time cloning
41
New cards
What are FSx File System Deployment Options?
Scratch: Temporary storage, data not replicated, 6x faster burst, for short-term cost-optimized processing
Persistent: Long-term storage replicated in same AZ, failed files replaced in minutes, for long-term processing and sensitive data
42
New cards
What is Kinesis Data Streams?
Purpose: Collect and store streaming data in REAL-TIME
Retention: Up to 365 days
Features: Data can't be deleted (must expire), reprocess/replay capability, ordering guaranteed per partition ID, >1MB data
Encryption: At-rest (KMS), in-flight (HTTPS)
43
New cards
What are Kinesis Data Streams capacity modes?
Provisioned: Manually scale shards, 1MB/s in + 2MB/s out per shard, pay per shard per hour
On-demand: Default 4MB/s in or 4000 records/s, auto-scales based on 30-day peak, pay per stream per hour + data in/out per GB
44
New cards
What are common Kinesis Data Streams troubleshooting solutions?
Writing too slow: Service/shard limits exceeded, data not evenly distributed
Large producer: Use batching
500/503 errors: Implement retry with exponential backoff
Connection errors to Flink: Network/VPC issues
Throttling: Check for hot shards, use random partition key
45
New cards
What is Kinesis Data Firehose?
Definition: Fully managed NEAR-REAL-TIME service with buffering
Features: Auto-scaling, pay-per-use, supports JSON/CSV/Parquet/AVRO/binary
Flow: Producers → Firehose buffer → (optional Lambda transform) → Destinations (S3, Redshift, OpenSearch, 3rd party, HTTP endpoints)
46
New cards
What is Managed Service for Apache Flink (MSAF)?
Definition: Managed, serverless Flink service on AWS
Flink: Framework for processing data streams
Features: Can custom-build apps from scratch and load from S3
Use cases: Streaming ETL, continuous metric generation
47
New cards
What is MSK (Managed Streaming for Apache Kafka)?
Definition: Alternative to Kinesis using Kafka producers/consumers
Features: Fully managed (create/update/delete clusters), multi-AZ (recommended 3), data stored in EBS
Flow: Source → Producers → MSK Cluster (topics/brokers) → Consumers → Destinations
48
New cards
What are MSK security features?
Encryption: TLS in-flight between brokers/clients, at-rest EBS with KMS
Network: Security groups
Authentication/Authorization: Mutual TLS, SASL/SCRAM (authentication), Kafka ACLs (authorization), IAM Access Control (both)
49
New cards
When should you use MSK over Kinesis?
Use MSK when:
- Message size > 1MB
- Need Kafka-specific features
- Existing Kafka infrastructure to migrate