data ingestion

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/48

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

49 Terms

New cards

What are the three types of data structures in AWS/ML context?

Structured Data: Organized with defined schema (e.g., database tables, CSV)
2. Unstructured Data: No predefined structure (e.g., text, video, audio, images)
3. Semi-structured Data: Tagged/categorized elements (e.g., XML, JSON, log files)

New cards

What are the "3 V's" of data properties?

Volume: Amount/size of data
2. Velocity: Speed of data generation and processing
3. Variety: Different types, structures, and sources of data

New cards

What is a Data Warehouse and when should you use it?

Definition: Centralized repository optimized for analysis of structured data from different sources

Eg. Redshift
Key features: Schema-on-write (ETL), star/snowflake schemas, complex read-heavy queries
Use when: You need structured data with fast complex queries, BI/analytics, integration from multiple sources

New cards

What is a Data Lake and when should you use it?

Definition: Storage repository for large amounts of raw data in native format
Key features: Schema-on-read (ELT), no preprocessing, flexible and agile

Eg. s3
Use when: Mix of structured/unstructured data, large volumes where cost-effectiveness matters, flexibility for future needs

New cards

What is a Data Lakehouse?

Definition: Hybrid architecture combining data warehouse performance/reliability with data lake flexibility/scale/low-cost
Example: AWS Lake Formation (S3 + Redshift Spectrum)
Features: Supports both schema-on-write and schema-on-read

New cards

What is a Data Mesh?

Definition: Governance and organization paradigm where individual teams own 'data products' within a domain
Key features: Domain-based data management, federated governance with central standards
Note: More about data management paradigm than specific technology

New cards

What are the three steps of ETL and their purposes?

1. Extract: Retrieve raw data, ensure integrity (real-time or batches)
2. Transform: Convert to suitable format (cleansing, enrichment, encoding, format changes)
3. Load: Move transformed data into warehouse (batches or streamed)
Tools: AWS Glue, EventBridge, Lambda

New cards

What is the difference between JDBC and ODBC?

JDBC (Java Database Connectivity): Platform-independent, language-dependent (Java only)
ODBC (Open Database Connectivity): Platform-dependent, language-independent

New cards

When should you use CSV format for data?

Use cases:
- Small to medium datasets
- Human-readable format needed
- Importing/exporting from databases and spreadsheets

New cards

When should you use JSON format for data?

Use cases:
- Web server/client communication
- Configuration files
- Flexible/nested schema requirements

New cards

When should you use AVRO format for data?

Definition: Binary format storing both data and schema
Use cases:
- Big data and real-time processing
- Schema evolution needed
- Efficient data transport

New cards

When should you use Parquet format for data?

Definition: Columnar format optimized for analytics
Use cases:
- Large data analysis
- Reading specific columns instead of entire records
- Optimized I/O operations needed

New cards

What are S3 buckets and objects?

Buckets: Must have globally unique names, defined at region level
Objects: Have a key (full path), no concept of directories, max size 5TB
Example key: s3://my-bucket/my_file.txt

New cards

How does S3 security work?

User-based: IAM policies
Resource-based: Bucket policies, Object ACL, Bucket ACL
Access rule: Object accessible if (IAM allows OR resource policy allows) AND no explicit deny
Encryption: Using encryption keys

New cards

What are the key features of S3 versioning?

Features:
- Enabled at bucket level
- Protects against unintended deletes
- Easy rollback
Notes: Files before enabling have version 'null', suspending doesn't delete previous versions

New cards

What are the requirements for S3 replication (CRR/SRR)?

Requirements:
- Versioning enabled in both source and destination
- Can be across different accounts
- Copying is asynchronous
Limitations: Can't chain to third bucket, existing items not replicated unless batch replication enabled

New cards

What are the use cases for S3 Cross-Region Replication (CRR)?

- Compliance requirements
- Low latency access
- Replication across accounts

New cards

What are the use cases for S3 Same-Region Replication (SRR)?

- Log aggregation
- Live replication between production and test environments

New cards

What is S3 Standard - General Purpose?

Availability: 99.99%
Use for: Frequently accessed data
Features: High throughput, low latency, sustains 2 concurrent facility failures
Use cases: Big data, mobile, gaming, content distribution

New cards

What is S3 Standard - Infrequent Access (IA)?

Standard IA:
- 99.99% availability
- Use cases: Disaster recovery, backups
One-Zone IA:
- 99.5% availability
- Data lost if AZ destroyed
- Use cases: Secondary backups, recreatable data

New cards

What are the three S3 Glacier tiers and their retrieval times?

Glacier Instant Retrieval: Millisecond, 90-day minimum billing period
Glacier Flexible Retrieval: Expedited (1-5 min), Standard (3-5 hr), Bulk (5-12 hr free), 90-day minimum
Glacier Deep Archive: Standard (12 hr), Bulk (48 hr), 180-day minimum

New cards

How does S3 Intelligent-Tiering work?

Purpose: Automatically moves objects between tiers based on usage
Cost: Small monthly monitoring fee, no retrieval charges
Tiers: Frequent (default) → Infrequent (30 days) → Archive IA (90 days) → Archive (90-700+ days) → Deep Archive (180-700+ days)

New cards

What are S3 Lifecycle Rules?

Can apply to: Entire buckets, specific files, document tags, non-current versions
Actions:
- Transition: Move to different storage class (e.g., Standard to IA after 60 days)
- Expiration: Delete objects (e.g., delete logs after 365 days)

New cards

How do S3 Event Notifications work?

Trigger: Events like S3:ObjectCreated
Delivery: Typically seconds, sometimes up to a minute
Targets: SNS, SQS, Lambda (requires IAM/resource policies)
Advanced: All events can go to EventBridge (18+ AWS services)

New cards

What are S3 performance characteristics?

Latency: 100-200ms
PUT/POST/COPY/DELETE: 3,500 requests/sec per prefix
GET/HEAD: 5,500 requests/sec per prefix

New cards

What are S3 file write optimization techniques?

Multi-part upload: Recommended for files >100MB
S3 Transfer Acceleration: Uses AWS edge locations to forward data to S3 bucket (maximizes private AWS network time, minimizes public internet time)

New cards

What is S3 byte-range fetch?

Purpose: File read optimization
Method: Parallelize GETs by requesting specific byte ranges
Use case: Reading just the head of a file

New cards

What are the three types of S3 Server-Side Encryption (SSE)?

SSE-S3: AWS-managed keys, enabled by default, AES-256
SSE-KMS: User control + CloudTrail auditing, may hit KMS API limits, includes DSSE-KMS option
SSE-C: Customer-provided keys NOT stored by AWS, requires HTTPS with key in header

New cards

What is S3 Client-Side Encryption?

Uses AWS Client-Side Encryption Library to encrypt data before sending to and after receiving from S3

New cards

What are S3 Access Points?

Purpose: Policies granting read/write for specific prefixes
Features: Each has own DNS name and access policy
VPC option: Can specify access point only accessible from VPC (requires VPC endpoint)

New cards

What is S3 Object Lambda?

Purpose: Modify objects before retrieval by caller application using Lambda functions
Use cases: Redacting information, format conversion, client-specific transformations
Requirements: S3 bucket, S3 Access Point, S3 Object Lambda Access Points

New cards

What is EBS (Elastic Block Store)?

Definition: Network-attached storage ('network USB stick') for EC2 instances
Features: Persist data, bound to specific AZ (unless snapshotted), billed by provisioned capacity
Types: gp2, gp3, io
Latency: Uses network, so some latency exists

New cards

What is the EBS Delete on Termination feature?

Purpose: Controls EBS behavior when EC2 instance terminates
Default: Root EBS deleted, other attached volumes preserved
Use case: Preserve root volume when instance terminated

New cards

What are EBS Elastic Volumes?

Feature: Change volume size and type without detaching volume or restarting instance

New cards

What is EFS (Elastic File System)?

Definition: Managed network filesystem mountable on multiple EC2 instances
Features: Multi-AZ, highly available, auto-scaling
Cost: Billed per use (3x gp2 price)
Requirements: Security group, Linux AMI only (not Windows)

New cards

What are the EFS Storage Tiers?

Storage Tiers: Standard (frequently accessed), EFS-IA (infrequent access), Archive (rarely accessed, 50% cheaper)
Availability: Standard (multi-AZ, production), One Zone (dev, with backup)
Cost savings: Lifecycle policies can achieve 90% savings

New cards

What is FSx for Windows File Server?

Purpose: Windows file system share drive
Features: Works on Linux EC2, supports MS Distributed File System, accessible from on-prem, multi-AZ, daily S3 backups, massive scale
Storage: HDD or SSD options

New cards

What is FSx for Lustre?

Purpose: Parallel distributed file system for HPC/video processing
Features: Accessible from on-prem, reads/writes to S3, massive scale and throughput
Storage options: SSD (low-latency, IOPS intensive), HDD (throughput-intensive)

New cards

What is FSx for NetApp ONTAP?

Purpose: Move ONTAP/NAS workloads to AWS
Protocols: NFS, SMB, iSCSI
Features: Works with all OS, auto-scaling, snapshots, replication, point-in-time cloning for testing

New cards

What is FSx for OpenZFS?

Purpose: Move ZFS workloads to AWS
Protocol: NFS
Features: Compatible with all OS, low latency, snapshots, replication, point-in-time cloning

New cards

What are FSx File System Deployment Options?

Scratch: Temporary storage, data not replicated, 6x faster burst, for short-term cost-optimized processing
Persistent: Long-term storage replicated in same AZ, failed files replaced in minutes, for long-term processing and sensitive data

New cards

What is Kinesis Data Streams?

Purpose: Collect and store streaming data in REAL-TIME
Retention: Up to 365 days
Features: Data can't be deleted (must expire), reprocess/replay capability, ordering guaranteed per partition ID, >1MB data
Encryption: At-rest (KMS), in-flight (HTTPS)

New cards

What are Kinesis Data Streams capacity modes?

Provisioned: Manually scale shards, 1MB/s in + 2MB/s out per shard, pay per shard per hour
On-demand: Default 4MB/s in or 4000 records/s, auto-scales based on 30-day peak, pay per stream per hour + data in/out per GB

New cards

What are common Kinesis Data Streams troubleshooting solutions?

Writing too slow: Service/shard limits exceeded, data not evenly distributed
Large producer: Use batching
500/503 errors: Implement retry with exponential backoff
Connection errors to Flink: Network/VPC issues
Throttling: Check for hot shards, use random partition key

New cards

What is Kinesis Data Firehose?

Definition: Fully managed NEAR-REAL-TIME service with buffering
Features: Auto-scaling, pay-per-use, supports JSON/CSV/Parquet/AVRO/binary
Flow: Producers → Firehose buffer → (optional Lambda transform) → Destinations (S3, Redshift, OpenSearch, 3rd party, HTTP endpoints)

New cards

What is Managed Service for Apache Flink (MSAF)?

Definition: Managed, serverless Flink service on AWS
Flink: Framework for processing data streams
Features: Can custom-build apps from scratch and load from S3
Use cases: Streaming ETL, continuous metric generation

New cards

What is MSK (Managed Streaming for Apache Kafka)?

Definition: Alternative to Kinesis using Kafka producers/consumers
Features: Fully managed (create/update/delete clusters), multi-AZ (recommended 3), data stored in EBS
Flow: Source → Producers → MSK Cluster (topics/brokers) → Consumers → Destinations

New cards

What are MSK security features?

Encryption: TLS in-flight between brokers/clients, at-rest EBS with KMS
Network: Security groups
Authentication/Authorization: Mutual TLS, SASL/SCRAM (authentication), Kafka ACLs (authorization), IAM Access Control (both)

New cards

When should you use MSK over Kinesis?

Use MSK when:
- Message size > 1MB
- Need Kafka-specific features
- Existing Kafka infrastructure to migrate