1/8
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
AWS Glue – File Conversion
Converts data files into optimized formats like Apache Parquet or ORC for analytics; improves query performance and reduces storage costs
AWS Glue Crawler
Scans data sources (S3, JDBC, etc.), infers schema, and populates the Glue Data Catalog automatically
Apache Parquet
File format optimized for analytics; columnar storage, compressed, and efficient for big data queries
AWS Batch
Manages batch computing workloads; automatically provisions compute resources, schedules jobs, and handles scaling for large-scale processing
Tip:
For storage-related exam questions, match the service to the data workflow:
Glue for ETL and cataloging, Batch for batch processing, optimized formats like Parquet for analytics efficiency.
You want to convert CSV files in S3 into a more efficient, columnar format for analytics. Which service and format should you use?
A) AWS Batch + JSON
B) AWS Glue + Apache Parquet
C) S3 Transfer Acceleration + ORC
D) Athena + CSV
Answer: B – Glue can convert files into Apache Parquet for analytics efficiency.
You need to automatically discover the schema of new files arriving in S3 and update your metadata catalog. What AWS service should you use?
A) AWS Batch
B) Glue Crawler
C) Athena
D) DataSync
Answer: B – Glue Crawler scans data sources and updates the Glue Data Catalog automatically.
Why would you store analytics data in Parquet format instead of CSV?
A) Parquet is row-based
B) Parquet is compressed and columnar, improving query performance
C) Parquet cannot be read by Athena
D) Parquet increases storage cost
Answer: B – Columnar and compressed format improves query performance and reduces storage costs.
You have thousands of data processing jobs that need to run daily with varying compute requirements. Which service is best suited?
A) Lambda
B) AWS Batch
C) Step Functions
D) EC2 Auto Scaling
Answer: B – AWS Batch manages compute provisioning, scheduling, and scaling for batch workloads.