1/76
77 question-and-answer flashcards covering foundational concepts, SQL, Amazon Redshift, Matillion, Python, and system-design topics for data-engineering interview preparation.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
A Data Warehouse stores clean, structured data for reporting; a Data Lake stores raw structured & unstructured data; a Data Mart is a smaller, department-focused warehouse.
Load raw data cheaply into the Data Lake first, then transform and load only needed, structured data into Redshift for fast analytics.
OLTP handles many short transactions (fast reads/writes); OLAP handles large analytical queries on historical data. Redshift is an OLAP system.
ETL transforms data before loading; ELT loads first then transforms inside the warehouse. Matillion/Redshift uses ELT.
Fact table stores numeric measurements (e.g., propertyviewsfact); dimension table stores descriptive attributes (e.g., property_dimension).
Granularity is the detail level of each fact row (e.g., one view vs. daily views). It determines the questions you can answer.
Star: fact table directly linked to denormalized dimensions (fewer joins); Snowflake: dimensions are further normalized into additional tables.
It is simpler, involves fewer joins, and yields faster queries; storage is cheap so redundancy is acceptable.
A dimension that tracks attributes which change slowly and unpredictably over time, such as customer address.
Type 1 overwrites; Type 2 adds new rows and keeps history (most common); Type 3 adds columns for previous values.
Designing the blueprint for how data is stored; good models ensure consistency, reliability, and query performance.
Natural keys come from source data; surrogate keys are warehouse-generated integers—stable, fast to join, and unify multiple sources.
Data integrity is overall accuracy and consistency; referential integrity ensures every foreign key points to a valid primary key.
Pre-aggregated multidimensional structures that enable extremely fast slice-and-dice analysis, like an advanced pivot table.
DDL defines structure (CREATE); DML manipulates data (INSERT/UPDATE/DELETE); DQL queries data (SELECT); DCL controls permissions (GRANT/REVOKE).
DELETE removes chosen rows slowly; TRUNCATE instantly removes all rows; DROP deletes the entire table structure and data.
UNION removes duplicates; UNION ALL keeps duplicates and is faster because it skips de-duplication.
WHERE filters rows before grouping; HAVING filters groups after GROUP BY.
A Common Table Expression (WITH) is a named temporary result set; it improves readability and can be reused within the query.
Use DENSE_RANK() over salaries in descending order and select rows where rank = 2.
INNER, LEFT, RIGHT, and FULL OUTER joins.
A table joined to itself; e.g., joining an employees table to itself to find each employee’s manager.
Primary key uniquely identifies each row; foreign key references a primary key in another table to enforce relationships.
A lookup structure that lets the DB locate rows quickly without scanning the whole table, similar to a book index.
They compute values across related rows without collapsing them: ROWNUMBER gives unique sequence; RANK skips numbers after ties; DENSERANK doesn’t skip numbers.
Use LAG(daily_sales) to fetch the previous day’s value in the current row and subtract to compute change.
It groups rows with identical values so aggregate functions can calculate results per group.
A temporary alternate name for a column or table (via AS) to make queries shorter and clearer.
Implements IF-THEN-ELSE logic to create conditional expressions or derived columns.
Use COALESCE to replace NULL, and IS NULL / IS NOT NULL in WHERE clauses to filter.
Shows the query execution plan, helping diagnose and optimize slow queries.
SUM(), COUNT(), and AVG().
CASE is standard SQL and portable; IF() is dialect-specific, so prefer CASE.
A view is a stored SELECT that presents virtual data; unlike a table it stores no data itself.
A pre-compiled set of SQL statements stored in the DB; useful for encapsulating reusable business logic.
Leader node coordinates queries; multiple compute nodes store data and execute steps in parallel (MPP).
Massively Parallel Processing splits tasks across nodes working simultaneously, enabling fast analysis of billions of rows.
Data stored by column, enabling reading only needed columns and superior compression—both speed analytic queries.
Use the parallel COPY command from S3 rather than individual INSERTs.
Split large files into equal parts and use a manifest file; always compress files (e.g., GZIP/ZSTD).
A JSON file listing exact S3 objects to COPY, ensuring only specified files are loaded.
DISTKEY controls row distribution: EVEN (round-robin), KEY (same key on same node), ALL (full copy on every node).
Use ALL for small, frequently joined dimensions; use KEY for large fact tables and their largest dimension on the join key.
Uneven data across nodes; a poorly chosen DISTKEY with non-uniform values puts too much data on some nodes, slowing queries.
Defines on-disk sort order; filtering on sorted columns lets Redshift skip large data blocks, accelerating scans.
COMPOUND orders by priority column sequence—best when queries filter on first column; INTERLEAVED gives equal weight to all columns—best for varied filter patterns.
A COMPOUND SORTKEY with the date column first.
Reclaims space from deleted rows and re-sorts data to maintain performance.
Regular view runs its query each time; materialized view stores pre-computed results for faster repeated access.
Through standard GRANT and REVOKE privileges on users and groups for schemas, tables, and columns.
Feature that lets you query data in S3 directly without loading—great for huge, rarely accessed datasets.
Configures query queues to prioritize workloads so short dashboard queries aren’t blocked by long ETL jobs.
A cloud-native data integration tool; it is an ELT tool that pushes SQL to the warehouse.
Transformation logic is converted to SQL and executed inside Redshift rather than on an external engine.
Transformation Job manipulates data; Orchestration Job controls flow—running transformations, COPYs, and conditional logic.
Provides a drag-and-drop UI that visually builds jobs and auto-generates the underlying SQL.
Connect Matillion to Git to commit, push, and pull job changes collaboratively.
Databases (MySQL), SaaS apps (Salesforce, Google Analytics), and file storage (Amazon S3).
Use Matillion’s built-in scheduler to set a recurring trigger on the orchestration job.
Use the Detect Changes component to compare source vs. target and insert new/changed rows while flagging old ones inactive.
Acts as glue for tasks SQL is weak at: API ingestion, workflow automation, complex validation, etc.
psycopg2 (PostgreSQL adapter).
pandas.
A class is a blueprint; an object is an instantiated instance of that blueprint.
Encapsulation, Abstraction, Inheritance, Polymorphism.
Abstract class can include some implemented methods; interface defines only method signatures with no implementation.
Use pandas.read_csv with the chunksize parameter to process in manageable pieces.
An isolated environment per project that avoids dependency/version conflicts.
Ingest to S3 lake; ELT into Redshift staging; transform to Star Schema; serve to BI; orchestrate & monitor with scheduling/alerts.
Star schema: factlistingsperformance (daily metrics) + dimensions dimdate, dimproperty, dimagent, dimlocation.
Use streaming: capture changes (DMS), push to Kinesis, process stream, store in fast store (e.g., Elasticsearch).
Check logs, isolate failing task, inspect common issues (schema change, credentials), reproduce manually.
Validation checks, reconciliation counts, automated tests, and alerting on failures.
Clarify issue, trace lineage back to raw data, validate at each step, communicate findings & fix.
Explore with Postman; write Python requests script with pagination & error handling; land JSON in S3; schedule via Airflow.
Accept additive columns gracefully; detect breaking changes via monitoring and update transformations promptly.
Aggregate daily views, then use AVG(total_views) OVER (ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) to produce rolling average stored in a reporting table.