Data Engineering Interview Prep Guide

0.0(0)

Studied by 0 people

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/76

Earn XP

Description and Tags

77 question-and-answer flashcards covering foundational concepts, SQL, Amazon Redshift, Matillion, Python, and system-design topics for data-engineering interview preparation.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

77 Terms

New cards

What is a Data Warehouse and how does it differ from a Data Lake and a Data Mart?

A Data Warehouse stores clean, structured data for reporting; a Data Lake stores raw structured & unstructured data; a Data Mart is a smaller, department-focused warehouse.

New cards

When would you use a Data Lake together with a Data Warehouse such as Amazon Redshift?

Load raw data cheaply into the Data Lake first, then transform and load only needed, structured data into Redshift for fast analytics.

New cards

Explain the difference between OLTP and OLAP systems. Where does Redshift fit?

OLTP handles many short transactions (fast reads/writes); OLAP handles large analytical queries on historical data. Redshift is an OLAP system.

New cards

What are the key differences between ETL and ELT, and which does the Matillion/Redshift stack follow?

ETL transforms data before loading; ELT loads first then transforms inside the warehouse. Matillion/Redshift uses ELT.

New cards

What is a fact table and what is a dimension table? Give a dubizzle example of each.

Fact table stores numeric measurements (e.g., propertyviewsfact); dimension table stores descriptive attributes (e.g., property_dimension).

New cards

What is the granularity of a fact table and why is it important?

Granularity is the detail level of each fact row (e.g., one view vs. daily views). It determines the questions you can answer.

New cards

What is the difference between a Star Schema and a Snowflake Schema?

Star: fact table directly linked to denormalized dimensions (fewer joins); Snowflake: dimensions are further normalized into additional tables.

New cards

Why is a Star Schema generally preferred for Redshift?

It is simpler, involves fewer joins, and yields faster queries; storage is cheap so redundancy is acceptable.

New cards

What is a Slowly Changing Dimension (SCD)?

A dimension that tracks attributes which change slowly and unpredictably over time, such as customer address.

New cards

Describe SCD Type 1, Type 2, and Type 3. Which is most common for historical analysis?

Type 1 overwrites; Type 2 adds new rows and keeps history (most common); Type 3 adds columns for previous values.

New cards

What is data modeling and why is it important?

Designing the blueprint for how data is stored; good models ensure consistency, reliability, and query performance.

New cards

What are natural keys vs. surrogate keys and why use surrogates in a warehouse?

Natural keys come from source data; surrogate keys are warehouse-generated integers—stable, fast to join, and unify multiple sources.

New cards

Define data integrity and referential integrity.

Data integrity is overall accuracy and consistency; referential integrity ensures every foreign key points to a valid primary key.

New cards

What are OLAP cubes and why are they useful?

Pre-aggregated multidimensional structures that enable extremely fast slice-and-dice analysis, like an advanced pivot table.

New cards

What is the difference between DDL, DML, DQL, and DCL?

DDL defines structure (CREATE); DML manipulates data (INSERT/UPDATE/DELETE); DQL queries data (SELECT); DCL controls permissions (GRANT/REVOKE).

New cards

Contrast DELETE, TRUNCATE, and DROP.

DELETE removes chosen rows slowly; TRUNCATE instantly removes all rows; DROP deletes the entire table structure and data.

New cards

UNION vs. UNION ALL—what’s the difference and which is faster?

UNION removes duplicates; UNION ALL keeps duplicates and is faster because it skips de-duplication.

New cards

Difference between WHERE and HAVING clauses.

WHERE filters rows before grouping; HAVING filters groups after GROUP BY.

New cards

What is a CTE and why is it often better than a subquery?

A Common Table Expression (WITH) is a named temporary result set; it improves readability and can be reused within the query.

New cards

Provide a SQL pattern to find the second-highest salary.

Use DENSE_RANK() over salaries in descending order and select rows where rank = 2.

New cards

List the main SQL join types.

INNER, LEFT, RIGHT, and FULL OUTER joins.

New cards

What is a self-join? Give a common use case.

A table joined to itself; e.g., joining an employees table to itself to find each employee’s manager.

New cards

Define primary key and foreign key.

Primary key uniquely identifies each row; foreign key references a primary key in another table to enforce relationships.

New cards

What is an index and how does it speed up queries?

A lookup structure that lets the DB locate rows quickly without scanning the whole table, similar to a book index.

New cards

What are window functions and how do ROWNUMBER, RANK, and DENSERANK differ?

They compute values across related rows without collapsing them: ROWNUMBER gives unique sequence; RANK skips numbers after ties; DENSERANK doesn’t skip numbers.

New cards

How would you use LAG() or LEAD() for day-over-day comparisons?

Use LAG(daily_sales) to fetch the previous day’s value in the current row and subtract to compute change.

New cards

How does GROUP BY work?

It groups rows with identical values so aggregate functions can calculate results per group.

New cards

What is an alias in SQL and why use it?

A temporary alternate name for a column or table (via AS) to make queries shorter and clearer.

New cards

Purpose of the CASE statement in SQL.

Implements IF-THEN-ELSE logic to create conditional expressions or derived columns.

New cards

How do you handle NULL values in queries?

Use COALESCE to replace NULL, and IS NULL / IS NOT NULL in WHERE clauses to filter.

New cards

What does the EXPLAIN command do?

Shows the query execution plan, helping diagnose and optimize slow queries.

New cards

Name three aggregate functions.

SUM(), COUNT(), and AVG().

New cards

IF vs. CASE in SQL—why prefer CASE?

CASE is standard SQL and portable; IF() is dialect-specific, so prefer CASE.

New cards

What is a view and how does it differ from a table?

A view is a stored SELECT that presents virtual data; unlike a table it stores no data itself.

New cards

What is a stored procedure and when is it useful?

A pre-compiled set of SQL statements stored in the DB; useful for encapsulating reusable business logic.

New cards

Describe Amazon Redshift’s architecture.

Leader node coordinates queries; multiple compute nodes store data and execute steps in parallel (MPP).

New cards

What is MPP architecture and why is it vital for warehousing?

Massively Parallel Processing splits tasks across nodes working simultaneously, enabling fast analysis of billions of rows.

New cards

Explain columnar storage and its benefits.

Data stored by column, enabling reading only needed columns and superior compression—both speed analytic queries.

New cards

Recommended way to bulk-load data into Redshift.

Use the parallel COPY command from S3 rather than individual INSERTs.

New cards

Give two best practices for the COPY command.

Split large files into equal parts and use a manifest file; always compress files (e.g., GZIP/ZSTD).

New cards

What is a manifest file in Redshift loads?

A JSON file listing exact S3 objects to COPY, ensuring only specified files are loaded.

New cards

What is a Distribution Key (DISTKEY) and its styles?

DISTKEY controls row distribution: EVEN (round-robin), KEY (same key on same node), ALL (full copy on every node).

New cards

When choose DISTSTYLE ALL vs. DISTSTYLE KEY?

Use ALL for small, frequently joined dimensions; use KEY for large fact tables and their largest dimension on the join key.

New cards

What is data skew in Redshift and how can a bad DISTKEY cause it?

Uneven data across nodes; a poorly chosen DISTKEY with non-uniform values puts too much data on some nodes, slowing queries.

New cards

What is a Sort Key and how does it speed queries?

Defines on-disk sort order; filtering on sorted columns lets Redshift skip large data blocks, accelerating scans.

New cards

Compare COMPOUND and INTERLEAVED Sort Keys.

COMPOUND orders by priority column sequence—best when queries filter on first column; INTERLEAVED gives equal weight to all columns—best for varied filter patterns.

New cards

Best SORTKEY for a large fact table queried by date.

A COMPOUND SORTKEY with the date column first.

New cards

Purpose of the VACUUM command in Redshift.

Reclaims space from deleted rows and re-sorts data to maintain performance.

New cards

Materialized view vs. regular view in Redshift.

Regular view runs its query each time; materialized view stores pre-computed results for faster repeated access.

New cards

How does Redshift enforce security?

Through standard GRANT and REVOKE privileges on users and groups for schemas, tables, and columns.

New cards

What is Redshift Spectrum and when is it useful?

Feature that lets you query data in S3 directly without loading—great for huge, rarely accessed datasets.

New cards

What is Workload Management (WLM) in Redshift?

Configures query queues to prioritize workloads so short dashboard queries aren’t blocked by long ETL jobs.

New cards

What is Matillion and is it ETL or ELT?

A cloud-native data integration tool; it is an ELT tool that pushes SQL to the warehouse.

New cards

Define a “push-down” transformation in Matillion.

Transformation logic is converted to SQL and executed inside Redshift rather than on an external engine.

New cards

Difference between an Orchestration Job and a Transformation Job in Matillion.

Transformation Job manipulates data; Orchestration Job controls flow—running transformations, COPYs, and conditional logic.

New cards

How does Matillion enable building pipelines with little coding?

Provides a drag-and-drop UI that visually builds jobs and auto-generates the underlying SQL.

New cards

How do you handle version control for Matillion jobs?

Connect Matillion to Git to commit, push, and pull job changes collaboratively.

New cards

Name three types of data sources Matillion can connect to.

Databases (MySQL), SaaS apps (Salesforce, Google Analytics), and file storage (Amazon S3).

New cards

How would you schedule a daily job in Matillion?

Use Matillion’s built-in scheduler to set a recurring trigger on the orchestration job.

New cards

How can you implement an SCD Type 2 dimension in Matillion?

Use the Detect Changes component to compare source vs. target and insert new/changed rows while flagging old ones inactive.

New cards

What role does Python play in a mostly SQL data-engineering stack?

Acts as glue for tasks SQL is weak at: API ingestion, workflow automation, complex validation, etc.

New cards

Which Python library is standard for connecting to Redshift?

psycopg2 (PostgreSQL adapter).

New cards

Which Python library is most common for data manipulation?

pandas.

New cards

Explain class vs. object in Python.

A class is a blueprint; an object is an instantiated instance of that blueprint.

New cards

List the four main OOP principles.

Encapsulation, Abstraction, Inheritance, Polymorphism.

New cards

Difference between an abstract class and an interface.

Abstract class can include some implemented methods; interface defines only method signatures with no implementation.

New cards

How would you read a large CSV efficiently in Python?

Use pandas.read_csv with the chunksize parameter to process in manageable pieces.

New cards

What is a Python virtual environment and why use it?

An isolated environment per project that avoids dependency/version conflicts.

New cards

Outline an end-to-end data pipeline design.

Ingest to S3 lake; ELT into Redshift staging; transform to Star Schema; serve to BI; orchestrate & monitor with scheduling/alerts.

New cards

Design a data warehouse schema for property listings on dubizzle.

Star schema: factlistingsperformance (daily metrics) + dimensions dimdate, dimproperty, dimagent, dimlocation.

New cards

How would you design near real-time dashboard updates?

Use streaming: capture changes (DMS), push to Kinesis, process stream, store in fast store (e.g., Elasticsearch).

New cards

Steps to debug an intermittently failing data pipeline.

Check logs, isolate failing task, inspect common issues (schema change, credentials), reproduce manually.

New cards

How do you ensure data quality in pipelines?

Validation checks, reconciliation counts, automated tests, and alerting on failures.

New cards

A dashboard number is wrong—what’s your investigation process?

Clarify issue, trace lineage back to raw data, validate at each step, communicate findings & fix.

New cards

High-level design to ingest data from a new REST API.

Explore with Postman; write Python requests script with pagination & error handling; land JSON in S3; schedule via Airflow.

New cards

How do you handle schema changes from a source system?

Accept additive columns gracefully; detect breaking changes via monitoring and update transformations promptly.

New cards

How would you compute a 30-day rolling average of property views per neighborhood?

Aggregate daily views, then use AVG(total_views) OVER (ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) to produce rolling average stored in a reporting table.