Data Engineering Interview Prep Guide

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/76

flashcard set

Earn XP

Description and Tags

77 question-and-answer flashcards covering foundational concepts, SQL, Amazon Redshift, Matillion, Python, and system-design topics for data-engineering interview preparation.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

77 Terms

1
New cards
  1. What is a Data Warehouse and how does it differ from a Data Lake and a Data Mart?

A Data Warehouse stores clean, structured data for reporting; a Data Lake stores raw structured & unstructured data; a Data Mart is a smaller, department-focused warehouse.

2
New cards
  1. When would you use a Data Lake together with a Data Warehouse such as Amazon Redshift?

Load raw data cheaply into the Data Lake first, then transform and load only needed, structured data into Redshift for fast analytics.

3
New cards
  1. Explain the difference between OLTP and OLAP systems. Where does Redshift fit?

OLTP handles many short transactions (fast reads/writes); OLAP handles large analytical queries on historical data. Redshift is an OLAP system.

4
New cards
  1. What are the key differences between ETL and ELT, and which does the Matillion/Redshift stack follow?

ETL transforms data before loading; ELT loads first then transforms inside the warehouse. Matillion/Redshift uses ELT.

5
New cards
  1. What is a fact table and what is a dimension table? Give a dubizzle example of each.

Fact table stores numeric measurements (e.g., propertyviewsfact); dimension table stores descriptive attributes (e.g., property_dimension).

6
New cards
  1. What is the granularity of a fact table and why is it important?

Granularity is the detail level of each fact row (e.g., one view vs. daily views). It determines the questions you can answer.

7
New cards
  1. What is the difference between a Star Schema and a Snowflake Schema?

Star: fact table directly linked to denormalized dimensions (fewer joins); Snowflake: dimensions are further normalized into additional tables.

8
New cards
  1. Why is a Star Schema generally preferred for Redshift?

It is simpler, involves fewer joins, and yields faster queries; storage is cheap so redundancy is acceptable.

9
New cards
  1. What is a Slowly Changing Dimension (SCD)?

A dimension that tracks attributes which change slowly and unpredictably over time, such as customer address.

10
New cards
  1. Describe SCD Type 1, Type 2, and Type 3. Which is most common for historical analysis?

Type 1 overwrites; Type 2 adds new rows and keeps history (most common); Type 3 adds columns for previous values.

11
New cards
  1. What is data modeling and why is it important?

Designing the blueprint for how data is stored; good models ensure consistency, reliability, and query performance.

12
New cards
  1. What are natural keys vs. surrogate keys and why use surrogates in a warehouse?

Natural keys come from source data; surrogate keys are warehouse-generated integers—stable, fast to join, and unify multiple sources.

13
New cards
  1. Define data integrity and referential integrity.

Data integrity is overall accuracy and consistency; referential integrity ensures every foreign key points to a valid primary key.

14
New cards
  1. What are OLAP cubes and why are they useful?

Pre-aggregated multidimensional structures that enable extremely fast slice-and-dice analysis, like an advanced pivot table.

15
New cards
  1. What is the difference between DDL, DML, DQL, and DCL?

DDL defines structure (CREATE); DML manipulates data (INSERT/UPDATE/DELETE); DQL queries data (SELECT); DCL controls permissions (GRANT/REVOKE).

16
New cards
  1. Contrast DELETE, TRUNCATE, and DROP.

DELETE removes chosen rows slowly; TRUNCATE instantly removes all rows; DROP deletes the entire table structure and data.

17
New cards
  1. UNION vs. UNION ALL—what’s the difference and which is faster?

UNION removes duplicates; UNION ALL keeps duplicates and is faster because it skips de-duplication.

18
New cards
  1. Difference between WHERE and HAVING clauses.

WHERE filters rows before grouping; HAVING filters groups after GROUP BY.

19
New cards
  1. What is a CTE and why is it often better than a subquery?

A Common Table Expression (WITH) is a named temporary result set; it improves readability and can be reused within the query.

20
New cards
  1. Provide a SQL pattern to find the second-highest salary.

Use DENSE_RANK() over salaries in descending order and select rows where rank = 2.

21
New cards
  1. List the main SQL join types.

INNER, LEFT, RIGHT, and FULL OUTER joins.

22
New cards
  1. What is a self-join? Give a common use case.

A table joined to itself; e.g., joining an employees table to itself to find each employee’s manager.

23
New cards
  1. Define primary key and foreign key.

Primary key uniquely identifies each row; foreign key references a primary key in another table to enforce relationships.

24
New cards
  1. What is an index and how does it speed up queries?

A lookup structure that lets the DB locate rows quickly without scanning the whole table, similar to a book index.

25
New cards
  1. What are window functions and how do ROWNUMBER, RANK, and DENSERANK differ?

They compute values across related rows without collapsing them: ROWNUMBER gives unique sequence; RANK skips numbers after ties; DENSERANK doesn’t skip numbers.

26
New cards
  1. How would you use LAG() or LEAD() for day-over-day comparisons?

Use LAG(daily_sales) to fetch the previous day’s value in the current row and subtract to compute change.

27
New cards
  1. How does GROUP BY work?

It groups rows with identical values so aggregate functions can calculate results per group.

28
New cards
  1. What is an alias in SQL and why use it?

A temporary alternate name for a column or table (via AS) to make queries shorter and clearer.

29
New cards
  1. Purpose of the CASE statement in SQL.

Implements IF-THEN-ELSE logic to create conditional expressions or derived columns.

30
New cards
  1. How do you handle NULL values in queries?

Use COALESCE to replace NULL, and IS NULL / IS NOT NULL in WHERE clauses to filter.

31
New cards
  1. What does the EXPLAIN command do?

Shows the query execution plan, helping diagnose and optimize slow queries.

32
New cards
  1. Name three aggregate functions.

SUM(), COUNT(), and AVG().

33
New cards
  1. IF vs. CASE in SQL—why prefer CASE?

CASE is standard SQL and portable; IF() is dialect-specific, so prefer CASE.

34
New cards
  1. What is a view and how does it differ from a table?

A view is a stored SELECT that presents virtual data; unlike a table it stores no data itself.

35
New cards
  1. What is a stored procedure and when is it useful?

A pre-compiled set of SQL statements stored in the DB; useful for encapsulating reusable business logic.

36
New cards
  1. Describe Amazon Redshift’s architecture.

Leader node coordinates queries; multiple compute nodes store data and execute steps in parallel (MPP).

37
New cards
  1. What is MPP architecture and why is it vital for warehousing?

Massively Parallel Processing splits tasks across nodes working simultaneously, enabling fast analysis of billions of rows.

38
New cards
  1. Explain columnar storage and its benefits.

Data stored by column, enabling reading only needed columns and superior compression—both speed analytic queries.

39
New cards
  1. Recommended way to bulk-load data into Redshift.

Use the parallel COPY command from S3 rather than individual INSERTs.

40
New cards
  1. Give two best practices for the COPY command.

Split large files into equal parts and use a manifest file; always compress files (e.g., GZIP/ZSTD).

41
New cards
  1. What is a manifest file in Redshift loads?

A JSON file listing exact S3 objects to COPY, ensuring only specified files are loaded.

42
New cards
  1. What is a Distribution Key (DISTKEY) and its styles?

DISTKEY controls row distribution: EVEN (round-robin), KEY (same key on same node), ALL (full copy on every node).

43
New cards
  1. When choose DISTSTYLE ALL vs. DISTSTYLE KEY?

Use ALL for small, frequently joined dimensions; use KEY for large fact tables and their largest dimension on the join key.

44
New cards
  1. What is data skew in Redshift and how can a bad DISTKEY cause it?

Uneven data across nodes; a poorly chosen DISTKEY with non-uniform values puts too much data on some nodes, slowing queries.

45
New cards
  1. What is a Sort Key and how does it speed queries?

Defines on-disk sort order; filtering on sorted columns lets Redshift skip large data blocks, accelerating scans.

46
New cards
  1. Compare COMPOUND and INTERLEAVED Sort Keys.

COMPOUND orders by priority column sequence—best when queries filter on first column; INTERLEAVED gives equal weight to all columns—best for varied filter patterns.

47
New cards
  1. Best SORTKEY for a large fact table queried by date.

A COMPOUND SORTKEY with the date column first.

48
New cards
  1. Purpose of the VACUUM command in Redshift.

Reclaims space from deleted rows and re-sorts data to maintain performance.

49
New cards
  1. Materialized view vs. regular view in Redshift.

Regular view runs its query each time; materialized view stores pre-computed results for faster repeated access.

50
New cards
  1. How does Redshift enforce security?

Through standard GRANT and REVOKE privileges on users and groups for schemas, tables, and columns.

51
New cards
  1. What is Redshift Spectrum and when is it useful?

Feature that lets you query data in S3 directly without loading—great for huge, rarely accessed datasets.

52
New cards
  1. What is Workload Management (WLM) in Redshift?

Configures query queues to prioritize workloads so short dashboard queries aren’t blocked by long ETL jobs.

53
New cards
  1. What is Matillion and is it ETL or ELT?

A cloud-native data integration tool; it is an ELT tool that pushes SQL to the warehouse.

54
New cards
  1. Define a “push-down” transformation in Matillion.

Transformation logic is converted to SQL and executed inside Redshift rather than on an external engine.

55
New cards
  1. Difference between an Orchestration Job and a Transformation Job in Matillion.

Transformation Job manipulates data; Orchestration Job controls flow—running transformations, COPYs, and conditional logic.

56
New cards
  1. How does Matillion enable building pipelines with little coding?

Provides a drag-and-drop UI that visually builds jobs and auto-generates the underlying SQL.

57
New cards
  1. How do you handle version control for Matillion jobs?

Connect Matillion to Git to commit, push, and pull job changes collaboratively.

58
New cards
  1. Name three types of data sources Matillion can connect to.

Databases (MySQL), SaaS apps (Salesforce, Google Analytics), and file storage (Amazon S3).

59
New cards
  1. How would you schedule a daily job in Matillion?

Use Matillion’s built-in scheduler to set a recurring trigger on the orchestration job.

60
New cards
  1. How can you implement an SCD Type 2 dimension in Matillion?

Use the Detect Changes component to compare source vs. target and insert new/changed rows while flagging old ones inactive.

61
New cards
  1. What role does Python play in a mostly SQL data-engineering stack?

Acts as glue for tasks SQL is weak at: API ingestion, workflow automation, complex validation, etc.

62
New cards
  1. Which Python library is standard for connecting to Redshift?

psycopg2 (PostgreSQL adapter).

63
New cards
  1. Which Python library is most common for data manipulation?

pandas.

64
New cards
  1. Explain class vs. object in Python.

A class is a blueprint; an object is an instantiated instance of that blueprint.

65
New cards
  1. List the four main OOP principles.

Encapsulation, Abstraction, Inheritance, Polymorphism.

66
New cards
  1. Difference between an abstract class and an interface.

Abstract class can include some implemented methods; interface defines only method signatures with no implementation.

67
New cards
  1. How would you read a large CSV efficiently in Python?

Use pandas.read_csv with the chunksize parameter to process in manageable pieces.

68
New cards
  1. What is a Python virtual environment and why use it?

An isolated environment per project that avoids dependency/version conflicts.

69
New cards
  1. Outline an end-to-end data pipeline design.

Ingest to S3 lake; ELT into Redshift staging; transform to Star Schema; serve to BI; orchestrate & monitor with scheduling/alerts.

70
New cards
  1. Design a data warehouse schema for property listings on dubizzle.

Star schema: factlistingsperformance (daily metrics) + dimensions dimdate, dimproperty, dimagent, dimlocation.

71
New cards
  1. How would you design near real-time dashboard updates?

Use streaming: capture changes (DMS), push to Kinesis, process stream, store in fast store (e.g., Elasticsearch).

72
New cards
  1. Steps to debug an intermittently failing data pipeline.

Check logs, isolate failing task, inspect common issues (schema change, credentials), reproduce manually.

73
New cards
  1. How do you ensure data quality in pipelines?

Validation checks, reconciliation counts, automated tests, and alerting on failures.

74
New cards
  1. A dashboard number is wrong—what’s your investigation process?

Clarify issue, trace lineage back to raw data, validate at each step, communicate findings & fix.

75
New cards
  1. High-level design to ingest data from a new REST API.

Explore with Postman; write Python requests script with pagination & error handling; land JSON in S3; schedule via Airflow.

76
New cards
  1. How do you handle schema changes from a source system?

Accept additive columns gracefully; detect breaking changes via monitoring and update transformations promptly.

77
New cards
  1. How would you compute a 30-day rolling average of property views per neighborhood?

Aggregate daily views, then use AVG(total_views) OVER (ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) to produce rolling average stored in a reporting table.