1/82
A comprehensive set of question-and-answer flashcards covering the core concepts, technologies, and interview strategies discussed in the lecture notes for the Bayut & dubizzle Data Engineer internship.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What does the phrase “raw data integrations with primary and third-party systems” primarily refer to in a data pipeline?
The Extract phase, where data is pulled from sources such as production databases, micro-services, or external APIs in formats like JSON, CSV, or Parquet.
Which data-warehouse design activity is implied by the responsibility “data-warehouse modelling for operational and application data layers”?
Designing optimized schemas (e.g., dimensional modelling with fact and dimension tables) for analytical queries, often implementing a bronze-silver-gold layer architecture.
What three core technologies constitute the technical stack called out in the Bayut & dubizzle job description?
Amazon Redshift for warehousing, SQL for data manipulation, and Python for scripting/automation.
Which ETL/ELT tool does Bayut & dubizzle use to orchestrate data pipelines?
Matillion ETL (which actually follows an ELT, push-down approach).
Which BI platform (formerly Periscope Data) is used for dashboards at Bayut & dubizzle?
Sisense for Cloud Data Teams (formerly Periscope Data).
In the job description, what soft-skill is highlighted by the phrase “liaise with other departments”?
Strong communication skills to translate business needs into technical specifications and explain technical concepts to non-technical stakeholders.
What are the three main storage layers of the modern data stack described (in order of raw to curated)?
Data Lake (raw), Data Warehouse (cleaned/structured), and Data Mart (department-specific curated subset).
Define a Data Warehouse in one sentence.
A centralized repository of cleansed, structured, and historically stored data optimized for analytical querying (OLAP).
How does a Data Lake differ from a Data Warehouse?
A Data Lake stores raw structured, semi-structured, and unstructured data in native format without mandatory schema-on-write, giving flexibility for exploratory analysis.
What is a Data Mart?
A smaller, business-line-specific subset of a Data Warehouse that improves query performance and simplifies access for a particular user group.
What is the key distinction between OLTP and OLAP systems?
OLTP supports many fast, atomic transactions for day-to-day operations; OLAP supports complex, large-scale analytical queries over historical data.
Which paradigm—ETL or ELT—do Matillion and Amazon Redshift together implement?
ELT: data is Extracted, Loaded into Redshift in raw form, and Transformed inside the warehouse using its MPP engine.
In dimensional modelling, what does a Fact table store?
Quantitative measurements (metrics) plus foreign keys pointing to related dimension tables.
What does a Dimension table provide in a star schema?
Descriptive, contextual attributes (the who, what, where, when, why) that qualify facts.
What is meant by the ‘grain’ of a fact table?
The level of detail represented by each row, e.g., “one row per listing per day.”
Name the two most common schema architectures in data warehousing.
Star Schema and Snowflake Schema.
Why is a Star Schema usually preferred in Amazon Redshift?
It minimizes joins, boosting query performance, and modern columnar compression reduces redundancy concerns.
What is a Slowly Changing Dimension (SCD) Type 2?
A technique that tracks historical changes by adding a new row for each change and marking old rows as expired, enabling point-in-time analysis.
Which SQL category contains commands like CREATE and ALTER?
DDL – Data Definition Language.
Which SQL statement category consists almost solely of SELECT queries?
DQL – Data Query Language.
Give one key difference between DELETE and TRUNCATE.
DELETE is a logged, row-by-row DML operation that can use a WHERE clause; TRUNCATE is a fast DDL operation that removes all rows and usually cannot be rolled back.
When should you use UNION ALL instead of UNION?
Whenever duplicate removal is not explicitly required, because UNION ALL is faster and avoids a costly sort/distinct step.
What clause is used to filter aggregated groups rather than individual rows?
HAVING.
Why are Common Table Expressions (CTEs) often preferred over deeply nested subqueries?
They improve readability, modularity, and maintainability by allowing step-by-step logical breakdowns.
Name two ranking window functions.
ROWNUMBER() and RANK() (also DENSERANK()).
Which window function would you use to access the previous row’s value?
LAG().
What does the SQL EXPLAIN command return?
The query execution plan, showing steps, join methods, estimated rows, and costs to aid performance tuning.
What architectural principle gives Amazon Redshift its parallel processing power?
MPP – Massively Parallel Processing, distributing data and query workload across multiple nodes and slices.
What role does the Redshift Leader Node play?
It parses, optimizes, compiles, and coordinates query execution across compute nodes, then aggregates results.
Why does Redshift store data in columnar format?
Columnar storage allows reading only needed columns, drastically cutting I/O for analytical queries and enabling better compression.
What is the recommended command for bulk-loading data into Redshift?
COPY.
Name two COPY-command best practices for performance.
Stage data in S3 and split it into multiple equal-sized compressed files (ideally 1 MB–1 GB, number of files ≈ number of slices).
What is a Redshift manifest file used for?
To list explicit S3 object paths for COPY, ensuring transactional, idempotent, and deterministic loads.
What does a DISTKEY do in Redshift?
Determines how rows are distributed across nodes by hashing a chosen column, impacting join performance and data movement.
When should you choose DISTSTYLE ALL?
For small dimension tables frequently joined to large facts, placing a full copy on every node to eliminate network shuffling.
What is the primary purpose of a SORTKEY in Redshift?
Ordering rows on disk to enable zone-map pruning, reducing I/O for queries with filters on the sorted columns.
Contrast COMPOUND and INTERLEAVED sort keys.
COMPOUND sorts by columns in listed order (good for prefix filters); INTERLEAVED gives equal weight to each column (better for varied filter columns but costlier to maintain).
What maintenance command re-sorts data and reclaims space in Redshift?
VACUUM.
In Matillion, what is an Orchestration Job?
A high-level workflow that schedules, sequences, and controls other jobs, scripts, and conditional logic.
What is a Transformation Job in Matillion?
A visual data-flow that defines actual transformations which Matillion converts into optimized SQL executed inside Redshift.
Explain Matillion’s ‘staging-first’ best practice.
Load raw data into a staging table with minimal changes, then transform from staging to cleaned fact/dimension tables in a separate job, improving robustness and debug-ability.
Which version-control system can Matillion natively integrate with?
Git.
Give two typical roles for Python in this stack.
Automating tasks (e.g., triggering Matillion via API) and extracting data from APIs to stage in S3.
Which Python library is commonly used to connect to Redshift?
psycopg2 (PostgreSQL adapter).
List the four pillars of Object-Oriented Programming.
Encapsulation, Inheritance, Polymorphism, and Abstraction.
What high-level steps make up the recommended system-design interview framework?
1 Clarify requirements, 2 Design high-level architecture, 3 Deep-dive data model, 4 Detail pipeline implementation, 5 Discuss data quality/monitoring, 6 Scalability & bottlenecks.
During a system-design round, why should you always start by clarifying requirements?
To understand scope, data volume, velocity, users, and latency needs, ensuring the proposed solution addresses the actual problem.
In STAR behavioral answers, what does each letter stand for?
Situation, Task, Action, Result.
What makes a good ‘Result’ in a STAR story?
Concrete, ideally quantified impact that shows success and learning (e.g., “reduced query time by 70%”).
Give an example of an insightful question to ask interviewers about their ELT stack.
“How does the team balance SQL optimization versus Redshift cluster tuning (e.g., WLM, node types) when improving performance?”
Why is UNION ALL usually faster than UNION?
UNION ALL skips the duplicate-removal step, avoiding an extra sort/distinct operation.
Which SQL clause would you use to rename a column or table temporarily in a query for readability?
AS (alias).
What is a materialized view in Redshift?
A view whose query result is pre-computed and stored on disk, enabling near-instant query responses, periodically refreshed by the pipeline.
How does Redshift Spectrum extend Redshift’s capabilities?
It allows querying data directly in S3 without loading it into Redshift tables, blending external and internal data sources.
What is Workload Management (WLM) in Redshift used for?
Allocating cluster resources among query queues to balance and prioritize workloads for consistent performance.
In Redshift, what problem does data skew cause?
Uneven data distribution leads to some nodes doing more work, slowing down parallel queries and reducing performance.
What is the purpose of a zone map in Redshift?
Metadata storing min/max values for each 1-MB block to enable skipping irrelevant blocks during scans, leveraging sort keys.
Why should INSERT statements be avoided for bulk loads into Redshift?
They process rows serially, bypass parallelism, and are far less efficient than the COPY command.
Which SQL command resets identity columns while rapidly removing all rows?
TRUNCATE.
What does the COALESCE function do in SQL?
Returns the first non-NULL value in a list of expressions, useful for handling NULLs.
Describe the difference between a View and a Table.
A table stores data physically; a view stores only a SQL definition and returns results dynamically (unless materialized).
What is dbt (mentioned indirectly under data quality frameworks)?
A transformation/workflow tool that enables version-controlled, testable, SQL-based ELT pipelines (not used directly at Bayut but conceptually relevant).
In Redshift COPY, why is file compression (e.g., GZIP) recommended?
It lowers S3 storage costs, reduces network transfer size, and accelerates load times since Redshift can decompress on the fly.
What does the ‘Write-Audit-Publish’ pattern ensure in data pipelines?
Data is loaded to a temp table, validated, and only then swapped into production tables, preventing bad data from reaching consumers.
Which Redshift distribution style is default when none is specified?
DISTSTYLE EVEN.
What does the LAG() window function commonly help calculate?
Differences or changes by comparing current row values with previous row values (e.g., day-over-day metrics).
Give a use case for a CROSS JOIN.
Generating every possible combination of two small tables, such as creating a date-dimensional scaffold for all products.
Why is the COPY file count ideally a multiple of total slices in a Redshift cluster?
So each slice loads one file in parallel, fully utilizing cluster resources and maximizing throughput.
What is the key benefit of staging raw data in Amazon S3 before loading to Redshift?
Decouples data ingestion from warehouse processing and leverages COPY’s parallelism from S3, a highly integrated, scalable storage layer.
What is an alias for Periscope Data after rebranding?
Sisense for Cloud Data Teams.
How can materialized views improve dashboard performance in Periscope/Sisense?
By precomputing heavy aggregations so dashboards query the fast materialized result instead of large base tables.
What Redshift command would you inspect when tuning long-running queries?
EXPLAIN (and possibly EXPLAIN ANALYZE for actual run-time stats).
Name two Python libraries commonly used for data manipulation and visualization in exploratory work.
Pandas and Matplotlib (or Seaborn).
What is Encapsulation in OOP?
The concept of bundling data and methods operating on that data within one class, hiding implementation details from outside access.
Which Redshift key strategy is recommended for a small dimension table frequently joined to large facts?
DISTSTYLE ALL plus a simple COMPOUND SORTKEY on its primary key.
Why are date columns often chosen as leading SORTKEYs?
Most analytical queries filter by time ranges, so sorting by date maximizes block pruning and speeds scans.
What should you include in Matillion to alert teams on pipeline failures?
Notification components (e.g., email or Slack) tied to job success/failure paths.
What AWS service can stream website events into S3 for later Redshift loading (mentioned in system-design scenario)?
Amazon Kinesis Data Firehose.
How does INTERLEAVED SORTKEY affect VACUUM operations?
It requires more processing during VACUUM, making maintenance heavier than for COMPOUND keys.
What does the Redshift VACUUM DELETE phase accomplish?
Reclaims space from rows marked for deletion, freeing disk and restoring sort order.
In Python, what is a virtual environment and why use it?
An isolated environment that keeps project-specific package versions separate, preventing dependency conflicts.
When presenting system design, why is it critical to mention data-quality checks?
They demonstrate foresight in ensuring reliability, preventing bad data from propagating, and reflect real-world production considerations.
What SQL clause allows hierarchical query referencing within the same statement for readability?
WITH, which defines a Common Table Expression (CTE).