Bayut & dubizzle Data Engineer Internship – Interview Prep Flashcards

0.0(0)

Studied by 0 people

View linked note

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/82

Earn XP

Description and Tags

A comprehensive set of question-and-answer flashcards covering the core concepts, technologies, and interview strategies discussed in the lecture notes for the Bayut & dubizzle Data Engineer internship.

Last updated 3:35 PM on 7/13/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

83 Terms

New cards

What does the phrase “raw data integrations with primary and third-party systems” primarily refer to in a data pipeline?

The Extract phase, where data is pulled from sources such as production databases, micro-services, or external APIs in formats like JSON, CSV, or Parquet.

New cards

Which data-warehouse design activity is implied by the responsibility “data-warehouse modelling for operational and application data layers”?

Designing optimized schemas (e.g., dimensional modelling with fact and dimension tables) for analytical queries, often implementing a bronze-silver-gold layer architecture.

New cards

What three core technologies constitute the technical stack called out in the Bayut & dubizzle job description?

Amazon Redshift for warehousing, SQL for data manipulation, and Python for scripting/automation.

New cards

Which ETL/ELT tool does Bayut & dubizzle use to orchestrate data pipelines?

Matillion ETL (which actually follows an ELT, push-down approach).

New cards

Which BI platform (formerly Periscope Data) is used for dashboards at Bayut & dubizzle?

Sisense for Cloud Data Teams (formerly Periscope Data).

New cards

In the job description, what soft-skill is highlighted by the phrase “liaise with other departments”?

Strong communication skills to translate business needs into technical specifications and explain technical concepts to non-technical stakeholders.

New cards

What are the three main storage layers of the modern data stack described (in order of raw to curated)?

Data Lake (raw), Data Warehouse (cleaned/structured), and Data Mart (department-specific curated subset).

New cards

Define a Data Warehouse in one sentence.

A centralized repository of cleansed, structured, and historically stored data optimized for analytical querying (OLAP).

New cards

How does a Data Lake differ from a Data Warehouse?

A Data Lake stores raw structured, semi-structured, and unstructured data in native format without mandatory schema-on-write, giving flexibility for exploratory analysis.

New cards

What is a Data Mart?

A smaller, business-line-specific subset of a Data Warehouse that improves query performance and simplifies access for a particular user group.

New cards

What is the key distinction between OLTP and OLAP systems?

OLTP supports many fast, atomic transactions for day-to-day operations; OLAP supports complex, large-scale analytical queries over historical data.

New cards

Which paradigm—ETL or ELT—do Matillion and Amazon Redshift together implement?

ELT: data is Extracted, Loaded into Redshift in raw form, and Transformed inside the warehouse using its MPP engine.

New cards

In dimensional modelling, what does a Fact table store?

Quantitative measurements (metrics) plus foreign keys pointing to related dimension tables.

New cards

What does a Dimension table provide in a star schema?

Descriptive, contextual attributes (the who, what, where, when, why) that qualify facts.

New cards

What is meant by the ‘grain’ of a fact table?

The level of detail represented by each row, e.g., “one row per listing per day.”

New cards

Name the two most common schema architectures in data warehousing.

Star Schema and Snowflake Schema.

New cards

Why is a Star Schema usually preferred in Amazon Redshift?

It minimizes joins, boosting query performance, and modern columnar compression reduces redundancy concerns.

New cards

What is a Slowly Changing Dimension (SCD) Type 2?

A technique that tracks historical changes by adding a new row for each change and marking old rows as expired, enabling point-in-time analysis.

New cards

Which SQL category contains commands like CREATE and ALTER?

DDL – Data Definition Language.

New cards

Which SQL statement category consists almost solely of SELECT queries?

DQL – Data Query Language.

New cards

Give one key difference between DELETE and TRUNCATE.

DELETE is a logged, row-by-row DML operation that can use a WHERE clause; TRUNCATE is a fast DDL operation that removes all rows and usually cannot be rolled back.

New cards

When should you use UNION ALL instead of UNION?

Whenever duplicate removal is not explicitly required, because UNION ALL is faster and avoids a costly sort/distinct step.

New cards

What clause is used to filter aggregated groups rather than individual rows?

HAVING.

New cards

Why are Common Table Expressions (CTEs) often preferred over deeply nested subqueries?

They improve readability, modularity, and maintainability by allowing step-by-step logical breakdowns.

New cards

Name two ranking window functions.

ROWNUMBER() and RANK() (also DENSERANK()).

New cards

Which window function would you use to access the previous row’s value?

LAG().

New cards

What does the SQL EXPLAIN command return?

The query execution plan, showing steps, join methods, estimated rows, and costs to aid performance tuning.

New cards

What architectural principle gives Amazon Redshift its parallel processing power?

MPP – Massively Parallel Processing, distributing data and query workload across multiple nodes and slices.

New cards

What role does the Redshift Leader Node play?

It parses, optimizes, compiles, and coordinates query execution across compute nodes, then aggregates results.

New cards

Why does Redshift store data in columnar format?

Columnar storage allows reading only needed columns, drastically cutting I/O for analytical queries and enabling better compression.

New cards

What is the recommended command for bulk-loading data into Redshift?

COPY.

New cards

Name two COPY-command best practices for performance.

Stage data in S3 and split it into multiple equal-sized compressed files (ideally 1 MB–1 GB, number of files ≈ number of slices).

New cards

What is a Redshift manifest file used for?

To list explicit S3 object paths for COPY, ensuring transactional, idempotent, and deterministic loads.

New cards

What does a DISTKEY do in Redshift?

Determines how rows are distributed across nodes by hashing a chosen column, impacting join performance and data movement.

New cards

When should you choose DISTSTYLE ALL?

For small dimension tables frequently joined to large facts, placing a full copy on every node to eliminate network shuffling.

New cards

What is the primary purpose of a SORTKEY in Redshift?

Ordering rows on disk to enable zone-map pruning, reducing I/O for queries with filters on the sorted columns.

New cards

Contrast COMPOUND and INTERLEAVED sort keys.

COMPOUND sorts by columns in listed order (good for prefix filters); INTERLEAVED gives equal weight to each column (better for varied filter columns but costlier to maintain).

New cards

What maintenance command re-sorts data and reclaims space in Redshift?

VACUUM.

New cards

In Matillion, what is an Orchestration Job?

A high-level workflow that schedules, sequences, and controls other jobs, scripts, and conditional logic.

New cards

What is a Transformation Job in Matillion?

A visual data-flow that defines actual transformations which Matillion converts into optimized SQL executed inside Redshift.

New cards

Explain Matillion’s ‘staging-first’ best practice.

Load raw data into a staging table with minimal changes, then transform from staging to cleaned fact/dimension tables in a separate job, improving robustness and debug-ability.

New cards

Which version-control system can Matillion natively integrate with?

Git.

New cards

Give two typical roles for Python in this stack.

Automating tasks (e.g., triggering Matillion via API) and extracting data from APIs to stage in S3.

New cards

Which Python library is commonly used to connect to Redshift?

psycopg2 (PostgreSQL adapter).

New cards

List the four pillars of Object-Oriented Programming.

Encapsulation, Inheritance, Polymorphism, and Abstraction.

New cards

What high-level steps make up the recommended system-design interview framework?

1 Clarify requirements, 2 Design high-level architecture, 3 Deep-dive data model, 4 Detail pipeline implementation, 5 Discuss data quality/monitoring, 6 Scalability & bottlenecks.

New cards

During a system-design round, why should you always start by clarifying requirements?

To understand scope, data volume, velocity, users, and latency needs, ensuring the proposed solution addresses the actual problem.

New cards

In STAR behavioral answers, what does each letter stand for?

Situation, Task, Action, Result.

New cards

What makes a good ‘Result’ in a STAR story?

Concrete, ideally quantified impact that shows success and learning (e.g., “reduced query time by 70%”).

New cards

Give an example of an insightful question to ask interviewers about their ELT stack.

“How does the team balance SQL optimization versus Redshift cluster tuning (e.g., WLM, node types) when improving performance?”

New cards

Why is UNION ALL usually faster than UNION?

UNION ALL skips the duplicate-removal step, avoiding an extra sort/distinct operation.

New cards

Which SQL clause would you use to rename a column or table temporarily in a query for readability?

AS (alias).

New cards

What is a materialized view in Redshift?

A view whose query result is pre-computed and stored on disk, enabling near-instant query responses, periodically refreshed by the pipeline.

New cards

How does Redshift Spectrum extend Redshift’s capabilities?

It allows querying data directly in S3 without loading it into Redshift tables, blending external and internal data sources.

New cards

What is Workload Management (WLM) in Redshift used for?

Allocating cluster resources among query queues to balance and prioritize workloads for consistent performance.

New cards

In Redshift, what problem does data skew cause?

Uneven data distribution leads to some nodes doing more work, slowing down parallel queries and reducing performance.

New cards

What is the purpose of a zone map in Redshift?

Metadata storing min/max values for each 1-MB block to enable skipping irrelevant blocks during scans, leveraging sort keys.

New cards

Why should INSERT statements be avoided for bulk loads into Redshift?

They process rows serially, bypass parallelism, and are far less efficient than the COPY command.

New cards

Which SQL command resets identity columns while rapidly removing all rows?

TRUNCATE.

New cards

What does the COALESCE function do in SQL?

Returns the first non-NULL value in a list of expressions, useful for handling NULLs.

New cards

Describe the difference between a View and a Table.

A table stores data physically; a view stores only a SQL definition and returns results dynamically (unless materialized).

New cards

What is dbt (mentioned indirectly under data quality frameworks)?

A transformation/workflow tool that enables version-controlled, testable, SQL-based ELT pipelines (not used directly at Bayut but conceptually relevant).

New cards

In Redshift COPY, why is file compression (e.g., GZIP) recommended?

It lowers S3 storage costs, reduces network transfer size, and accelerates load times since Redshift can decompress on the fly.

New cards

What does the ‘Write-Audit-Publish’ pattern ensure in data pipelines?

Data is loaded to a temp table, validated, and only then swapped into production tables, preventing bad data from reaching consumers.

New cards

Which Redshift distribution style is default when none is specified?

DISTSTYLE EVEN.

New cards

What does the LAG() window function commonly help calculate?

Differences or changes by comparing current row values with previous row values (e.g., day-over-day metrics).

New cards

Give a use case for a CROSS JOIN.

Generating every possible combination of two small tables, such as creating a date-dimensional scaffold for all products.

New cards

Why is the COPY file count ideally a multiple of total slices in a Redshift cluster?

So each slice loads one file in parallel, fully utilizing cluster resources and maximizing throughput.

New cards

What is the key benefit of staging raw data in Amazon S3 before loading to Redshift?

Decouples data ingestion from warehouse processing and leverages COPY’s parallelism from S3, a highly integrated, scalable storage layer.

New cards

What is an alias for Periscope Data after rebranding?

Sisense for Cloud Data Teams.

New cards

How can materialized views improve dashboard performance in Periscope/Sisense?

By precomputing heavy aggregations so dashboards query the fast materialized result instead of large base tables.

New cards

What Redshift command would you inspect when tuning long-running queries?

EXPLAIN (and possibly EXPLAIN ANALYZE for actual run-time stats).

New cards

Name two Python libraries commonly used for data manipulation and visualization in exploratory work.

Pandas and Matplotlib (or Seaborn).

New cards

What is Encapsulation in OOP?

The concept of bundling data and methods operating on that data within one class, hiding implementation details from outside access.

New cards

Which Redshift key strategy is recommended for a small dimension table frequently joined to large facts?

DISTSTYLE ALL plus a simple COMPOUND SORTKEY on its primary key.

New cards

Why are date columns often chosen as leading SORTKEYs?

Most analytical queries filter by time ranges, so sorting by date maximizes block pruning and speeds scans.

New cards

What should you include in Matillion to alert teams on pipeline failures?

Notification components (e.g., email or Slack) tied to job success/failure paths.

New cards

What AWS service can stream website events into S3 for later Redshift loading (mentioned in system-design scenario)?

Amazon Kinesis Data Firehose.

New cards

How does INTERLEAVED SORTKEY affect VACUUM operations?

It requires more processing during VACUUM, making maintenance heavier than for COMPOUND keys.

New cards

What does the Redshift VACUUM DELETE phase accomplish?

Reclaims space from rows marked for deletion, freeing disk and restoring sort order.

New cards

In Python, what is a virtual environment and why use it?

An isolated environment that keeps project-specific package versions separate, preventing dependency conflicts.

New cards

When presenting system design, why is it critical to mention data-quality checks?

They demonstrate foresight in ensuring reliability, preventing bad data from propagating, and reflect real-world production considerations.

New cards

What SQL clause allows hierarchical query referencing within the same statement for readability?

WITH, which defines a Common Table Expression (CTE).