Bayut & dubizzle Data Engineer Internship – Interview Prep Flashcards

1. Role Overview & Job-Description Deconstruction

Bayut & dubizzle run a "modern, cloud-centric" environment; intern expected to be versatile, proactive, code-quality driven.
Core phrases unpacked:
- “Raw data integrations…” ⇒ Extract stage; sources = prod DBs (PostgreSQL/MySQL), internal micro-services, 3rd-party APIs (Google Analytics, Salesforce). Must know formats (JSON, CSV, Parquet) & connection methods (API, JDBC/ODBC); tackle rate-limits & schema drift.
- “Warehouse modelling for operational & application layers” ⇒ Intern will design schemas, not just move data; implies bronze-silver-gold layering & dimensional modelling.
- Stack anchors: Amazon Redshift, SQL, Python, Matillion ETL, Periscope Data (Sisense). Work under agile flow & coding standards.
- “ELT design in Matillion” ⇒ Visual orchestration; SQL pushed-down to Redshift.
- “Advanced visualisations in Periscope” ⇒ Quality of pipelines ↔ dashboard performance; requires end-to-end mindset.
- “Liaise with other departments” ⇒ Strong written/oral communication; translate vague biz questions → concrete specs.

2. Tech-Stack: Cloud Data Flow (Bird’s-Eye)

Sources: internal DBs/logs + 3rd-party SaaS.
Ingestion/Orchestration: Matillion—schedules, dependencies, connectors, error-handling.
Storage & Compute: Redshift (MPP, columnar). Raw often lands in S3 first.
BI/Consumption: Periscope connects live to Redshift.
Languages: SQL dominant; Python for scripting / API extractions / automation.

3. Culture Markers: Entrepreneurial Spirit & Curiosity

Take ownership beyond ticket scope; e.g., detect recurring data-quality issue, embed validation job.
Ask “why?” behind every request; leads to correct grains & definitions (e.g., “user” differs per dept).
Interview: prepare examples showing self-initiated improvements & domain curiosity.

4. Foundational Data-Engineering Concepts

4.1 Modern Repositories

Data Warehouse (DWH): Cleansed, structured, query-optimised; e.g., Redshift.
Data Lake: Raw, schema-on-read; usually S3; feeds DWH.
Data Mart: Department-specific slice of DWH.

4.2 Processing Paradigms

OLTP: Short, atomic transactions, highly normalised.
OLAP: Complex, read-heavy analytics, denormalised (star); Redshift.

4.3 Data-Flow Philosophies

ETL (Extract-Transform-Load): transform on separate server before warehouse.
ELT (Extract-Load-Transform): load raw to warehouse, transform in-place via its engine—Matillion + Redshift exemplify.

4.4 Dimensional Modelling

Fact tables: numeric metrics + FKs.
Dimension tables: descriptive context.
Granularity: single most critical design decision.

4.5 Schema Types

Star: central fact surrounded by denormalised dims—fewer joins, best for Redshift.
Snowflake: normalised dims—less storage, slower queries.

4.6 Slowly Changing Dimensions (SCD)

Type 1: overwrite.
Type 2: new row + date range & surrogate key (standard for history).
Type 3: add new column (limited).

5. SQL Toolkit for Data Engineers

5.1 Command Families

DDL (CREATE, ALTER, DROP, TRUNCATE)
DML (INSERT, UPDATE, DELETE)
DQL (SELECT)
DCL (GRANT, REVOKE)
TCL (COMMIT, ROLLBACK, SAVEPOINT)

5.2 Query Construction Essentials

CTEs vs subqueries; aliases; INNER / OUTER / CROSS joins.

5.3 Window Functions

Ranking: ROWNUMBER, RANK, DENSERANK.
Aggregates: SUM() OVER.
Offsets: LAG, LEAD.

5.4 Practical Distinctions

UNION vs UNION ALL (dedupe cost).
DELETE vs TRUNCATE vs DROP.
WHERE (row filter pre-agg) vs HAVING (group filter post-agg).
Prefer CASE over vendor-specific IF.

5.5 Performance Fundamentals

Index concept (B-Tree) & Redshift’s alternatives (sort keys, zone maps).
Use $\text{EXPLAIN}$ to inspect join type, estimated rows & cost.

6. Amazon Redshift Deep-Dive

6.1 Architecture

Leader Node parses, optimises, compiles query pieces → distributes to compute nodes.
Compute Nodes split into slices, execute in parallel (MPP).
Columnar storage reduces I/O for analytic columns.

6.2 Bulk Loading with COPY

Golden rule: never mass-INSERT; use COPY.
Stage files in S3; split into N files ≈ multiple of slices; 1 MB–1 GB compressed (GZIP/BZIP2).
Use manifest JSON for idempotent loads.

6.3 Table-Design Levers

DISTKEY: KEY, ALL, EVEN—choose to minimise network shuffle.
SORTKEY: COMPOUND (prefix-friendly) vs INTERLEAVED.
VACUUM: re-sort, reclaim space.

6.4 Decision Matrix (quick rules)

Large fact joined on userid, date filters ⇒ DISTKEY(userid), COMPOUND SORTKEY(date).
Small dim ⇒ DISTSTYLE ALL.
Temp staging ⇒ EVEN, no sort key.

7. Matillion Orchestration Principles

ELT philosophy: GUI → auto-generated SQL pushed to Redshift.
Job types:
- Orchestration (control flow, branching, notifications).
- Transformation (business logic; compiles to single CTAS / INSERT … SELECT).
Best practices: load raw to staging first; separate transform job; integrate with Git; leverage connectors.

8. Python’s Supporting Role

Automation scripts, API ingestion, pre-load validation.
Libraries: pandas for data munging; psycopg2 for Redshift/PostgreSQL.
OOP basics: class/object, encapsulation, inheritance, polymorphism, abstraction; abstract class vs interface.

9. Periscope Data (Sisense) – BI Layer

Engineers ensure star schemas & materialised views enable fast dashboards.
Provide views/materialised views for analysts; schedule refreshes via Matillion.

10. System-Design Interview Framework

Clarify requirements: volume, velocity, variety, latency, consumers.
Sketch high-level architecture (Source → Ingest → Storage/Processing → Serving).
Model data: define facts/dims; pick DISTKEY & SORTKEY.
Detail ELT pipeline (COPY → stage → transform → publish).
Plan data-quality & monitoring (Write-Audit-Publish, alerts, idempotency).
Discuss scalability & bottlenecks.

Example: User-Activity Pipeline (Daily Batch)

Kinesis Firehose → S3 (partitioned dt=YYYY-MM-DD) → Matillion COPY → stagingevents (SUPER) → sessionisation transform → factuseractivity (DISTKEY userkey, SORTKEY(eventtimestamp,userkey)).
Dashboards read via Periscope; latency ≤ 24 h.

Example: "Property Insights" Data Model

factlistingdailyperformance (dist=agentkey, sort=(datekey,agentkey)).
dimlisting (SCD-2), dimagent (ALL), dim_date (ALL).

11. Behavioral Preparation (STAR Method)

Prepare ≥ 3 stories: technical challenge, rapid learning, cross-functional communication.
Emphasise ownership, curiosity, measurable results (e.g., "reduced query time by 70% using correct SORTKEY").
Sample questions to ask interviewers: performance-tuning focus, data-quality frameworks, onboarding process, upcoming data challenges.

12. Consolidated Question Bank & 48-Hour Checklist

90+ technical & behavioral questions covering: warehouses, SCDs, SQL joins, Redshift COPY best practices, Matillion push-down, Python OOP.
Last-minute actions:
- Re-whiteboard two system-design scenarios.
- Practise window-function query on paper.
- Rehearse STAR stories.
- Memorise key DIST/SORT decision rules.
- Finalise insightful interviewer questions.
- Test interview tech; sleep well.

13. Key Formulae & Definitions (Quick-Glance)

Time on Market: \text{TOM} = \text{Off_Market_Date} - \text{Listing_Date}
Running total example: $\text{SUM(sales)}\;OVER\;(PARTITION\;BY\;country)$
ROW_NUMBER window: $ROW_NUMBER()\;OVER\;(PARTITION\;BY\;dept\;ORDER\;BY\;salary\;DESC)$

Use this structured outline as a full-fledged study notebook; every bullet links back to explicit statements in the transcript while organising them for rapid revision.