Bayut & dubizzle Data Engineer Internship – Interview Prep Flashcards

1. Role Overview & Job-Description Deconstruction

  • Bayut & dubizzle run a "modern, cloud-centric" environment; intern expected to be versatile, proactive, code-quality driven.
  • Core phrases unpacked:
    • “Raw data integrations…” ⇒ Extract stage; sources = prod DBs (PostgreSQL/MySQL), internal micro-services, 3rd-party APIs (Google Analytics, Salesforce). Must know formats (JSON, CSV, Parquet) & connection methods (API, JDBC/ODBC); tackle rate-limits & schema drift.
    • “Warehouse modelling for operational & application layers” ⇒ Intern will design schemas, not just move data; implies bronze-silver-gold layering & dimensional modelling.
    • Stack anchors: Amazon Redshift, SQL, Python, Matillion ETL, Periscope Data (Sisense). Work under agile flow & coding standards.
    • “ELT design in Matillion” ⇒ Visual orchestration; SQL pushed-down to Redshift.
    • “Advanced visualisations in Periscope” ⇒ Quality of pipelines ↔ dashboard performance; requires end-to-end mindset.
    • “Liaise with other departments” ⇒ Strong written/oral communication; translate vague biz questions → concrete specs.

2. Tech-Stack: Cloud Data Flow (Bird’s-Eye)

  • Sources: internal DBs/logs + 3rd-party SaaS.
  • Ingestion/Orchestration: Matillion—schedules, dependencies, connectors, error-handling.
  • Storage & Compute: Redshift (MPP, columnar). Raw often lands in S3 first.
  • BI/Consumption: Periscope connects live to Redshift.
  • Languages: SQL dominant; Python for scripting / API extractions / automation.

3. Culture Markers: Entrepreneurial Spirit & Curiosity

  • Take ownership beyond ticket scope; e.g., detect recurring data-quality issue, embed validation job.
  • Ask “why?” behind every request; leads to correct grains & definitions (e.g., “user” differs per dept).
  • Interview: prepare examples showing self-initiated improvements & domain curiosity.

4. Foundational Data-Engineering Concepts

4.1 Modern Repositories

  • Data Warehouse (DWH): Cleansed, structured, query-optimised; e.g., Redshift.
  • Data Lake: Raw, schema-on-read; usually S3; feeds DWH.
  • Data Mart: Department-specific slice of DWH.

4.2 Processing Paradigms

  • OLTP: Short, atomic transactions, highly normalised.
  • OLAP: Complex, read-heavy analytics, denormalised (star); Redshift.

4.3 Data-Flow Philosophies

  • ETL (Extract-Transform-Load): transform on separate server before warehouse.
  • ELT (Extract-Load-Transform): load raw to warehouse, transform in-place via its engine—Matillion + Redshift exemplify.

4.4 Dimensional Modelling

  • Fact tables: numeric metrics + FKs.
  • Dimension tables: descriptive context.
  • Granularity: single most critical design decision.

4.5 Schema Types

  • Star: central fact surrounded by denormalised dims—fewer joins, best for Redshift.
  • Snowflake: normalised dims—less storage, slower queries.

4.6 Slowly Changing Dimensions (SCD)

  • Type 1: overwrite.
  • Type 2: new row + date range & surrogate key (standard for history).
  • Type 3: add new column (limited).

5. SQL Toolkit for Data Engineers

5.1 Command Families

  • DDL (CREATE, ALTER, DROP, TRUNCATE)
  • DML (INSERT, UPDATE, DELETE)
  • DQL (SELECT)
  • DCL (GRANT, REVOKE)
  • TCL (COMMIT, ROLLBACK, SAVEPOINT)

5.2 Query Construction Essentials

  • CTEs vs subqueries; aliases; INNER / OUTER / CROSS joins.

5.3 Window Functions

  • Ranking: ROWNUMBER, RANK, DENSERANK.
  • Aggregates: SUM() OVER.
  • Offsets: LAG, LEAD.

5.4 Practical Distinctions

  • UNION vs UNION ALL (dedupe cost).
  • DELETE vs TRUNCATE vs DROP.
  • WHERE (row filter pre-agg) vs HAVING (group filter post-agg).
  • Prefer CASE over vendor-specific IF.

5.5 Performance Fundamentals

  • Index concept (B-Tree) & Redshift’s alternatives (sort keys, zone maps).
  • Use EXPLAIN\text{EXPLAIN} to inspect join type, estimated rows & cost.

6. Amazon Redshift Deep-Dive

6.1 Architecture

  • Leader Node parses, optimises, compiles query pieces → distributes to compute nodes.
  • Compute Nodes split into slices, execute in parallel (MPP).
  • Columnar storage reduces I/O for analytic columns.

6.2 Bulk Loading with COPY

  • Golden rule: never mass-INSERT; use COPY.
  • Stage files in S3; split into N files ≈ multiple of slices; 1 MB–1 GB compressed (GZIP/BZIP2).
  • Use manifest JSON for idempotent loads.

6.3 Table-Design Levers

  • DISTKEY: KEY, ALL, EVEN—choose to minimise network shuffle.
  • SORTKEY: COMPOUND (prefix-friendly) vs INTERLEAVED.
  • VACUUM: re-sort, reclaim space.

6.4 Decision Matrix (quick rules)

  • Large fact joined on userid, date filters ⇒ DISTKEY(userid), COMPOUND SORTKEY(date).
  • Small dim ⇒ DISTSTYLE ALL.
  • Temp staging ⇒ EVEN, no sort key.

7. Matillion Orchestration Principles

  • ELT philosophy: GUI → auto-generated SQL pushed to Redshift.
  • Job types:
    • Orchestration (control flow, branching, notifications).
    • Transformation (business logic; compiles to single CTAS / INSERT … SELECT).
  • Best practices: load raw to staging first; separate transform job; integrate with Git; leverage connectors.

8. Python’s Supporting Role

  • Automation scripts, API ingestion, pre-load validation.
  • Libraries: pandas for data munging; psycopg2 for Redshift/PostgreSQL.
  • OOP basics: class/object, encapsulation, inheritance, polymorphism, abstraction; abstract class vs interface.

9. Periscope Data (Sisense) – BI Layer

  • Engineers ensure star schemas & materialised views enable fast dashboards.
  • Provide views/materialised views for analysts; schedule refreshes via Matillion.

10. System-Design Interview Framework

  1. Clarify requirements: volume, velocity, variety, latency, consumers.
  2. Sketch high-level architecture (Source → Ingest → Storage/Processing → Serving).
  3. Model data: define facts/dims; pick DISTKEY & SORTKEY.
  4. Detail ELT pipeline (COPY → stage → transform → publish).
  5. Plan data-quality & monitoring (Write-Audit-Publish, alerts, idempotency).
  6. Discuss scalability & bottlenecks.

Example: User-Activity Pipeline (Daily Batch)

  • Kinesis Firehose → S3 (partitioned dt=YYYY-MM-DD) → Matillion COPY → stagingevents (SUPER) → sessionisation transform → factuseractivity (DISTKEY userkey, SORTKEY(eventtimestamp,userkey)).
  • Dashboards read via Periscope; latency ≤ 24 h.

Example: "Property Insights" Data Model

  • factlistingdailyperformance (dist=agentkey, sort=(datekey,agentkey)).
  • dimlisting (SCD-2), dimagent (ALL), dim_date (ALL).

11. Behavioral Preparation (STAR Method)

  • Prepare ≥ 3 stories: technical challenge, rapid learning, cross-functional communication.
  • Emphasise ownership, curiosity, measurable results (e.g., "reduced query time by 70% using correct SORTKEY").
  • Sample questions to ask interviewers: performance-tuning focus, data-quality frameworks, onboarding process, upcoming data challenges.

12. Consolidated Question Bank & 48-Hour Checklist

  • 90+ technical & behavioral questions covering: warehouses, SCDs, SQL joins, Redshift COPY best practices, Matillion push-down, Python OOP.
  • Last-minute actions:
    • Re-whiteboard two system-design scenarios.
    • Practise window-function query on paper.
    • Rehearse STAR stories.
    • Memorise key DIST/SORT decision rules.
    • Finalise insightful interviewer questions.
    • Test interview tech; sleep well.

13. Key Formulae & Definitions (Quick-Glance)

  • Time on Market: \text{TOM} = \text{Off_Market_Date} - \text{Listing_Date}
  • Running total example: SUM(sales)  OVER  (PARTITION  BY  country)\text{SUM(sales)}\;OVER\;(PARTITION\;BY\;country)
  • ROW_NUMBER window: ROWNUMBER()  OVER  (PARTITION  BY  dept  ORDER  BY  salary  DESC)ROW_NUMBER()\;OVER\;(PARTITION\;BY\;dept\;ORDER\;BY\;salary\;DESC)

Use this structured outline as a full-fledged study notebook; every bullet links back to explicit statements in the transcript while organising them for rapid revision.