Bayut & dubizzle Data Engineer Internship – Interview Prep Flashcards
1. Role Overview & Job-Description Deconstruction
- Bayut & dubizzle run a "modern, cloud-centric" environment; intern expected to be versatile, proactive, code-quality driven.
- Core phrases unpacked:
- “Raw data integrations…” ⇒ Extract stage; sources = prod DBs (PostgreSQL/MySQL), internal micro-services, 3rd-party APIs (Google Analytics, Salesforce). Must know formats (JSON, CSV, Parquet) & connection methods (API, JDBC/ODBC); tackle rate-limits & schema drift.
- “Warehouse modelling for operational & application layers” ⇒ Intern will design schemas, not just move data; implies bronze-silver-gold layering & dimensional modelling.
- Stack anchors: Amazon Redshift, SQL, Python, Matillion ETL, Periscope Data (Sisense). Work under agile flow & coding standards.
- “ELT design in Matillion” ⇒ Visual orchestration; SQL pushed-down to Redshift.
- “Advanced visualisations in Periscope” ⇒ Quality of pipelines ↔ dashboard performance; requires end-to-end mindset.
- “Liaise with other departments” ⇒ Strong written/oral communication; translate vague biz questions → concrete specs.
2. Tech-Stack: Cloud Data Flow (Bird’s-Eye)
- Sources: internal DBs/logs + 3rd-party SaaS.
- Ingestion/Orchestration: Matillion—schedules, dependencies, connectors, error-handling.
- Storage & Compute: Redshift (MPP, columnar). Raw often lands in S3 first.
- BI/Consumption: Periscope connects live to Redshift.
- Languages: SQL dominant; Python for scripting / API extractions / automation.
3. Culture Markers: Entrepreneurial Spirit & Curiosity
- Take ownership beyond ticket scope; e.g., detect recurring data-quality issue, embed validation job.
- Ask “why?” behind every request; leads to correct grains & definitions (e.g., “user” differs per dept).
- Interview: prepare examples showing self-initiated improvements & domain curiosity.
4. Foundational Data-Engineering Concepts
4.1 Modern Repositories
- Data Warehouse (DWH): Cleansed, structured, query-optimised; e.g., Redshift.
- Data Lake: Raw, schema-on-read; usually S3; feeds DWH.
- Data Mart: Department-specific slice of DWH.
4.2 Processing Paradigms
- OLTP: Short, atomic transactions, highly normalised.
- OLAP: Complex, read-heavy analytics, denormalised (star); Redshift.
4.3 Data-Flow Philosophies
- ETL (Extract-Transform-Load): transform on separate server before warehouse.
- ELT (Extract-Load-Transform): load raw to warehouse, transform in-place via its engine—Matillion + Redshift exemplify.
4.4 Dimensional Modelling
- Fact tables: numeric metrics + FKs.
- Dimension tables: descriptive context.
- Granularity: single most critical design decision.
4.5 Schema Types
- Star: central fact surrounded by denormalised dims—fewer joins, best for Redshift.
- Snowflake: normalised dims—less storage, slower queries.
4.6 Slowly Changing Dimensions (SCD)
- Type 1: overwrite.
- Type 2: new row + date range & surrogate key (standard for history).
- Type 3: add new column (limited).
5. SQL Toolkit for Data Engineers
5.1 Command Families
- DDL (CREATE, ALTER, DROP, TRUNCATE)
- DML (INSERT, UPDATE, DELETE)
- DQL (SELECT)
- DCL (GRANT, REVOKE)
- TCL (COMMIT, ROLLBACK, SAVEPOINT)
5.2 Query Construction Essentials
- CTEs vs subqueries; aliases; INNER / OUTER / CROSS joins.
5.3 Window Functions
- Ranking: ROWNUMBER, RANK, DENSERANK.
- Aggregates: SUM() OVER.
- Offsets: LAG, LEAD.
5.4 Practical Distinctions
- UNION vs UNION ALL (dedupe cost).
- DELETE vs TRUNCATE vs DROP.
- WHERE (row filter pre-agg) vs HAVING (group filter post-agg).
- Prefer CASE over vendor-specific IF.
5.5 Performance Fundamentals
- Index concept (B-Tree) & Redshift’s alternatives (sort keys, zone maps).
- Use to inspect join type, estimated rows & cost.
6. Amazon Redshift Deep-Dive
6.1 Architecture
- Leader Node parses, optimises, compiles query pieces → distributes to compute nodes.
- Compute Nodes split into slices, execute in parallel (MPP).
- Columnar storage reduces I/O for analytic columns.
6.2 Bulk Loading with COPY
- Golden rule: never mass-INSERT; use COPY.
- Stage files in S3; split into N files ≈ multiple of slices; 1 MB–1 GB compressed (GZIP/BZIP2).
- Use manifest JSON for idempotent loads.
6.3 Table-Design Levers
- DISTKEY: KEY, ALL, EVEN—choose to minimise network shuffle.
- SORTKEY: COMPOUND (prefix-friendly) vs INTERLEAVED.
- VACUUM: re-sort, reclaim space.
6.4 Decision Matrix (quick rules)
- Large fact joined on userid, date filters ⇒ DISTKEY(userid), COMPOUND SORTKEY(date).
- Small dim ⇒ DISTSTYLE ALL.
- Temp staging ⇒ EVEN, no sort key.
7. Matillion Orchestration Principles
- ELT philosophy: GUI → auto-generated SQL pushed to Redshift.
- Job types:
- Orchestration (control flow, branching, notifications).
- Transformation (business logic; compiles to single CTAS / INSERT … SELECT).
- Best practices: load raw to staging first; separate transform job; integrate with Git; leverage connectors.
8. Python’s Supporting Role
- Automation scripts, API ingestion, pre-load validation.
- Libraries: pandas for data munging; psycopg2 for Redshift/PostgreSQL.
- OOP basics: class/object, encapsulation, inheritance, polymorphism, abstraction; abstract class vs interface.
9. Periscope Data (Sisense) – BI Layer
- Engineers ensure star schemas & materialised views enable fast dashboards.
- Provide views/materialised views for analysts; schedule refreshes via Matillion.
10. System-Design Interview Framework
- Clarify requirements: volume, velocity, variety, latency, consumers.
- Sketch high-level architecture (Source → Ingest → Storage/Processing → Serving).
- Model data: define facts/dims; pick DISTKEY & SORTKEY.
- Detail ELT pipeline (COPY → stage → transform → publish).
- Plan data-quality & monitoring (Write-Audit-Publish, alerts, idempotency).
- Discuss scalability & bottlenecks.
Example: User-Activity Pipeline (Daily Batch)
- Kinesis Firehose → S3 (partitioned dt=YYYY-MM-DD) → Matillion COPY → stagingevents (SUPER) → sessionisation transform → factuseractivity (DISTKEY userkey, SORTKEY(eventtimestamp,userkey)).
- Dashboards read via Periscope; latency ≤ 24 h.
Example: "Property Insights" Data Model
- factlistingdailyperformance (dist=agentkey, sort=(datekey,agentkey)).
- dimlisting (SCD-2), dimagent (ALL), dim_date (ALL).
11. Behavioral Preparation (STAR Method)
- Prepare ≥ 3 stories: technical challenge, rapid learning, cross-functional communication.
- Emphasise ownership, curiosity, measurable results (e.g., "reduced query time by 70% using correct SORTKEY").
- Sample questions to ask interviewers: performance-tuning focus, data-quality frameworks, onboarding process, upcoming data challenges.
12. Consolidated Question Bank & 48-Hour Checklist
- 90+ technical & behavioral questions covering: warehouses, SCDs, SQL joins, Redshift COPY best practices, Matillion push-down, Python OOP.
- Last-minute actions:
- Re-whiteboard two system-design scenarios.
- Practise window-function query on paper.
- Rehearse STAR stories.
- Memorise key DIST/SORT decision rules.
- Finalise insightful interviewer questions.
- Test interview tech; sleep well.
13. Key Formulae & Definitions (Quick-Glance)
- Time on Market: \text{TOM} = \text{Off_Market_Date} - \text{Listing_Date}
- Running total example:
- ROW_NUMBER window:
Use this structured outline as a full-fledged study notebook; every bullet links back to explicit statements in the transcript while organising them for rapid revision.