1/9
An intro to all the different data science/engineering things you might not fully understand rn- mainly going through theory as SQL and Python have a different library
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
What is DAP?
DAP (Data Analytics Platform) A centralised environment/infrastructure that houses all the tools, data, and pipelines needed to collect, process, store, and analyse data at scale.
Think of it as the "ecosystem" your data lives and moves through — from raw ingestion all the way to final reporting.
Ties back to DEV → UAT → PROD:
DAP is the overarching platform that hosts all three environmentsThink of it as the "stadium" — DEV, UAT, and PROD are just different rooms inside it.
What is a CSV?
CSV (Comma-Separated Values) A plain text file format that stores tabular data where each line is a row and each value is separated by a comma.
Example:
name, age, city
Alice, 30, New York
Bob, 25, AustinThink of it as a stripped-down spreadsheet — no formatting, no formulas, just raw data.
What is DEV?
DEV (Development Environment) A sandbox environment where data engineers/developers build and test code, pipelines, or models before pushing to production. It's where you experiment freely without breaking anything real.
DAP (Data Analytics Platform) connection — DEV is typically the first stage in a DAP pipeline:
DEV → UAT (testing) → PROD (live)Think of it as the "draft mode" of your data workflow.
What is a SCHEMA?
A defined structure or blueprint that describes how data is organised — including table names, column names, data types, and relationships.
Example:
users table:
- id (integer)
- name (varchar)
- sign_up (date)Think of it as the "rules of the house" — it tells the database exactly what kind of data is allowed and where it lives.
Pyspark
The Python interface for Apache Spark — a framework used to process and analyze massive datasets across multiple machines (distributed computing).
Why it matters in data engineering:
Handles data too large for pandas to process on a single machine
Runs operations in parallel across a cluster for speed
Quick comparison:
Pandas → works on your laptop, small-medium data
PySpark → works across many machines, big dataThink of it as pandas on steroids — built for when your data is too big to fit in memory.
Hive
A data warehouse tool built on top of Hadoop that lets you query large datasets stored in distributed storage using a SQL-like language called HiveQL.
Key points:
Designed for big data — reads data stored across many machines
Uses HiveQL which looks almost identical to SQL
sql
SELECT name, age FROM users WHERE age > 30;Where it fits:
Raw Data → Hive (query & organize) → Analytics/ReportingThink of it as SQL for big data — instead of querying a traditional database, you're querying massive files sitting in a distributed system like HDFS.
What is a dataframe?
A two-dimensional data structure that organises data into rows and columns — like a table in a spreadsheet or database, but live in memory and ready to manipulate with code.
Example (in PySpark or Pandas):
| id | name | age |
|----|-------|-----|
| 1 | Alice | 30 |
| 2 | Bob | 25 |Key traits:
Each column has a name and data type
Each row is a single record
Think of it as a programmable spreadsheet — you can filter, sort, join, and transform it with just a few lines of code.
Pandas
A Python library used for data manipulation and analysis — built around the DataFrame structure, making it easy to clean, explore, and transform data.
Common operations:
python
import pandas as pd
pd.read_csv("file.csv") # load data
df.head() # preview first 5 rows
df.dropna() # remove missing values
df.groupby("category").mean() # aggregate dataWhere it fits:
Raw Data → Pandas (clean & transform) → Analysis/VisualisationThink of it as your Swiss army knife for small-to-medium datasets — the go-to first tool for any data professional working in Python.
lil note:⚠ When data gets too large for Pandas → switch to PySpark
What is a cache?
A temporary storage layer that saves the results of an operation so it doesn't have to be recomputed or re-fetched every time it's needed — making processes faster and more efficient.
Example in PySpark:
python
df.cache() # stores DataFrame in memory for repeated useWhy it matters:
Without cache → recalculates from scratch every time (slow)
With cache → pulls saved result from memory (fast)Think of it as a shortcut memory — instead of redoing expensive work, it keeps the result on standby for quick access.
lil note:⚠ Caches use memory — always clear them when no longer needed to avoid overload.
What is numpy?
A Python library used for numerical computing — built around arrays, making it fast and efficient at performing mathematical operations on large sets of numbers.
Common operations:
python
import numpy as np
np.array([1, 2, 3]) # create an array
np.mean([10, 20, 30]) # calculate mean
np.sum([10, 20, 30]) # sum values
np.sqrt(144) # square root → 12Key difference from Pandas:
NumPy → works with numbers & mathematical operations (arrays)
Pandas → works with structured tabular data (DataFrames)Think of it as the mathematical engine under the hood — in fact, Pandas itself is built on top of NumPy.
lil note:💡 If Pandas is the spreadsheet, NumPy is the calculator powering it.