Data Science pt.1 (INTRO)

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/9

Earn XP

Description and Tags

An intro to all the different data science/engineering things you might not fully understand rn- mainly going through theory as SQL and Python have a different library

Last updated 7:31 PM on 5/17/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

10 Terms

New cards

What is DAP?

DAP (Data Analytics Platform) A centralised environment/infrastructure that houses all the tools, data, and pipelines needed to collect, process, store, and analyse data at scale.

Think of it as the "ecosystem" your data lives and moves through — from raw ingestion all the way to final reporting.

Ties back to DEV → UAT → PROD:

DAP is the overarching platform that hosts all three environments

Think of it as the "stadium" — DEV, UAT, and PROD are just different rooms inside it.

New cards

What is a CSV?

CSV (Comma-Separated Values) A plain text file format that stores tabular data where each line is a row and each value is separated by a comma.

Example:

name, age, city
Alice, 30, New York
Bob, 25, Austin

Think of it as a stripped-down spreadsheet — no formatting, no formulas, just raw data.

New cards

What is DEV?

DEV (Development Environment) A sandbox environment where data engineers/developers build and test code, pipelines, or models before pushing to production. It's where you experiment freely without breaking anything real.

DAP (Data Analytics Platform) connection — DEV is typically the first stage in a DAP pipeline:

DEV → UAT (testing) → PROD (live)

Think of it as the "draft mode" of your data workflow.

New cards

What is a SCHEMA?

A defined structure or blueprint that describes how data is organised — including table names, column names, data types, and relationships.

Example:

users table:
- id        (integer)
- name      (varchar)
- sign_up   (date)

Think of it as the "rules of the house" — it tells the database exactly what kind of data is allowed and where it lives.

New cards

Pyspark

The Python interface for Apache Spark — a framework used to process and analyze massive datasets across multiple machines (distributed computing).

Why it matters in data engineering:

Handles data too large for pandas to process on a single machine
Runs operations in parallel across a cluster for speed

Quick comparison:

Pandas   → works on your laptop, small-medium data
PySpark  → works across many machines, big data

Think of it as pandas on steroids — built for when your data is too big to fit in memory.

New cards

Hive

A data warehouse tool built on top of Hadoop that lets you query large datasets stored in distributed storage using a SQL-like language called HiveQL.

Key points:

Designed for big data — reads data stored across many machines
Uses HiveQL which looks almost identical to SQL

sql

SELECT name, age FROM users WHERE age > 30;

Where it fits:

Raw Data → Hive (query & organize) → Analytics/Reporting

Think of it as SQL for big data — instead of querying a traditional database, you're querying massive files sitting in a distributed system like HDFS.

New cards

What is a dataframe?

A two-dimensional data structure that organises data into rows and columns — like a table in a spreadsheet or database, but live in memory and ready to manipulate with code.

Example (in PySpark or Pandas):

| id | name  | age |
|----|-------|-----|
| 1  | Alice | 30  |
| 2  | Bob   | 25  |

Key traits:

Each column has a name and data type
Each row is a single record

Think of it as a programmable spreadsheet — you can filter, sort, join, and transform it with just a few lines of code.

New cards

Pandas

A Python library used for data manipulation and analysis — built around the DataFrame structure, making it easy to clean, explore, and transform data.

Common operations:

python

import pandas as pd

pd.read_csv("file.csv")      # load data
df.head()                    # preview first 5 rows
df.dropna()                  # remove missing values
df.groupby("category").mean() # aggregate data

Where it fits:

Raw Data → Pandas (clean & transform) → Analysis/Visualisation

Think of it as your Swiss army knife for small-to-medium datasets — the go-to first tool for any data professional working in Python.

lil note:⚠ When data gets too large for Pandas → switch to PySpark

New cards

What is a cache?

A temporary storage layer that saves the results of an operation so it doesn't have to be recomputed or re-fetched every time it's needed — making processes faster and more efficient.

Example in PySpark:

python

df.cache()  # stores DataFrame in memory for repeated use

Why it matters:

Without cache → recalculates from scratch every time (slow)
With cache    → pulls saved result from memory (fast)

Think of it as a shortcut memory — instead of redoing expensive work, it keeps the result on standby for quick access.

lil note:⚠ Caches use memory — always clear them when no longer needed to avoid overload.

New cards

What is numpy?

A Python library used for numerical computing — built around arrays, making it fast and efficient at performing mathematical operations on large sets of numbers.

Common operations:

python

import numpy as np

np.array([1, 2, 3])        # create an array
np.mean([10, 20, 30])      # calculate mean
np.sum([10, 20, 30])       # sum values
np.sqrt(144)               # square root → 12

Key difference from Pandas:

NumPy  → works with numbers & mathematical operations (arrays)
Pandas → works with structured tabular data (DataFrames)

Think of it as the mathematical engine under the hood — in fact, Pandas itself is built on top of NumPy.

lil note:💡 If Pandas is the spreadsheet, NumPy is the calculator powering it.