Computer Abstractions & Technology + Machine Learning Practice Flashcards

Chapter 1 — Computer Abstractions and Technology

1.1 Introduction & Classes of Computers

Pervasiveness of Computing: Computing is embedded in automobiles, mobile phones, the genome project, the Web, and search engines. Progress is driven by technology improvements and domain-specific accelerators.
Personal Computers (PCs): General-purpose systems designed to run variety of software. They are designed around a $cost/performance$ tradeoff.
Server Computers: Network-based systems that emphasize high capacity, performance, and reliability. They range in size from small servers to building-sized facilities.
Supercomputers: A specialized type of server designed for high-end scientific and engineering calculations. They possess the highest capability but represent a tiny fraction of the market.
Embedded Computers: Computers hidden inside other systems (e.g., in cars or appliances), characterized by stringent power, performance, and cost constraints.
The PostPC Era:
- Personal Mobile Devices (PMDs): Battery-operated, internet-connected devices costing hundreds of dollars (e.g., smartphones, tablets, e-glasses).
- Cloud Computing: Runs on Warehouse-Scale Computers (WSC). It delivers Software as a Service (SaaS), where applications are split between the PMD and the cloud (e.g., Amazon, Google).

1.2 Seven Great Ideas in Computer Architecture

Use abstraction to simplify design: Hiding details to make the system easier to understand.
Make the common case fast: Improving performance where it matters most.
Performance via parallelism: Executing multiple tasks simultaneously.
Performance via pipelining: Overlapping the execution of instructions.
Performance via prediction: Guessing the outcome of a decision to avoid delays.
Hierarchy of memories: Using different levels of memory to balance speed, size, and cost.
Dependability via redundancy: Including extra components to protect against failure.

1.3 Below Your Program

Software Layers:
- Application Software: Written in high-level languages (HLL).
- System Software: Includes the Compiler (translates HLL into machine code) and the Operating System (manages I/O, memory, storage, task scheduling, and resource sharing).
- Hardware: Includes the processor, memory, and I/O controllers.
Levels of Program Code:
- High-level language: Provides abstraction near the problem; offers productivity and portability.
- Assembly language: The textual form of instructions.
- Hardware representation: Binary digits (bits) encoding instructions and data.

1.4 Under the Covers

Five Classic Components: All computers share input, output, memory, datapath, and control.
The Processor: Composed of the Datapath (performs operations on data) and Control (sequences the datapath and memory).
Cache Memory: Small, fast SRAM used for immediate data access.
Displays and Interfaces:
- Touchscreens: Resistive vs. capacitive; capacitive is standard for tablets/phones as it allows multi-touch.
- LCD: Made of pixels; mirrors the contents of the frame buffer memory.
Abstraction Concepts:
- Instruction Set Architecture (ISA): The hardware/software interface.
- Application Binary Interface (ABI): The combination of the ISA and the system-software interface.
Memory Types:
- Volatile Main Memory: Loses contents when power is removed.
- Non-volatile Secondary Memory: Magnetic disks, flash memory, and optical disks (CD/DVD).
Networks: Provide communication and resource sharing. Types include LAN (Ethernet), WAN (the Internet), and wireless (WiFi, Bluetooth).

1.5 Building Processors and Memory

Technology Progress:
- 1951: Vacuum tube (Relative performance/cost: $1$ ).
- 1965: Transistor (Relative performance/cost: $35$ ).
- 1975: Integrated circuit (IC) (Relative performance/cost: $900$ ).
- 1995: Very-large-scale IC (VLSI) (Relative performance/cost: $2,400,000$ ).
- 2013: Ultra-large-scale IC (Relative performance/cost: $250,000,000,000$ ).
Manufacturing: Silicon is a semiconductor. Yield is the proportion of working dies per wafer. IC cost relates non-linearly to die area and defect rate.

1.6 Performance Metrics

Definitions:
- Response time (Latency): Time taken for a single task.
- Throughput: Total work done per unit time.
Relative Performance Formula:
- $\text{Performance} = \frac{1}{\text{Execution Time}}$
- $\frac{\text{Performance}_X}{\text{Performance}_Y} = \frac{\text{Execution Time}_Y}{\text{Execution Time}_X} = n$
Worked Example 1: If Machine A takes $10\,s$ and Machine B takes $15\,s$ , $n = 15\,s / 10\,s = 1.5$ . Machine A is $1.5\times$ faster.
Measuring Time:
- Elapsed (wall-clock) time: Total response time including I/O and OS overhead.
- CPU time: Time spent purely on the job; divided into user CPU time and system CPU time.
CPU Clocking Equations:
- $\text{Clock rate (frequency)} = \frac{1}{\text{Clock period}}$
- $\text{CPU Time} = \text{CPU Clock Cycles} \times \text{Clock Cycle Time} = \frac{\text{CPU Clock Cycles}}{\text{Clock Rate}}$
Worked Example 2: Computer A ( $2\,GHz$ , $10\,s$ CPU time). Computer B needs $6\,s$ CPU time but has $1.2\times$ the cycles of A.
1. $\text{Cycles}_A = 10\,s \times 2 \times 10^{9} = 20 \times 10^{9}$ .
2. $\text{Cycles}_B = 1.2 \times 20 \times 10^{9} = 24 \times 10^{9}$ .
3. $\text{Clock Rate}_B = 24 \times 10^{9} / 6\,s = 4.0\,GHz$ .
Instruction Count and CPI:
- $\text{Clock Cycles} = \text{Instruction Count} \times CPI$
- $\text{CPU Time} = IC \times CPI \times \text{Clock Cycle Time} = \frac{IC \times CPI}{\text{Clock Rate}}$
CPI (Cycles Per Instruction): The average cycles used per instruction.
- Weighted CPI: $\text{Clock Cycles} = \sum_{i=1}^{n} (CPI_i \times IC_i)$
- $\text{Average CPI} = \frac{\text{Clock Cycles}}{\text{Instruction Count}}$
Worked Example 3: Computer A ( $250\,ps$ , $CPI = 2.0$ ) vs. Computer B ( $500\,ps$ , $CPI = 1.2$ ).
- $Time_A = I \times 2.0 \times 250\,ps = I \times 500\,ps$ .
- $Time_B = I \times 1.2 \times 500\,ps = I \times 600\,ps$ .
- A is $1.2\times$ faster.

1.7 The Power Wall

Dynamic Power Formula: $Power = \text{Capacitive load} \times Voltage^{2} \times Frequency$
Worked Example 5: New CPU has $85\%$ capacitive load, $15\%$ reduction in voltage ( $0.85\times$ ), and $15\%$ reduction in frequency ( $0.85\times$ ).
- $\text{Ratio} = 0.85 \times 0.85^{2} \times 0.85 = 0.85^{4} \approx 0.52$ ( $48\%$ reduction).
The "power wall" stopped single-core frequency scaling because voltage cannot be lowered further and heat cannot be removed efficiently.

1.8 Multiprocessors

The response to the power wall is multicore microprocessors, placing multiple processors on one chip. This requires explicitly parallel programming, unlike Instruction-Level Parallelism (ILP) which is handled by hardware.

1.9 Benchmarks and Laws

SPEC (Standard Performance Evaluation Corp): Uses benchmark suites like CINT (integer) and CFP (floating-point). Includes SPEC power ( $\text{ssj\_ops/sec}$ vs Watts).
Amdahl's Law: Overall speed-up is limited by the part that cannot be improved.
- $T_{improved} = \frac{T_{affected}}{\text{improvement factor}} + T_{unaffected}$
Worked Example 6: Multiply takes $80\,s$ of a $100\,s$ program. To get a $5\times$ speed-up ( $100\,s \rightarrow 20\,s$ ):
- $20 = 80/n + (100 - 80) \Rightarrow 20 = 80/n + 20 \Rightarrow 0 = 80/n$ . No finite $n$ works.
MIPS (Millions of Instructions Per Second):
- $\text{MIPS} = \frac{\text{Instruction Count}}{\text{Execution Time} \times 10^{6}}$
- Pitfall: MIPS ignores ISA differences and instruction complexity; CPI varies between programs.

Chapter 2 — Python for Data Science & Machine Learning

Python Characteristics: Simple, versatile, and the most widely used language for data science.
Operators:
- Arithmetic: $+\,, -\,, *\,, /\,, //\,, \%\,, **$
- Logical: $and\,, or\,, not$
Data Types: $int\,, float\,, string\,, bool\,, list\,, tuple\,, dict$
Core Libraries:
- NumPy: Numerical arrays and statistics.
- pandas: Data loading (e.g., pd.read_csv()) and table manipulation.
- matplotlib: Plotting and visualization (scatter, bar, plot).
- scikit-learn (sklearn): Machine-learning models.

Chapter 4 — Regression

4.1 Supervised Learning and Types

Supervised Learning: Learning from data where correct labels/outcomes are provided.
Classification: Outcome variable is discrete/categorical.
Regression: Outcome variable is continuous.

4.3 Linear Regression

Equation: $y = c + m \cdot X$ (where $c$ is intercept, $m$ is slope).
Fitting: Uses Ordinary Least Squares (OLS) to minimize squared error.
Mean Squared Error (MSE):
- $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^{2}$
Root Mean Squared Error (RMSE): $RMSE = \sqrt{MSE}$ . Interpreted in the same units as $y$ .

4.4 Multiple Linear Regression

Generalizes a line to a hyperplane: $\hat{y} = c + m_{1}X_{1} + m_{2}X_{2} + \dots + m_{k}X_{k}$

4.5 Regularization (Ridge and Lasso)

Used to prevent overfitting by penalizing large coefficients.
Ridge (L2 Penalty): $\text{Loss} = MSE + \alpha \cdot \sum_{i=1}^{m} \theta_{i}^{2}$
Lasso (L1 Penalty): $\text{Loss} = MSE + \alpha \cdot \sum_{i=1}^{m} |\theta_{i}|$
- Lasso can drive coefficients to zero, performing automatic feature selection.

4.6 Gradient Descent

A robust procedure for minimizing error by moving "downhill" on the error surface.
Steps:
1. Initialize parameters (m, c).
2. Compute cost (e.g., MSE).
3. Compute the gradient.
4. Update parameters: $\text{param} = \text{param} - (\text{learning\_rate} \times \text{gradient})$ .
5. Repeat until convergence.

Chapter 5 — Classification — Part 1

5.2 k-Nearest Neighbours (k-NN)

Mechanism: Uses majority voting of the $k$ closest points in the feature space.
Worked Example: A vehicle classified by length/weight. If $k=3$ and neighbors are 1 Sedan and 2 SUVs, the vehicle is classified as an SUV.

5.3 Decision Trees

Tree Structure: Composed of decision nodes (splits) and leaf nodes (class labels).
Entropy Formula: $E(S) = - \sum_{i} p_{i} \log_{2} p_{i}$
Conditional Entropy: $E(A, B) = \sum_{k \in B} P(k) \cdot E(k)$
Information Gain: $Gain = E(\text{before split}) - E(\text{after split})$
Worked Example (Balloons Dataset):
- Class variable "Inflated": $True=8, False=12$ . Total $20$ .
- $E(Inflated) = - (0.6 \log_{2} 0.6) - (0.4 \log_{2} 0.4) = 0.9710$
- Split on "Act": "Dip" ( $T=0, F=8$ ), "Stretch" ( $T=8, F=4$ ).
- $E(Inflated | Act) = (8/20) \cdot 0 + (12/20) \cdot [-(0.3 \log_{2} 0.3) - (0.7 \log_{2} 0.7)] = 0.5454$
- $\text{Information Gain} = 0.9710 - 0.5454 = 0.4256$ .

5.4 Random Forest

An ensemble of many decision trees to prevent overfitting.
Bootstrap sampling: Sampling N records with replacement for each tree.
Random feature selection: Selecting a random subset of $m$ features out of $M$ at each node split.
Aggregation: Majority vote for classification, average for regression.

Chapter 6 — Classification — Part 2

6.1 Logistic Regression

Used for two-class problems to convert continuous output to probability.
Sigmoid Function: $g(z) = \frac{1}{1 + e^{-z}}$
Probability Model: $P(y=1 | x; \theta) = h_{\theta}(x)$ . Thresholded usually at $0.5$ .
Log-Likelihood: Maximized during training to find best parameters.
ROC and AUC: Plotting True-Positive Rate (TPR) vs. False-Positive Rate (FPR). AUC closer to $1$ signifies a good classifier.

6.2 Softmax Regression (Multinomial)

Generalizes logistic regression for multiple categories.
Softmax Function: $softmax(z_{i}) = \frac{e^{z_{i}}}{\sum_{j=1}^{n} e^{z_{j}}}$

6.3 Naïve Bayes

Based on Bayes' Theorem with the "naïve" assumption of independence between predictors.
Formula: $P(c | x) = \frac{P(x | c) \cdot P(c)}{P(x)}$
Variants:
- MultinomialNB: For counts/categories.
- GaussianNB: For continuous features assuming a normal distribution.

6.4 Support Vector Machine (SVM)

Searches for the optimal separating hyperplane in a higher dimension.
Support Vectors: The points closest to the hyperplane.
Margin: The distance between the hyperplane and the support vectors; SVM maximizes this margin.

Appendix — Python Coding Reference

A.5 The Universal scikit-learn ML Recipe

This 7-step pattern applies to virtually all models in Chapters 4-6:

Import: Get the model and tools (e.g., from sklearn.linear_model import LinearRegression).
Load: Load CSV using pandas: df = pd.read_csv("data.csv").
Split X/y: X = df.drop(columns=["target"]), y = df["target"].
Train/Test Split: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30).
Create Model: Instantiate the specific model (e.g., model = KNeighborsClassifier(n_neighbors=3)).
Train: model.fit(X_train, y_train).
Predict & Score: preds = model.predict(X_test), print(accuracy_score(y_test, preds)).

A.6 Model Creation Lines

Linear Regression: model = LinearRegression()
Ridge/Lasso: model = Ridge(alpha=1.0) or model = Lasso(alpha=1.0)
k-NN: model = KNeighborsClassifier(n_neighbors=3)
Decision Tree: model = DecisionTreeClassifier(criterion="entropy")
Random Forest: model = RandomForestClassifier(n_estimators=100)
Logistic Regression: model = LogisticRegression()
Softmax: model = LogisticRegression(multi_class="multinomial")
SVM: model = SVC(kernel="linear")