Computer Abstractions & Technology + Machine Learning Practice Flashcards

Chapter 1 — Computer Abstractions and Technology

1.1 Introduction & Classes of Computers
  • Pervasiveness of Computing: Computing is embedded in automobiles, mobile phones, the genome project, the Web, and search engines. Progress is driven by technology improvements and domain-specific accelerators.
  • Personal Computers (PCs): General-purpose systems designed to run variety of software. They are designed around a cost/performancecost/performance tradeoff.
  • Server Computers: Network-based systems that emphasize high capacity, performance, and reliability. They range in size from small servers to building-sized facilities.
  • Supercomputers: A specialized type of server designed for high-end scientific and engineering calculations. They possess the highest capability but represent a tiny fraction of the market.
  • Embedded Computers: Computers hidden inside other systems (e.g., in cars or appliances), characterized by stringent power, performance, and cost constraints.
  • The PostPC Era:
    • Personal Mobile Devices (PMDs): Battery-operated, internet-connected devices costing hundreds of dollars (e.g., smartphones, tablets, e-glasses).
    • Cloud Computing: Runs on Warehouse-Scale Computers (WSC). It delivers Software as a Service (SaaS), where applications are split between the PMD and the cloud (e.g., Amazon, Google).
1.2 Seven Great Ideas in Computer Architecture
  • Use abstraction to simplify design: Hiding details to make the system easier to understand.
  • Make the common case fast: Improving performance where it matters most.
  • Performance via parallelism: Executing multiple tasks simultaneously.
  • Performance via pipelining: Overlapping the execution of instructions.
  • Performance via prediction: Guessing the outcome of a decision to avoid delays.
  • Hierarchy of memories: Using different levels of memory to balance speed, size, and cost.
  • Dependability via redundancy: Including extra components to protect against failure.
1.3 Below Your Program
  • Software Layers:
    • Application Software: Written in high-level languages (HLL).
    • System Software: Includes the Compiler (translates HLL into machine code) and the Operating System (manages I/O, memory, storage, task scheduling, and resource sharing).
    • Hardware: Includes the processor, memory, and I/O controllers.
  • Levels of Program Code:
    • High-level language: Provides abstraction near the problem; offers productivity and portability.
    • Assembly language: The textual form of instructions.
    • Hardware representation: Binary digits (bits) encoding instructions and data.
1.4 Under the Covers
  • Five Classic Components: All computers share input, output, memory, datapath, and control.
  • The Processor: Composed of the Datapath (performs operations on data) and Control (sequences the datapath and memory).
  • Cache Memory: Small, fast SRAM used for immediate data access.
  • Displays and Interfaces:
    • Touchscreens: Resistive vs. capacitive; capacitive is standard for tablets/phones as it allows multi-touch.
    • LCD: Made of pixels; mirrors the contents of the frame buffer memory.
  • Abstraction Concepts:
    • Instruction Set Architecture (ISA): The hardware/software interface.
    • Application Binary Interface (ABI): The combination of the ISA and the system-software interface.
  • Memory Types:
    • Volatile Main Memory: Loses contents when power is removed.
    • Non-volatile Secondary Memory: Magnetic disks, flash memory, and optical disks (CD/DVD).
  • Networks: Provide communication and resource sharing. Types include LAN (Ethernet), WAN (the Internet), and wireless (WiFi, Bluetooth).
1.5 Building Processors and Memory
  • Technology Progress:
    • 1951: Vacuum tube (Relative performance/cost: 11).
    • 1965: Transistor (Relative performance/cost: 3535).
    • 1975: Integrated circuit (IC) (Relative performance/cost: 900900).
    • 1995: Very-large-scale IC (VLSI) (Relative performance/cost: 2,400,0002,400,000).
    • 2013: Ultra-large-scale IC (Relative performance/cost: 250,000,000,000250,000,000,000).
  • Manufacturing: Silicon is a semiconductor. Yield is the proportion of working dies per wafer. IC cost relates non-linearly to die area and defect rate.
1.6 Performance Metrics
  • Definitions:
    • Response time (Latency): Time taken for a single task.
    • Throughput: Total work done per unit time.
  • Relative Performance Formula:
    • Performance=1Execution Time\text{Performance} = \frac{1}{\text{Execution Time}}
    • PerformanceXPerformanceY=Execution TimeYExecution TimeX=n\frac{\text{Performance}_X}{\text{Performance}_Y} = \frac{\text{Execution Time}_Y}{\text{Execution Time}_X} = n
  • Worked Example 1: If Machine A takes 10s10\,s and Machine B takes 15s15\,s, n=15s/10s=1.5n = 15\,s / 10\,s = 1.5. Machine A is 1.5×1.5\times faster.
  • Measuring Time:
    • Elapsed (wall-clock) time: Total response time including I/O and OS overhead.
    • CPU time: Time spent purely on the job; divided into user CPU time and system CPU time.
  • CPU Clocking Equations:
    • Clock rate (frequency)=1Clock period\text{Clock rate (frequency)} = \frac{1}{\text{Clock period}}
    • CPU Time=CPU Clock Cycles×Clock Cycle Time=CPU Clock CyclesClock Rate\text{CPU Time} = \text{CPU Clock Cycles} \times \text{Clock Cycle Time} = \frac{\text{CPU Clock Cycles}}{\text{Clock Rate}}
  • Worked Example 2: Computer A (2GHz2\,GHz, 10s10\,s CPU time). Computer B needs 6s6\,s CPU time but has 1.2×1.2\times the cycles of A.
    1. CyclesA=10s×2×109=20×109\text{Cycles}_A = 10\,s \times 2 \times 10^{9} = 20 \times 10^{9}.
    2. CyclesB=1.2×20×109=24×109\text{Cycles}_B = 1.2 \times 20 \times 10^{9} = 24 \times 10^{9}.
    3. Clock RateB=24×109/6s=4.0GHz\text{Clock Rate}_B = 24 \times 10^{9} / 6\,s = 4.0\,GHz.
  • Instruction Count and CPI:
    • Clock Cycles=Instruction Count×CPI\text{Clock Cycles} = \text{Instruction Count} \times CPI
    • CPU Time=IC×CPI×Clock Cycle Time=IC×CPIClock Rate\text{CPU Time} = IC \times CPI \times \text{Clock Cycle Time} = \frac{IC \times CPI}{\text{Clock Rate}}
  • CPI (Cycles Per Instruction): The average cycles used per instruction.
    • Weighted CPI: Clock Cycles=i=1n(CPIi×ICi)\text{Clock Cycles} = \sum_{i=1}^{n} (CPI_i \times IC_i)
    • Average CPI=Clock CyclesInstruction Count\text{Average CPI} = \frac{\text{Clock Cycles}}{\text{Instruction Count}}
  • Worked Example 3: Computer A (250ps250\,ps, CPI=2.0CPI = 2.0) vs. Computer B (500ps500\,ps, CPI=1.2CPI = 1.2).
    • TimeA=I×2.0×250ps=I×500psTime_A = I \times 2.0 \times 250\,ps = I \times 500\,ps.
    • TimeB=I×1.2×500ps=I×600psTime_B = I \times 1.2 \times 500\,ps = I \times 600\,ps.
    • A is 1.2×1.2\times faster.
1.7 The Power Wall
  • Dynamic Power Formula: Power=Capacitive load×Voltage2×FrequencyPower = \text{Capacitive load} \times Voltage^{2} \times Frequency
  • Worked Example 5: New CPU has 85%85\% capacitive load, 15%15\% reduction in voltage (0.85×0.85\times), and 15%15\% reduction in frequency (0.85×0.85\times).
    • Ratio=0.85×0.852×0.85=0.8540.52\text{Ratio} = 0.85 \times 0.85^{2} \times 0.85 = 0.85^{4} \approx 0.52 (48%48\% reduction).
  • The "power wall" stopped single-core frequency scaling because voltage cannot be lowered further and heat cannot be removed efficiently.
1.8 Multiprocessors
  • The response to the power wall is multicore microprocessors, placing multiple processors on one chip. This requires explicitly parallel programming, unlike Instruction-Level Parallelism (ILP) which is handled by hardware.
1.9 Benchmarks and Laws
  • SPEC (Standard Performance Evaluation Corp): Uses benchmark suites like CINT (integer) and CFP (floating-point). Includes SPEC power (ssj_ops/sec\text{ssj\_ops/sec} vs Watts).
  • Amdahl's Law: Overall speed-up is limited by the part that cannot be improved.
    • Timproved=Taffectedimprovement factor+TunaffectedT_{improved} = \frac{T_{affected}}{\text{improvement factor}} + T_{unaffected}
  • Worked Example 6: Multiply takes 80s80\,s of a 100s100\,s program. To get a 5×5\times speed-up (100s20s100\,s \rightarrow 20\,s):
    • 20=80/n+(10080)20=80/n+200=80/n20 = 80/n + (100 - 80) \Rightarrow 20 = 80/n + 20 \Rightarrow 0 = 80/n. No finite nn works.
  • MIPS (Millions of Instructions Per Second):
    • MIPS=Instruction CountExecution Time×106\text{MIPS} = \frac{\text{Instruction Count}}{\text{Execution Time} \times 10^{6}}
    • Pitfall: MIPS ignores ISA differences and instruction complexity; CPI varies between programs.

Chapter 2 — Python for Data Science & Machine Learning

  • Python Characteristics: Simple, versatile, and the most widely used language for data science.
  • Operators:
    • Arithmetic: +,,,/,//,%,+\,, -\,, *\,, /\,, //\,, \%\,, **
    • Logical: and,or,notand\,, or\,, not
  • Data Types: int,float,string,bool,list,tuple,dictint\,, float\,, string\,, bool\,, list\,, tuple\,, dict
  • Core Libraries:
    • NumPy: Numerical arrays and statistics.
    • pandas: Data loading (e.g., pd.read_csv()) and table manipulation.
    • matplotlib: Plotting and visualization (scatter, bar, plot).
    • scikit-learn (sklearn): Machine-learning models.

Chapter 4 — Regression

4.1 Supervised Learning and Types
  • Supervised Learning: Learning from data where correct labels/outcomes are provided.
  • Classification: Outcome variable is discrete/categorical.
  • Regression: Outcome variable is continuous.
4.3 Linear Regression
  • Equation: y=c+mXy = c + m \cdot X (where cc is intercept, mm is slope).
  • Fitting: Uses Ordinary Least Squares (OLS) to minimize squared error.
  • Mean Squared Error (MSE):
    • MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^{2}
  • Root Mean Squared Error (RMSE): RMSE=MSERMSE = \sqrt{MSE}. Interpreted in the same units as yy.
4.4 Multiple Linear Regression
  • Generalizes a line to a hyperplane: y^=c+m1X1+m2X2++mkXk\hat{y} = c + m_{1}X_{1} + m_{2}X_{2} + \dots + m_{k}X_{k}
4.5 Regularization (Ridge and Lasso)
  • Used to prevent overfitting by penalizing large coefficients.
  • Ridge (L2 Penalty): Loss=MSE+αi=1mθi2\text{Loss} = MSE + \alpha \cdot \sum_{i=1}^{m} \theta_{i}^{2}
  • Lasso (L1 Penalty): Loss=MSE+αi=1mθi\text{Loss} = MSE + \alpha \cdot \sum_{i=1}^{m} |\theta_{i}|
    • Lasso can drive coefficients to zero, performing automatic feature selection.
4.6 Gradient Descent
  • A robust procedure for minimizing error by moving "downhill" on the error surface.
  • Steps:
    1. Initialize parameters (m, c).
    2. Compute cost (e.g., MSE).
    3. Compute the gradient.
    4. Update parameters: param=param(learning_rate×gradient)\text{param} = \text{param} - (\text{learning\_rate} \times \text{gradient}).
    5. Repeat until convergence.

Chapter 5 — Classification — Part 1

5.2 k-Nearest Neighbours (k-NN)
  • Mechanism: Uses majority voting of the kk closest points in the feature space.
  • Worked Example: A vehicle classified by length/weight. If k=3k=3 and neighbors are 1 Sedan and 2 SUVs, the vehicle is classified as an SUV.
5.3 Decision Trees
  • Tree Structure: Composed of decision nodes (splits) and leaf nodes (class labels).
  • Entropy Formula: E(S)=ipilog2piE(S) = - \sum_{i} p_{i} \log_{2} p_{i}
  • Conditional Entropy: E(A,B)=kBP(k)E(k)E(A, B) = \sum_{k \in B} P(k) \cdot E(k)
  • Information Gain: Gain=E(before split)E(after split)Gain = E(\text{before split}) - E(\text{after split})
  • Worked Example (Balloons Dataset):
    • Class variable "Inflated": True=8,False=12True=8, False=12. Total 2020.
    • E(Inflated)=(0.6log20.6)(0.4log20.4)=0.9710E(Inflated) = - (0.6 \log_{2} 0.6) - (0.4 \log_{2} 0.4) = 0.9710
    • Split on "Act": "Dip" (T=0,F=8T=0, F=8), "Stretch" (T=8,F=4T=8, F=4).
    • E(InflatedAct)=(8/20)0+(12/20)[(0.3log20.3)(0.7log20.7)]=0.5454E(Inflated | Act) = (8/20) \cdot 0 + (12/20) \cdot [-(0.3 \log_{2} 0.3) - (0.7 \log_{2} 0.7)] = 0.5454
    • Information Gain=0.97100.5454=0.4256\text{Information Gain} = 0.9710 - 0.5454 = 0.4256.
5.4 Random Forest
  • An ensemble of many decision trees to prevent overfitting.
  • Bootstrap sampling: Sampling N records with replacement for each tree.
  • Random feature selection: Selecting a random subset of mm features out of MM at each node split.
  • Aggregation: Majority vote for classification, average for regression.

Chapter 6 — Classification — Part 2

6.1 Logistic Regression
  • Used for two-class problems to convert continuous output to probability.
  • Sigmoid Function: g(z)=11+ezg(z) = \frac{1}{1 + e^{-z}}
  • Probability Model: P(y=1x;θ)=hθ(x)P(y=1 | x; \theta) = h_{\theta}(x). Thresholded usually at 0.50.5.
  • Log-Likelihood: Maximized during training to find best parameters.
  • ROC and AUC: Plotting True-Positive Rate (TPR) vs. False-Positive Rate (FPR). AUC closer to 11 signifies a good classifier.
6.2 Softmax Regression (Multinomial)
  • Generalizes logistic regression for multiple categories.
  • Softmax Function: softmax(zi)=ezij=1nezjsoftmax(z_{i}) = \frac{e^{z_{i}}}{\sum_{j=1}^{n} e^{z_{j}}}
6.3 Naïve Bayes
  • Based on Bayes' Theorem with the "naïve" assumption of independence between predictors.
  • Formula: P(cx)=P(xc)P(c)P(x)P(c | x) = \frac{P(x | c) \cdot P(c)}{P(x)}
  • Variants:
    • MultinomialNB: For counts/categories.
    • GaussianNB: For continuous features assuming a normal distribution.
6.4 Support Vector Machine (SVM)
  • Searches for the optimal separating hyperplane in a higher dimension.
  • Support Vectors: The points closest to the hyperplane.
  • Margin: The distance between the hyperplane and the support vectors; SVM maximizes this margin.

Appendix — Python Coding Reference

A.5 The Universal scikit-learn ML Recipe

This 7-step pattern applies to virtually all models in Chapters 4-6:

  1. Import: Get the model and tools (e.g., from sklearn.linear_model import LinearRegression).
  2. Load: Load CSV using pandas: df = pd.read_csv("data.csv").
  3. Split X/y: X = df.drop(columns=["target"]), y = df["target"].
  4. Train/Test Split: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30).
  5. Create Model: Instantiate the specific model (e.g., model = KNeighborsClassifier(n_neighbors=3)).
  6. Train: model.fit(X_train, y_train).
  7. Predict & Score: preds = model.predict(X_test), print(accuracy_score(y_test, preds)).
A.6 Model Creation Lines
  • Linear Regression: model = LinearRegression()
  • Ridge/Lasso: model = Ridge(alpha=1.0) or model = Lasso(alpha=1.0)
  • k-NN: model = KNeighborsClassifier(n_neighbors=3)
  • Decision Tree: model = DecisionTreeClassifier(criterion="entropy")
  • Random Forest: model = RandomForestClassifier(n_estimators=100)
  • Logistic Regression: model = LogisticRegression()
  • Softmax: model = LogisticRegression(multi_class="multinomial")
  • SVM: model = SVC(kernel="linear")