DS/CMPSC 410: Programming Models for Big Data Final Exam Comprehensive Study Guide

Big Data Opportunities and Programming Model Needs

Limitations of Conventional Programming Models for Big Data Analytics - Conventional models often rely on a single-node architecture where the dataset must fit within the memory ( $RAM$ ) or local disk space of a single machine. - They lack inherent mechanisms for massive parallelism, requiring manual management of concurrency, data synchronization, and race conditions. - Scalability is limited to the physical constraints of a single unit of hardware (vertical scaling), which becomes cost-prohibitive compared to horizontal scaling.
Streaming Data and Time Requirements - The velocity of streaming data requires real-time or near-real-time processing. - New programming models are needed to provide low-latency analytics, as traditional batch-processing models accumulate data before processing, which delays insight generation beyond the window of utility.
Clusters and Their Role - A cluster is a collection of interconnected computers (nodes) that work together as a single system to execute tasks. - Its role is to provide the underlying hardware infrastructure that supports distributed storage and distributed computing, enabling the programming model to partition data across many nodes.
History of Big Data Models - MapReduce is widely considered the first major programming model specifically designed for Big Data processing at a massive scale. - It was motivated by the need to process vast amounts of unstructured web data, such as indexing the web for search engines and generating web graphs.

MapReduce and Foundations of Distributed Processing

Exemplar Problems for MapReduce - Counting word frequencies in a library of digital books. - Analyzing log files for security or performance monitoring. - Distributed grep (searching for a pattern across massive text files).
Innovative Ideas in MapReduce for Scalability - Partitioning Strategy: Each mapper partitions its output Key-Value pairs into $R$ subsets, where $R$ is the number of reducers specified for the job. - Consistency in Hashing: The hash function used for key-based partitioning is identical across all mappers, ensuring that all occurrences of the same key from any mapper are mapped to the same reducer. - Shuffling Efficiency: The $i^{th}$ Reducer pulls only the $i^{th}$ output partition from each mapper. This drastically reduces the time required for searching and gathering keys assigned to that specific reducer from the intermediate files.
Role of the Key-Value Table - Key-value pairs serve as the fundamental data structure, providing a universal interface for the data flow between the Map phase and the Reduce phase.
The "Plan-Ahead" Mechanism - Aims to solve resource allocation and task scheduling conflicts by determining the number of workers and task distribution before execution begins. - It requires information regarding data locality (where the data is physically stored) to minimize network traffic by scheduling mappers on nodes where the data blocks reside.
Cost Reduction Focus - The primary cost reduction achieved by MapReduce is the optimized communication and search time for the Reduce phase (specifically, the time for each reduce worker to read and process its input through the coordinated multi-partition shuffle).
Fault Tolerance and the Master Node - The Master node monitors the status of all mappers and reducers. - If a node crashes, the Master reassigns its tasks to other available nodes. This is vital because, in large clusters, hardware failure is a statistical certainty, and the system must be able to complete a job without a total restart.
Challenges with Iterative Analytics - MapReduce is inefficient for iterative processes (e.g., Machine Learning) because it writes intermediate data to the disk at the end of every Map and Reduce step, inducing significant I/O overhead compared to in-memory processing.
Transition to Spark - Spark was adopted rapidly by the industry because it supports multiple languages (Scala, Python, Java, R), offers high-level APIs that simplify complex logic, and provides significant speed improvements via in-memory computing. - The main advantage of Spark over MapReduce is Spark's ability to keep data in memory ( $RAM$ ) across multiple stages of a pipeline, avoiding slow disk I/O.

Spark, RDDs, and Lazy Evaluation

Parsing Data in Spark - sc.textFile("path") reads a file into a Resilient Distributed Dataset ( $RDD$ ). - To parse a file into a list of tokens per line: use RDD.map(lambda x: x.split(" ")). - To parse a file into a single flat list of tokens from all lines: use RDD.flatMap(lambda x: x.split(" ")).
Error Detection Timing - If there is a typo in the input filename, the error will not be detected when sc.textFile is called. It will only be detected when an Action (e.g., count(), collect(), saveAsTextFile()) is performed. - This occurs because of Lazy Evaluation, where Spark only records the transformation plan (Lineage) and does not attempt to access the data until it is absolutely necessary for an output.
Lazy Evaluation Principles - Spark builds a Directed Acyclic Graph ( $DAG$ ) of transformations rather than executing code immediately. - Benefits: It allows the Spark Optimizer to view the entire workflow and optimize the execution plan (e.g., pipelining operations) to minimize data shuffling and maximize cluster utilization.
RDD Properties - Immutability: RDDs cannot be changed once created. Any transformation results in a new RDD. - This supports fault tolerance; if a partition is lost, the system can recompute it using the lineage of transformations from the original data source.
Variables in Spark vs. Other Languages - In standard languages, a variable holds a value or a reference to data in memory. In Spark, a variable assigned to an RDD holds a description of how to compute the data (the lineage/recipe).
Local vs. Cluster Mode - Local mode runs Spark on a single machine (useful for debugging and testing code logic with small data). - Cluster mode runs Spark across many nodes for processing massive datasets using distributed resources.
Case Study: Iran Tweet Counts - Code completion example: - RDD1 = sc.textFile("/storage/home/juy1/work/US_Iran.csv") - RDD2 = RDD1.flatMap(lambda x: x.split(",") ) (assuming comma-separated) - RDD3 = RDD2.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b, 5) (where 5 is the number of partitions). - RDD3.saveAsTextFile("/storage/home/juy1/work/US_IranWordCounts.txt")
Spark Context and Re-initialization - Running a SparkContext statement again in local mode after one is already active will typically result in an error, as Spark generally allows only one SparkContext per JVM.
RDD Features and Limitations - RDDs support: Recomputation on failure, key-value storage (even if the key is a list), persistence (persist), lazy evaluation, and joining keys from different RDDs. - Sorting: To sort a key-value RDD on its value, one can use sortBy(lambda x: x[1]). To sort in descending order, use sortBy(lambda x: x[1], ascending=False).

RDD-Based Integration and Join Operations

Join Requirements - To use the join method, both RDDs must consist of key-value pairs (tuples), where the matching elements are located in the first position of the tuple: $(K, V1).join((K, V2)) \rightarrow (K, (V1, V2))$ .
Join Types and Missing Values - Inner Join: Returns only keys present in both RDDs. - Left Outer Join: Returns all keys from the left RDD; if a key is missing in the right RDD, the value is returned as None. - Right Outer Join: Returns all keys from the right RDD; missing left values result in None. - Handling Missing Values: Use .fillna() or .map() with conditional logic to replace None with default values (e.g., $0$ ).
Hashtag Change Analysis - To find hashtags that increased significantly on day 2 (including new ones), use KV1.rightOuterJoin(KV2). - Apply a map function to handle None values for day 1 counts, treating them as $0$ , then subtract Day 1 count from Day 2 count.

Running Spark in Cluster Mode

Cluster Mode vs. Local Mode - Cluster mode allows the workload to be distributed across hundreds or thousands of nodes, enabling processing of petabyte-scale data that a single local machine cannot handle.
Code Changes for Cluster Deployment - Remove hardcoded local paths and replace them with shared filesystem paths (e.g., $HDFS$ or research storage clusters). - Ensure specific worker configuration or master URL settings are compatible with the cluster manager (YARN, Mesos, or Standalone).
Task Submission with pbs-spark-submit - Steps include initializing the environment, ensuring the script is accessible to all nodes, and checking for existing output directories. - Dependency: You must delete or rename existing output directories before re-running a script, as Spark will not overwrite them to prevent accidental data loss.
Execution Verification - Success is verified by checking the log file for the absence of StackTrace errors and finding the generated _SUCCESS flag file in the output directory.
Log File Analysis - Look for: Python errors, resource allocation warnings, execution time statistics, and task progress indicators.

Spark DataFrames, SQL Functions, and Aggregation

SparkSession vs. SparkContext - SparkContext is the entry point for the low-level RDD API. - SparkSession is the modern unified entry point that encompasses SparkContext and allows for DataFrame and Dataset APIs with optimized SQL execution.
DataFrame Schemas - A schema defines: Name of columns, Type of columns (e.g., Integer, String), and Nullability (whether the column can have null values). - Advantage of Explicit Schema: It saves the time required to infer the schema (which scans the dataset) and prevents data type errors during ingestion.
DataFrame Operations - Join: Uses df1.join(df2, "common_column"). Unlike RDD joins, it does not require the data to be explicitly in key-value format beforehand. - Creating Columns: Use .withColumn("new_name", function). Example: withColumn("GenresArray", split(col("Genres"), "|")).
Movie Analytical Queries - Count $\ge 100$: df.groupBy("title").count().filter("count >= 100"). - Avg rating $> 3.5$: df.groupBy("title").avg("rating").filter("avg(rating) > 3.5"). - Finding Adventure movies with high review counts: Use filter(array_contains(col("GenresArray"), "Adventure")) in combination with logical filters on review counts and average ratings.
Extracting Data from DataFrames - Three ways to extract data from an RDD created via df.rdd: Using index-based access row[0], name-based access row.columnName, or converting back to a dictionary with asDict(). - Row Object: A generic record format in Spark SQL. Use row.__fields__ to see available elements.
Aggregation Patterns - Total rows per group: groupBy("column").count(). - Sum logic: groupBy("column").sum("target_column"). - Average rating generation: df.withColumn("AvgRating", col("SumRating") / col("ReviewCount")).

ALS-Based Recommendation Systems and Hyperparameter Tuning

Principle of Alternating Least Squares (ALS) - Matrix Factorization technique that decomposes the large, sparse user-item interaction matrix into two low-rank dense matrices (user factors and item factors) to predict missing ratings.
Hyperparameters - rank: The number of latent factors (dimensionality of user/item vectors). - regParam (lambda): Regularization parameter to prevent overfitting. - alpha: Scaling factor for implicit feedback.
Overfitting Risks - An over-fitted ALS model performs exceptionally well on training data but poorly on unseen validation data. - Avoiding Overfitting: Use regularization and validate performance on a separate validation set during tuning.
Data Splitting Roles - Training Data: Used to build the model. - Validation Data: Used to tune hyperparameters and select the best model. - Testing Data: Used at the very end to give an unbiased estimate of the final model performance.
Spark vs. Pandas - Constraint: Pandas DataFrames and Spark SQL DataFrames are different. However, they can be converted: spark.createDataFrame(pandas_df) or spark_df.toPandas().
Evaluation - To calculate Root Mean Square Error ( $RMSE$ ), join predicted ratings with actual validation ratings and compute: $RMSE = \sqrt{\frac{\sum (y_i - \hat{y}_i)^2}{n}}$ . - The output of Lab 6 indicates the best hyperparameter by identifying the combination that yielded the minimum $RMSE$ on the validation set.

Persist and Unpersist Strategies

Usage in Iterative Tuning - RDDs created outside a loop but used inside multiple tuning iterations (like training/validation sets) must be persisted to avoid re-reading and re-parsing data from disk for every single hyperparameter combination.
Key Rules of persist() - Executing persist() is Lazy: It does not save the data until an action is called. - Persistence makes a "best effort" but is not guaranteed if the cluster runs out of memory; it may spill to disk or recompute later. - Adding persist() indiscriminately can lead to memory exhaustion; use it strategically for data reused multiple times.
Role of unpersist() - Manually releases the memory/disk resources used by an RDD or DataFrame. This is critical for long-running iterative loops to prevent the cluster from running out of storage.

Machine Learning Pipelines and Decision Tree Learning

ML Pipelines - They organize multiple steps (preprocessing, indexing, training) into a single workflow, making development more modular and reproducible.
Stages: Fit vs. Transform - StringIndexer: fit learns the mapping from categories to indices; transform applies it. - Decision Tree: fit trains the model using data; transform uses the trained model to make predictions.
Order of Stages - VectorAssembler must occur after all scaling or indexing of individual features but before the Decision Tree Classifier. - StringIndexers must occur before the VectorAssembler.
Accessing Models - After a pipeline is fit, it produces a PipelineModel. You can access the specific Decision Tree stage using the .stages attribute (e.g., model.stages[-1]).
Tree Depth and Overfitting - A deeper tree can capture more complex patterns but is prone to overfitting (learning noise in training data). Shallower trees generalize better but may underfit.
Evaluation - Decision trees in Spark can be visualized using the .toDebugString property to see the logical flow of splits.

Frequent Patterns Mining (Apriori)

Benefits of Identifying Patterns - Market Basket Analysis (e.g., if customers buy Apple, they also buy Chargers). - Course Scheduling: Identifying which courses students frequently take together to avoid scheduling conflicts.
Apriori Principle - If an itemset is frequent, all of its subsets must also be frequent. Conversely, if an itemset is infrequent, all of its supersets are also infrequent. This principle is used to prune the search space and reduce computational costs.
Column Conversion - To convert a string set like "Port1-Port2" into an array, use split(col("PortSet"), "-").
Top-k Impact - Choosing only the top $k$ items simplifies the computation but may miss long-tail frequent patterns involving less common segments.

Clustering and PCA

Distance Measures - Euclidean Distance: Standard straight-line distance; sensitive to the scale of variables. - Cosine Similarity: Measures the angle between vectors; useful for high-dimensional text data where magnitude is less important than direction.
One Hot Encoding (OHE) - OHE converts categorical variables into binary vectors. It should be used for clustering when categorical categories lack a meaningful ordinal relationship. - Impact: It significantly increases the dimensionality of the feature space (one new column per category).
High-Dimensional Challenges - "The Curse of Dimensionality": As dimensions increase, data becomes sparse, and the distance between any two points tends to converge, making distance-based clustering ineffective.
Principal Component Analysis (PCA) - Principle: Finds the first principal component by identifying the direction of maximum variance in the data. - Usage in Spark: Use pyspark.ml.feature.PCA to reduce high-dimensional OHE data into a smaller set of dense features. - Interpreting Results: To find cluster centers in the original space after PCA-reduced clustering, use the cluster labels to groupBy the original high-dimensional DataFrame and calculate the mean of all original features for each cluster.

Accelerating Deep Learning and Spark

GPU and DNN Acceleration - GPUs accelerate forward inference and backpropagation by parallelizing matrix multiplications ( $C = A \times B$ ), which are the fundamental operations in neural networks. - Epochs and Overfit: Running too many epochs can lead to overfilling. Early stopping is a common technique to avoid this.
Spark GPU Acceleration - CUdf: Accelerates column-oriented DataFrame operations. GPUs are generally not suitable for RDDs because RDDs are row-oriented. - Bottlenecks: A common bottleneck occurs when moving data between CPU-based preprocessing (RDD/DF) and GPU-based training (Matrix operations). The transfer over the PCIe bus can be slow. - Improvement: Keep the entire pipeline (including feature transformation) on the GPU using libraries like NVidia Rapids to minimize data movement.

Large Language Models (LLM) and Fine-tuning

Evaluation - Performance for summarization is often evaluated using metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
Fine-Tuning Risks - Catastrophic Forgetting: The risk that a model loses the general knowledge it gained during pre-training while learning a specific task. - Mitigation: Use Lower Learning Rates or methods like LoRA (Low-Rank Adaptation).
LoRA (Low-Rank Adaptation) - Principle: Instead of updating all billions of parameters, LoRA injects trainable low-rank matrices into each layer of the Transformer architecture, freezing the original weights. - Benefit: Drastically reduces the number of trainable parameters and memory requirements for fine-tuning.
Quantization (4-bit Floating Point) - Reduces the precision of weights to save memory. - A 4-bit float with 1 sign bit, 1 mantissa bit, and 2 bits for exponent allows for a limited range of values compared to 16-bit or 32-bit floats but provides more dynamic range than 4-bit integers.