DS/CMPSC 410 Programming Models for Big Data Midterm Exam Study Guide

DS/CMPSC 410 Programming Models for Big Data Spring 2026 Midterm Exam Study Guide

Instructor: Prof. John Yen

Important Note

While the self-study questions are grouped by topic/lab, a question in the exam can be related to multiple topic areas and/or labs.

Topic 1: Big Data Opportunities and Needs of Programming Models for Big Data

Limitations of Conventional Programming Models:
- Conventional programming models struggle to efficiently analyze massive datasets due to limitations in scalability and adaptability to dynamic data environments.
Time Requirement of Generating Data Analytics using Streaming Data:
- The need for new programming models arises from the increasing time requirement for generating analytics from streaming data, emphasizing the necessity for models that can handle real-time data.
Definition and Role of a Cluster:
- A cluster is defined as a set of connected computers that work together. It supports programming models for Big Data by distributing workload across multiple nodes, allowing for parallel processing and better resource utilization.
First Programming Model for Big Data:
- The first prominent programming model for Big Data is MapReduce, which introduced a paradigm for processing large amounts of data in a distributed fashion.
Big Data Analytic Problems that Motivated MapReduce:
- Problems requiring efficient processing of large datasets, such as sorting, filtering, and aggregating, were the drivers behind the innovation of MapReduce.

Topic 2: MapReduce (and Motivations for Developing Spark)

Exemplar Problem for MapReduce:
- An example problem suitable for MapReduce is counting the frequency of words in large text files, excluding the pre-processing of web pages.
Innovative Ideas in MapReduce Enabling Scalable Big Data Analytics:
- Each mapper partitions its output (key-value pairs) into R subsets/files, where R is the number of reducers.
- The hash function for key-based partitioning is consistent across all mappers, ensuring uniform distribution.
- The ith reducer processes the ith output partition from each mapper, reducing time in searching for assigned keys from input files.
Role of Key-Value Table in MapReduce:
- The key-value table acts as the intermediary for data, allowing mappers to emit key-value pairs which reducers can process.
Problem Addressed by MapReduce’s ‘Plan-Ahead’:
- The ‘plan-ahead’ strategy in MapReduce resolves scheduling issues by preemptively arranging how tasks are executed based on estimated resource needs.
- The information needed includes data distribution and required resources at various stages.
Primary Cost Reduction Achieved by MapReduce:
- The primary cost reduction occurs primarily in the time for each reduce worker to read and process its input, facilitating quicker output generation.
Role of Master in MapReduce when a Node Crashes:
- The Master node is responsible for monitoring the status of all nodes and reallocating tasks if a mapper or reducer fails. This role is crucial for maintaining reliability and continuity in operations.
Challenges of Using MapReduce for Iterative Big Data Analytics:
- MapReduce faces limitations with iterative algorithms (like those used in machine learning) as they require multiple passes over the data, leading to inefficiencies.
Factors Enabling Rapid Industry Adoption of Spark:
- The speed of processing due to in-memory computations and the flexibility of operations (such as real-time streaming) significantly contributed to Spark's swift industry adoption.
Main Advantage of Spark over MapReduce:
- Spark's primary advantage lies in its faster processing capabilities using in-memory computation, making it better suited for iterative tasks and complex analytics.

Topic 3: Spark, RDD, and Lazy Evaluation

Definition of Lazy Evaluation in Spark:
- Lazy evaluation in Spark refers to the practice of postponing the execution of transformations until an action is invoked, optimizing the computation pipeline and conserving resources.
Benefits of Lazy Evaluation in Data Distribution and Computation:
- By delaying execution, Spark can optimize the execution plan, reducing network I/O and allowing effective distribution of data across the cluster.
RDD-Based Methods that are Actions:
- Actions in RDD are operations such as count(), collect(), or saveAsTextFile(), which trigger the execution of transformations up to that point.
Difference Between RDD in Spark and Variables in Other Languages:
- An RDD is a resilient distributed dataset designed to handle data across clusters, unlike typical variables, which are often confined to a single computational environment.
Mutability of RDD:
- RDDs are immutable, meaning once created, they cannot be altered, which helps prevent inconsistencies and allows fault tolerance throughout the computation process.
Benefits of Running Spark in Local Mode Before Cluster Mode:
- Testing in local mode allows for debugging and validation under controlled conditions, which minimizes errors before deploying in a cluster environment.
Understanding Time Error Messages via Lazy Evaluation:
- Lazy evaluation indicates that error messages related to time may not appear until an action that requires computation is executed based on input, helping to identify the point of failure within the chain of transformations.
Comparison of map and flatMap:
- The two following statements yield different results due to their behavior:
- tweets_lines.map(lambda x: x.split(" ")) returns an RDD of lists from each line.
- tweets_lines.flatMap(lambda x: x.split(" ")) returns a flattened RDD of words from all lines.
Definition of RDD-Based Filter in Spark:
- The filter operation in Spark keeps only those elements in the RDD that satisfy a given predicate. It's suitable for tasks like data cleansing and subsetting.
Example PySpark Code for Word Counts:

  RDD1 = sc.textFile("/storage/home/juy1/work/US_Iran.csv")  
  RDD2 = RDD1.flatMap(lambda x: x.split(" "))  
  RDD3 = RDD2.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a + b, 5)  
  RDD3.saveAsTextFile("/storage/home/juy1/work/US_IranWordCounts.txt")

Generating Error Messages with Incorrect Input Filename:
- The error from an incorrect filename in the sc.textFile command occurs immediately upon execution, as the input file cannot be found.
Triggering Generation of Error Messages:
- Adding RDD3.take(3) before saveAsTextFile will lead to an earlier error message from the incorrect filename due to the immediate action trigger.
Specifying Local Mode vs. Cluster Mode in Spark:
- The mode is specified in the Spark application setup, where SparkConf.setMaster("local") is used for local mode and specific cluster addresses (like YARN or Mesos) are used for cluster mode.
Consequences of Running SparkContext Again in Local Mode:
- Running the SparkContext initialization again in local mode leads to an error, as the context cannot be created multiple times within the same application.
Spark Features NOT Supported by RDDs:
- Feature B: Storing key-value pairs where keys are lists of multiple items is not supported by RDDs.
- Other features are supported: recomputation, persistence, lazy evaluation, and joins.
Sorting Key-Value RDDs on Values:
- To sort a key-value RDD by value: use RDD.sortBy(lambda x: x[1]).
- For descending order: RDD.sortBy(lambda x: x[1], ascending=False).

Topic 4: RDD-Based Integration and Lab 4 Hashtag Change

Requirements for Joining Two RDDs:
- Two RDDs can be combined using join operations if they share a common key.
Differences in Join Types:
- Inner Join: Includes only keys present in both RDDs.
- Left Outer Join: Includes all keys from the first RDD, matching keys from the second RDD where applicable.
- Right Outer Join: Opposite of left, includes all keys from the second RDD.
- Full Join: Combines all keys, filling in with nulls where matches do not exist.
Join Operations Resulting in Missing Values:
- Outer joins may produce RDDs with missing values where a key appears in only one of the RDDs.
- Missing values can be transformed by defining default values or filtering them out using .filter().
Finding Significant Increases in Hashtags Over Two Days:
- Using an outer join on KV1 (day 1) and KV2 (day 2) can reveal which hashtags increased significantly.
- A mapping function can then calculate the change in counts between the two datasets.

Running Spark in Cluster Mode

Key Benefits of Cluster Mode:
- Cluster mode provides massive scalability and enables processing of large datasets across multiple machines.
Code Changes for Cluster Mode:
- Adjusting the Spark application configuration to point to the cluster manager and ensuring necessary dependencies and libraries are accessible.
Steps before Running pbs-spark-submit:
- Configuration verification, checking filesystem paths, ensuring the cluster is running, and setting environment variables may be needed before submitting jobs.
- Most of these steps need not be repeated unless configurations change.
Checking Execution Success in Log Files:
- Successful execution can be determined by checking for completion messages in the log file indicating execution without errors.
Addressing Bugs After Partial Runs:
- Once bugs in the .py file have been resolved, the script can be run again. However, new output may or may not be generated again as previous execution states could still affect the results.
Information to Look for in the Log File:
- Key metrics such as the start and end times of tasks, any error messages, resource usage, and task dependencies should be reviewed.

Topic 5: Lab 5 Data Frame-based Integration, Aggregation, and SQL Functions

Big Data Processing with Spark DataFrame:
- DataFrames are preferred for structured data processing and SQL-like operations that require schema enforcement and optimization.
Difference Between SparkSession and SparkContext:
- SparkContext is the entry point for low-level Spark operations, while SparkSession provides a unified entry point for DataFrame APIs and SQL.
Defining a Schema for a DataFrame:
- A schema is defined using StructType and includes the names and data types of each column within the DataFrame.
Information Specified in a DataFrame Schema:
- Information includes:
- A. Name of each column.
- B. Type of each column.
- C. Nullability of each column.
- D. Whether the column is an array.
Advantage of Defining a Schema Over Inferring It:
- Explicitly defining a schema ensures data correctness, improves performance by avoiding inferring types from large datasets, and allows for greater control over data import.
Joining Two DataFrames:
- The join operation in DataFrames allows the merging of datasets based on common columns, differing from RDD-based joins that merge using keys only.
Generating Titles of Movies Over 100 Reviews:

  df.filter(df.ReviewCount > 100).select("Title")

Generating Titles of Movies with Average Reviews above 3.5:

  df.filter(df.AvgRating > 3.5).select("Title")

Creating Array of Genres from “Genres” Column:

  df.withColumn("GenresArray", split(df.Genres, "|"))

Adventure Movies with Review Count above Mean and Avg Rating above 3.5:

  df.filter((df.GenresArray.contains("Adventure") & (df.ReviewCount > df.AvgReviewCount)) & (df.AvgRating > 3.5)).select("Title")

Extracting RDD Elements from a DataFrame:
- Methods: select() method, .rdd conversion, and using .collect() for retrieval.
Definition of Row Object in DataFrame:
- The Row object represents a row of data within a DataFrame and can be accessed using indexing or field name.
Filtering Movies of Specific Genre from DataFrame:

  df.filter(array_contains(df.GenresArray, "Animation"))

Joining DataFrames with a Common Column Name:
- Use join(df2, "commonColumn") for joining based on shared column names.
Calculating Sum Using GroupBy:

  df.groupBy("columnName").sum("valueColumn")

Counting Total Rows for Each Group:

  df.groupBy("columnName").count()

Calculating Average Rating for Movies:

  df.withColumn("AvgRating", df.sum(Rating) / df.count())

Topic 6: Lab 6 ALS-Based Recommendation Systems and Hyperparameter Tuning

Principle of Alternating Least Squares (ALS):
- ALS operates by alternating between fixing user and item matrices to minimize the loss function, optimizing recommendations based on interactions.
Hyperparameters in ALS Training:
- Key hyperparameters include the rank of the factorization, number of iterations, and regularization parameter.
Risk of Overfitting in ALS Models:
- Overfitting occurs when a model is overly complex and performs poorly on unseen data. An example includes a model that accurately predicts training data but fails to generalize, such as memorizing user interactions.
Intuition of Rank in ALS Model:
- The rank represents the number of latent factors used to describe users and items, impacting recommendations quality and model complexity.
Roles of Testing, Validation, and Training Data in Hyperparameter Tuning:
- Validation data is used to fine-tune hyperparameters, training data to build the model, and testing data to assess the model's performance.
True/False Statements:
- F: Pandas DataFrame and Spark SQL DataFrame are different, however, a Pandas DataFrame can be converted to Spark SQL DataFrame.
- T: A Spark DataFrame can be saved in CSV format with headings.
Calculating Root Mean Square (RMS) Error of Validation Data:
- Combine the RDD from predictions and validation dataset using a join, then compute the RMS error using standard formula methods.
Accessing Elements of Prediction RDD from ALS Model:
- Use .map() operations to access generated predictions. The method DataFrame.collect() does not work on prediction RDDs.
Counting Total Movies in Specific Genres:
- Filter the RDD or DataFrame for movies of the desired genre and use .count() to determine the total.
Evaluating Model Based on Best Hyper-Parameter Combination:
- Ensure to utilize the model trained on the validation set with the best hyper-parameters during the testing phase.
Indicating Best Hyper-Parameter Combinations in Output File:
- The output from Lab 6 will document metrics detailing the performance of each hyper-parameter configuration tested, indicating the most effective set.