Recommender Systems – Content Filtering vs. Collaborative Filtering

Problem Context and Motivation

• 2006 Netflix Prize framed a now-ubiquitous task: predict which items a user will like based on past behaviour.
• Comparable stakes across all major e-commerce and content platforms (movies, music, news, shopping, policy analysis).
• Graph metaphor: people on one side, items on the other; line thickness = strength of preference.
• Red, missing connections = unknown future preferences to estimate.
• Perfect prediction is impossible because:

  1. We forecast future behaviour from past data.

  2. Human taste is dynamic — it drifts over time.

Matrix Representation of Preferences

• Organise data into a matrix RR:
• Rows = users uu, columns = items ii (movies).
• Cell R<em>uiR<em>{ui} holds a rating on an agreed scale (e.g. 0!!40!\to!4 where 00 "hate", 22 "neutral", 44 "love"). • Reality: RR is sparse (most users have rated only a few items). • Core computational goal: infer the missing ratings (R</em>ui= ?)(R</em>{ui}=\ ?\,).

Content-Based (Feature) Filtering

• Idea: "Tell me what the item is like and what the user likes; I’ll compute the match".
• Steps

  1. Define explicit features (genres, actors, mood…).

  2. Encode each user as a feature vector PuP_{u*} (e.g. Alice = [3,0][3,0] for comedy 3, action 0).

  3. Encode each item as a feature vector QiQ_{*i} (Matrix = [0,4][0,4]).

  4. Compute predicted preference via dot product / matrix multiplication.
    • For Alice–Matrix: 3×0+0×4=03\times0 + 0\times4 = 0 → "She’ll hate it".
    • Matrix view:
    • User-Feature matrix PP (size U×KU\times K).
    • Feature-Item matrix QQ (size K×IK\times I).
    • Predicted ratings R^=P  Q\hat R = P\;Q.
    • Optional scale correction: divide by 8 then round to nearest 0.5 keeps R^[0,4]\hat R\in[0,4].
    • Limitations
    • Needs many accurate, human-supplied features → heavy onboarding friction.
    • Users often can’t articulate their tastes; some factors are subconscious.
    • Model oversimplifies nuanced preference patterns → mediocre accuracy.

Collaborative Filtering (Latent Factor Model)

• Philosophy: "People similar to you liked X; therefore you might like X".
• Reference milestone: 2009 paper by Cohen, Bell & Volinsky.
• Key twist: learn the features directly from rating patterns, not from questionnaires.
• Procedure

  1. Start with sparse RR.

  2. Factorise RPQR \approx P\,Q where:
    PP = User-LatentFeature matrix (U×KU\times K).
    QQ = LatentFeature-Item matrix (K×IK\times I).
    KK (10–100) ≪ min(U,I)\min(U,I).

  3. Use a machine-learning optimiser (e.g. stochastic gradient descent, alternating least squares) to minimise error:
    min<em>P,Q</em>(u,i)known(R<em>uiP</em>uQ<em>i)2+λ(P</em>u2+Qi2)\min<em>{P,Q} \sum</em>{(u,i)\,\in\,\text{known}} (R<em>{ui}-P</em>u Q<em>i)^2 + \lambda (|P</em>u|^2+|Q_i|^2).

  4. Resulting vectors capture latent factors (taste dimensions we cannot directly name).

  5. Predict missing entries with R^=PQ\hat R = P\,Q; fill in the whole matrix.
    • Interpretation of latent factors
    • May correlate with genre, pacing, era, "cult classic" vibe, etc., but labels remain unknown → emergent structure.
    • Why accuracy improves
    • Factors arise from actual co-rating behaviour, capturing subtle, non-obvious affinities.
    • Connection to compression
    • Full RR (millions×thousands) stored as two much smaller matrices.
    • Possible because RR carries pattern, not random noise.
    • If ratings were random, low-rank factorisation would incur massive reconstruction error.

Broader Applications & Synthetic Control

• Same math used in policy evaluation ("synthetic control"):
• To gauge effect of gun control or minimum-wage hike in City A, blend data from Cities B, C, D with similar latent characteristics → counterfactual outcome.
• Any domain where users interact with items: e-commerce, social feeds, ads, playlists, academic paper suggestions.

Ethical, Philosophical & Practical Considerations

• Burden-of-input trade-off: content filtering asks intrusive questions, collaborative filtering offloads work to the algorithm.
• Taste manipulation vs. reflection: recommender may shape future preferences, blurring prediction and influence.
• Dynamic preference drift: models must retrain to track temporal changes.
• Privacy: latent factors are learned from many individuals → risk of de-anonymisation or sensitive attribute leakage.

Key Numerical & Formula Recap

• Rating scale: 0,0.5,1,1.5,,40,0.5,1,1.5,\dots,4 (after normalisation).
• Dot-product example: score<em>user,item=</em>kP<em>ukQ</em>ki\text{score}<em>{user,item}=\sum</em>k P<em>{uk}Q</em>{ki}.
• Normalization step for two-feature demo: divide raw PQP\,Q by 88 then round.
• Optimisation objective (regularised MSE) shown above.

Connections to Prior Knowledge

• Matrix factorisation parallels Singular Value Decomposition (SVD) and Principal Component Analysis (PCA) — both uncover low-dimensional structure.
• Compression analogy echoes Fourier or JPEG transforms: leverage redundancy to store information efficiently.
• Links to clustering: latent factors implicitly cluster users/items in feature space.

Summary Bullets

• Recommender problem = predict unknown entries in a user-item rating matrix.
• Content filtering relies on explicit, human-defined features → simple but limited.
• Collaborative filtering discovers latent features via matrix factorisation → higher accuracy, less user burden.
• Latent factors provide both predictive power and a compressed representation.
• Technique generalises from Netflix movies to music, shopping, policy forecasting, etc.
• Success hinges on shared patterns; purely random data defeats the approach.

Problem Context and Motivation

  • The 2006 Netflix Prize fundamentally shaped the now-widespread challenge of predicting user preferences for items based on their historical behavior. This task is crucial for all major e-commerce and content streaming platforms, spanning domains like movies, music, news, retail, and even sophisticated applications such as policy analysis.

  • This problem can be visualised using a graph metaphor:

    • Users are on one side, and items (e.g., movies, products) are on the other.

    • Connections (edges) between users and items represent known preferences (e.g., a rating given by a user to a movie).

    • The thickness of these lines can denote the strength or intensity of the preference (e.g., a higher rating).

    • Red or missing connections signify unknown future preferences that the system aims to estimate.

  • Achieving perfect prediction is inherently impossible for several reasons:

    1. We are forecasting future behavior using only past data, which is an inherently uncertain endeavor.

    2. Human taste is dynamic and fluid; it naturally drifts and evolves over time due to new experiences, cultural shifts, or personal development.

    3. Data sparsity is a significant challenge: users typically interact with and rate only a tiny fraction of the available items.

Matrix Representation of Preferences

  • User-item interaction data is conventionally organised into a matrix, denoted as RR:

    • Rows of the matrix represent individual users (uu).

    • Columns represent specific items (ii), such as movies.

    • Each cell RuiR_{ui} contains a known rating provided by user uu for item ii, typically on an agreed-upon scale (e.g., 040\to4, where 00 signifies "hate," 22 "neutral," and 44 "love").

  • In reality, the matrix RR is highly sparse, meaning that the vast majority of its cells are empty (most users have rated only a small subset of the available items). This sparsity is a fundamental challenge.

  • The core computational objective of recommender systems is to infer or estimate these missing ratings (Rui=?R_{ui}=?) to provide personalised suggestions.

Content-Based (Feature) Filtering

  • This approach is founded on the principle: *

Problem Context and Motivation

  • The 2006 Netflix Prize fundamentally shaped the now-widespread challenge of predicting user preferences for items based on their historical behavior. This task is crucial for all major e-commerce and content streaming platforms, spanning domains like movies, music, news, retail, and even sophisticated applications such as policy analysis.

  • This problem can be visualised using a graph metaphor:

    • Users are on one side, and items (e.g., movies, products) are on the other.

    • Connections (edges) between users and items represent known preferences (e.g., a rating given by a user to a movie).

    • The thickness of these lines can denote the strength or intensity of the preference (e.g., a higher rating).

    • Red or missing connections signify unknown future preferences that the system aims to estimate.

  • Achieving perfect prediction is inherently impossible for several reasons:

    1. We are forecasting future behavior using only past data, which is an inherently uncertain endeavor.

    2. Human taste is dynamic and fluid; it naturally drifts and evolves over time due to new experiences, cultural shifts, or personal development.

    3. Data sparsity is a significant challenge: users typically interact with and rate only a tiny fraction of the available items.

Matrix Representation of Preferences

  • User-item interaction data is conventionally organised into a matrix, denoted as RR:

    • Rows of the matrix represent individual users (uu).

    • Columns represent specific items (ii), such as movies.

    • Each cell RuiR_{ui} contains a known rating provided by user uu for item ii, typically on an agreed-upon scale (e.g., 040\to4, where 00 signifies "hate," 22 "neutral," and 44 "love").

  • In reality, the matrix RR is highly sparse, meaning that the vast majority of its cells are empty (most users have rated only a small subset of the available items). This sparsity is a fundamental challenge.

  • The core computational objective of recommender systems is to infer or estimate these missing ratings (Rui=?R_{ui}=?) to provide personalised suggestions.

Content-Based (Feature) Filtering

  • This approach is founded on the principle: *

Problem Context and Motivation

  • The 2006 Netflix Prize fundamentally shaped the now-widespread challenge of predicting user preferences for items based on their historical behavior. This task is crucial for all major e-commerce and content streaming platforms, spanning domains like movies, music, news, retail, and even sophisticated applications such as policy analysis.

  • This problem can be visualised using a graph metaphor:

    • Users are on one side, and items (e.g., movies, products) are on the other.

    • Connections (edges) between users and items represent known preferences (e.g., a rating given by a user to a movie).

    • The thickness of these lines can denote the strength or intensity of the preference (e.g., a higher rating).

    • Red or missing connections signify unknown future preferences that the system aims to estimate.

  • Achieving perfect prediction is inherently impossible for several reasons:

    1. We are forecasting future behavior using only past data, which is an inherently uncertain endeavor.

    2. Human taste is dynamic and fluid; it naturally drifts and evolves over time due to new experiences, cultural shifts, or personal development.

    3. Data sparsity is a significant challenge: users typically interact with and rate only a tiny fraction of the available items.

Matrix Representation of Preferences

  • User-item interaction data is conventionally organised into a matrix, denoted as RR:

    • Rows of the matrix represent individual users (uu).

    • Columns represent specific items (ii), such as movies.

    • Each cell RuiR_{ui} contains a known rating provided by user uu for item ii, typically on an agreed-upon scale (e.g., 040\to4, where 00 signifies "hate," 22 "neutral," and 44 "love").

  • In reality, the matrix RR is highly sparse, meaning that the vast majority of its cells are empty (most users have rated only a small subset of the available items). This sparsity is a fundamental challenge.

  • The core computational objective of recommender systems is to infer or estimate these missing ratings (Rui=?R_{ui}=?) to provide personalised suggestions.

Content-Based (Feature) Filtering

  • This approach is founded on the principle: *