Comprehensive Study Notes: Time-Constrained Reinforcement Learning for eCommerce Recommendations

Project Overview and Core Objectives

Context: Group 6 consisting of Sejin Chen, Song Young Liu, and AFMA Fujil Kabir, focuses on the paper "Time Constraint Recommendations: Reinforcement Learning Strategies for eCommerce."
Conceptual Framing: The research reframes the recommendation system problem from a simple ranking task to a budget-constrained sequential decision problem.
Operational Scenario: It simulates shopping and scroll-based interfaces where a user interacts with items under a limited time budget.
Mechanism of Interaction: * Episodes: The interaction is divided into episodes containing multiple recommendation turns. * Slates: At each turn, the agent presents a set of items known as a "slate." * User Action: The user can select one (or more, in extensions) item from the slate or ignore the entire set. * Agent Observation: The agent monitors the remaining time budget and the history of items shown or clicked.

Markov Decision Process (MDP) Formulation

State Space ( $S$ ): Comprises the remaining user time budget and the items currently placed in the slate ( $QD$ ).
Action Space ( $A$ ): The selection of an item from the provided slate.
Reward Response ( $R$ ): * Reward is $1$ if the user selects/clicks an item. * Reward is $0$ if no item is selected. * Constraint Logic: Clicking is contingent on the time budget. If the evaluation cost of a selected item exceeds the remaining budget, the user cannot click, and the reward is $0$ .
Transition Function: Defined by the environmental response to actions, specifically how the budget depletes and the slate evolves.
Objective: The system aims to optimize a discounted return objective to prioritize long-term engagement over immediate rewards.

Learning Pipeline and System Architecture

RL Approach: Model-free and value-based methodology.
Baselines: The study utilizes SARSA (on-policy) and Q-learning (off-policy) for performance comparison.
User Simulator: * Utilizes a Transformer-based deep neural network. * Outputs a "relevance score" from which the simulator determines click probability.
Policy Engine: Uses an XGBoost regressor to act as the policy (approximating the Q-function or V-function) to maximize total clicks.

Replication of Original Paper Experiments

Goal: Verify if the implementation reproduces the high-level behavior where Q-learning achieves higher play rates than SARSA.
Simulation Parameters: * Number of Users: $150$ . * Slate Size: $30$ . * User Budgets: Range from $100$ to $500$ . * Discount Factors ( $\gamma$ ): Range from $0.2$ to $1.0$ .
Hardware and Constraints: * Experiments ran for approximately $183$ minutes on a MacBook Pro. * Used a subset of the full dataset due to the original dataset's size constraints.
Primary Metrics: * Play Rate: Defined as the proportion of slates where at least one click occurred. Measures user engagement. * Effective Slate Size: Measures the number of items a user can actually evaluate within their budget before abandonment.
Replication Findings: * Play Rate: Q-learning consistently outperformed SARSA across all budget settings, matching the original paper's visual and qualitative conclusions. * Effective Slate Size: Results were more sensitive to the budget and discount factor. SARSA performed poorly under tight budgets but stabilized as the budget increased, reducing the negative impact of exploration. * Sensitivity to $\gamma$ : Performance change from $\gamma = 0.2$ to $\gamma = 0.8$ showed Q-learning had strong positive changes under tight budgets, while SARSA was more irregular.

Extension 1: Feature-Dependent Cost Models

Limitation Addition: The original paper used uniform cost sampling ( $0$ to $100$ ), making costs independent of item characteristics.
Hypothesis: If costs are structured according to item features, an RL agent can learn this structure to make better budget-aware recommendations.
Implementation Details: * Extension 1a: Costs determined by item features (category, price bucket, description length). * Extension 1b: Costs determined by features plus the position in the slate. * Position Penalty Formula: \text{cost_adjusted} = \text{cost_base} \times (1 + 0.1 \times \ln(1 + \text{position})).
State Representation for Agent: The agent receives a 5-dimensional input including position, relevance, cost, budget-to-go, and accumulated relevance.
Experimental Results: * Play Rate: No significant improvement observed. Baseline (uniform cost) and feature-dependent models showed overlapping confidence intervals. * RL Advantage: The impact of the discount factor diminished under feature-based costs, suggesting the greedy policy ( $\gamma = 0$ ) was already near-optimal.
Analysis of Failure to Improve: * Observational Limitation: The agent observes the scalar cost value but not the underlying features that generated it. From the perspective of the XGBoost agent, a uniform cost of $42.5$ and a feature-derived cost of $42.5$ are identical. * Fixed Item Pool: The agent may have memorized costs of items rather than learning to generalize.

Extension 4: Multi-Click Scenario and Reward Function Improvement

Limitation Addition: The original paper allowed only one click per slate. In real scenarios, users frequently click multiple items.
Multi-Click Implementation: * Switched from categorical modeling to Bernoulli trials for every item in a slate based on individual relevance scores. * Modified the Relevance Score Model (PRM) to use Binary Cross Entropy (BCE) loss instead of standard Cross Entropy loss.
Improved Reward Function: * In the original, all clicks were rewarded equally. * New Mapping: * Ignore: $0$ * Click: $1$ * Add to Cart: $2$ * Purchase: $3$ * The Alibaba personalized re-ranking dataset lacked "Add to Cart" and "Purchase" flags, so these were simulated via additional Bernoulli trials (e.g., $40\%$ chance of clicking an item leading to an "Add to Cart" event).
Metrics for Extension 4: * Average Clicks per user slate. * Average Total Reward collected per slate.
Findings: * Multi-Click Impact: Substantially improved play rates and engagement across all budgets. The multi-click setting allowed for a more "expressive" interaction model. * Reward Alone: Changing the reward structure without moving to a multi-click setting often hurt or failed to improve the play rate. * Combined Effect: The combination of Multi-Click and Improved Reward was the strongest setting across most metrics, specifically yielding the highest average total rewards.

Feedback and Challenges

Professor Feedback: * Addresseed unclear figures for Extension 4 by improving plot interpretability and aligning with the original paper's visual style. * Clarified the baseline comparison (comparing against single-click/no-reward-improvement models).
Student Suggestions: * Suggestions for Deep Q-Networks (DQN) were acknowledged but deprioritized in favor of real-world environment adaptation. * Plotting suggestions were adopted for clearer data representation.
Data Challenges: * Missing fields for purchase and cart data required the use of random sampling and Bernoulli distribution simulations. * Computational Intensity: Managing $640$ million evaluation steps required migrating to the NJIT Wolverine HPC and separating SARSA and Q-learning into parallel SLURM jobs.

Synthesis and Future Directions

Principal Lesson: Modelling the interaction process (User Simulator) realistically is a more powerful lever for performance than simply redesigning reward weights or using more complex value functions like DQNs.
State Space Importance: Extension 1 proved that if features driving costs are hidden from the agent, it cannot exploit the structure. Future work should expand the state space from $5$ dimensions to $21$ dimensions to include specific item features.
Real-World Application: Successful RL in e-commerce requires granular reward models (differentiating clicks from purchases) and realistic simulators that allow for early abandonment and multi-item interaction.