Comprehensive Study Notes: Time-Constrained Reinforcement Learning for eCommerce Recommendations

Project Overview and Core Objectives

  • Context: Group 6 consisting of Sejin Chen, Song Young Liu, and AFMA Fujil Kabir, focuses on the paper "Time Constraint Recommendations: Reinforcement Learning Strategies for eCommerce."

  • Conceptual Framing: The research reframes the recommendation system problem from a simple ranking task to a budget-constrained sequential decision problem.

  • Operational Scenario: It simulates shopping and scroll-based interfaces where a user interacts with items under a limited time budget.

  • Mechanism of Interaction:     * Episodes: The interaction is divided into episodes containing multiple recommendation turns.     * Slates: At each turn, the agent presents a set of items known as a "slate."     * User Action: The user can select one (or more, in extensions) item from the slate or ignore the entire set.     * Agent Observation: The agent monitors the remaining time budget and the history of items shown or clicked.

Markov Decision Process (MDP) Formulation

  • State Space (SS): Comprises the remaining user time budget and the items currently placed in the slate (QDQD).

  • Action Space (AA): The selection of an item from the provided slate.

  • Reward Response (RR):     * Reward is 11 if the user selects/clicks an item.     * Reward is 00 if no item is selected.     * Constraint Logic: Clicking is contingent on the time budget. If the evaluation cost of a selected item exceeds the remaining budget, the user cannot click, and the reward is 00.

  • Transition Function: Defined by the environmental response to actions, specifically how the budget depletes and the slate evolves.

  • Objective: The system aims to optimize a discounted return objective to prioritize long-term engagement over immediate rewards.

Learning Pipeline and System Architecture

  • RL Approach: Model-free and value-based methodology.

  • Baselines: The study utilizes SARSA (on-policy) and Q-learning (off-policy) for performance comparison.

  • User Simulator:     * Utilizes a Transformer-based deep neural network.     * Outputs a "relevance score" from which the simulator determines click probability.

  • Policy Engine: Uses an XGBoost regressor to act as the policy (approximating the Q-function or V-function) to maximize total clicks.

Replication of Original Paper Experiments

  • Goal: Verify if the implementation reproduces the high-level behavior where Q-learning achieves higher play rates than SARSA.

  • Simulation Parameters:     * Number of Users: 150150.     * Slate Size: 3030.     * User Budgets: Range from 100100 to 500500.     * Discount Factors (γ\gamma): Range from 0.20.2 to 1.01.0.

  • Hardware and Constraints:     * Experiments ran for approximately 183183 minutes on a MacBook Pro.     * Used a subset of the full dataset due to the original dataset's size constraints.

  • Primary Metrics:     * Play Rate: Defined as the proportion of slates where at least one click occurred. Measures user engagement.     * Effective Slate Size: Measures the number of items a user can actually evaluate within their budget before abandonment.

  • Replication Findings:     * Play Rate: Q-learning consistently outperformed SARSA across all budget settings, matching the original paper's visual and qualitative conclusions.     * Effective Slate Size: Results were more sensitive to the budget and discount factor. SARSA performed poorly under tight budgets but stabilized as the budget increased, reducing the negative impact of exploration.     * Sensitivity to γ\gamma: Performance change from γ=0.2\gamma = 0.2 to γ=0.8\gamma = 0.8 showed Q-learning had strong positive changes under tight budgets, while SARSA was more irregular.

Extension 1: Feature-Dependent Cost Models

  • Limitation Addition: The original paper used uniform cost sampling (00 to 100100), making costs independent of item characteristics.

  • Hypothesis: If costs are structured according to item features, an RL agent can learn this structure to make better budget-aware recommendations.

  • Implementation Details:     * Extension 1a: Costs determined by item features (category, price bucket, description length).     * Extension 1b: Costs determined by features plus the position in the slate.     * Position Penalty Formula: \text{cost_adjusted} = \text{cost_base} \times (1 + 0.1 \times \ln(1 + \text{position})).

  • State Representation for Agent: The agent receives a 5-dimensional input including position, relevance, cost, budget-to-go, and accumulated relevance.

  • Experimental Results:     * Play Rate: No significant improvement observed. Baseline (uniform cost) and feature-dependent models showed overlapping confidence intervals.     * RL Advantage: The impact of the discount factor diminished under feature-based costs, suggesting the greedy policy (γ=0\gamma = 0) was already near-optimal.

  • Analysis of Failure to Improve:     * Observational Limitation: The agent observes the scalar cost value but not the underlying features that generated it. From the perspective of the XGBoost agent, a uniform cost of 42.542.5 and a feature-derived cost of 42.542.5 are identical.     * Fixed Item Pool: The agent may have memorized costs of items rather than learning to generalize.

Extension 4: Multi-Click Scenario and Reward Function Improvement

  • Limitation Addition: The original paper allowed only one click per slate. In real scenarios, users frequently click multiple items.

  • Multi-Click Implementation:     * Switched from categorical modeling to Bernoulli trials for every item in a slate based on individual relevance scores.     * Modified the Relevance Score Model (PRM) to use Binary Cross Entropy (BCE) loss instead of standard Cross Entropy loss.

  • Improved Reward Function:     * In the original, all clicks were rewarded equally.     * New Mapping:         * Ignore: 00         * Click: 11         * Add to Cart: 22         * Purchase: 33     * The Alibaba personalized re-ranking dataset lacked "Add to Cart" and "Purchase" flags, so these were simulated via additional Bernoulli trials (e.g., 40%40\% chance of clicking an item leading to an "Add to Cart" event).

  • Metrics for Extension 4:     * Average Clicks per user slate.     * Average Total Reward collected per slate.

  • Findings:     * Multi-Click Impact: Substantially improved play rates and engagement across all budgets. The multi-click setting allowed for a more "expressive" interaction model.     * Reward Alone: Changing the reward structure without moving to a multi-click setting often hurt or failed to improve the play rate.     * Combined Effect: The combination of Multi-Click and Improved Reward was the strongest setting across most metrics, specifically yielding the highest average total rewards.

Feedback and Challenges

  • Professor Feedback:     * Addresseed unclear figures for Extension 4 by improving plot interpretability and aligning with the original paper's visual style.     * Clarified the baseline comparison (comparing against single-click/no-reward-improvement models).

  • Student Suggestions:     * Suggestions for Deep Q-Networks (DQN) were acknowledged but deprioritized in favor of real-world environment adaptation.     * Plotting suggestions were adopted for clearer data representation.

  • Data Challenges:     * Missing fields for purchase and cart data required the use of random sampling and Bernoulli distribution simulations.     * Computational Intensity: Managing 640640 million evaluation steps required migrating to the NJIT Wolverine HPC and separating SARSA and Q-learning into parallel SLURM jobs.

Synthesis and Future Directions

  • Principal Lesson: Modelling the interaction process (User Simulator) realistically is a more powerful lever for performance than simply redesigning reward weights or using more complex value functions like DQNs.

  • State Space Importance: Extension 1 proved that if features driving costs are hidden from the agent, it cannot exploit the structure. Future work should expand the state space from 55 dimensions to 2121 dimensions to include specific item features.

  • Real-World Application: Successful RL in e-commerce requires granular reward models (differentiating clicks from purchases) and realistic simulators that allow for early abandonment and multi-item interaction.