Vivek

Project Overview on Calllight

In today's presentation, we focus on Calllight, a multi-agent reinforcement learning framework designed for traffic signal control. This project is a collaborative effort among Gerald, Vivek, and Jaime. Jaime will cover the project overview, feedback on our milestone, and problem formulation, followed by Vivek's discussion on the modeling results. The foundation of our study is based on the Calllight paper by Huawei et al., published in 2019.

Problem Addressed

The Calllight framework addresses the problem of traffic signal control in urban environments. In most real-world scenarios, traffic lights either operate independently or adhere to fixed timing plans, which often results in poor coordination and traffic congestion. Recognizing that intersections are highly interconnected, where decisions made at one intersection can impact others nearby, the implementation of Calllight aims to use a multi-agent reinforcement learning (MARL) approach. Here, each intersection is regarded as an individual agent.

Key Components of Calllight's Approach

  1. Model Architecture: The framework combines deep Q-learning with graph attention networks (GAT). This innovative model enables each agent to dynamically assess which neighboring intersections hold the most significance, thereby enhancing decision-making compared to previous models that treated all neighbors as equally important.
  2. Overall Goal: The principal objective of Calllight is to diminish traffic congestion and enhance overall traffic flow across the entire network.

Feedback from Midterm Demo

During our midterm demonstration, we received vital feedback highlighting several areas for improvement:

  1. Reward Function: The existing pressure-based reward, determined by queue length, was not sufficiently justified, prompting plans to enrich it by incorporating additional weighted components.
  2. Experimental Results: There was a request for more robust experimental results, including learning curves and quantifiable metrics, which we are now addressing in this presentation.
  3. Timeline Clarity: Our initial project timeline lacked clarity, which we have since refined to indicate that we’re on track with our goals.
  4. Omitted Experiments: We clarified that some experiments, such as those involving Copilot Excel, were skipped due to complexity and time limitations.
  5. Comparative Analysis and Scalability: Suggestions were made to improve performance comparisons and address scalability issues, particularly for larger traffic networks.

Current Open Issues

As we continue our project, several open issues persist:

  1. Reward Function Refinement: Enhancements are underway to adequately justify the reward function, with future expansions intended.
  2. Training Convergence: Training has not fully converged, particularly in larger environments, necessitating additional training rounds.
  3. Scalability Concerns: As we scale to larger grid sizes, scalability remains a challenge requiring further exploration.
  4. Validation of Original Results: We are still in the process of fine-tuning to align our results with those reported in the original paper, which necessitates ongoing validation efforts.

Motivation Behind the Study

The core motivation encapsulated by the paper stems from the necessity for coordinated traffic signals. Traditional traffic management systems, based on fixed time or rule-based methods, may not adapt to the unpredictable patterns of real-world traffic.

Reinforcement Learning as a Solution

Reinforcement learning (RL) emerges as an appropriate method since it allows agents to learn optimal behaviors from real-time interactions with their environments, independent of rigid assumptions. Yet, deploying RL in this multi-agent context introduces complexity, including:

  1. Multi-Agent Influence: Decisions made by one agent impact others.
  2. Non-Stationary Environments: All agents are concurrently learning, complicating state representation.
  3. Scaling Complexity: Computational demands increase dramatically as more intersections are introduced into the model.

The paper’s key contribution lies in employing graph attention mechanisms, which enable agents to selectively focus on the most impactful neighboring intersections, aiding in the orchestration of traffic management.

Impact of the Study

Resolving traffic coordination challenges can lead to:

  1. Reduced congestion
  2. Improved travel efficiency
  3. Lower emissions
  4. Advances in intelligent urban infrastructure
  5. Contributions to the fields of multi-agent reinforcement learning and graph-based models.

Problem Formulation

In the Calllight framework, each intersection functions as an agent within an environment modeled by a traffic simulator. While the original paper utilized the SUMO simulator, our implementation uses CitiFlow, which better accommodates larger traffic networks. At every time step, each agent observes several states, which include:

  • Number of vehicles waiting at each lane
  • Current traffic light phase

Following observation, the agent selects an action, determining the next traffic light phase. The environment subsequently transitions into a new state, while the agent receives a reward, usually tied to queue length or traffic pressure, as minimizing these factors crucially reduces congestion.

Challenges in Learning

The learning process is complicated due to:

  1. Agent Interconnectivity: Decisions by one intersection affect many others, necessitating cooperative learning.
  2. Delayed Rewards: The non-instantaneous nature of rewards impedes efficient learning progression.
  3. Expanding State Space: The complexity of the state space escalates swiftly with the addition of intersections.

Core Innovations of Calllight

In summary, the primary innovation of Calllight lies in its use of GAT to steer attention toward the most relevant neighboring intersections, improving decision-making at each traffic signal. The computational components of the model include:

  • A tailored reward mechanism that factors in expected cumulative rewards for actions in specific states.
  • A focused attention equation helps determine which intersection requires attention based on contextual relevance.
  • Policy optimization relies on the Bellman optimality equation, while the Q-learning update scheme minimizes the loss function through the use of a target network and replay buffer, ensuring stable learning.

Learning Pipeline Description

The learning pipeline consists of a sequence of processes:

  1. The environment is encapsulated as a graph representing various states.
  2. The Graph Attention Network processes data across multiple intersections using queue network updates.
  3. Agents take actions, yielding corresponding rewards, and leveraging a replay buffer for training updates.
  4. The policy for action selection is determined by the agent action with the highest Q-value, optimizing for traffic flow across the network.

Distinction from Traditional Approaches

Calllight diverges from traditional reinforcement learning methods by:

  • Collective Agent Decision-Making: Unlike classical models which consider agents independently, Calllight involves collective decision-making influenced by neighboring intersections.
  • State Representation: While traditional reinforcement learning models utilize local features, Calllight employs augmented state representations that incorporate data from neighboring intersections.
  • Reward Mechanism: Calllight implements a comprehensive reward signal rather than simplistic individual rewards applicable in classical methods.

Experimental Methodology

For our experiments, CitiFlow serves as the chosen simulator. We are also considering the original SUMO simulator. The computational infrastructure used is VULLWAR, in addition to utilizing a high-performance computing setup. The experimental road networks that we examined comprise grid sizes of 6x6 and 10x10, with 16x16 planned for future experiments. Relevant parameters include:

  • Queue length and waiting time as state variables
  • Learning rate established at 0.001
  • Discount factor at 0.95

Evaluation Metrics

Key evaluation metrics entail:

  1. Average travel time
  2. Wait time
  3. Amount of traffic
  4. Training duration that spans 1 to 48 hours depending on network size

Provisioned hardware consists of NVIDIA GPUs and multi-core CPUs with RAM greater than 16 GB.

Performance Comparison

In comparison to existing solutions, particularly PressLite, Calllight demonstrates superior performance by integrating GAT with multi-agent reinforcement learning, rendering it exceptionally viable for large city traffic networks.

Architecture Overview

The overall architecture involves:

  1. Traffic flow files processed by the CitiFlow simulator.
  2. Data aggregation through multiple processes to enhance learning.
  3. Application of GAT for Q-learning updates and resultant generation of learning curves.
  4. The environment includes transition files, vehicle arrival and exit logs, signal intersection phase changes, and road network configurations.

Experimental Goals

We set out to replicate the early training behavior noted in the original Calllight paper. Our interim focus was on:

  1. Ensuring end-to-end operational capabilities of the training pipeline.
  2. Verifying the agent's capacity to produce significant learning signals.
  3. Matching preliminary trends reported in original research.

In our testing involving a 6x6 grid and subsequently a 10x10 grid, we attained meaningful results. The agent demonstrated learning efficiency and stability across varying conditions. Specifically, queue lengths showed a consistent decline from approximately -7.8 to -6.1.

Observed Output Files

The outputs observed included log files that confirm the proper functioning of the traffic environment:

  • Pickle files recording reinforcement learning transitions.
  • CSV files tracking vehicle arrivals and departures, essential for computing queue lengths.
  • Record of intersection phase changes validating the agent's control over traffic signals.

Upon validating these outputs across increasing grid complexity, we confirmed that our reproduction pipeline operates as anticipated.

Preliminary Results Analysis

Preliminary results indicated stable convergence patterns and consistent improvement metrics for both 6x6 and 10x10 grids:

  1. 10x10 grid output specificity in terms of vehicle quantity was observed, unveiling profound learning curves and log correctness.
  2. Learning improvements were substantial given the limited number of vehicles, showcasing that fewer vehicles per intersection correlates with improved performance metrics.

Strengths and Weaknesses Identified

Strengths:

  • Stable Convergence: Smooth improvement trends aligned with initial performance claims from the original paper.
  • Modularity of System: Facilitated distinct work on environment, training, logging, and data generation.

Weaknesses:

  • Implementation details regarding reward normalization, feature aggregation, and log sensitivity led to complexities in reproducing the framework effectively.

Future Directions

Potential improvements are anticipated prior to the final demonstration:

  1. Enhancing Reward Design: Future exploration around normalization parameters to enhance stability during training iterations.
  2. Exploring Larger Grids: Attempting the scaling of the grid size to 16x16 for broader city networks or implementing real-world traffic data scenarios.
  3. Incorporating Explicit Communication: Augmenting the MARL framework by integrating explicit communication methodologies to better respond to dynamic traffic situations.

Conclusion

The team has successfully navigated through various project phases, achieving a fully functional implementation of the Calllight framework on 10x10 grids. We established robust logging, correct sample generation, and clear evidence of consistent learning behaviors in alignment with previous research claims. Despite the challenges identified, enhancements to our modular pipeline establish a solid foundation for the forthcoming stages of this project, mainly focused on scalability and further experimentation.

Thank you.