Playing for Data: Ground Truth from Computer Games

Abstract

  • The paper addresses challenges in creating large datasets with pixel-level labels for training computer vision models.
  • It proposes a method for generating pixel-accurate semantic label maps using images from modern video games, such as Grand Theft Auto V.
  • The technique leverages the communication between video games and graphics hardware, allowing for rapid propagation of semantic labels across images.
  • Validations show that using this method can significantly enhance model accuracy with a reduced need for hand-labeled data.

Introduction

  • High-capacity models in computer vision require extensive training datasets.
  • Image classification datasets exist with millions of labeled images, but semantic segmentation datasets are often much smaller due to labor-intensive labeling.
  • High-quality annotations take considerable human effort, leading to a “curse of dataset annotation.”
  • The paper explores the potential of commercial games for creating large-scale, pixel-accurate ground truth data for semantic segmentation tasks.

Methodology

Data Acquisition
  • Modern games like Grand Theft Auto V offer extensive, realistic environments essential for training models.
  • Internal operation of commercial games is largely inaccessible, hindering detailed semantic annotation.
  • The authors use a technique known as detouring to monitor and manipulate rendering commands between the game and graphics hardware.
  • By hashing rendering resources, the approach allows persistent signatures for objects across different images and gameplay sessions.
Semantic Labeling Process
  • Pixel annotations are created by utilizing a combination of mesh, texture, and shader (MTS) identifiers.
  • Each frame is completely rendered twice:
    • First Pass: Standard color image is generated.
    • Second Pass: Encodes resource IDs for pixel annotations.
  • Patches of pixels sharing the same MTS combination facilitate automatic decomposition of images.
  • An association rule mining approach helps in identifying and labeling semantic classes for patches, enhancing annotation speed and accuracy.

Results

Dataset Statistics
  • The method successfully labeled 25,000 images in 49 hours, achieving an annotation density of 98.3% with an average labeling time of 7 seconds per image.
  • Compared to other datasets like CamVid and Cityscapes, this approach shows a significant increase in the speed and scale of labeling, reducing annotation time drastically.
Evaluation of Semantic Segmentation Models
  • Using labeled data from games enhances the accuracy of models trained on real-world datasets (CamVid, KITTI).
    • Performance improvements were noted in mean Intersection over Union (IoU) metrics when using synthetic training data supplemented with real labels.
  • Results indicate that training with as little as one-third of the CamVid dataset, when complemented with synthetic game data, outperformed models trained on the full dataset alone.

Discussion

  • The paper's findings suggest modern video games can be a rich source for generating diverse and extensive training datasets for computer vision tasks.
  • Future work could further explore extending this approach to real-time video streams and other dense prediction problems (like depth estimation).
  • The contribution of this paper lies in highlighting the potential for effective data generation without the need for the source code or game assets, focusing on real-time image synthesis for training semantic segmentation systems.