Playing for Data: Ground Truth from Computer Games

The paper addresses challenges in creating large datasets with pixel-level labels for training computer vision models.
It proposes a method for generating pixel-accurate semantic label maps using images from modern video games, such as Grand Theft Auto V.
The technique leverages the communication between video games and graphics hardware, allowing for rapid propagation of semantic labels across images.
Validations show that using this method can significantly enhance model accuracy with a reduced need for hand-labeled data.

High-capacity models in computer vision require extensive training datasets.
Image classification datasets exist with millions of labeled images, but semantic segmentation datasets are often much smaller due to labor-intensive labeling.
High-quality annotations take considerable human effort, leading to a “curse of dataset annotation.”
The paper explores the potential of commercial games for creating large-scale, pixel-accurate ground truth data for semantic segmentation tasks.

Modern games like Grand Theft Auto V offer extensive, realistic environments essential for training models.
Internal operation of commercial games is largely inaccessible, hindering detailed semantic annotation.
The authors use a technique known as detouring to monitor and manipulate rendering commands between the game and graphics hardware.
By hashing rendering resources, the approach allows persistent signatures for objects across different images and gameplay sessions.

Pixel annotations are created by utilizing a combination of mesh, texture, and shader (MTS) identifiers.
Each frame is completely rendered twice:
- First Pass: Standard color image is generated.
- Second Pass: Encodes resource IDs for pixel annotations.
Patches of pixels sharing the same MTS combination facilitate automatic decomposition of images.
An association rule mining approach helps in identifying and labeling semantic classes for patches, enhancing annotation speed and accuracy.

The method successfully labeled 25,000 images in 49 hours, achieving an annotation density of 98.3% with an average labeling time of 7 seconds per image.
Compared to other datasets like CamVid and Cityscapes, this approach shows a significant increase in the speed and scale of labeling, reducing annotation time drastically.

Using labeled data from games enhances the accuracy of models trained on real-world datasets (CamVid, KITTI).
- Performance improvements were noted in mean Intersection over Union (IoU) metrics when using synthetic training data supplemented with real labels.
Results indicate that training with as little as one-third of the CamVid dataset, when complemented with synthetic game data, outperformed models trained on the full dataset alone.

The paper's findings suggest modern video games can be a rich source for generating diverse and extensive training datasets for computer vision tasks.
Future work could further explore extending this approach to real-time video streams and other dense prediction problems (like depth estimation).
The contribution of this paper lies in highlighting the potential for effective data generation without the need for the source code or game assets, focusing on real-time image synthesis for training semantic segmentation systems.