Playing for Data: Ground Truth from Computer Games
Abstract
- The paper addresses challenges in creating large datasets with pixel-level labels for training computer vision models.
- It proposes a method for generating pixel-accurate semantic label maps using images from modern video games, such as Grand Theft Auto V.
- The technique leverages the communication between video games and graphics hardware, allowing for rapid propagation of semantic labels across images.
- Validations show that using this method can significantly enhance model accuracy with a reduced need for hand-labeled data.
Introduction
- High-capacity models in computer vision require extensive training datasets.
- Image classification datasets exist with millions of labeled images, but semantic segmentation datasets are often much smaller due to labor-intensive labeling.
- High-quality annotations take considerable human effort, leading to a “curse of dataset annotation.”
- The paper explores the potential of commercial games for creating large-scale, pixel-accurate ground truth data for semantic segmentation tasks.
Methodology
Data Acquisition
- Modern games like Grand Theft Auto V offer extensive, realistic environments essential for training models.
- Internal operation of commercial games is largely inaccessible, hindering detailed semantic annotation.
- The authors use a technique known as detouring to monitor and manipulate rendering commands between the game and graphics hardware.
- By hashing rendering resources, the approach allows persistent signatures for objects across different images and gameplay sessions.
Semantic Labeling Process
- Pixel annotations are created by utilizing a combination of mesh, texture, and shader (MTS) identifiers.
- Each frame is completely rendered twice:
- First Pass: Standard color image is generated.
- Second Pass: Encodes resource IDs for pixel annotations.
- Patches of pixels sharing the same MTS combination facilitate automatic decomposition of images.
- An association rule mining approach helps in identifying and labeling semantic classes for patches, enhancing annotation speed and accuracy.
Results
Dataset Statistics
- The method successfully labeled 25,000 images in 49 hours, achieving an annotation density of 98.3% with an average labeling time of 7 seconds per image.
- Compared to other datasets like CamVid and Cityscapes, this approach shows a significant increase in the speed and scale of labeling, reducing annotation time drastically.
Evaluation of Semantic Segmentation Models
- Using labeled data from games enhances the accuracy of models trained on real-world datasets (CamVid, KITTI).
- Performance improvements were noted in mean Intersection over Union (IoU) metrics when using synthetic training data supplemented with real labels.
- Results indicate that training with as little as one-third of the CamVid dataset, when complemented with synthetic game data, outperformed models trained on the full dataset alone.
Discussion
- The paper's findings suggest modern video games can be a rich source for generating diverse and extensive training datasets for computer vision tasks.
- Future work could further explore extending this approach to real-time video streams and other dense prediction problems (like depth estimation).
- The contribution of this paper lies in highlighting the potential for effective data generation without the need for the source code or game assets, focusing on real-time image synthesis for training semantic segmentation systems.