AI Data Work Typology Notes

A Typology of Artificial Intelligence Data Work

Introduces a typology for understanding human labor in AI systems, termed 'AI data work,' focusing on data preparation and model evaluation.

  • AI data work is essential for AI production.

  • Draws on fieldwork with an AI data BPO and microwork platforms.

  • Clarifies terms like 'ghost work,' 'microwork,' 'crowdwork,' and 'cloudwork'.

  • Differentiates how labor is organized in various contexts.

  • Divides AI data work institutions and the AI data pipeline conceptually.

  • Highlights how workers' practices become valuable in global AI production networks.

Keywords

Artificial intelligence, microwork, crowdwork, data work, ghost work, business process outsourcing.

Introduction

AI systems are gaining traction, requiring human labor for data processing.

  • These systems use machine learning algorithms needing large datasets, energy, and computational power (Benderet al.,2021;Crawford,2021)(Bender \textit{et al.}, 2021; Crawford, 2021).

  • Human workers collect, curate, annotate, and evaluate data.

  • AI companies outsource tasks due to resource constraints and the need for expertise.

  • Third-party data services offer flexible labor.

  • Focus on AI data work to highlight the role of human labor in AI, emphasizing employment relations and working conditions (Crawford,2021;Dauvergne,2022;Howcroft&BergvallKa˚reborn,2019)(Crawford, 2021; Dauvergne, 2022; Howcroft \& Bergvall-K\aa{reborn}, 2019).

AI Data Work Expansion

AI companies' data needs are growing.

  • Estimated 80% of AI project hours involve data collection, organization, and annotation (CognilyticaResearch,2019)(Cognilytica Research, 2019).

  • The global data collection and labeling market is expected to reach 17.10billion17.10 billion by 2030 (GrandViewResearch,2022)(Grand View Research, 2022).

  • Terminology such as microwork, crowdwork, and ghostwork has proliferated (Berget al.,2018;DeStefano,2016;Gray&Suri,2019;Irani,2015b)(Berg \textit{et al.}, 2018; De Stefano, 2016; Gray \& Suri, 2019; Irani, 2015b).

  • Clear distinction between crowdsourced workers and BPO employees.

  • Distinctions are not always clear in the literature (Crawford,2021;Miceli&Posada,2022;Tubaro&Casilli,2019)(Crawford, 2021; Miceli \& Posada, 2022; Tubaro \& Casilli, 2019).

This article clarifies these confusions by asking what are the different types of institutions that undertake data work and what set of interconnected processes is required to transform data into integrated datasets capable of being used in machine learning models?

New Typology of AI Data Work

Constructs a new typology of AI data work.

  • Provides a table of AI data work institutions to distinguish between six different ideal types.

    • Helps determine the nature of AI data work institutions and disentangle inconsistencies.

  • Provides a conceptual schema for understanding data work in the AI production process as a sequential system of steps (AI data pipeline).

    • Adds to existing frameworks that have concentrated more on the functional roles of data workers in the AI production process (Tubaroet al.,2020)(Tubaro \textit{et al.}, 2020).

    • Adopts an organisational perspective to distinguish between how labour is performed under different employment structures and in different contexts as part of the production of AI.

Research

Undertakes an examination of the existing literature and draw on a research project focused on AI training primarily for computer vision algorithms and applications.

  • Fieldwork in 2023 with a BPO specializing in AI data work in Kenya and Uganda (Muldoonet al.,2023)(Muldoon \textit{et al.}, 2023).

  • Draws on fieldwork collected since 2010.

    • Focused on the East African BPO sector (2010-2014) (Graham,2015)(Graham, 2015).

    • Research on remote work platforms in multiple countries (2014-2020) (Anwar&Graham,2022;Grahamet al.,2017)(Anwar \& Graham, 2022; Graham \textit{et al.}, 2017).

    • Global research project (Fairwork) with surveys in 84 countries (2020-present) (Grahamet al.,2020)(Graham \textit{et al.}, 2020).

AI Data Pipeline

Analysis of the AI data pipeline draws on fieldwork at an AI data BPO to provide insight into work practices.

  • Fieldwork at three delivery centres in Nairobi, Kenya and Gulu, Uganda in April and May 2023.

    • Included workplace observations, presentations, and interviews (N=46)(N=46).

  • Gathers understanding of labour processes and management techniques.

  • Constructs a general model of the AI data preparation and evaluation process.

  • Draws on computer vision, with similar formulations across the industry.

AI Data Pipeline Definition

Define an AI data pipeline as the set of data processing activities necessary to integrate datasets into the training and testing of machine learning models.

  • Develop our own typology of the various stages of the AI data pipeline by drawing on industry sources and synthesising them into a conceptual framework that can help make sense of the role different AI data work institutions play in the overall process.

Microwork, Crowdwork and AI Data Work

AI companies need high-quality, low-cost data as algorithms become more sophisticated (Benderet al.,2021;Crawford,2021;Dauvergne,2022)(Bender \textit{et al.}, 2021; Crawford, 2021; Dauvergne, 2022).

  • Requires human labor for categorizing, annotating, and evaluating data (Miceliet al.,2022)(Miceli \textit{et al.}, 2022).

  • AI data work is defined as human labor for preparing and evaluating datasets and model outputs, often outsourced.

    • Excludes software developers, machine learning engineers, and content moderators.

  • Assignments include categorizing, assembling, annotating, and correcting data (Miceli&Posada,2022;Tubaroet al.,2020)(Miceli \& Posada, 2022; Tubaro \textit{et al.}, 2020).

  • Workers’ actions impact AI systems through embedded social values and biases (Benderet al.,2021;Posada,2021)(Bender \textit{et al.}, 2021; Posada, 2021).

AI Data Work Term

AI data work is a precise definition not captured by existing terms.

  • Follows Miceli and Posada (2022) in employing the term ‘data work’ as applying to the activity of curating, annotating and verifying data sets, but we seek to offer a more precise analysis of how this particular form of work applies to the production process of AI systems.

  • Contribute to these studies to show how AI systems have their own hidden labour that enables the more visible work of machine learning engineers training AI models.

  • Specifically examine the AI production process.

  • Overlaps with other studies of data work in different industries.

  • Builds on studies of data management outside of the AI industry (Parmiggianiet al.,2022;Pine&Bossen,2020)(Parmiggiani \textit{et al.}, 2022; Pine \& Bossen, 2020).

  • AI data work is a preferable term to other concepts because it offers a more precise formulation of the work involved and reduces ambiguities.

Preference of AI Data Work Over Alternative Terms

AI data work is more precise than microwork because the latter includes non-AI-related activities.

  • Microwork can include surveys, translation, and feedback (Berget al.,2018;Irani,2015b)(Berg \textit{et al.}, 2018; Irani, 2015b).

  • Microwork is small tasks on on-demand labor platforms (Irani,2015b)(Irani, 2015b).

  • Microworkers are paid little and are independent contractors (Berget al.,2018)(Berg \textit{et al.}, 2018).

  • Ghost work is broader than AI data work and microwork (Gray&Suri,2019)(Gray \& Suri, 2019).

  • AI data work is a much more limited and defined set of activities related to the preparation and verification of datasets in the production of AI systems.

  • Crowdwork is heterogeneous, including microwork, freelancing, and creative work (DeStefano,2016;Howcroft&BergvallKa˚reborn,2019)(De Stefano, 2016; Howcroft \& Bergvall-K\aa{reborn}, 2019).

  • Cloudwork is too broad and too narrow, as not all AI data work is remote (Lubkeet al.,2023)(Lubke \textit{et al.}, 2023).

BPO Centers

AI data work in BPO centers is often left out of discussions.

  • Miceli and Posada (2022) identified crowdsourcing platforms and specialized BPO companies.

  • BPOs involve contracting a third-party provider for data-related tasks (Miceli and Posada, 2022).

  • AI data workers at BPOs can be short-term or long-term employees.

  • BPOs are more specialized and expensive than digital platforms.

Existing Studies on AI Data Work

Existing studies focus on digital labor crowdsourcing platforms (Miceli&Posada,2022;Tubaro&Casilli,2019;Tubaroet al.,2020)(Miceli \& Posada, 2022; Tubaro \& Casilli, 2019; Tubaro \textit{et al.}, 2020).

  • This article adds to research on digital platforms by foregrounding the important role played by BPOs in the AI data industry.

  • Tubaro et al. (2020) examined microwork in AI preparation, verification, and impersonation.

  • AI data workers’ labor is embedded in planetary labor markets (Posada,2021)(Posada, 2021).

    • Much of this work is in the Global South (Jones,2021b;Muldoonet al.,2023;Posada,2021)(Jones, 2021b; Muldoon \textit{et al.}, 2023; Posada, 2021).

    • This article examines different data work institutions.

AI Data Work Institutions

An AI data work institution arranges for AI data work by employees or independent contractors.

  • Follows W. Richard Scott (2013: 56) in defining institutions broadly as consisting of the ‘regulative, normative, and cultural-cognitive elements that, together with associated activities and resources, provide stability and meaning to social life’.

  • Emphasize the formal/regulatory aspects controlling behavior in competitive markets.

  • Douglas North (1991: 97) emphasizes that institutions ‘define the choice set and therefore determine transaction and production costs and hence the profitability and feasibility of engaging in eco- nomic activity’.

Empirical Cases

Table 1 demonstrates how a wide variety of empirical cases could fit within certain ideal types that share key attributes (Weber,1949)(Weber, 1949).

  • Synthesizes literature into a 3-by-2 table, generating six canonical institutions.

  • Categorized institutions by employment relationships, work type, and outsourcing.

    • First criterion: crowdsourced vs. tethered workers.

    • Second criterion: AI data work vs. other functions.

    • Third criterion: outsourced vs. internal.

Type A: Generalist Platform

Generalist microwork platforms (e.g., Amazon Mechanical Turk) operate online marketplaces (Berget al.,2018;BergvallKa˚reborn&Howcroft,2014;Casleret al.,2013)(Berg \textit{et al.}, 2018; Bergvall-K\aa{reborn} \& Howcroft, 2014; Casler \textit{et al.}, 2013).

  • Digital verification replaces human management (Irani,2015a)(Irani, 2015a).

  • Concerns exist about precarious conditions and lack of protections (Aloisi&DeStefano,2022;Chenet al.,2019)(Aloisi \& De Stefano, 2022; Chen \textit{et al.}, 2019).

  • Advantage: Access to a large, cheap workforce for categorizing and annotating data.

  • Disadvantages: lack of training, quality control issues, security concerns, and limited end-to-end services.

Type B: Generalist BPO

AI companies outsource AI data work to BPO companies.

  • Generalist BPOs offer a wide range of services.

  • IT-based BPO work grew due to reduced communication costs (Lacityet al.,2011)(Lacity \textit{et al.}, 2011).

  • BPO offers services in finance, logistics, HR, healthcare, retail and banking (Mehtaet al.,2006)(Mehta \textit{et al.}, 2006).

  • BPO work can be outsourced domestically or internationally.

Advantages of BPO Over Crowdsourced Platforms

  • Closer worker supervision for higher quality outputs.

  • Specialized end-to-end services for better data management.

  • Continuous feedback for process improvement.

  • Opportunity to engage multiple BPOs for different tasks.

Type C: AI Data Platform

Workers on AI data platforms are independent contractors.

  • AI companies post tasks to be completed individually.

  • Specialized services support AI systems (Schmidt, 2019).

  • Require annotators to complete training courses before being allowed to access specific kinds of more skilled tasks such as Light Detection and Ranging (hereafter LiDAR) a method for determining ranges through laser imaging which is a form of 3D laser scanning that is used for 3D moving objects.

  • Some specialize in specific industries or AI services (Tubaro & Casilli, 2019).

Type D: AI Data BPOs

Specialized BPOs cater to complex AI data work (e.g., Cloudfactory, Sama).

  • Train workforces in AI data work, such as image, video and 3D moving object annota- tion.

  • Specialize in industries, machine learning, and AI data tasks.

  • Develop their own platform and services for managing AI data.

Type E: Internal Data Platform

Several of the large tech companies use their own internal data work