Ethics and Bias

Ethics and Bias in NLP and AI

  • Ethics and NLP Intersections:
    • NLP interacts with ethics due to the use of natural language data produced by and about real people.
    • It is important to consider the rights of individuals regarding privacy and anonymity.
    • Understanding differences between the data population and the general population is important.
    • Acknowledging and understanding inherent biases in the data is crucial.
    • NLP applications have expanding real-world impact, necessitating awareness of potential benefits and abuses.
  • Algorithmic Bias: A Historical Case Study
    • In 1988, a British medical school was found guilty of discrimination by the UK Commission for Racial Equality.
    • A computer program used for applicant screening was biased against women and individuals with non-European names.
    • The algorithm was trained on historical human admission decisions with 90-95% accuracy, reflecting existing biases.
    • The use of algorithms did not eliminate biased decision-making.
    • Relying solely on human decision-makers is also insufficient to resolve the bias problem.
    • The issue is amplified today due to the increasing prevalence of data-driven AI models.
    • Identifying and mitigating bias and unfairness in AI models is therefore essential.

Bias in the ML Pipeline

  • Data Creation/Collection:
    • Datasets are typically created using a sample of available data to make inferences about a larger population.
    • Sampling bias: Some instances are selected more than others, influenced by socio-cultural conditions.
      • This affects the representativeness of demographics and behavior, and uniform random sampling is difficult to achieve.
      • The result is poor generalization of trained AI models.
      • Context around social and historical conditions in data generation is important.
    • Negative set bias: Insufficient representative samples reinforce digital divides and data inequities.
    • Self-selection bias: Occurs when individuals volunteer to participate and those participants are systematically different than those that choose not to.
  • Data Creation/Collection:
    • Label bias: Arises during the labeling or data annotation process.
      • Subjective biases and domain background of annotators influence labeling.
      • This bias tends to increase in high-volume longitudinal studies where things change over time.
    • Apprehension bias: User behavior, and its manifestation in the resulting dataset, is affected by the awareness of being observed.
      • This includes self-presentation dynamics where study participants alter their behavior because of the active observers.
  • Model and Data Analysis:
    • Involves the construction of algorithms, selection of features, assumptions on data distribution, data cleaning, pre-processing, and choice of ML models.
    • Confounding bias: External variables manipulate the apparent relationship between the independent and dependent variables.
      • These variables can be 'omitted' or act as a 'proxy' or indirect bias.
    • Chronological bias: Changes in study design over time and temporal variations caused by population or system drifts can introduce bias.
  • Model and Data Analysis:
    • Algorithm bias: Includes ranking bias where personalization algorithms reinforce cognitive and social biases.
    • Insensitive measure bias: Use of insufficiently accurate methods to detect the outcome of interest.
    • Using alternative objective functions when the true criterion is not directly measurable (e.g., user clicks as a proxy for user satisfaction).
    • Homogeneity bias: Less diverse content from a narrower spectrum of sources on social media.
  • Data and Model Evaluation:
    • Human evaluation bias: Includes confirmation bias, peak-end effect bias, and recall bias.
    • Validation and test sets can systematically under- or over-estimate the predictive performance of the model.

Socially Useful NLP Applications

  • Assistive technology like text-to-speech, voice search, and image description for the blind benefits people with disabilities.
  • Machine translation, summarization, and improved search engines provide unprecedented information access to the general public.
  • Identifying fake news, trolls, and toxic comments can prevent the spread of harmful information.
  • Social media monitoring can assist in disasters or identify health issues.
  • However, such monitoring can be abused for surveillance.

Potential Harms from NLP Systems

  • Inequal Utility:
    • Text-to-speech and voice assistant systems may have issues with pronunciation and mannerisms for non-native speakers or native speakers who are different from the training data, which degrades understanding between social groups or linguistic variations.
  • Allocation Harms:
    • NLP systems can lead to unfair effects.
      • Unfair positive allocation: Biases in granting loans or university placements.
      • Unfair negative allocation: Biased arrests based on social media posts.
    • Exclusionary norms in these systems reinforce the dominant social group, implicitly excluding other groups.
  • Allocational Harms in LLMs:
    • Disparate distribution of resources or opportunities between social groups
      • Direct discrimination: Disparate treatment explicitly due to a social group (e.g. resume screening).
      • Indirect discrimination: Disparate treatment towards social groups due to proxies or other implicit factors (e.g. LLM-aided healthcare tools).
  • Stereotyping:
    • Systems may reflect societal biases in their output.
      • For example, translating neutral Turkish sentences into English, Google associates pronouns with stereotypically gendered jobs.
    • Label bias in toxicity classification datasets due to underspecified annotation guidelines and annotator positionality, reflecting different cultural and social norms.

Assessing AI Systems Adversarially

  • Ethics of the Research Question:
    • Consider the impact of technology and potential dual use: Who benefits? Who could be harmed? Could sharing data and models have major impacts on people’s lives?
  • Privacy:
    • Who owns the data? Is data published or merely publicized? Is there user consent, and what implicit assumptions do users have about data usage?
  • Bias in Data:
    • Look for artifacts, population-specific distributions, and representativeness.
  • Bias in Models:
    • How do you control for confounding variables and corner cases? Does the system optimize for the “right” objective? Does the system amplify bias?
  • Utility-Based Evaluation Beyond Accuracy:
    • Consider false positive (FP) and false negative (FN) rates, the cost of misclassification, and fault tolerance.

Evaluating Bias in LLMs

  • Large language models (LLMs) have revolutionized language technologies.
  • LLMs serve as foundation models that can be fine-tuned for specific functions, replacing task-specific models for small datasets.
  • Trained on vast, uncurated Internet data, LLMs may perpetuate harm by inheriting stereotypes, misrepresentations, and derogatory and exclusionary language.
  • Examples:
    • Text Generation: Different completions based on gendered prompts.
    • Machine Translation: Gender ambiguities lead to gendered translations.
    • Question-Answering: Relying on stereotypes to answer ambiguous questions related to ethnicity.
    • Classification: Toxicity detection models misclassify African-American English tweets as negative more often than Standard American English.
  • Protected or Sensitive Attributes:
    • Attributes for which the model should not be biased, like:
      • Race, Sex, Religion, National origin, Citizenship, Pregnancy, Disability status, Genetic information.
  • How to identify bias in NLP systems:
    • Association tests
    • Analyzing performance measures across groups.
    • Counterfactual evaluations
Techniques for Identifying Bias
  • Word Embedding Association Test (WEAT)
    • Embeddings learn from co-occurrence statistics (e.g., king - man + woman = queen).
    • Biases emerge when text encodes unsavory stereotypes (e.g., doctor - man + woman = nurse).
    • Considers two sets of target words (e.g., programmer, engineer, … and nurse, teacher, …) and two sets of attribute words (e.g., man, male, … and woman, female …).
    • Null Hypothesis: There is no difference between the two sets of target words in terms of similarity to the two sets of attribute words.
  • Word Embedding Association Test (WEAT) details:
    • X, Y are sets of target words of equal size, and A, B the two sets of attribute words.
    • The test statistic is:
    • s(w, A, B): association of w with the attribute
    • S(X, Y, A, B): differential association of two sets of target words with the attribute
    • {(Xi, Yi)}i all the partitions of XuY into two sets of equal size.
    • The one-sided p-value of the permutation test is Pri[s(Xi, Yi, A, B) > s (X, Y, A, B)].
  • Associative Bias in Word Embeddings:
    • Word embeddings exhibit human-like social biases.
    • WEAT can be extended to measure bias in sentence encoders (Sentence Encoder Association Test; SEAT).
    • In SEAT, words are inserted into semantically bleached sentence templates such as “This is .” or “ is here.”
    • These templates convey little meaning beyond the inserted terms.
    • ELMo and BERT display less association bias compared to older (context-free) methods.
  • Issues with Association Tests:
    • Positive predictive ability: Can detect the presence of bias but not guarantee its absence.
      • A lack of evidence of bias is not evidence of a lack of bias.
    • Bias in word embeddings will not necessarily propagate to downstream tasks.
  • Analysing performance measures across groups:
    • U.S. Labor Law defines disparate impact as practices that unintentionally adversely affect a protected group.
    • Algorithms exhibit impact disparity when outcomes differ across subgroups.
    • Identified by comparing performance measures across groups.
    • Example: ASR systems showed higher word error rates for black speakers (0.35) compared to white speakers (0.19).
  • Counterfactual Evaluation:
    • Modify text by flipping protected attributes (gender, race, etc.) and observe differences in model performance.
      • For example, in coreference resolution, identify all expressions in a text that refer to the same real-world entity.
      • Introduce minimal pair sentences that differ only by pronoun gender.
How to Identify Bias in LLMs (cont.)
  • No single or universal metric or method for measuring bias in LLM outputs.
    • Human Evaluation:
      • Uses human experts to detect bias using predefined criteria.
      • Qualitative and subjective.
      • Costly, time-consuming, inconsistent, and prone to human errors or biases.
    • Automatic Evaluation:
      • Uses computational methods to automatically measure bias using predefined metrics.
      • Scalable.
      • Quantitative.
      • Limited, noisy, and may be inaccurate.
  • Automatic Evaluation Steps:
    • Split the input texts for the different protected groups to be analyzed (e.g., Women and Men).
    • Run the inference and evaluate for each group using quantitative evaluation metrics.
  • Types of Evaluation Metrics:
    • Statistical metrics, Bias metrics, Reference score.
    • For generative tasks, use another LLM to predict a score: Language Toxicity (Toxicity), Language Polarity (Regard) and gendered stereotype (HONEST).
    • Traditional performance metrics can be used to compare each group.
  • Statistical Metrics
  • Enables researchers to evaluate biases across all subgroups in their dataset by assembling a confusion matrix of each subgroup.
  • Each predicted token (word) is compared with the ground truth.
  • Mostly only Accuracy and F1 Score are used; any classification-based metric could be used.
    • Accuracy: (TP + TN) / (TP + TN + FP + FN)
    • Precision: TP / (TP+FP)
    • Recall: TP / (TP+FN)
    • F1 Score: 2 * (precision * recall)/ (precision + recall)
    • AUPRC: P / (P +N)
  • Bias Metrics:
    • Similar to statistical metrics but focus more on analyzing whether bias is present, used to quantitatively analyze disparities between privileged and unprivileged groups.
    • True Positive Rate: TP / (TP + FP)
    • True Negative Rate: TN / (TN + FP)
    • False Omission Rate: FN / (FN + TN)
    • False Discovery Rate: FP / (TP + FP)
    • False Positive Rate: FP/ (FP + TN)
    • False Negative Rate: FN / (FN + TP)
    • Negative Predictive Value TN / (TN + FN)
    • Predicted Positive Ratio (TP + FN) / (TP + FP + TN + FN)
  • Reference Score:
    • Reference scores are calculated by comparing the generated text (or embedding) to one or more reference texts (or embeddings) and calculating a score based on the n-gram overlap between them.
    • The reference text can be a human-curated set or from a reference LLM model.
  • Most common reference score metrics:
    • BLEU, ROUGE, METEOR, Perplexity, BERT Score.
  • BLEU (Bilingual Evaluation Understudy):
    • Algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
    • Quality is measured by comparing how close it is to a professional human translation as possible.
    • Normally used in translation and image captioning tasks.
    • calculates the precision of the n-grams in the machine-translated text by comparing them to the reference text.
    • BLEU score ranges from 0 to 1, with 1 representing a perfect match to the reference text and 0 representing no overlap
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
    • Used for evaluating automatic summarization and machine translation.
    • ROUGE-N measures the number of matching n-grams between the generated and the ground-truth text (ROUGE-1 for unigrams, ROUGE-2 for bigrams).
    • ROUGE-L calculates the Longest Common Subsequence, - identifying the longest overlapping sequence of tokens.
  • METEOR:
    • Measure of how well a language model generates text that is accurate and relevant.
    • Uses the F1 Score to calculate the n-gram overlap score.
    • Rewards not only exact word matches but also matching stems, synonyms and paraphrases
  • Perplexity:
    • Evaluates to what extent the generated texts are similar to the distribution of text that a given model was trained on.
  • BERT Score:
    • Leverages the pre-trained contextual embeddings from BERT models and matches words in candidate and reference sentences by cosine similarity.
    • Also computes precision, recall, and F1 Score.
    • Prediction Scores
      • Uses a task specific LLM model to evaluate a score in the generated text towards the social perceptions of a demographic (e.g. gender, race, sexual orientation).
      • Toxicity: Evaluates the toxicity score using the roberta-hate-speech-dynabench-r4 model.
      • Language Polarity: Evaluates language polarity (negative, neutral, positive and other.) using BERT-based model.
      • Gendered Stereotype: Measures hurtful sentence completions, using HurtLex to quantify how often sentences are completed with a hurtful word (HONEST score).

Synthetic Data Evaluation

  • Data Quality Evaluation:
    • Ensure the quality of generated text in terms of usability as a stand-in for the original dataset and