NLP interacts with ethics due to the use of natural language data produced by and about real people.
It is important to consider the rights of individuals regarding privacy and anonymity.
Understanding differences between the data population and the general population is important.
Acknowledging and understanding inherent biases in the data is crucial.
NLP applications have expanding real-world impact, necessitating awareness of potential benefits and abuses.
Algorithmic Bias: A Historical Case Study
In 1988, a British medical school was found guilty of discrimination by the UK Commission for Racial Equality.
A computer program used for applicant screening was biased against women and individuals with non-European names.
The algorithm was trained on historical human admission decisions with 90-95% accuracy, reflecting existing biases.
The use of algorithms did not eliminate biased decision-making.
Relying solely on human decision-makers is also insufficient to resolve the bias problem.
The issue is amplified today due to the increasing prevalence of data-driven AI models.
Identifying and mitigating bias and unfairness in AI models is therefore essential.
Bias in the ML Pipeline
Data Creation/Collection:
Datasets are typically created using a sample of available data to make inferences about a larger population.
Sampling bias: Some instances are selected more than others, influenced by socio-cultural conditions.
This affects the representativeness of demographics and behavior, and uniform random sampling is difficult to achieve.
The result is poor generalization of trained AI models.
Context around social and historical conditions in data generation is important.
Negative set bias: Insufficient representative samples reinforce digital divides and data inequities.
Self-selection bias: Occurs when individuals volunteer to participate and those participants are systematically different than those that choose not to.
Data Creation/Collection:
Label bias: Arises during the labeling or data annotation process.
Subjective biases and domain background of annotators influence labeling.
This bias tends to increase in high-volume longitudinal studies where things change over time.
Apprehension bias: User behavior, and its manifestation in the resulting dataset, is affected by the awareness of being observed.
This includes self-presentation dynamics where study participants alter their behavior because of the active observers.
Model and Data Analysis:
Involves the construction of algorithms, selection of features, assumptions on data distribution, data cleaning, pre-processing, and choice of ML models.
Confounding bias: External variables manipulate the apparent relationship between the independent and dependent variables.
These variables can be 'omitted' or act as a 'proxy' or indirect bias.
Chronological bias: Changes in study design over time and temporal variations caused by population or system drifts can introduce bias.
Model and Data Analysis:
Algorithm bias: Includes ranking bias where personalization algorithms reinforce cognitive and social biases.
Insensitive measure bias: Use of insufficiently accurate methods to detect the outcome of interest.
Using alternative objective functions when the true criterion is not directly measurable (e.g., user clicks as a proxy for user satisfaction).
Homogeneity bias: Less diverse content from a narrower spectrum of sources on social media.
Data and Model Evaluation:
Human evaluation bias: Includes confirmation bias, peak-end effect bias, and recall bias.
Validation and test sets can systematically under- or over-estimate the predictive performance of the model.
Socially Useful NLP Applications
Assistive technology like text-to-speech, voice search, and image description for the blind benefits people with disabilities.
Machine translation, summarization, and improved search engines provide unprecedented information access to the general public.
Identifying fake news, trolls, and toxic comments can prevent the spread of harmful information.
Social media monitoring can assist in disasters or identify health issues.
However, such monitoring can be abused for surveillance.
Potential Harms from NLP Systems
Inequal Utility:
Text-to-speech and voice assistant systems may have issues with pronunciation and mannerisms for non-native speakers or native speakers who are different from the training data, which degrades understanding between social groups or linguistic variations.
Allocation Harms:
NLP systems can lead to unfair effects.
Unfair positive allocation: Biases in granting loans or university placements.
Unfair negative allocation: Biased arrests based on social media posts.
Exclusionary norms in these systems reinforce the dominant social group, implicitly excluding other groups.
Allocational Harms in LLMs:
Disparate distribution of resources or opportunities between social groups
Direct discrimination: Disparate treatment explicitly due to a social group (e.g. resume screening).
Indirect discrimination: Disparate treatment towards social groups due to proxies or other implicit factors (e.g. LLM-aided healthcare tools).
Stereotyping:
Systems may reflect societal biases in their output.
For example, translating neutral Turkish sentences into English, Google associates pronouns with stereotypically gendered jobs.
Label bias in toxicity classification datasets due to underspecified annotation guidelines and annotator positionality, reflecting different cultural and social norms.
Assessing AI Systems Adversarially
Ethics of the Research Question:
Consider the impact of technology and potential dual use: Who benefits? Who could be harmed? Could sharing data and models have major impacts on people’s lives?
Privacy:
Who owns the data? Is data published or merely publicized? Is there user consent, and what implicit assumptions do users have about data usage?
Bias in Data:
Look for artifacts, population-specific distributions, and representativeness.
Bias in Models:
How do you control for confounding variables and corner cases? Does the system optimize for the “right” objective? Does the system amplify bias?
Utility-Based Evaluation Beyond Accuracy:
Consider false positive (FP) and false negative (FN) rates, the cost of misclassification, and fault tolerance.
Evaluating Bias in LLMs
Large language models (LLMs) have revolutionized language technologies.
LLMs serve as foundation models that can be fine-tuned for specific functions, replacing task-specific models for small datasets.
Trained on vast, uncurated Internet data, LLMs may perpetuate harm by inheriting stereotypes, misrepresentations, and derogatory and exclusionary language.
Examples:
Text Generation: Different completions based on gendered prompts.
Machine Translation: Gender ambiguities lead to gendered translations.
Question-Answering: Relying on stereotypes to answer ambiguous questions related to ethnicity.
Classification: Toxicity detection models misclassify African-American English tweets as negative more often than Standard American English.
Protected or Sensitive Attributes:
Attributes for which the model should not be biased, like:
Embeddings learn from co-occurrence statistics (e.g., king - man + woman = queen).
Biases emerge when text encodes unsavory stereotypes (e.g., doctor - man + woman = nurse).
Considers two sets of target words (e.g., programmer, engineer, … and nurse, teacher, …) and two sets of attribute words (e.g., man, male, … and woman, female …).
Null Hypothesis: There is no difference between the two sets of target words in terms of similarity to the two sets of attribute words.
Word Embedding Association Test (WEAT) details:
X, Y are sets of target words of equal size, and A, B the two sets of attribute words.
The test statistic is:
s(w, A, B): association of w with the attribute
S(X, Y, A, B): differential association of two sets of target words with the attribute
{(Xi, Yi)}i all the partitions of XuY into two sets of equal size.
The one-sided p-value of the permutation test is Pri[s(Xi, Yi, A, B) > s (X, Y, A, B)].
Associative Bias in Word Embeddings:
Word embeddings exhibit human-like social biases.
WEAT can be extended to measure bias in sentence encoders (Sentence Encoder Association Test; SEAT).
In SEAT, words are inserted into semantically bleached sentence templates such as “This is .” or “ is here.”
These templates convey little meaning beyond the inserted terms.
ELMo and BERT display less association bias compared to older (context-free) methods.
Issues with Association Tests:
Positive predictive ability: Can detect the presence of bias but not guarantee its absence.
A lack of evidence of bias is not evidence of a lack of bias.
Bias in word embeddings will not necessarily propagate to downstream tasks.
Analysing performance measures across groups:
U.S. Labor Law defines disparate impact as practices that unintentionally adversely affect a protected group.
Algorithms exhibit impact disparity when outcomes differ across subgroups.
Identified by comparing performance measures across groups.
Example: ASR systems showed higher word error rates for black speakers (0.35) compared to white speakers (0.19).
Counterfactual Evaluation:
Modify text by flipping protected attributes (gender, race, etc.) and observe differences in model performance.
For example, in coreference resolution, identify all expressions in a text that refer to the same real-world entity.
Introduce minimal pair sentences that differ only by pronoun gender.
How to Identify Bias in LLMs (cont.)
No single or universal metric or method for measuring bias in LLM outputs.
Human Evaluation:
Uses human experts to detect bias using predefined criteria.
Qualitative and subjective.
Costly, time-consuming, inconsistent, and prone to human errors or biases.
Automatic Evaluation:
Uses computational methods to automatically measure bias using predefined metrics.
Scalable.
Quantitative.
Limited, noisy, and may be inaccurate.
Automatic Evaluation Steps:
Split the input texts for the different protected groups to be analyzed (e.g., Women and Men).
Run the inference and evaluate for each group using quantitative evaluation metrics.
For generative tasks, use another LLM to predict a score: Language Toxicity (Toxicity), Language Polarity (Regard) and gendered stereotype (HONEST).
Traditional performance metrics can be used to compare each group.
Statistical Metrics
Enables researchers to evaluate biases across all subgroups in their dataset by assembling a confusion matrix of each subgroup.
Each predicted token (word) is compared with the ground truth.
Mostly only Accuracy and F1 Score are used; any classification-based metric could be used.
Accuracy: (TP + TN) / (TP + TN + FP + FN)
Precision: TP / (TP+FP)
Recall: TP / (TP+FN)
F1 Score: 2 * (precision * recall)/ (precision + recall)
AUPRC: P / (P +N)
Bias Metrics:
Similar to statistical metrics but focus more on analyzing whether bias is present, used to quantitatively analyze disparities between privileged and unprivileged groups.
Reference scores are calculated by comparing the generated text (or embedding) to one or more reference texts (or embeddings) and calculating a score based on the n-gram overlap between them.
The reference text can be a human-curated set or from a reference LLM model.
Most common reference score metrics:
BLEU, ROUGE, METEOR, Perplexity, BERT Score.
BLEU (Bilingual Evaluation Understudy):
Algorithm for evaluating the quality of text which has been machine-translated from one natural language to another.
Quality is measured by comparing how close it is to a professional human translation as possible.
Normally used in translation and image captioning tasks.
calculates the precision of the n-grams in the machine-translated text by comparing them to the reference text.
BLEU score ranges from 0 to 1, with 1 representing a perfect match to the reference text and 0 representing no overlap
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
Used for evaluating automatic summarization and machine translation.
ROUGE-N measures the number of matching n-grams between the generated and the ground-truth text (ROUGE-1 for unigrams, ROUGE-2 for bigrams).
ROUGE-L calculates the Longest Common Subsequence, - identifying the longest overlapping sequence of tokens.
METEOR:
Measure of how well a language model generates text that is accurate and relevant.
Uses the F1 Score to calculate the n-gram overlap score.
Rewards not only exact word matches but also matching stems, synonyms and paraphrases
Perplexity:
Evaluates to what extent the generated texts are similar to the distribution of text that a given model was trained on.
BERT Score:
Leverages the pre-trained contextual embeddings from BERT models and matches words in candidate and reference sentences by cosine similarity.
Also computes precision, recall, and F1 Score.
Prediction Scores
Uses a task specific LLM model to evaluate a score in the generated text towards the social perceptions of a demographic (e.g. gender, race, sexual orientation).
Toxicity: Evaluates the toxicity score using the roberta-hate-speech-dynabench-r4 model.
Language Polarity: Evaluates language polarity (negative, neutral, positive and other.) using BERT-based model.
Gendered Stereotype: Measures hurtful sentence completions, using HurtLex to quantify how often sentences are completed with a hurtful word (HONEST score).
Synthetic Data Evaluation
Data Quality Evaluation:
Ensure the quality of generated text in terms of usability as a stand-in for the original dataset and