Presented by: Amandalynne Paullada
Department of Linguistics, UW
Date: 4 December 2024
Topic: Computational Linguistics
Guest Lecture with material from E. Bender, N. Tachikawa Shapiro
Definition (ACL): Computational linguistics is the scientific study of language from a computational perspective.
Concerned with creating computational models of linguistic phenomena.
Utilizing computers for linguistic data analysis and hypothesis testing.
Natural Language Processing (NLP):
Algorithms and models enabling computers to manipulate human language.
Corpus: A large compilation of written/spoken language kinds.
Often requires both computational and manual curation.
Example: Corpus del Español - consists of:
100 million words of Spanish from the 1200s to 1900s.
2 million words from the internet.
Additional resources available from the Linguistic Data Consortium.
Computing linguistic data involves:
Analyzing syntactic structures' frequency.
Complexity of Syntax: Algorithms are applied to produce parse trees from the text.
Computational Sociolinguistics:
Analyze linguistic variation across time.
Study sociolinguistic variation in digital contexts.
Foundation of word vectors (embeddings):
Hypothesis: Words used in similar contexts tend to have similar meanings.
Rather than symbolic meanings, it models word contexts mathematically.
Similar words are represented as proximate points in high-dimensional vector space.
Quotation: “You shall know a word by the company it keeps” – Firth (1957).
Reference to Figure 1 in “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change” (Hamilton et al 2016).
Investigation of sociolinguistic variations in digital communication across different demographics:
Differences in text messaging styles between generations (Boomers vs. Gen Z).
Gender-related emoji usage effects.
Specific language use variances between different personal contexts (romantic vs. platonic texts).
New vocabulary acquisition sources, including TikTok, friends, and media.
Roles of NLP:
Enable extraction of patterns and information from language data (Natural Language Understanding).
Generate coherent human language strings (Natural Language Generation).
Applications:
Search engines, Machine translation, Content moderation, Chatbots, Spelling/grammar checkers, Predictive text, etc.
Automatic Speech Recognition: Techniques to process spoken language streams into words/phrases.
Speech Synthesis: Develop computational models that approximate human speech sounds based on contextual information.
Automatic Speech Recognition: Capture & segment audio.
Natural Language Understanding: Extract meaningful components from transcriptions.
Natural Language Generation: Formulate responses.
Speech Synthesis: Produce audio with correct pronunciation and intonation.
Many NLP applications employ machine learning, particularly supervised learning:
Example: Classifying spam vs. normal emails.
Utilizes labeled examples for training models.
Content moderation on online platforms involves enforcing posting policies, sometimes through algorithms.
Automated processes may utilize NLP models allowing contextual analysis instead of blanket restrictions.
Supervised learning can be used for better contextual understanding.
Algorithms facilitate translation across various human languages in multiple formats (speech/text).
Example: Google Translate and Google Lens.
Digital media occasionally misinterprets content, illustrating imperfections in AI translation processes.
Extension of NLP models to biomedical discourse, aiding in:
Searching peer-reviewed research for specific topics (e.g. COVID-19).
Analyzing social determinants of health in unstructured clinical data.
Core Concept: Collect word sequence statistics from text corpora to estimate probabilities.
Example demonstrates how particular words are followed by others based on frequency.
Practical applications for predictive modeling in speech recognition.
Establish a likelihood for specific word sequences in ambiguous contexts (e.g. homophones).
Importance of context in next-word predictions highlights limitations of simple models.
Real-time user engagement with predictive text features on smartphones illustrates practical relevance.
LLMs (e.g. ChatGPT) designed for next-word prediction tasks with vast datasets and advanced computational structures.
Utilization of Reinforcement Learning from Human Feedback (RLHF) to steer preferred output behaviors.
Investigating human-like language acquisition by LLMs remains an active area of study.
Magnitude of Models:
Data Size: Billions of tokens gathered from varied online sources.
Model Size: Comprised of billions of parameters with considerable computational demands and environmental impact.
Digital representation challenges faced by underrepresented languages in online spaces.
Issues linked to script availability, social-political factors affecting language vitality, and minimal usage online.
Recommended Sources:
Language Files, Chapter 16.
Speech & Language Processing by Jurafsky & Martin (free online).
Natural Language Processing with Python (Bird, Klein & Loper, free to read online).
Notable Associations:
Association for Computational Linguistics (ACL)
Society for Computation in Linguistics (SCiL)
Conference on Computational Natural Language Learning (CoNLL)
Empirical Methods in Natural Language Processing (EMNLP)
Programs and Research:
UW NLP seminars.
Master’s program in Computational Linguistics.
Courses: LING 471, LING 472, CSE 447.
Encouragement to ask questions for further clarity and understanding.