Wednesday_Week11_Ling200
Page 1: Introduction
Presented by: Amandalynne Paullada
Department of Linguistics, UW
Date: 4 December 2024
Topic: Computational Linguistics
Guest Lecture with material from E. Bender, N. Tachikawa Shapiro
Page 2: Definition of Computational Linguistics
Definition (ACL): Computational linguistics is the scientific study of language from a computational perspective.
Concerned with creating computational models of linguistic phenomena.
Page 3: Computational Approaches to Linguistics
Utilizing computers for linguistic data analysis and hypothesis testing.
Natural Language Processing (NLP):
Algorithms and models enabling computers to manipulate human language.
Page 4: Corporal Analysis in Linguistics
Corpus: A large compilation of written/spoken language kinds.
Often requires both computational and manual curation.
Example: Corpus del Español - consists of:
100 million words of Spanish from the 1200s to 1900s.
2 million words from the internet.
Additional resources available from the Linguistic Data Consortium.
Page 5: Computational Syntax and Analysis
Computing linguistic data involves:
Analyzing syntactic structures' frequency.
Complexity of Syntax: Algorithms are applied to produce parse trees from the text.
Page 6: Sociolinguistic Variation
Computational Sociolinguistics:
Analyze linguistic variation across time.
Study sociolinguistic variation in digital contexts.
Page 7: Distributional Semantics
Foundation of word vectors (embeddings):
Hypothesis: Words used in similar contexts tend to have similar meanings.
Rather than symbolic meanings, it models word contexts mathematically.
Similar words are represented as proximate points in high-dimensional vector space.
Quotation: “You shall know a word by the company it keeps” – Firth (1957).
Page 8: Example of Distributional Semantics
Reference to Figure 1 in “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change” (Hamilton et al 2016).
Page 9: Effects of Digital Technology on Language
Investigation of sociolinguistic variations in digital communication across different demographics:
Differences in text messaging styles between generations (Boomers vs. Gen Z).
Gender-related emoji usage effects.
Specific language use variances between different personal contexts (romantic vs. platonic texts).
New vocabulary acquisition sources, including TikTok, friends, and media.
Page 10: Natural Language Processing (NLP)
Roles of NLP:
Enable extraction of patterns and information from language data (Natural Language Understanding).
Generate coherent human language strings (Natural Language Generation).
Applications:
Search engines, Machine translation, Content moderation, Chatbots, Spelling/grammar checkers, Predictive text, etc.
Page 11: Speech Processing and Synthesis
Automatic Speech Recognition: Techniques to process spoken language streams into words/phrases.
Speech Synthesis: Develop computational models that approximate human speech sounds based on contextual information.
Page 12: Application of Speech Recognition in Siri
Automatic Speech Recognition: Capture & segment audio.
Natural Language Understanding: Extract meaningful components from transcriptions.
Natural Language Generation: Formulate responses.
Speech Synthesis: Produce audio with correct pronunciation and intonation.
Page 13: Machine Learning in NLP
Many NLP applications employ machine learning, particularly supervised learning:
Example: Classifying spam vs. normal emails.
Utilizes labeled examples for training models.
Page 14: Content Moderation Applications
Content moderation on online platforms involves enforcing posting policies, sometimes through algorithms.
Automated processes may utilize NLP models allowing contextual analysis instead of blanket restrictions.
Supervised learning can be used for better contextual understanding.
Page 15: Machine Translation Applications
Algorithms facilitate translation across various human languages in multiple formats (speech/text).
Example: Google Translate and Google Lens.
Digital media occasionally misinterprets content, illustrating imperfections in AI translation processes.
Page 16: Biomedical NLP Applications
Extension of NLP models to biomedical discourse, aiding in:
Searching peer-reviewed research for specific topics (e.g. COVID-19).
Analyzing social determinants of health in unstructured clinical data.
Page 17: Language Models
Core Concept: Collect word sequence statistics from text corpora to estimate probabilities.
Example demonstrates how particular words are followed by others based on frequency.
Page 18: Language Models Continued
Practical applications for predictive modeling in speech recognition.
Establish a likelihood for specific word sequences in ambiguous contexts (e.g. homophones).
Page 19: Contextual Predictions in Language Models
Importance of context in next-word predictions highlights limitations of simple models.
Real-time user engagement with predictive text features on smartphones illustrates practical relevance.
Page 20: Large Language Models Overview
LLMs (e.g. ChatGPT) designed for next-word prediction tasks with vast datasets and advanced computational structures.
Utilization of Reinforcement Learning from Human Feedback (RLHF) to steer preferred output behaviors.
Investigating human-like language acquisition by LLMs remains an active area of study.
Page 21: Scale of Large Language Models
Magnitude of Models:
Data Size: Billions of tokens gathered from varied online sources.
Model Size: Comprised of billions of parameters with considerable computational demands and environmental impact.
Page 22: Challenges of Linguistic Diversity in NLP
Digital representation challenges faced by underrepresented languages in online spaces.
Issues linked to script availability, social-political factors affecting language vitality, and minimal usage online.
Page 23: Suggested Further Readings
Recommended Sources:
Language Files, Chapter 16.
Speech & Language Processing by Jurafsky & Martin (free online).
Natural Language Processing with Python (Bird, Klein & Loper, free to read online).
Page 24: Professional Organizations and Conferences
Notable Associations:
Association for Computational Linguistics (ACL)
Society for Computation in Linguistics (SCiL)
Conference on Computational Natural Language Learning (CoNLL)
Empirical Methods in Natural Language Processing (EMNLP)
Page 25: University of Washington Opportunities
Programs and Research:
UW NLP seminars.
Master’s program in Computational Linguistics.
Courses: LING 471, LING 472, CSE 447.
Page 26: Q&A Session
Encouragement to ask questions for further clarity and understanding.