Wednesday_Week11_Ling200

Page 1: Introduction

  • Presented by: Amandalynne Paullada

  • Department of Linguistics, UW

  • Date: 4 December 2024

  • Topic: Computational Linguistics

  • Guest Lecture with material from E. Bender, N. Tachikawa Shapiro

Page 2: Definition of Computational Linguistics

  • Definition (ACL): Computational linguistics is the scientific study of language from a computational perspective.

    • Concerned with creating computational models of linguistic phenomena.

Page 3: Computational Approaches to Linguistics

  • Utilizing computers for linguistic data analysis and hypothesis testing.

  • Natural Language Processing (NLP):

    • Algorithms and models enabling computers to manipulate human language.

Page 4: Corporal Analysis in Linguistics

  • Corpus: A large compilation of written/spoken language kinds.

    • Often requires both computational and manual curation.

    • Example: Corpus del Español - consists of:

      • 100 million words of Spanish from the 1200s to 1900s.

      • 2 million words from the internet.

    • Additional resources available from the Linguistic Data Consortium.

Page 5: Computational Syntax and Analysis

  • Computing linguistic data involves:

    • Analyzing syntactic structures' frequency.

  • Complexity of Syntax: Algorithms are applied to produce parse trees from the text.

Page 6: Sociolinguistic Variation

  • Computational Sociolinguistics:

    • Analyze linguistic variation across time.

    • Study sociolinguistic variation in digital contexts.

Page 7: Distributional Semantics

  • Foundation of word vectors (embeddings):

    • Hypothesis: Words used in similar contexts tend to have similar meanings.

    • Rather than symbolic meanings, it models word contexts mathematically.

    • Similar words are represented as proximate points in high-dimensional vector space.

    • Quotation: “You shall know a word by the company it keeps” – Firth (1957).

Page 8: Example of Distributional Semantics

  • Reference to Figure 1 in “Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change” (Hamilton et al 2016).

Page 9: Effects of Digital Technology on Language

  • Investigation of sociolinguistic variations in digital communication across different demographics:

    • Differences in text messaging styles between generations (Boomers vs. Gen Z).

    • Gender-related emoji usage effects.

    • Specific language use variances between different personal contexts (romantic vs. platonic texts).

    • New vocabulary acquisition sources, including TikTok, friends, and media.

Page 10: Natural Language Processing (NLP)

  • Roles of NLP:

    • Enable extraction of patterns and information from language data (Natural Language Understanding).

    • Generate coherent human language strings (Natural Language Generation).

  • Applications:

    • Search engines, Machine translation, Content moderation, Chatbots, Spelling/grammar checkers, Predictive text, etc.

Page 11: Speech Processing and Synthesis

  • Automatic Speech Recognition: Techniques to process spoken language streams into words/phrases.

  • Speech Synthesis: Develop computational models that approximate human speech sounds based on contextual information.

Page 12: Application of Speech Recognition in Siri

  1. Automatic Speech Recognition: Capture & segment audio.

  2. Natural Language Understanding: Extract meaningful components from transcriptions.

  3. Natural Language Generation: Formulate responses.

  4. Speech Synthesis: Produce audio with correct pronunciation and intonation.

Page 13: Machine Learning in NLP

  • Many NLP applications employ machine learning, particularly supervised learning:

    • Example: Classifying spam vs. normal emails.

    • Utilizes labeled examples for training models.

Page 14: Content Moderation Applications

  • Content moderation on online platforms involves enforcing posting policies, sometimes through algorithms.

    • Automated processes may utilize NLP models allowing contextual analysis instead of blanket restrictions.

    • Supervised learning can be used for better contextual understanding.

Page 15: Machine Translation Applications

  • Algorithms facilitate translation across various human languages in multiple formats (speech/text).

    • Example: Google Translate and Google Lens.

    • Digital media occasionally misinterprets content, illustrating imperfections in AI translation processes.

Page 16: Biomedical NLP Applications

  • Extension of NLP models to biomedical discourse, aiding in:

    • Searching peer-reviewed research for specific topics (e.g. COVID-19).

    • Analyzing social determinants of health in unstructured clinical data.

Page 17: Language Models

  • Core Concept: Collect word sequence statistics from text corpora to estimate probabilities.

    • Example demonstrates how particular words are followed by others based on frequency.

Page 18: Language Models Continued

  • Practical applications for predictive modeling in speech recognition.

    • Establish a likelihood for specific word sequences in ambiguous contexts (e.g. homophones).

Page 19: Contextual Predictions in Language Models

  • Importance of context in next-word predictions highlights limitations of simple models.

  • Real-time user engagement with predictive text features on smartphones illustrates practical relevance.

Page 20: Large Language Models Overview

  • LLMs (e.g. ChatGPT) designed for next-word prediction tasks with vast datasets and advanced computational structures.

    • Utilization of Reinforcement Learning from Human Feedback (RLHF) to steer preferred output behaviors.

  • Investigating human-like language acquisition by LLMs remains an active area of study.

Page 21: Scale of Large Language Models

  • Magnitude of Models:

    • Data Size: Billions of tokens gathered from varied online sources.

    • Model Size: Comprised of billions of parameters with considerable computational demands and environmental impact.

Page 22: Challenges of Linguistic Diversity in NLP

  • Digital representation challenges faced by underrepresented languages in online spaces.

    • Issues linked to script availability, social-political factors affecting language vitality, and minimal usage online.

Page 23: Suggested Further Readings

  • Recommended Sources:

    • Language Files, Chapter 16.

    • Speech & Language Processing by Jurafsky & Martin (free online).

    • Natural Language Processing with Python (Bird, Klein & Loper, free to read online).

Page 24: Professional Organizations and Conferences

  • Notable Associations:

    • Association for Computational Linguistics (ACL)

    • Society for Computation in Linguistics (SCiL)

    • Conference on Computational Natural Language Learning (CoNLL)

    • Empirical Methods in Natural Language Processing (EMNLP)

Page 25: University of Washington Opportunities

  • Programs and Research:

    • UW NLP seminars.

    • Master’s program in Computational Linguistics.

    • Courses: LING 471, LING 472, CSE 447.

Page 26: Q&A Session

  • Encouragement to ask questions for further clarity and understanding.

robot