Computational Linguistics and Probability Theory Notes

Computational Linguistics

  • Definition: A field that combines linguistics and computer science, focusing on statistical description of natural language.
  • Key Concepts: Involves probability and information theory as foundational elements to evaluate language processing.

Random Experiments and Events

  • Random Experiment: An experiment whose outcomes cannot be precisely determined beforehand; can be repeated multiple times.
    • Example: Tossing a die.
  • Random Phenomenon: Statement about the outcome of a random experiment, determined post-experiment.
    • Examples of Outcomes: Even number, number greater than 3.
  • Elementary Phenomenon: Base outcomes that cannot be further decomposed.
    • Sample Space: All elementary phenomena $ ext{Ω}$.

Probability Basics

  • Probability of Event A: Given by P(A) = rac{k}{N} where:
    • NN: Total number of trials.
    • kk: Number of occurrences of event A.
  • Characteristics:
    • Probability values range: 0extP(A)ext10 ext{ ≤ } P(A) ext{ ≤ } 1.
    • Certain events have probability = 1 P(Ω)=1P(Ω) = 1.
  • Sum of Probabilities for Incompatible Events: P(A B) = P(A) + P(B).
  • Determining Probabilities: Can utilize combinatorial methods or statistical measurements from real data.

Joint and Conditional Probability

  • Joint Probability: Denoted as P(AB)=P(A,B)P(A ∩ B) = P(A, B).
  • Conditional Probability: Denoted as P(A | B) = rac{P(A ∩ B)}{P(B)} which estimates from relative frequencies.
    • Rearranged as: P(A|B) = rac{c(A,B)/N}{c(B)/N} = rac{c(A,B)}{c(B)}.

Word Probability Examples

  • Examining Word Probabilities:
    • For the word "today" in Czech, labelled as event A.
    • Event B as occurrence of the word "respektive".
  • Calculating Probabilities:
    • P(A) = rac{c(A)}{N_w}= rac{2}{4}= rac{1}{2}.
    • P(B) = rac{c(B)}{N_w}= rac{1}{4}.
  • Pair Probabilities:
    • P(A,B) = rac{c(A,B)}{N_{w-1,w}} = rac{1}{4}.
    • Conditional Probability: P(B|A) = rac{1/4}{1/2} = rac{1}{2}.

Random Variables

  • Definition: Variable whose value is determined by the outcome of a random experiment, with discrete or continuous values.
    • Discrete Random Variable: Countable outcomes.
    • Continuous Random Variable: Unbounded possible values.
  • Probability Distribution: Function describing the probabilities of individual outcomes. The distribution function F(x)=P(Xx)F(x) = P(X ≤ x).

Entropy and Information

  • Concept of Entropy: Measures uncertainty in the outcome of a random variable.
    • More information leads to lower entropy.
    • Formula: For a coin toss, H(X) = - ig( P( ext{head}) imes ext{log}2(P( ext{head})) + P( ext{tail}) imes ext{log}2(P( ext{tail})) ig).
    • Example with a 32-sided die:
    • Calculation shows entropy is H(X)=5H(X) = 5 bits (larger uncertainty).
  • Entropy Properties:
    • Entropy (H) is non-negative, 0H(X)0 ≤ H(X).
    • Uniform distribution has maximum entropy

Perplexity

  • Definition: Measure of how difficult it is to predict outcomes of random variables.
  • Examples of Perplexity:
    • Variable with 2 outcomes: H(X)=1extbit,G(X)=2H(X)= 1 ext{ bit}, G(X) = 2.
    • Variable with 32 outcomes of varying probabilities has different perceptual difficulties in prediction.

Joint and Conditional Entropy

  • Joint Entropy: Measures total uncertainty of two random variables $X$ and $Y$.
  • Conditional Entropy: Measures remaining uncertainty of one variable given knowledge of the other, calculated through H(YX)H(Y|X).

General Properties of Language Entropy

  • English Example: Different entropies depending on character prediction (
    o character knowledge, previous, pairs, etc.).
  • Entropy Decrease: Observing previous events lowers entropy.

Kullback-Leibler Divergence** (Relative Entropy)

  • Measures the difference between two probability distributions $ ext{D}(p ext{||} q)$.
  • Not symmetric: D(pextq)<br/>eqD(qextp)D(p ext{||} q) <br /> eq D(q ext{||} p).

Cross Entropy

  • Indicates bits required to describe one probability distribution using another. C(p,q)=extxp(x)extlog(q(x))C(p, q) = - ext{∑}_x p(x) ext{log}(q(x)).

Mutual Information

  • Quantifies the reduction in uncertainty of one random variable given knowledge of another. I(X,Y)=H(X)H(XY)I(X, Y) = H(X) - H(X|Y). If independent: I(X,Y)=0I(X, Y) = 0.