Computational Linguistics and Probability Theory Notes

Definition: A field that combines linguistics and computer science, focusing on statistical description of natural language.
Key Concepts: Involves probability and information theory as foundational elements to evaluate language processing.

Random Experiment: An experiment whose outcomes cannot be precisely determined beforehand; can be repeated multiple times.
- Example: Tossing a die.
Random Phenomenon: Statement about the outcome of a random experiment, determined post-experiment.
- Examples of Outcomes: Even number, number greater than 3.
Elementary Phenomenon: Base outcomes that cannot be further decomposed.
- Sample Space: All elementary phenomena $ ext{Ω}$.

Probability of Event A: Given by P(A) = rac{k}{N} where:
- $N$ : Total number of trials.
- $k$ : Number of occurrences of event A.
Characteristics:
- Probability values range: $0 ext{ ≤ } P(A) ext{ ≤ } 1$ .
- Certain events have probability = 1 $P(Ω) = 1$ .
Sum of Probabilities for Incompatible Events: P(A B) = P(A) + P(B).
Determining Probabilities: Can utilize combinatorial methods or statistical measurements from real data.

Joint Probability: Denoted as $P(A ∩ B) = P(A, B)$ .
Conditional Probability: Denoted as P(A | B) = rac{P(A ∩ B)}{P(B)} which estimates from relative frequencies.
- Rearranged as: P(A|B) = rac{c(A,B)/N}{c(B)/N} = rac{c(A,B)}{c(B)}.

Examining Word Probabilities:
- For the word "today" in Czech, labelled as event A.
- Event B as occurrence of the word "respektive".
Calculating Probabilities:
- P(A) = rac{c(A)}{N_w}=rac{2}{4}=rac{1}{2}.
- P(B) = rac{c(B)}{N_w}=rac{1}{4}.
Pair Probabilities:
- P(A,B) = rac{c(A,B)}{N_{w-1,w}} = rac{1}{4}.
- Conditional Probability: P(B|A) = rac{1/4}{1/2} = rac{1}{2}.

Definition: Variable whose value is determined by the outcome of a random experiment, with discrete or continuous values.
- Discrete Random Variable: Countable outcomes.
- Continuous Random Variable: Unbounded possible values.
Probability Distribution: Function describing the probabilities of individual outcomes. The distribution function $F(x) = P(X ≤ x)$ .

Concept of Entropy: Measures uncertainty in the outcome of a random variable.
- More information leads to lower entropy.
- Formula: For a coin toss, H(X) = - ig( P( ext{head}) imes ext{log}2(P( ext{head})) + P( ext{tail}) imes ext{log}2(P( ext{tail})) ig).
- Example with a 32-sided die:
- Calculation shows entropy is $H(X) = 5$ bits (larger uncertainty).
Entropy Properties:
- Entropy (H) is non-negative, $0 ≤ H(X)$ .
- Uniform distribution has maximum entropy

Definition: Measure of how difficult it is to predict outcomes of random variables.
Examples of Perplexity:
- Variable with 2 outcomes: $H(X)= 1 ext{ bit}, G(X) = 2$ .
- Variable with 32 outcomes of varying probabilities has different perceptual difficulties in prediction.

Joint Entropy: Measures total uncertainty of two random variables $X$ and $Y$.
Conditional Entropy: Measures remaining uncertainty of one variable given knowledge of the other, calculated through $H(Y|X)$ .

English Example: Different entropies depending on character prediction (
o character knowledge, previous, pairs, etc.).
Entropy Decrease: Observing previous events lowers entropy.

Measures the difference between two probability distributions $ ext{D}(p ext{||} q)$.
Not symmetric: $D(p ext{||} q) <br /> eq D(q ext{||} p)$ .

Indicates bits required to describe one probability distribution using another. $C(p, q) = - ext{∑}_x p(x) ext{log}(q(x))$ .

Quantifies the reduction in uncertainty of one random variable given knowledge of another. $I(X, Y) = H(X) - H(X|Y)$ . If independent: $I(X, Y) = 0$ .