Document Analysis

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/203

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

204 Terms

1
New cards

structural ambiguity definition

when a grammar can assign more than one parse to a sentence

2
New cards

types of structural ambiguity

attachment ambiguity, coordination ambiguity

3
New cards

attachment ambiguity definition

a particular constituent can be attached to the parse tree at more than one place, e.g. โ€œWe saw the Eiffel Tower flying to Paris.โ€

4
New cards

coordination ambiguity definition

ambiguity that occurs from the use of coordinators such as โ€˜andโ€™, โ€˜orโ€™, e.g. โ€œold men and womenโ€

5
New cards

part-of-speech tagging

the process of assigning a part-of-speech to each word in a text (=part-of-speech disambiguation)

6
New cards

part-of-speech ambiguity

words are ambiguous, they may have more than one possible part-of-speech

7
New cards

part-of-speech disambiguation

determining the correct part-of-speech for a word in a text (=part-of-speech tagging)

8
New cards

BIO notation

tagging notation for tokens/characters where B = beginning of chunk immediately following different chunk, I = inside chunk, O = outside chunk.

9
New cards

Universal dependencies

a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages

10
New cards

multilingual word embeddings

embeddings in a word embedding space that is shared across multiple languages, such that similar words in different languages are close

11
New cards

two methods for multilingual word embedding constructiong

supervised (use parallel corpus to learn mapping) or unsupervised (use adversarial training to learn mapping)

12
New cards

supervised approach to multilingual word embeddings

Given word vectors x_i from language A, y_i from language B, small dictionary (x_i, y_i), learn vector mapping W s.t. y_i = W x_i, W is orthogonal

Train by minimising loss on supervised set (sum(l(W x_i, y_i)/n).

Use Euclidean loss, or Cross-domain Similarity Local Scaling

13
New cards

unsupervised approach to multilingual word embeddings

Given word vectors x_i from language A, y_i from language B, learn vector mapping W s.t. y_i = W x_i, W is orthogonal

Key idea: employ discriminator D to distinguish samples from WX and Y

14
New cards

MUSE

Multilingual Unsupervised and Supervised Embeddings

A Python library for multilingual word embeddings

15
New cards

low resource language

languages which lack large monolingual or parallel corpora and/or manually crafted linguistic resources sufficient for building statistical NLP applications

16
New cards

problems for low resource languages

โ€“ SOTA NLP models require large amounts of training data and complex language-specific engineering

โ€“ Language-specific engineering is expensive, requiring linguistically trained speakers of the language

17
New cards

reasons why work on low resource languages is important

  • preservation

  • emergency response

  • educational applications

  • monitoring demographic and political processes

  • knowledge expansion

18
New cards

machine learning methods suitable for low resource NLP

  • active learning

  • transfer learning

  • multi-task learning

  • learning-to-learn

  • meta-learning

  • semi-supervised learning

  • dual learning (E โ†’ F, F โ†’ E)

  • unsupervised learning

19
New cards

types of unsupervised learning for low resource languages

  • unsupervised POS tagging

  • unsupervised syntactic parsing

20
New cards

cross-lingual transfer learning

train a model on one language and apply it to another

also includes transfer of annotations, like POS tags, via cross-lingual bridges, and transfer of features

21
New cards

XLM

cross-lingual language model

BERT type model trained on 15 languages

22
New cards

XLM training

One method of pretraining used for XLM is translation language modelling (TLM). Concatenate parallel sentence pairs from Wikipedia in different languages, and randomly mask words in both sentences for MLM.

23
New cards

intrinsic evaluation

direct quality of performing a test task, e.g. evaluate POS tagging by comparison to a gold standard/ground truth

24
New cards

extrinsic evaluation

test whether the output is useful for downstream tasks e.g. evaluate summarisation by implementing in an information retrieval system

25
New cards

technique for small datasets?

cross validation

26
New cards

purpose of splitting dataset

split into

train (for training model)

validation (for tuning hyperparameters)

test (for evaluating model)

27
New cards

accuracy

number of correct predictions / total number of instances

do NOT use for imbalanced data

28
New cards

precision

TP / (TP + FP)

29
New cards

recall

TP / (TP + FN)

30
New cards

F-measure

F_a = 1/(a/precision + (1-a)/recall), a:[0,1]

31
New cards

AUC

area under the curve

32
New cards

ROC

receiver operating characteristics

33
New cards

AUROC

area under receiver operating characteristics

allows for trade-off between precision and recall

plot false positive rate (FPR) on x-axis, true positive rate (TPR) on y-axis, and measure area beneath.

34
New cards

false positive rate

FP/(FP+TN)

35
New cards

true positive rate

TP/(TP+FN)

36
New cards

meaning of AUROC=0.5

random model

37
New cards

meaning of AUROC=1

perfect classifier

38
New cards

good evaluation metric for semantic parsing

accuracy

39
New cards

UAS

unlabelled attachment score

for dependency parsing, proportion of words whose head is correctly assigned

40
New cards

LAS

labelled attachment score

for dependency parsing, proportion of words whose head is correctly assigned with the right dependency label

41
New cards

evaluation metrics for dependency parsing

precision and recall of unlabelled/labelled attachment score

42
New cards

macro-averaging precision/recall

evaluation strategy for multi-class classification

arithmetic mean of precision/recall for each class

43
New cards

micro-averaging precision/recall

evaluation strategy for multi-class classification

donโ€™t treat classes as equal (thatโ€™s macro) - aggregate the contributions of all classes by calculating precision/recall with every instance.

44
New cards

evaluation metric for named entity recognition

macro-averaged precision/recall

45
New cards

evaluation metric for multi-label classification

macro-averaged precision/recall

46
New cards

evaluation metric for coreference resolution

mean average precision / mean reciprocal rank

47
New cards

average precision

for a single query q with m relevant documents

AP(q) = sum(Precision(R_k) for k in 1..m)/m

where R_k is the set of ranked retrieval results from the top document down to the kth relevant document

48
New cards

mean average precision

mean of the average precision score across many queries

49
New cards

mean reciprocal rank

averaged inverse rank of the first relevant document

50
New cards

tasks that use Term Overlap Evaluation (TOE) Metrics

image caption generation, machine translation, text simplification, text summarisation, question answering, chatbots

51
New cards

popular Term Overlap Evaluation (TOE) Metrics

BLEU, ROUGE

52
New cards

BLEU

BiLingual Evaluation Understudy

n-gram precision = (number of n-grams in both texts)/(number of n-grams in generated text)

BLEU = exp (sum(log(p_n), n=1..N)/N)

May use smoothing to avoid log0

53
New cards

meaning of BLEU=1.0

perfect match

54
New cards

meaning of BLEU=0.0

perfect mismatch

55
New cards

ROUGE

Recall-Oriented Understudy for Gisting Evaluation

ROUGE-n: overlap of n-grams between system and gold standard

count(matching n-grams in generated text)/count(n-grams in reference summaries)

Should be used instead of BLEU for summarisation and simplification, as precision is trivial.

56
New cards

SCU

Summarisation Content Unit

clause-length semantic units shared by some number of reference summaries.

based on meaning, not n-grams.

57
New cards

types of baseline models

  • trivial models (always predict common class, guess randomly etc.)

  • simple models (e.g. logistic regression)

  • well-known methods for a task

  • current SOTA

58
New cards

ablation study

systematically remove aspects of a model (e.g. feature sets) to verify necessity

59
New cards

evaluation - what to compare?

  • algorithms

  • feature sets

  • baselines

  • ablation studies

  • different datasets

60
New cards

semantics

a branch of linguistics and logic concerned with meaning

61
New cards

difference between semantics and syntax

semantics is meaning, syntax is the structure of language

62
New cards

components of first order logic

constants, functions, variables, predicates, logical connectives, quantifiers

63
New cards

semantic parsing

the transformation of sentences to a meaning representation (usually in lambda calculus)

64
New cards

semantic parsing datasets

GEO, ATIS, WikiSQL

65
New cards

predicate-argument semantics

light semantic representation - represent meaning through predicates (properties characterising subjects) and arguments (which are constrained)

66
New cards

homonymy

multiple words coincidentally share an orthographic form (e.g. bank)

67
New cards

polysemy

a word has multiple different but related senses (e.g. solution)

68
New cards

homophone

same pronunciation but different spelling (e.g. wood/would)

69
New cards

homograph

same orthographic form, but different pronunciation (e.g. bass)

70
New cards

hyponym

one sense is a hyponym of another sense if it is more specific (e.g. โ€˜carโ€™ is a hyponym of โ€˜vehicleโ€™)

71
New cards

hypernym

one sense is a hypernym of another sense if it is more general (e.g. โ€˜colourโ€™ is a hypernym of โ€˜redโ€™)

72
New cards

WordNet

hand-constructed database of lexical (ontological) relations

73
New cards

distributional hypothesis

words with similar meanings tend to appear in similar contexts

74
New cards

distributional word representations

a way to represent words as vectors describing the contexts in which they appear

75
New cards

coreference resolution

solve referential ambiguity by determining which text spans refer to the same entity

76
New cards

mentions (coreference resolution)

text spans that mention an entity

77
New cards

coreferent (coreference resolution)

text spans that refer to the same entity

78
New cards

antecedent (coreference resolution)

(of a mention) coreferent mentions earlier in the text

79
New cards

strategy for coreference resolution of pronouns

  1. search for candidate antecedents (any noun phrase in preceding text)

  2. match against hard agreement constraints (e.g. โ€˜heโ€™ โ€”> singular, masculine, animate, third person)

  3. select with heuristics (recency, subject > object)

80
New cards

strategy for coreference resolution of proper nouns

  • match syntactic head words of the reference with the referent

  • include a range of matching features: exact match, head match, string inclusion

  • Gazetteers of acronyms (ANU=Australian National University)

81
New cards

strategy for coreference resolution of nominals

requires world knowledge (e.g. that Apple Inc. is a firm and China is a growth market)

82
New cards

coreference resolution algorithms

  1. identify text spans mentioning entities

    1.1 get noun phrases (e.g. by constituent parsing)

    1.2 filter with simple rules (e.g. remove numbers, nested noun phrases)

  2. cluster mentions

    2.1 Mentioned-based models: supervised learning or ranking

    2.2 Entity-based models: clustering

83
New cards

syntactic constituency

groups of words behaving as single units (constituents)

84
New cards

context free grammar

formal system for modelling constituent structure in natural language

  • set of productions

  • lexicon of words (terminals)

  • set of symbols (non-terminals)

  • start symbols

85
New cards

parse tree

representation of a sequence of CFG production expansions (derivation)

86
New cards

probabilistic context-free grammar

CFG where each production has an associated probability (for resolving structural ambiguity)

estimate probabilities with maximum-likelihood estimation using a treebank (P(aโ€”>b) = count(aโ€”>b)/count(a))

87
New cards

treebank

a corpus in which each sentence is annotated with a parse tree

88
New cards

CFG equivalence

two grammars are equivalent if they generate the same language (i.e., the same set of strings)

89
New cards

Chomsky Normal Form (CNF)

The right-hand side of each rule either has two non-terminals or one terminal, except ๐‘† โ†’ ๐œ– (where ๐œ– is the empty string)

90
New cards

constituency parsing

Given a sentence (i.e., a sequence of terminals) and a CFG, determine whether the sentence can be generated by the grammar, and return parse tree(s)

91
New cards

CKY algorithm

algorithm for constituency parsing

based on dynamic programming (derive parse trees for constituents)

92
New cards

purpose of [CLS] token in BERT

added to the beginning of a sentence as a single representation for the entire sentence

used for sentence-level classification

93
New cards

goal of language models

to predict upcoming words

94
New cards

applications of language models

speech recognition, spelling correction, collocation error correction, machine translation, summarisation etc.

95
New cards

technique used to compute ๐‘(๐‘ค1, ๐‘ค2, โ€ฆ , ๐‘ค๐‘™)

chain rule (๐‘ƒ (๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘›) = ๐‘ƒ (๐‘ฅ1) ๐‘ƒ (๐‘ฅ2 | ๐‘ฅ1) ๐‘ƒ (๐‘ฅ3 | ๐‘ฅ1, ๐‘ฅ2) โ€ฆ ๐‘ƒ(๐‘ฅ๐‘›|๐‘ฅ1, ๐‘ฅ2, โ€ฆ , ๐‘ฅ๐‘›โˆ’1))

96
New cards

Markov assumption

๐‘ƒ(๐‘ค๐‘˜|๐‘ค1:๐‘˜โˆ’1) โ‰ˆ ๐‘ƒ(๐‘ค๐‘˜|๐‘ค๐‘˜โˆ’1)

97
New cards

(N-1)th-order Markov Assumption

๐‘ƒ(๐‘ค๐‘˜|๐‘ค1:๐‘˜โˆ’1) โ‰ˆ ๐‘ƒ(๐‘ค๐‘˜|๐‘ค๐‘˜โˆ’N+1:k-1)

98
New cards

method to handle small probability multiplication for language modelling

log probabilities

99
New cards

method to address language model overfitting

smoothing (prevent a language model from assigning a zero probability to an unseen event)

adjust low probabilities (such as zero probabilities) upwards and high probabilities downwards

100
New cards

interpolation smoothing

mix with lower order n-gram probabilities using a set of lambda hyperparameters

<p>mix with lower order n-gram probabilities using a set of lambda hyperparameters</p>