22) Corpus linguistics as a linguistic discipline, basic types of corpora, english and czech corpora and their comparison

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/27

There's no tags or description

Looks like no tags are added yet.

Last updated 11:37 AM on 5/18/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

28 Terms

New cards

**Corpus linguistics**

An empirical study of language in use, consisting of the analysis of patterns and usage through large collections of real-world text and speech data.

New cards

**Corpus**

A large, principled collection of naturally occurring examples of language consisting of being stored electronically, strictly serving to answer two fundamental research questions (about particular patterns and how they differ within varieties and registers) and fundamentally not acting as a dictionary.

New cards

**Limits of corpora**

Characteristics of a corpus consisting of the ability to state what is or is not present in a language, but being completely unable to provide negative evidence, explain why, or provide all possible languages at one time.

New cards

**Four major characteristics of Corpus approach**

The defining features of corpus linguistics consisting of being empirical, utilising a large and principled collection of natural texts, making extensive use of computers, and strictly depending on both quantitative and qualitative analytical techniques.

New cards

**Triangulation**

A research strategy consisting of using multiple methods and sources at once, exactly combining multiple data sources, multiple methods, multiple theoretical perspectives, and cross-validation.

New cards

**Corpus-based linguistics (Top-down approach)**

A deductive methodology consisting of using the corpus strictly as empirical evidence to test and verify pre-existing linguistic theories and grammatical rules (e.g., supporting what we already "think" we know).

New cards

**Corpus-driven linguistics (Bottom-up approach)**

An inductive approach supported by John Sinclair consisting of creating new linguistic theories directly from the data itself, strictly without imposing any pre-existing theoretical assumptions.

New cards

**Phraseology**

A central element of corpus linguistics consisting of the study of phrases.

New cards

**Lexicogrammar**

A concept advocated by Sinclair consisting of the idea that there is absolutely no difference between lexis and grammar.

New cards

**Register**

A linguistic concept referring to a specific situation of use, consisting of requiring different language for different audiences.

New cards

**Idiom principle**

A concept introduced by John Sinclair consisting of the fact that speakers often use fixed or semi-fixed phrases (like "heavy rain") because language is pattern-based and a single word does not have any meaning on its own, but words in sequence do.

New cards

**Pre-electronic corpora**

The historical stage of corpus linguistics consisting of manual collections of paper examples of usage (citation slips) used for dictionaries, which was very time-consuming and resulted in small corpora.

New cards

**Brown Corpus**

The very first computerised corpus, compiled in 1961 by Henry Kučera.

New cards

**Noam Chomsky's critique**

An argument made by the father of generative linguistics during the era of structuralism, consisting of stating that corpora reflect only surface performance, not innate competence, and are inherently finite, whereas human language is infinite due to our constant ability to invent new words and make mistakes.

New cards

**The digital age of corpora**

A period of rapid expansion consisting of the availability of personal computers, new softwares, and corpus-based dictionaries, leading to data-driven learning where teachers and students explore real examples of word usage.

New cards

**Corpus linguistics in the 21st century**

The modern state of the field defined by technological phenomena, consisting primarily of using the web as the largest corpus ever, analysing Big data, and using AI.

New cards

**Generalised corpora**

A basic type of corpora designed to represent a language as broadly as possible, consisting of very large collections (like the BNC, ANC, or COCA) that include written texts such as newspapers, magazines, fiction, and non-fiction.

New cards

**Specialised corpora**

A basic type of corpora created to answer very specific questions, consisting of texts of a certain type that aim to strictly represent the language of that specific type (e.g., MICASE).

New cards

**Learner Corpus**

A type of corpus focusing on typical errors, consisting of written texts or spoken transcripts of language used strictly by students who are currently acquiring the language, which allows for comparisons between learners and native speakers.

New cards

**Pedagogic Corpus**

A specific corpus type consisting strictly of language used in classroom settings (e.g., the Cambridge Learner Corpus).

New cards

**Synchronic corpora**

Corpora consisting of an attempt to represent a language or a specific text type strictly at one particular time.

New cards

**Diachronic corpora**

Corpora consisting of representing language and its development over a longer period of time.

New cards

**Annotated corpora**

A type of corpora consisting of added linguistic information where words have specific descriptions or tags.

New cards

**No-annotated corpora**

A type of corpora consisting strictly of language data in its original form completely without any added linguistic information.

New cards

**Corpus construction process**

A complex systematic process consisting of exactly four stages: acquiring a text (through contracts, scanning, internet, transcription), processing it into archives, registering all texts, and finally tagging them to add additional information about the words.

New cards

**British National Corpus (BNC)**

A generalised corpus consisting of contemporary written and spoken British English, which uniquely includes real, unscripted conversations recorded in everyday settings like homes, offices, and pubs.

New cards

**Corpus of Contemporary American English (COCA)**

A very large American corpus used for diachronic analysis of language change over time, consisting of both spoken and written English from fiction and academic texts.

New cards

**Czech National Corpus (CNC)**

A Czech corpus created by the Institute of the CNC at Charles University consisting of exactly four specific subcorpora: SYN (strictly written Czech), ORAL (strictly spoken Czech), DIKOR (dialect corpus), and InterCorp (a parallel corpus designed for translation studies).