Data Cleaning and Text Preprocessing

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/25

There's no tags or description

Looks like no tags are added yet.

Last updated 1:19 PM on 7/2/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai	Chat

No analytics yet

Send a link to your students to track their progress

26 Terms

New cards

Big Data

Data in large quantities, complex, and generated fast so that it is hard or (even impossible) to process by utilising traditional approaches.

New cards

5V’s of big data

Volume
Velocity
Variety
Veracity
Value

New cards

volume

The most characteristic aspect of big data is its volume.

Big data is data containing lots of instances and variables.

New cards

velocity

change over time

The speed of generating big data is often very high, and it is essential to have fast techniques for analysing them. Otherwise, if the data processing takes too much time, the results may be outdated and no longer useful by the time the results are there.

New cards

variety

complexity

In big data, we encounter various types of data that often need to be combined in one analysis.

New cards

veracity

Most of the cources of big data are not fully trustworthy.

It is crucial to try out new ways in which we can validate the data.

New cards

value

Processing any large dataset does not necessarily lead to valuable knowledge.

But is the proper data analytic approach is used, a lot of money can be made (e.g., cookies).

New cards

big data analytics

The process of extracting knowledge from big data.
or: the use of advanced analytic techniques for very large and diverse data sets.

New cards

six steps of data analustics process

Problem identification
Selection of data sources
Data cleaning
Data transformation
Mining
Analysis of results

New cards

Problem identification (1)

We need to understand the problem and identify different aspects.

What is the main objective of data analysis.
What should be predicted?
Why is that important?
How well does the prediction have to be? Do we need real-time results?

New cards

Selection of data sources (2)

After identifying the poblem, we need to consider all possible sources of data and select those that we need to solve the problem.

New cards

Data cleaning (3)

The data collected may include duplicate, corrupted, irrelevant, noisy, and missing data.

Data cleaning is the provess of fixing or removing these data instances.

New cards

Data transformation (4)

The format of the collected data may not yet be suited for feeding into machine learning (data mining) algorithms.

Refers to the provess of changing the format or structure of the data.

New cards

Mining (5)

We need to choose among various data analytic methods.

This also involved choosing between various possible settings within these methods.

New cards

Analysis of results (6)

Different criteria and methods of visualisation are used to examine and interpret the results of data mining models.

New cards

Text Mining (or text analysis)

The process of extracting interesting and non-trivial information and knowledge from unstructured text.

It is an automated process that used natural language processing to extract valuable insights from the text by converting it into information that machines can understand.

New cards

text mining challenges

Text mining often strongly relies on the context and background knowledge for defining and conveying meaning.

The human/natural language is full of ambiguous terms and phrases.

Moreover, natural languages are being done continuously to provide more efficient methods for text processing to resolve these challenges.

New cards

a series of data cleaning processes on the raw text:

Punctuation
Removing numbers
Removing spaces between text
Writing all words in lower case (removing capitals)
Removing stop words
Replace synonyms with more general concepts
POS tagging
Stemming and lemmatisation
Measuring lexical complexity

New cards

Removing stop words

Words that occur very commonly, and therefore contain very little information. For example:

However, removing stop words can lead to the loss of the original meaning and structure of the text.

New cards

stemming

A preprocessing procedure with the aim to reduce inflectional (occasionally derivational) forms of words to their word stem, root form, or a common base form.

For example:

love, loved, lovable, lovely, and loving → stem: ‘lov’

New cards

lemmisation

Very similar to stemming (both reduce the inflected words into their root form).

Difference: for some words, the stem may not be an actual word ( e.g., lov).

In contrast, the lemma is always an actual word in the language (e.g., love).

Words such as ‘am’ or ‘is’ change into ‘be’ (e.g., I am interested → I be interest; It is interesting → It be interest).

Chllenge: same root, different meanings. For example, iron vs. ironic; or animal vs. animated (both resulting in ‘anim’.

Overall, stemming is simple and faster as it is a step-by-step procedure (algorithm) that is performed on words, while lemmisation is more accurate and elegant, but slow.

New cards

Part of Speech Tagging (PoS Tagging)

The process of assigning parts of speech to each word in a text, such as noun, verb, adjective, adverb, and so on.

Helps performing tasks such as sentiment analysis, translation, and summarisation.

Challenge: some words may have a different part of speech depending on the context in which they are used.

For example: “What is the answer”, the word “answer” is a noun; “They answer the challenging question”, the word “answer” is a verb.

New cards

bag-of-words method (or vector space model)

A text is interpreted as a bag full of words, regardless of grammar and even word order.

The number of occurrences for each word is considered as a feature for that text.

Bag-of-words only provides a vector representation for each document containing the count of word occurrences in the document.

Words are mutually independent (individual words can be considered on their own).
Words order in text is irrelevant.

Despite its unrealistic assumptions and simplicity, this approach to text modelling proved to be highly effective and is often used in text mining.

Stemming and removing stop words help us to reduce the number of features when using bag-of-words.

New cards

n-gram

A sequence of n words that come one after another in a text.

Bi-grams in sentence “data cleaning is very important”:

data cleaning
cleaning is
is very
very important

n - 1 is like bag of words (“data”, “cleaning”, “is”, etc.)

n - 2 used most (“data cleaning”, “cleaning is”, etc.)

n - 3 also used most

etc.

(higher orders of n lead to a huge number of features)

New cards

TF-IDF

A method that gives important words a higher score and common words a lower score.

TF → How many times does a word appear in this document?

IDF → How rare is this word across all documents?

If a word appears in almost every document → low IDF
If a word appears in only a few documents → high IDF

TF-IDF → TF x IDF:

appears many times in one document, and
appears in only a few documents overall.

Bag-of-words treats every word as equally important.

TF-IDF gives more weight to invormative, unique words and less weight to bery common words.

New cards

How TF-IDF is helpful (in grouping news articles by topic)

Every article contains words like "the," "is," and "and." These words don't help distinguish one topic from another.
Sports articles might frequently contain "goal," "football," or "player."
Politics articles might contain "election," "government," or "policy."