1/25
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai | Chat |
|---|
No analytics yet
Send a link to your students to track their progress
Big Data
Data in large quantities, complex, and generated fast so that it is hard or (even impossible) to process by utilising traditional approaches.
5V’s of big data
Volume
Velocity
Variety
Veracity
Value
volume
The most characteristic aspect of big data is its volume.
Big data is data containing lots of instances and variables.
velocity
change over time
The speed of generating big data is often very high, and it is essential to have fast techniques for analysing them. Otherwise, if the data processing takes too much time, the results may be outdated and no longer useful by the time the results are there.
variety
complexity
In big data, we encounter various types of data that often need to be combined in one analysis.
veracity
Most of the cources of big data are not fully trustworthy.
It is crucial to try out new ways in which we can validate the data.
value
Processing any large dataset does not necessarily lead to valuable knowledge.
But is the proper data analytic approach is used, a lot of money can be made (e.g., cookies).
big data analytics
The process of extracting knowledge from big data.
or: the use of advanced analytic techniques for very large and diverse data sets.
six steps of data analustics process
Problem identification
Selection of data sources
Data cleaning
Data transformation
Mining
Analysis of results
Problem identification (1)
We need to understand the problem and identify different aspects.
What is the main objective of data analysis.
What should be predicted?
Why is that important?
How well does the prediction have to be? Do we need real-time results?
Selection of data sources (2)
After identifying the poblem, we need to consider all possible sources of data and select those that we need to solve the problem.
Data cleaning (3)
The data collected may include duplicate, corrupted, irrelevant, noisy, and missing data.
Data cleaning is the provess of fixing or removing these data instances.
Data transformation (4)
The format of the collected data may not yet be suited for feeding into machine learning (data mining) algorithms.
Refers to the provess of changing the format or structure of the data.
Mining (5)
We need to choose among various data analytic methods.
This also involved choosing between various possible settings within these methods.
Analysis of results (6)
Different criteria and methods of visualisation are used to examine and interpret the results of data mining models.
Text Mining (or text analysis)
The process of extracting interesting and non-trivial information and knowledge from unstructured text.
It is an automated process that used natural language processing to extract valuable insights from the text by converting it into information that machines can understand.
text mining challenges
Text mining often strongly relies on the context and background knowledge for defining and conveying meaning.
The human/natural language is full of ambiguous terms and phrases.
Moreover, natural languages are being done continuously to provide more efficient methods for text processing to resolve these challenges.
a series of data cleaning processes on the raw text:
Punctuation
Removing numbers
Removing spaces between text
Writing all words in lower case (removing capitals)
Removing stop words
Replace synonyms with more general concepts
POS tagging
Stemming and lemmatisation
Measuring lexical complexity
Removing stop words
Words that occur very commonly, and therefore contain very little information. For example:
it
be
to
the
However, removing stop words can lead to the loss of the original meaning and structure of the text.
stemming
A preprocessing procedure with the aim to reduce inflectional (occasionally derivational) forms of words to their word stem, root form, or a common base form.
For example:
love, loved, lovable, lovely, and loving → stem: ‘lov’
lemmisation
Very similar to stemming (both reduce the inflected words into their root form).
Difference: for some words, the stem may not be an actual word ( e.g., lov).
In contrast, the lemma is always an actual word in the language (e.g., love).
Words such as ‘am’ or ‘is’ change into ‘be’ (e.g., I am interested → I be interest; It is interesting → It be interest).
Chllenge: same root, different meanings. For example, iron vs. ironic; or animal vs. animated (both resulting in ‘anim’.
Overall, stemming is simple and faster as it is a step-by-step procedure (algorithm) that is performed on words, while lemmisation is more accurate and elegant, but slow.
Part of Speech Tagging (PoS Tagging)
The process of assigning parts of speech to each word in a text, such as noun, verb, adjective, adverb, and so on.
Helps performing tasks such as sentiment analysis, translation, and summarisation.
Challenge: some words may have a different part of speech depending on the context in which they are used.
For example: “What is the answer”, the word “answer” is a noun; “They answer the challenging question”, the word “answer” is a verb.
bag-of-words method (or vector space model)
A text is interpreted as a bag full of words, regardless of grammar and even word order.
The number of occurrences for each word is considered as a feature for that text.
Bag-of-words only provides a vector representation for each document containing the count of word occurrences in the document.
Words are mutually independent (individual words can be considered on their own).
Words order in text is irrelevant.
Despite its unrealistic assumptions and simplicity, this approach to text modelling proved to be highly effective and is often used in text mining.
Stemming and removing stop words help us to reduce the number of features when using bag-of-words.
n-gram
A sequence of n words that come one after another in a text.
Bi-grams in sentence “data cleaning is very important”:
data cleaning
cleaning is
is very
very important
n - 1 is like bag of words (“data”, “cleaning”, “is”, etc.)
n - 2 used most (“data cleaning”, “cleaning is”, etc.)
n - 3 also used most
etc.
(higher orders of n lead to a huge number of features)
TF-IDF
A method that gives important words a higher score and common words a lower score.
TF → How many times does a word appear in this document?
IDF → How rare is this word across all documents?
If a word appears in almost every document → low IDF
If a word appears in only a few documents → high IDF
TF-IDF → TF x IDF:
appears many times in one document, and
appears in only a few documents overall.
Bag-of-words treats every word as equally important.
TF-IDF gives more weight to invormative, unique words and less weight to bery common words.
How TF-IDF is helpful (in grouping news articles by topic)
Every article contains words like "the," "is," and "and." These words don't help distinguish one topic from another.
Sports articles might frequently contain "goal," "football," or "player."
Politics articles might contain "election," "government," or "policy."