1/27
A set of vocabulary flashcards covering essential terms and techniques mentioned in the lecture notes on comparative sentiment-analysis research using Naïve Bayes and Logistic Regression.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Sentiment Analysis
The process of automatically identifying and categorising opinions in text as positive, negative or neutral.
Logistic Regression
A supervised learning algorithm that models the probability of a categorical outcome using the logistic (sigmoid) function.
Naïve Bayes
A probabilistic classifier that assumes feature independence and applies Bayes’ theorem to predict class membership.
VADER (Valence Aware Dictionary and sEntiment Reasoner)
A lexicon- and rule-based tool tailored for social-media text that assigns polarity scores and compound sentiment values.
SMOTE (Synthetic Minority Over-sampling Technique)
A resampling method that creates synthetic examples for minority classes to balance imbalanced datasets.
TF-IDF (Term Frequency–Inverse Document Frequency)
A weighting scheme that reflects how important a word is to a document relative to a corpus.
Tokenization
The preprocessing step that splits raw text into smaller units such as words or sub-words called tokens.
Stemming
Reducing words to their root or base form to unify word variants (e.g., ‘running’ → ‘run’).
Stopword Removal
Eliminating very common words (e.g., ‘and’, ‘the’) that carry little semantic value in text analysis.
CRISP-DM
A six-phase, industry-standard methodology for data-mining projects: Business Understanding, Data Understanding, Data Preparation, Modelling, Evaluation and Deployment.
Machine Learning
A branch of artificial intelligence that enables systems to learn patterns from data and improve performance over time.
Supervised Learning
A machine-learning paradigm where models are trained using labelled input–output pairs.
GridSearchCV
An exhaustive search technique in scikit-learn that tests multiple hyperparameter combinations using cross-validation.
Precision
The proportion of true positive predictions among all positive predictions made by a model.
Recall
The proportion of true positive predictions captured out of all actual positive instances.
F1-Score
The harmonic mean of precision and recall; balances both metrics into a single measure.
Confusion Matrix
A table showing correct and incorrect predictions broken down by each class, used to evaluate classification models.
WordCloud
A visual representation where word size indicates frequency, revealing prominent terms in text data.
Decision Boundary
The surface that separates different class regions in the feature space according to a classifier.
Uji McNemar (McNemar Test)
A non-parametric statistical test for paired nominal data used to compare two classifiers on the same samples.
API X (formerly Twitter API)
An interface that allows programmatic access to X/Twitter data for retrieving, posting or analysing tweets.
Google Colab
A cloud-based Jupyter Notebook environment providing free CPU, GPU and TPU resources for Python code execution.
TruncatedSVD
A dimensionality-reduction technique that projects high-dimensional sparse data (e.g., TF-IDF) into lower dimensions.
One-vs-Rest Strategy
A multi-class classification approach that trains one binary classifier per class against all other classes.
Vectorizer
A tool (e.g., CountVectorizer, TfidfVectorizer) that converts text into numerical feature vectors for machine-learning models.
Lexicon-Based Analysis
Sentiment detection that relies on predefined dictionaries of words annotated with polarity scores.
Class Imbalance
A situation where some classes have far fewer samples than others, often degrading model performance.
Hyperparameter Tuning
The process of optimising external configuration settings (e.g., C value in Logistic Regression) to improve model performance.