CSC422

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/75

There's no tags or description

Looks like no tags are added yet.

Last updated 8:11 AM on 10/9/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

76 Terms

New cards

Big Data

Massive amounts of data generated daily from devices, sensors, and online activity.

New cards

Data Mining

The process of finding useful patterns or knowledge from large datasets using algorithms.

New cards

Knowledge

Actionable insights or information derived from data that help make decisions.

New cards

Learning Algorithm

A method used by computers to learn patterns or relationships from data.

New cards

Data Mining Pipeline

Input Data → Data Preprocessing → Data Mining → Post Processing → Information (the final useful knowledge you can act on)

New cards

Data Subsetting

Using a portion of the full dataset for analysis.

New cards

Supervised Learning

Learning from labeled data to predict outcomes. (Data Mining Tasks)

New cards

Unsupervised Learning

Finding hidden patterns in unlabeled data. (Data Mining Tasks)

New cards

Data Object

A single item or record in a dataset (a row).

New cards

Attribute

A property or characteristic of an object (a column).

New cards

Distinctness

Whether values can be told apart (=, ≠). Attribute Properties

New cards

Order

Whether values can be ranked (<, >). Attribute Properties

New cards

Addition

Whether differences between values are meaningful (+, −). Attribute Properties

New cards

Multiplication

Whether ratios between values are meaningful (×, ÷). Attribute Properties

New cards

Nominal

Categories with names only; no order or numbers.
Examples: ZIP code, Color, ID

New cards

Ordinal

Ordered categories; ranking matters but differences aren’t consistent or measurable, gaps are not consistent.
Examples: Grades, {Good, Better, Best}, Rank

New cards

Interval

Differences are meaningful, but no true zero.
Examples: Dates, °C, °F

New cards

Ratio

Differences and ratios are meaningful; has a true zero.
Examples: Age, Height, Weight, Money

New cards

Nominal, Ordinal, Interval, Ratio

Distinctness applies to

New cards

Ordinal, Interval, Ratio

Order applies to

New cards

Interval, Ratio

Addition applies to

New cards

Ratio

Multiplication applies to

New cards

Mode

Most common value; all attribute types

New cards

Median

Ordinal, Interval, Ratio, Middle value (robust to outliers);

New cards

Mean (and weighted mean)

Interval, Ratio only, Sum ÷ count

New cards

Range

Interval, ratio, max-min

New cards

Variance (s²)

Interval, Ratio, Average squared distance from mean

New cards

Standard Deviation (s)

Interval, Ratio, Square root of variance

New cards

Median Absolute Deviation

Interval, Ratio, Median of absolute differences

New cards

Discrete Data

Finite/countable values (integers); examples: ID, counts, zip codes

New cards

Continuous Data

Infinite real values (decimals); examples: height, temperature, age

New cards

Noise

Random errors in data (e.g. sensor error, distortion)
➡ Fix: visualize, remove noisy attributes, avoid overfitting

New cards

Outlier

Value far from others; possible error or anomaly
➡ Fix: detect, remove if irrelevant, use median-based stats

New cards

Nominal

Mode, Entropy, χ²

New cards

Ordinal

Median, Mode, Rank tests

New cards

Interval

Mean, Std. Dev., Correlation, z-score

New cards

Ratio

Mode, median, entropy, std. dev, correlation, z-score, rank tests, Geometric/Harmonic means (Everything!!)

New cards

Data Preprocessing

Steps to prepare raw data for analysis: sampling, feature selection, dimensionality reduction, feature creation, discretization, transformation.

New cards

Discretization

Converting continuous values into categories (e.g., age → “young,” “middle,” “old”).

New cards

Data Bias

Systematic errors caused by unrepresentative samples or flawed data sources.

New cards

Sampling

Selecting a subset of data to analyze when the full dataset is too large or costly.

New cards

Representative Sample

A sample that accurately reflects the population’s key properties.

New cards

Simple Random Sampling

Every item has equal chance; may miss rare cases.

New cards

Stratified Sampling

Sampling from each subgroup to ensure all are represented

New cards

Progressive Sampling

Start small, increase sample size until results stabilize.

New cards

Sample Size

Must be large enough to capture patterns in the population.

New cards

Survivorship Bias

Only analyzing entities that “survived” over time (e.g., current companies).

New cards

Lookahead Bias

Using future or modern knowledge to influence past data analysis.

New cards

Feature Selection

Choosing the most useful attributes to improve model performance and reduce dimensionality. Reduces noise, speeds up learning, prevents overfitting, improves accuracy.

New cards

Redundant Feature

Duplicates info found in other attributes (e.g., price and sales tax).

New cards

Irrelevant Feature

Adds no useful info for prediction (e.g., student ID when predicting GPA).

New cards

Curse of Dimensionality

Too many attributes → sparse data → harder to find meaningful patterns.

New cards

Dimensionality

Number of features (attributes) in a dataset.

New cards

Principal Component Analysis (PCA)

Reduces the number of features while keeping most of the important information. Finds new axes (principal components) that capture most variance in data. compressing many correlated features into a few powerful ones that still describe the data well.

New cards

Precision

If you care more about avoiding false alarms (like spam detection) → maximize…Of the items you said were positive (TP+FP), how many of them really were (TP)

New cards

Recall

If you care more about not missing cases (like oil spills or cancer detection) → maximize…Of the items that really were positive (TF+FN), how many of them did you actually find (TP)

New cards

Underfitting

Model is too simple → makes high errors on both training and testing.

(Like drawing a straight line through a wavy dataset.)

New cards

Overfitting

Model is too complex → perfect on training data but fails on test data.

(Like drawing a squiggly line that passes through every training point.)

New cards

noise and insufficient data

Two major reasons for overfitting

New cards

Holdout method

a model evaluation technique where a dataset is split into separate training and testing sets, the model is trained on the training set, and its performance is evaluated on the unseen testing set (EX: 70/30 or 50/50 or 60/40 chosen randomly)

New cards

Repeated Resampling

Repeat the holdout process several times and average results.

New cards

Stratified Sampling

Keep class proportions consistent in train/test splits (important for imbalanced data).

New cards

Bootstrap

Sampling with replacement to create multiple training sets

New cards

Hyperparameters

settings you choose before training (not learned from data), Examples:

Decision tree max depth
Number of clusters in K-Means
Learning rate in neural networks
Polynomial degree in regression
control model complexity → too high = overfitting, too low = underfitting

New cards

Eager learners

(like decision trees) build a model first using all training data.

Learning = slow
Predicting new data = fast

New cards

Lazy learners

(like KNN) don’t build a model until prediction time.

Learning = fast (just store the data)
Predicting = slow (must look through all stored data to find nearest neighbors).

New cards

KNN

an instance-based or example-based classifier:

It stores all training examples.
When a new example comes, it compares it to stored cases.
It predicts the class of the new example based on similar past cases.

New cards

rote learner

Memorizes data exactly. To classify a new case, it looks for an exact match.

New cards

Nearest neighbor

Looks for closest (not identical) examples using a distance metric.

New cards

Small k

may overfit (too sensitive to noise).

New cards

large k

may underfit (too smooth, ignores local patterns)

New cards

Ensemble methods

combine multiple models (classifiers) to make predictions.
The idea: instead of relying on one model, we train several and aggregate their outputs (e.g., by majority vote or averaging).
This usually improves accuracy and reduces overfitting.

New cards