CSC422

0.0(0)
Studied by 0 people
call kaiCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/75

encourage image

There's no tags or description

Looks like no tags are added yet.

Last updated 8:11 AM on 10/9/25
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No analytics yet

Send a link to your students to track their progress

76 Terms

1
New cards

Big Data

Massive amounts of data generated daily from devices, sensors, and online activity.

2
New cards

Data Mining

The process of finding useful patterns or knowledge from large datasets using algorithms.

3
New cards
4
New cards

Knowledge

Actionable insights or information derived from data that help make decisions.

5
New cards

Learning Algorithm

A method used by computers to learn patterns or relationships from data.

6
New cards

Data Mining Pipeline

Input Data → Data Preprocessing → Data Mining → Post Processing → Information (the final useful knowledge you can act on)

7
New cards

Data Subsetting

Using a portion of the full dataset for analysis.

8
New cards

Supervised Learning

Learning from labeled data to predict outcomes. (Data Mining Tasks)

9
New cards

Unsupervised Learning

Finding hidden patterns in unlabeled data. (Data Mining Tasks)

10
New cards

Data Object

A single item or record in a dataset (a row).

11
New cards

Attribute

A property or characteristic of an object (a column).

12
New cards

Distinctness

Whether values can be told apart (=, ≠). Attribute Properties

13
New cards

Order

Whether values can be ranked (<, >). Attribute Properties

14
New cards

Addition

Whether differences between values are meaningful (+, −). Attribute Properties

15
New cards

Multiplication

Whether ratios between values are meaningful (×, ÷). Attribute Properties

16
New cards

Nominal

Categories with names only; no order or numbers.
Examples: ZIP code, Color, ID

17
New cards

Ordinal

Ordered categories; ranking matters but differences aren’t consistent or measurable, gaps are not consistent.
Examples: Grades, {Good, Better, Best}, Rank

18
New cards

Interval

Differences are meaningful, but no true zero.
Examples: Dates, °C, °F

19
New cards

Ratio

Differences and ratios are meaningful; has a true zero.
Examples: Age, Height, Weight, Money

20
New cards

 Nominal, Ordinal, Interval, Ratio

Distinctness applies to

21
New cards

Ordinal, Interval, Ratio

Order applies to

22
New cards

Interval, Ratio

Addition applies to

23
New cards

Ratio

Multiplication applies to

24
New cards

Mode

Most common value; all attribute types

25
New cards

Median

Ordinal, Interval, Ratio, Middle value (robust to outliers);

26
New cards

Mean (and weighted mean)

Interval, Ratio only, Sum á count

27
New cards

Range

Interval, ratio, max-min

28
New cards

Variance (s²)

Interval, Ratio, Average squared distance from mean

29
New cards

Standard Deviation (s)

Interval, Ratio, Square root of variance

30
New cards

Median Absolute Deviation

Interval, Ratio, Median of absolute differences

31
New cards

Discrete Data

Finite/countable values (integers); examples: ID, counts, zip codes

32
New cards

Continuous Data

Infinite real values (decimals); examples: height, temperature, age

33
New cards

Noise

Random errors in data (e.g. sensor error, distortion)
➡ Fix: visualize, remove noisy attributes, avoid overfitting

34
New cards

Outlier

Value far from others; possible error or anomaly
➡ Fix: detect, remove if irrelevant, use median-based stats

35
New cards
36
New cards

Nominal

Mode, Entropy, χ²

37
New cards

Ordinal

Median, Mode, Rank tests

38
New cards

Interval

Mean, Std. Dev., Correlation, z-score

39
New cards

Ratio

Mode, median, entropy, std. dev, correlation, z-score, rank tests, Geometric/Harmonic means (Everything!!)

40
New cards

Data Preprocessing

Steps to prepare raw data for analysis: sampling, feature selection, dimensionality reduction, feature creation, discretization, transformation.

41
New cards

Discretization

Converting continuous values into categories (e.g., age → “young,” “middle,” “old”).

42
New cards

Data Bias

Systematic errors caused by unrepresentative samples or flawed data sources.

43
New cards

Sampling

Selecting a subset of data to analyze when the full dataset is too large or costly.

44
New cards

Representative Sample

A sample that accurately reflects the population’s key properties.

45
New cards

Simple Random Sampling

Every item has equal chance; may miss rare cases.

46
New cards

Stratified Sampling

Sampling from each subgroup to ensure all are represented

47
New cards

Progressive Sampling

Start small, increase sample size until results stabilize.

48
New cards

Sample Size

Must be large enough to capture patterns in the population.

49
New cards

Survivorship Bias

Only analyzing entities that “survived” over time (e.g., current companies).

50
New cards

Lookahead Bias

Using future or modern knowledge to influence past data analysis.

51
New cards

Feature Selection

Choosing the most useful attributes to improve model performance and reduce dimensionality. Reduces noise, speeds up learning, prevents overfitting, improves accuracy.

52
New cards

Redundant Feature

Duplicates info found in other attributes (e.g., price and sales tax).

53
New cards

Irrelevant Feature

Adds no useful info for prediction (e.g., student ID when predicting GPA).

54
New cards

Curse of Dimensionality

Too many attributes → sparse data → harder to find meaningful patterns.

55
New cards

Dimensionality

Number of features (attributes) in a dataset.

56
New cards

Principal Component Analysis (PCA)

Reduces the number of features while keeping most of the important information. Finds new axes (principal components) that capture most variance in data. compressing many correlated features into a few powerful ones that still describe the data well.

57
New cards

Precision

If you care more about avoiding false alarms (like spam detection) → maximize…Of the items you said were positive (TP+FP), how many of them really were (TP)

58
New cards

Recall

If you care more about not missing cases (like oil spills or cancer detection) → maximize…Of the items that really were positive (TF+FN), how many of them did you actually find (TP)

59
New cards

Underfitting

Model is too simple → makes high errors on both training and testing.

(Like drawing a straight line through a wavy dataset.)

60
New cards

Overfitting

Model is too complex → perfect on training data but fails on test data.

(Like drawing a squiggly line that passes through every training point.)

61
New cards

noise and insufficient data

Two major reasons for overfitting

62
New cards

Holdout method

a model evaluation technique where a dataset is split into separate training and testing sets, the model is trained on the training set, and its performance is evaluated on the unseen testing set (EX: 70/30 or 50/50 or 60/40 chosen randomly)

63
New cards

Repeated Resampling

Repeat the holdout process several times and average results.

64
New cards

Stratified Sampling

Keep class proportions consistent in train/test splits (important for imbalanced data).

65
New cards

Bootstrap

Sampling with replacement to create multiple training sets

66
New cards

Hyperparameters

settings you choose before training (not learned from data), Examples:

  • Decision tree max depth

  • Number of clusters in K-Means

  • Learning rate in neural networks

  • Polynomial degree in regression

  • control model complexity → too high = overfitting, too low = underfitting

67
New cards

Eager learners

(like decision trees) build a model first using all training data.

  • Learning = slow

  • Predicting new data = fast

68
New cards

Lazy learners

(like KNN) don’t build a model until prediction time.

  • Learning = fast (just store the data)

  • Predicting = slow (must look through all stored data to find nearest neighbors).

69
New cards

KNN

an instance-based or example-based classifier:

  • It stores all training examples.

  • When a new example comes, it compares it to stored cases.

  • It predicts the class of the new example based on similar past cases.

70
New cards

rote learner

Memorizes data exactly. To classify a new case, it looks for an exact match.

71
New cards

Nearest neighbor

Looks for closest (not identical) examples using a distance metric.

72
New cards

Small k

may overfit (too sensitive to noise).

73
New cards

large k

may underfit (too smooth, ignores local patterns)

74
New cards

Ensemble methods

  • combine multiple models (classifiers) to make predictions.

  • The idea: instead of relying on one model, we train several and aggregate their outputs (e.g., by majority vote or averaging).

  • This usually improves accuracy and reduces overfitting.

75
New cards
76
New cards

Explore top flashcards

flashcards
CĂ´ Yáşżn 5/12/2024
22
Updated 480d ago
0.0(0)
flashcards
EXAM 2 - part 6
22
Updated 250d ago
0.0(0)
flashcards
Einheit 1 Freunde
75
Updated 229d ago
0.0(0)
flashcards
Biology Honors Evolution
51
Updated 1096d ago
0.0(0)
flashcards
Matiekos egzas
73
Updated 819d ago
0.0(0)
flashcards
Livy 2.10 Vocab
20
Updated 1215d ago
0.0(0)
flashcards
CĂ´ Yáşżn 5/12/2024
22
Updated 480d ago
0.0(0)
flashcards
EXAM 2 - part 6
22
Updated 250d ago
0.0(0)
flashcards
Einheit 1 Freunde
75
Updated 229d ago
0.0(0)
flashcards
Biology Honors Evolution
51
Updated 1096d ago
0.0(0)
flashcards
Matiekos egzas
73
Updated 819d ago
0.0(0)
flashcards
Livy 2.10 Vocab
20
Updated 1215d ago
0.0(0)