Chapter 7 Machine Learning ( Supervised ) & 8 ( Unsupervised Learning )

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/21

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

22 Terms

New cards

What is machine learning?

Machine learning is a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty.

New cards

What are the two types of machine learning?

Supervised
Unsupervised

New cards

What is classification/prediction?

Classification/prediction is like human learn from past experiences.
Computer does not have “experience” so it learns from data, which represent some “past experiences” of an application domain.

New cards

What is the general flow process of supervised learning?

i ) Training Text, Documents, Images, Sounds, … > Features Vectors > / Labels >

ii) Machine Learning Algorithm >

iii) Predictive Model

iv) New Text Document, Images, Sound > Features Vector > iii > Expected Label

New cards

Where can dataset can be retrieved?

Public datasets
Data marketplace
Company and organization datasets
Web scraping

New cards

What are the activities carried out in preprocessing?

Data cleaning -Handling missing values by by either removing the corresponding samples or filling in the missing values with techniques such as mean, median, or mode imputation. Address outliers by using Winsorization technique or outlier imputation

Data Integration - If you have multiple data sources or datasets, you may need to integrate them into a single dataset. This typically involves handling inconsistencies in attribute names, resolving conflicts in data formats, and merging the data based on common identifiers.

Data Transformation - Common transformations include scaling numerical features to a similar range (e.g., using normalization or standardization), encoding categorical variables into numerical representations (e.g., one-hot encoding, front-end development → 1, other development → 2), and transforming skewed distributions

Imbalanced Data - Technique such as oversampling the minority class, undersampling the majority class, or using advanced algorithms like SMOTE (Synthetic Minority Over- sampling Technique) can be employed to address the imbalance.

Time-Series Data - Handling missing or irregular timestamps, resampling or interpolating the data to a regular time interval, and creating lag features or rolling windows for capturing temporal patterns.

New cards

What are the major types of machine learning algorithms?

Classification - Uses categorical / nominal

Regression - Continuous values

New cards

What are the types of classification methods?

K Nearest Neighbour
Decision Tree
Support Vector Machine
Bayesian Classification

New cards

What are the conditions for stopping partitioning?

All samples for a given node belong to the same class
There are no remaining attributes for further partitioning - majority voting is employed for classifying the leaf
There are no samples left

New cards

What is entropy?

Entropy is the measure of randomness in a dataset

New cards

What is the aim of decision tree?

Split the data in a way that the entropy in the data decreases so it is easier to make predictions

New cards

What are the two approaches to avoid overfitting?

Prepruning: Halt tree construction early - do not split a node if this would result in the goodness measure failing below a threshold

Difficult to choose an appropriate threshold

Postprunning: remove branches from a “fully grown” tree - get a sequence of progressively pruned trees

Use a set of data different from the training data to decide which is the “best pruned tree”

New cards

What are the contents of Naive Bayesian Classification?

Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems.

Incremental: each training example can incrementally increase/decrease the probability that a hypothesis is correct. prior knowledge can be combined with observed data.

Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities

Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

New cards

What is a confusion matrix?

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

New cards

What are the clustering methods?

K-Means
Gaussian Mixture Model
Mean-Shift
Hierarchical Clustering

New cards

What is the stopping/convergence criterion for K-Means?

no (or minimum) re-assignments of data points to different clusters
no (or minimum) change of centroids, or
minimum decrease in the sum of squared error J

New cards

What are the limitations of K-Means?

very sensitive to the initial points. - do many runs of k-means, each with different initial centroids.
must manually choose k - learn the optimal k for the clustering.
K-means has problems when clusters are of differing size, densities, non-globular shapes
K-means has problems when the data contains outliers

New cards

What are the advantages and disadvantages of mean shift?

Advantages

Does not assume number of clusters
Just a single parameter
Finds variable number of modes
Robust to outliers

Disadvantages

Output depends on window size
Computationally expensive ( one that, for a given input size, requires a relatively large number of steps to complete )

New cards

What are the types of hierarchical clustering?

Agglomerative (bottom up) clustering: It build the dendrogram (tree) from the bottom level

Divisive (top down) clustering: It starts with all data points in one cluster, the root.

New cards

What are the evaluation based on internal information?

Intra-cluster cohesion (compactness)

Cohesion measures how near the data points in a cluster are to the cluster centroid.
Sum of squared error (SSE) is a commonly used measure.

inter-cluster separation (isolation)

Separation means that different cluster centroids should be far away from one another.

New cards

Confusion matrix for predicted no, predicted yes, actual no and actual yes

Predicted no - Negative

Predicted yes - Positive

Predicted is actual - True

Predicted is not actual - False

Actual no but predicted yes - False positive

Actual no and predicted no -True negative ( True = correct, negative = no )

New cards

What is the formula for accuracy, precision and recall?

Accuracy = TP+TN / total

Precision = TP/ Predicted yes

Recall = TP/Actual yes