Intro to AI Ch. 6 - Machine Learning

0.0(0)

Studied by 1 person

0.0(0)

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/61

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

62 Terms

New cards

Arthur Samuel

1950 - defined machine learning as a field of study that gives computers the ability to learn without being explicitly programmed

Checkers-playing program was able to learn without needing to be explicitly programmed. Used minimax search, static evaluation, rote learning

New cards

Machine learning

“the construction and study of algorithms that can learn from data”

“A program learns from experience E if its performance on the task improves with the experience”

Build a model based on inputs and use that to make predictions, decisions, rules

Split into supervised, unsupervised, and reinforcement learning

New cards

Steps of machine learning

Data gathering
Data pre-processing
Choose a model
Train the model
Fine-tune the model
Apply the model

New cards

Supervised Learning

Assumes a known input of data including known answers/responses. Seeks to produce predictions for responses to new data.

Examples: Regression problem, classification problem

New cards

Unsupervised Learning

No labeled answers are provided. Instead learning program is to find structure or interesting patterns/associations within given data

Examples: clustering, association

New cards

Eager Learning

Constructs a general learning model before receiving a query

Less dependent on training data during prediction

Takes more time training, less time predicting

More memory usage during training, less during prediction

Example: Decision tree, Naive Bayes, Neural network

New cards

Lazy Learning

Stores training data (or carries out only minor processing) and waits until given a new query

Relies heavily on training data during prediction

Takes less time training, more time predicting

Less memory usage during training, more during predicting

Example: k-nearest neighbor

New cards

Decision Tree

Fast learning algorithms

Tree structure, each node is labeled with a test/question, each branch with possible answers, leaf nodes with a decision/solution

Program traverses tree to reach a leaf node (with a decision)

Supervised learning, eager learning

New cards

Attribute (feature descriptor)

Internal node of decision tree

New cards

Attribute value (feature descriptor value)

Branch of decision tree

New cards

Classification (decision tree)

Leaf node of decision tree

New cards

Decision tree learning algorithm

Top-down induction

Find the smallest tree possible

Choose best attribute
Extend tree by adding a new node for this attribute and a new branch for each attribute value
Sort training examples down through this node to the current leaves
If the training examples are unambiguously classified then stop
- Otherwise repeat from the top

New cards

Occam’s Razor

The most likely hypothesis is the simplest one that is consistent with all hypotheses

Attributed to William of Ockham (c. 1285-1349)

New cards

Best attribute

Generally, the attribute that makes the most difference to the classification of an example

Specifically, the attribute that achieves the highest information gain

New cards

Claude Shannon

Founded information theory - the mathematical theory of data communication and storage (1948)

Entropy model

New cards

Understanding entropy

A computational measure of the impurity of the elements in a set

The “uncertainty” associated with randomly picking a certain element from a set

Large probability (low uncertainty) → low entropy values

Small probability (high uncertainty) → large entropy values

New cards

Entropy formula H(X)

H(X) = -Σⁿ_i=1P(X=v_i)⋅log₂P(X=v_i)

P(X=v_i) is the probability

log₂P(X=v_i) is the uncertainty

v_i is each class of the value X

Unit - bits

New cards

Information Gain formula

Gain(S, D) = H(S) - Σ_VϵD|V|/|S|⋅H(V)

New cards

Noise (DT problem)

Problem: There are no attributes left, but there are both positive and negative examples remaining. Attributes do not give enough information to classify the examples

Solution: 1) Leaf nodes report majority classification. 2) Report estimated classification probabilities

New cards

Overfitting (DT problem)

Problem: the algorithm uses irrelevant attributes to make spurious distinctions among examples.

Solution: 1) Estimate the likelihood that the gain observed for a particular example set size is due to chance. 2) Cross-validation: set aside part of the training set to use as a test set to see how the current model will work with unseen data

New cards

Missing Data (DT problem)

Problem: not all attributes may be known for every example…how to classify?

Solution: 1) Assign a value for the empty attribute, perhaps using the average value in all the training examples. 2) Assume that the attribute has all possible values and proceed down all branches. Weigh each procession by the relative frequency of that procession in the other examples.

New cards

Multi-Valued Data (DT problem)

Problem: attributes with large numbers of possible values can give unfairly large gain values, leading to a useless tree

Solution: Modify the gain formula to get the Gain Ratio formula

New cards

Gain Ratio formula

GainRatio(S,A) = G(S,A)/SplitInformation(S,A)

SplitInformation(S,A)=Σ^c_i=1(|S_i|/|S|)⋅log₂(|S_i|/|S|)

New cards

Continuous Values (DT problem)

Problem: some attributes can have an infinity of values, and unlikely that any two examples will have the same exact value

Solution: Discretization - Split the value range up into manageable chunks. Preprocess the example data to determine which ranges are best in terms of the classifications they support

New cards

Costly Attributes (DT problem)

Problem: testing certain attributes may carry a substantial cost

Solution: Add a cost term to the attribute selection measure so that cost is considered during tree building.

Gain²(S,A)/Cost(A)

New cards

Naïve Bayes

An eager-learning probabilistic classifier based on Bayes Rule

Takes linear time, so easily scalable

New cards

Bayes Theorem

Thomas Bayes (1701-1761)

P(c|x)=P(x|c)P(c)/P(x)

The probability of target class c being true, given attribute x is true = …

New cards

Naïve Bayes Assumptions

Assume…

attributes are conditionally independent
attributes are weighted equally towards the outcome

Thus, P(c,X)=P(x1,c)⋅P(x2,c)⋅P(x3,c)⋅P(c)

New cards

k-Nearest Neighbor

A lazy-learning algorithm that classifies a query based on its similarity to test instances

Can represent feature space with one dimension for each descriptive feature

New cards

Euclidean Distance (k-NN)

Assuming two instances a and b in an m-dimensional feature space…

Euclidean(a,b)=sqrt[ ∑^m_i=1(a_i−b_i)² ]

“straight-line distance”

New cards

Manhattan distance (k-NN)

Assuming two instances a and b in an m-dimensional feature space…

Manhattan(a,b)= ∑^m_i=1abs(a_i−b_i)

“City-block distance”

New cards

Minkowski distance (k-NN)

Assuming two instances a and b in an m-dimensional feature space…

Minkowski(a,b)= [ ∑^m_i=1abs(a_i−b_i)^p ]^1/p

If p=1, then Manhattan

If p=2, then Euclidean

Can define an infinite number of distance metrics

New cards

Voronoi Tessellation

Divide a feature space into local neighborhoods of classes. Each line segment is equidistant between two points of opposite classes

New cards

Why “k” nearest neighbors?

Nearest Neighbor classification is sensitive to noise such as mis-labeled data. Instead have the “k” nearest neighbors vote to classify the query.

New cards

How to choose “k”?

A process called parameter tuning

Too small → sensitive to noise

Too large → less precise

Thus, cross-validate using training and test data

Choose k < sqrt(n), where n is the number of training examples, and choose and odd number if dealing with an even number of classes

New cards

kNN algorithm

k is number of nearest neighbors, S is the set of instances, Q is the new query

Calculate the similarity of Q to all the instances in S
Rank the instances in S by their similarity to Q
Identify the set of k nearest neighbors
- Report the majority vote of this set

New cards

Scaling (kNN problem)

Problem: distance measures are dominated by one of the attributes

Solution: scale attributes to be roughly square

New cards

Missing values (kNN problem)

Problem: Missing feature values can lead to misclassified examples

New cards

Outliers (kNN problem)

Problem: outliers create individual spaces that are separated from the rest of their class

Solution: dilute the dependence of the algorithm on individual instances (increase k)

New cards

Clustering Algorithm

A kind unsupervised learning algorithm that aims to find homogenous subgroups of data point with similar characteristics, given a finite set of data points.

The end goal of the clustering task is to generate/find a single feature that indicates that cluster that an instance belongs to.

Use cases: Customer segmentation, urban planning, seismology

New cards

k-means clustering

Goal: take a dataset of instances D and divide the instances in this data set d1,…,dn into k disjoint clusters. “k” is an input into the algorithm and each instance can only belong to one cluster. Algorithm aims to minimize:

∑ⁿ_i=1min_c1,…,ck Dist(d_i, c_j) where c1 to ck are the centroids of k clusters and Dist is a distance measure used to compare instances to centroids.

Computationally efficient

New cards

k-means clustering approach

Data preprocessing: convert dataset into numerical values if not already
Initialize K and centroids
- Can choose cluster centroids (seeds) at random, or divide dataset into k portions and pick one from each portion
Assign clusters to datapoints using some distance metric
- Calculates the distance between the datapoint and all the centroids, and assign the datapoint to the nearest centroid
Update centroids
- Update cluster centroid to be the mean of all the datapoints within that cluster
Stopping criterion
- Perform 2 and 3 iteratively until some stopping criterion, such as…
- Clusters stop changing, centroids remain stable, distance from datapoints to their centroid is minimum, fixed number of iterations have been reached

New cards

Inertia Score (k-means clustering)

Tells how far away the data points in a cluster are from the nearest centroid. Want small values. Score ranges from [0, infty)

Measured using Within-Cluster-Sum-of-Squares

WCSS=∑_{all centroids}(∑_{all points within each centroid}distance(p, C)²

New cards

Silhouette Width (k-means clustering)

A score for how well an individual data point is clustered. Considers both inter-cluster and intra-cluster distances. Ranges [-1,1]. A good score is closer to 1.

s(i) = [b(i) - a(i)] / larger of a(i) and b(i)

a(i) is the average distance to other points within its own cluster

b(i) is the average distance to points within the nearest cluster

New cards

How to determine k? (k-means clustering)

plot data and see if there is a pattern
try different values of k and choose the one that gives the best performance
1. Elbow method
2. Silhouette method

New cards

Elbow method (k-means clustering)

Want small inertia and small k. Intertia decreases with increasing k.

Calculate inertia score for a range of k’s, choose the “elbow point” - the point where the change in inertia with increasing k becomes less significant.

New cards

Silhouette Method (k-means clustering)

Want silhouette width close to 1 and small k.

Calculate silhouette width for a range of k’s. Choose the one with widths closer to 1 and less outliers.

New cards

Association Rule Mining

A rule-based machine learning method that finds relationships hidden in large datasets

Finds associations between any and all attributes

New cards

Association Rules

Strong rules that define association between attributes in a dataset.

If antecedent is found in the data, then something else is also likely to be found

Antecedent → Consequent

X → Y

New cards

Market Basket Analysis

Proposed by Agrwal, Imielinski, Swami in 1993.

Analyze transaction data and identify association rules using measures of “interestingness”

New cards

Antecedent (association rule mining)

Something which is found in the data

“if”

New cards

Consequent (association rule mining)

Something which occurs when the antecedent occurs

“then”

New cards

Itemset (association rule mining)

The list of all the items in the antecedent and the consequent

New cards

Association Rule Mining Issues

Patterns can occur by chance
- Solution: Support, confidence, and lift
Rule mining from large data-sets can be computationally expensive
- Learn only strong rules; use the Apriori Algorithm

New cards

Support (association rule mining)

Indicates how frequently an item/itemset is in all the dataset. Low support for a rule means there is not enough information, no conclusions can be drawn.

Ex. % of customers that purchased milk+cheese

Support(A) = Frequency(A) / Total number of entries in dataset

Support of (A→D) = Frequency(A&D) / Total number of entries in dataset

New cards

Confidence (association rule mining)

Indicates how often a rule is found to be true in the dataset

Ex. % of customers that bought milk and also bought cheese

Confidence(X→Y) = Support(X→Y) / Support(X)

New cards

Lift (association rule mining)

Indicates how the antecedent and consequent are related to one another, i.e., the rise in probability of the occurrence of Y when X has already occurred.

Lift(X→Y) = Support(X→Y) / [Support(X) * Support(Y)]

Lift = 1, X and Y are not dependent on one another

Lift < 1, negatively correlated

Lift > 1, positively correlated

New cards

Good rules have… (association rule mining)

Support as high as possible

Confidence close to 1

Lift higher than 1

New cards

Apriori Algorithm

For frequent itemset mining and association rule mining from large databases.

Generally, identify the frequent individual items in the dataset and extending them to larger and larger itemsets so long as those itemsets appear sufficiently often in the database.

New cards

Anti-Monotone Property

If a set cannot pass a test, all of its supersets will fail the same test.

i.e., All subsets of a frequent itemset must be frequent. All supersets of an infrequent itemset must be infrequent.

New cards

Apriori Algorithm Pseudocode

Dataset D

Apriori(D, min_sup, min_conf)

k=1

generate frequent itemsets of length k

generate length (k+1) candidates from length k frequent itemsets

prune candidate itemsets containing subsets of length k that are infrequent (< min_sup)

count the support each candidate by scanning D

eliminate infrequent candidates, leaving only those that are frequent

k++

New cards