Intro to AI Ch. 6 - Machine Learning

0.0(0)
studied byStudied by 1 person
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/61

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

62 Terms

1
New cards

Arthur Samuel

1950 - defined machine learning as a field of study that gives computers the ability to learn without being explicitly programmed

Checkers-playing program was able to learn without needing to be explicitly programmed. Used minimax search, static evaluation, rote learning

2
New cards

Machine learning

“the construction and study of algorithms that can learn from data”

“A program learns from experience E if its performance on the task improves with the experience”

Build a model based on inputs and use that to make predictions, decisions, rules

Split into supervised, unsupervised, and reinforcement learning

3
New cards

Steps of machine learning

  1. Data gathering

  2. Data pre-processing

  3. Choose a model

  4. Train the model

  5. Fine-tune the model

  6. Apply the model

4
New cards

Supervised Learning

Assumes a known input of data including known answers/responses. Seeks to produce predictions for responses to new data.

Examples: Regression problem, classification problem

5
New cards

Unsupervised Learning

No labeled answers are provided. Instead learning program is to find structure or interesting patterns/associations within given data

Examples: clustering, association

6
New cards

Eager Learning

Constructs a general learning model before receiving a query

Less dependent on training data during prediction

Takes more time training, less time predicting

More memory usage during training, less during prediction

Example: Decision tree, Naive Bayes, Neural network

7
New cards

Lazy Learning

Stores training data (or carries out only minor processing) and waits until given a new query

Relies heavily on training data during prediction

Takes less time training, more time predicting

Less memory usage during training, more during predicting

Example: k-nearest neighbor

8
New cards

Decision Tree

Fast learning algorithms

Tree structure, each node is labeled with a test/question, each branch with possible answers, leaf nodes with a decision/solution

Program traverses tree to reach a leaf node (with a decision)

Supervised learning, eager learning

9
New cards

Attribute (feature descriptor)

Internal node of decision tree

10
New cards

Attribute value (feature descriptor value)

Branch of decision tree

11
New cards

Classification (decision tree)

Leaf node of decision tree

12
New cards

Decision tree learning algorithm

Top-down induction

Find the smallest tree possible

  • Choose best attribute

  • Extend tree by adding a new node for this attribute and a new branch for each attribute value

  • Sort training examples down through this node to the current leaves

  • If the training examples are unambiguously classified then stop

    • Otherwise repeat from the top

13
New cards

Occam’s Razor

The most likely hypothesis is the simplest one that is consistent with all hypotheses

Attributed to William of Ockham (c. 1285-1349)

14
New cards

Best attribute

Generally, the attribute that makes the most difference to the classification of an example

Specifically, the attribute that achieves the highest information gain

15
New cards

Claude Shannon

Founded information theory - the mathematical theory of data communication and storage (1948)

Entropy model

16
New cards

Understanding entropy

A computational measure of the impurity of the elements in a set

The “uncertainty” associated with randomly picking a certain element from a set

Large probability (low uncertainty) → low entropy values

Small probability (high uncertainty) → large entropy values

17
New cards

Entropy formula H(X)

H(X) = -Σni=1P(X=vi)⋅log2P(X=vi)

P(X=vi) is the probability

log2P(X=vi) is the uncertainty

vi is each class of the value X

Unit - bits

18
New cards

Information Gain formula

Gain(S, D) = H(S) - ΣVϵD|V|/|S|⋅H(V)

19
New cards

Noise (DT problem)

Problem: There are no attributes left, but there are both positive and negative examples remaining. Attributes do not give enough information to classify the examples

Solution: 1) Leaf nodes report majority classification. 2) Report estimated classification probabilities

20
New cards

Overfitting (DT problem)

Problem: the algorithm uses irrelevant attributes to make spurious distinctions among examples.

Solution: 1) Estimate the likelihood that the gain observed for a particular example set size is due to chance. 2) Cross-validation: set aside part of the training set to use as a test set to see how the current model will work with unseen data

21
New cards

Missing Data (DT problem)

Problem: not all attributes may be known for every example…how to classify?

Solution: 1) Assign a value for the empty attribute, perhaps using the average value in all the training examples. 2) Assume that the attribute has all possible values and proceed down all branches. Weigh each procession by the relative frequency of that procession in the other examples.

22
New cards

Multi-Valued Data (DT problem)

Problem: attributes with large numbers of possible values can give unfairly large gain values, leading to a useless tree

Solution: Modify the gain formula to get the Gain Ratio formula

23
New cards

Gain Ratio formula

GainRatio(S,A) = G(S,A)/SplitInformation(S,A)

SplitInformation(S,A)=Σci=1(|Si|/|S|)log2(|Si|/|S|)

24
New cards

Continuous Values (DT problem)

Problem: some attributes can have an infinity of values, and unlikely that any two examples will have the same exact value

Solution: Discretization - Split the value range up into manageable chunks. Preprocess the example data to determine which ranges are best in terms of the classifications they support

25
New cards

Costly Attributes (DT problem)

Problem: testing certain attributes may carry a substantial cost

Solution: Add a cost term to the attribute selection measure so that cost is considered during tree building.

Gain2(S,A)/Cost(A)

26
New cards

Naïve Bayes

An eager-learning probabilistic classifier based on Bayes Rule

Takes linear time, so easily scalable

27
New cards

Bayes Theorem

Thomas Bayes (1701-1761)

P(c|x)=P(x|c)P(c)/P(x)

The probability of target class c being true, given attribute x is true = …

28
New cards

Naïve Bayes Assumptions

Assume…

  • attributes are conditionally independent

  • attributes are weighted equally towards the outcome

Thus, P(c,X)=P(x1,c)⋅P(x2,c)⋅P(x3,c)⋅P(c)

29
New cards

k-Nearest Neighbor

A lazy-learning algorithm that classifies a query based on its similarity to test instances

Can represent feature space with one dimension for each descriptive feature

30
New cards

Euclidean Distance (k-NN)

Assuming two instances a and b in an m-dimensional feature space…

Euclidean(a,b)=sqrt[ ∑mi=1(ai−bi)² ]

“straight-line distance”

31
New cards

Manhattan distance (k-NN)

Assuming two instances a and b in an m-dimensional feature space…

Manhattan(a,b)= ∑mi=1abs(ai−bi)

“City-block distance”

32
New cards

Minkowski distance (k-NN)

Assuming two instances a and b in an m-dimensional feature space…

Minkowski(a,b)= [ ∑mi=1abs(ai−bi)p ]1/p

If p=1, then Manhattan

If p=2, then Euclidean

Can define an infinite number of distance metrics

33
New cards

Voronoi Tessellation

Divide a feature space into local neighborhoods of classes. Each line segment is equidistant between two points of opposite classes

34
New cards

Why “k” nearest neighbors?

Nearest Neighbor classification is sensitive to noise such as mis-labeled data. Instead have the “k” nearest neighbors vote to classify the query.

35
New cards

How to choose “k”?

A process called parameter tuning

Too small → sensitive to noise

Too large → less precise

Thus, cross-validate using training and test data

Choose k < sqrt(n), where n is the number of training examples, and choose and odd number if dealing with an even number of classes

36
New cards

kNN algorithm

k is number of nearest neighbors, S is the set of instances, Q is the new query

  • Calculate the similarity of Q to all the instances in S

  • Rank the instances in S by their similarity to Q

  • Identify the set of k nearest neighbors

    • Report the majority vote of this set

37
New cards

Scaling (kNN problem)

Problem: distance measures are dominated by one of the attributes

Solution: scale attributes to be roughly square

38
New cards

Missing values (kNN problem)

Problem: Missing feature values can lead to misclassified examples

39
New cards

Outliers (kNN problem)

Problem: outliers create individual spaces that are separated from the rest of their class

Solution: dilute the dependence of the algorithm on individual instances (increase k)

40
New cards

Clustering Algorithm

A kind unsupervised learning algorithm that aims to find homogenous subgroups of data point with similar characteristics, given a finite set of data points.

The end goal of the clustering task is to generate/find a single feature that indicates that cluster that an instance belongs to.

Use cases: Customer segmentation, urban planning, seismology

41
New cards

k-means clustering

Goal: take a dataset of instances D and divide the instances in this data set d1,…,dn into k disjoint clusters. “k” is an input into the algorithm and each instance can only belong to one cluster. Algorithm aims to minimize:

ni=1minc1,…,ck Dist(di, cj) where c1 to ck are the centroids of k clusters and Dist is a distance measure used to compare instances to centroids.

Computationally efficient

42
New cards

k-means clustering approach

  1. Data preprocessing: convert dataset into numerical values if not already

  2. Initialize K and centroids

    • Can choose cluster centroids (seeds) at random, or divide dataset into k portions and pick one from each portion

  3. Assign clusters to datapoints using some distance metric

    • Calculates the distance between the datapoint and all the centroids, and assign the datapoint to the nearest centroid

  4. Update centroids

    • Update cluster centroid to be the mean of all the datapoints within that cluster

  5. Stopping criterion

    • Perform 2 and 3 iteratively until some stopping criterion, such as…

    • Clusters stop changing, centroids remain stable, distance from datapoints to their centroid is minimum, fixed number of iterations have been reached

43
New cards

Inertia Score (k-means clustering)

Tells how far away the data points in a cluster are from the nearest centroid. Want small values. Score ranges from [0, infty)

Measured using Within-Cluster-Sum-of-Squares

WCSS=∑all centroids(∑all points within each centroiddistance(p, C)2

44
New cards

Silhouette Width (k-means clustering)

A score for how well an individual data point is clustered. Considers both inter-cluster and intra-cluster distances. Ranges [-1,1]. A good score is closer to 1.

s(i) = [b(i) - a(i)] / larger of a(i) and b(i)

a(i) is the average distance to other points within its own cluster

b(i) is the average distance to points within the nearest cluster

45
New cards

How to determine k? (k-means clustering)

  1. plot data and see if there is a pattern

  2. try different values of k and choose the one that gives the best performance

    1. Elbow method

    2. Silhouette method

46
New cards

Elbow method (k-means clustering)

Want small inertia and small k. Intertia decreases with increasing k.

Calculate inertia score for a range of k’s, choose the “elbow point” - the point where the change in inertia with increasing k becomes less significant.

47
New cards

Silhouette Method (k-means clustering)

Want silhouette width close to 1 and small k.

Calculate silhouette width for a range of k’s. Choose the one with widths closer to 1 and less outliers.

48
New cards

Association Rule Mining

A rule-based machine learning method that finds relationships hidden in large datasets

Finds associations between any and all attributes

49
New cards

Association Rules

Strong rules that define association between attributes in a dataset.

If antecedent is found in the data, then something else is also likely to be found

Antecedent → Consequent

X → Y

50
New cards

Market Basket Analysis

Proposed by Agrwal, Imielinski, Swami in 1993.

Analyze transaction data and identify association rules using measures of “interestingness”

51
New cards

Antecedent (association rule mining)

Something which is found in the data

“if”

52
New cards

Consequent (association rule mining)

Something which occurs when the antecedent occurs

“then”

53
New cards

Itemset (association rule mining)

The list of all the items in the antecedent and the consequent

54
New cards

Association Rule Mining Issues

  1. Patterns can occur by chance

    • Solution: Support, confidence, and lift

  2. Rule mining from large data-sets can be computationally expensive

    • Learn only strong rules; use the Apriori Algorithm

55
New cards

Support (association rule mining)

Indicates how frequently an item/itemset is in all the dataset. Low support for a rule means there is not enough information, no conclusions can be drawn.

Ex. % of customers that purchased milk+cheese

Support(A) = Frequency(A) / Total number of entries in dataset

Support of (A→D) = Frequency(A&D) / Total number of entries in dataset

56
New cards

Confidence (association rule mining)

Indicates how often a rule is found to be true in the dataset

Ex. % of customers that bought milk and also bought cheese

Confidence(X→Y) = Support(X→Y) / Support(X)

57
New cards

Lift (association rule mining)

Indicates how the antecedent and consequent are related to one another, i.e., the rise in probability of the occurrence of Y when X has already occurred.

Lift(X→Y) = Support(X→Y) / [Support(X) * Support(Y)]

Lift = 1, X and Y are not dependent on one another

Lift < 1, negatively correlated

Lift > 1, positively correlated

58
New cards

Good rules have… (association rule mining)

Support as high as possible

Confidence close to 1

Lift higher than 1

59
New cards

Apriori Algorithm

For frequent itemset mining and association rule mining from large databases.

Generally, identify the frequent individual items in the dataset and extending them to larger and larger itemsets so long as those itemsets appear sufficiently often in the database.

60
New cards

Anti-Monotone Property

If a set cannot pass a test, all of its supersets will fail the same test.

i.e., All subsets of a frequent itemset must be frequent. All supersets of an infrequent itemset must be infrequent.

61
New cards

Apriori Algorithm Pseudocode

Dataset D

Apriori(D, min_sup, min_conf)

k=1

generate frequent itemsets of length k

generate length (k+1) candidates from length k frequent itemsets

prune candidate itemsets containing subsets of length k that are infrequent (< min_sup)

count the support each candidate by scanning D

eliminate infrequent candidates, leaving only those that are frequent

k++

62
New cards