1/61
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
Arthur Samuel
1950 - defined machine learning as a field of study that gives computers the ability to learn without being explicitly programmed
Checkers-playing program was able to learn without needing to be explicitly programmed. Used minimax search, static evaluation, rote learning
Machine learning
“the construction and study of algorithms that can learn from data”
“A program learns from experience E if its performance on the task improves with the experience”
Build a model based on inputs and use that to make predictions, decisions, rules
Split into supervised, unsupervised, and reinforcement learning
Steps of machine learning
Data gathering
Data pre-processing
Choose a model
Train the model
Fine-tune the model
Apply the model
Supervised Learning
Assumes a known input of data including known answers/responses. Seeks to produce predictions for responses to new data.
Examples: Regression problem, classification problem
Unsupervised Learning
No labeled answers are provided. Instead learning program is to find structure or interesting patterns/associations within given data
Examples: clustering, association
Eager Learning
Constructs a general learning model before receiving a query
Less dependent on training data during prediction
Takes more time training, less time predicting
More memory usage during training, less during prediction
Example: Decision tree, Naive Bayes, Neural network
Lazy Learning
Stores training data (or carries out only minor processing) and waits until given a new query
Relies heavily on training data during prediction
Takes less time training, more time predicting
Less memory usage during training, more during predicting
Example: k-nearest neighbor
Decision Tree
Fast learning algorithms
Tree structure, each node is labeled with a test/question, each branch with possible answers, leaf nodes with a decision/solution
Program traverses tree to reach a leaf node (with a decision)
Supervised learning, eager learning
Attribute (feature descriptor)
Internal node of decision tree
Attribute value (feature descriptor value)
Branch of decision tree
Classification (decision tree)
Leaf node of decision tree
Decision tree learning algorithm
Top-down induction
Find the smallest tree possible
Choose best attribute
Extend tree by adding a new node for this attribute and a new branch for each attribute value
Sort training examples down through this node to the current leaves
If the training examples are unambiguously classified then stop
Otherwise repeat from the top
Occam’s Razor
The most likely hypothesis is the simplest one that is consistent with all hypotheses
Attributed to William of Ockham (c. 1285-1349)
Best attribute
Generally, the attribute that makes the most difference to the classification of an example
Specifically, the attribute that achieves the highest information gain
Claude Shannon
Founded information theory - the mathematical theory of data communication and storage (1948)
Entropy model
Understanding entropy
A computational measure of the impurity of the elements in a set
The “uncertainty” associated with randomly picking a certain element from a set
Large probability (low uncertainty) → low entropy values
Small probability (high uncertainty) → large entropy values
Entropy formula H(X)
H(X) = -Σni=1P(X=vi)⋅log2P(X=vi)
P(X=vi) is the probability
log2P(X=vi) is the uncertainty
vi is each class of the value X
Unit - bits
Information Gain formula
Gain(S, D) = H(S) - ΣVϵD|V|/|S|⋅H(V)
Noise (DT problem)
Problem: There are no attributes left, but there are both positive and negative examples remaining. Attributes do not give enough information to classify the examples
Solution: 1) Leaf nodes report majority classification. 2) Report estimated classification probabilities
Overfitting (DT problem)
Problem: the algorithm uses irrelevant attributes to make spurious distinctions among examples.
Solution: 1) Estimate the likelihood that the gain observed for a particular example set size is due to chance. 2) Cross-validation: set aside part of the training set to use as a test set to see how the current model will work with unseen data
Missing Data (DT problem)
Problem: not all attributes may be known for every example…how to classify?
Solution: 1) Assign a value for the empty attribute, perhaps using the average value in all the training examples. 2) Assume that the attribute has all possible values and proceed down all branches. Weigh each procession by the relative frequency of that procession in the other examples.
Multi-Valued Data (DT problem)
Problem: attributes with large numbers of possible values can give unfairly large gain values, leading to a useless tree
Solution: Modify the gain formula to get the Gain Ratio formula
Gain Ratio formula
GainRatio(S,A) = G(S,A)/SplitInformation(S,A)
SplitInformation(S,A)=Σci=1(|Si|/|S|)⋅log2(|Si|/|S|)
Continuous Values (DT problem)
Problem: some attributes can have an infinity of values, and unlikely that any two examples will have the same exact value
Solution: Discretization - Split the value range up into manageable chunks. Preprocess the example data to determine which ranges are best in terms of the classifications they support
Costly Attributes (DT problem)
Problem: testing certain attributes may carry a substantial cost
Solution: Add a cost term to the attribute selection measure so that cost is considered during tree building.
Gain2(S,A)/Cost(A)
Naïve Bayes
An eager-learning probabilistic classifier based on Bayes Rule
Takes linear time, so easily scalable
Bayes Theorem
Thomas Bayes (1701-1761)
P(c|x)=P(x|c)P(c)/P(x)
The probability of target class c being true, given attribute x is true = …
Naïve Bayes Assumptions
Assume…
attributes are conditionally independent
attributes are weighted equally towards the outcome
Thus, P(c,X)=P(x1,c)⋅P(x2,c)⋅P(x3,c)⋅P(c)
k-Nearest Neighbor
A lazy-learning algorithm that classifies a query based on its similarity to test instances
Can represent feature space with one dimension for each descriptive feature
Euclidean Distance (k-NN)
Assuming two instances a and b in an m-dimensional feature space…
Euclidean(a,b)=sqrt[ ∑mi=1(ai−bi)² ]
“straight-line distance”
Manhattan distance (k-NN)
Assuming two instances a and b in an m-dimensional feature space…
Manhattan(a,b)= ∑mi=1abs(ai−bi)
“City-block distance”
Minkowski distance (k-NN)
Assuming two instances a and b in an m-dimensional feature space…
Minkowski(a,b)= [ ∑mi=1abs(ai−bi)p ]1/p
If p=1, then Manhattan
If p=2, then Euclidean
Can define an infinite number of distance metrics
Voronoi Tessellation
Divide a feature space into local neighborhoods of classes. Each line segment is equidistant between two points of opposite classes
Why “k” nearest neighbors?
Nearest Neighbor classification is sensitive to noise such as mis-labeled data. Instead have the “k” nearest neighbors vote to classify the query.
How to choose “k”?
A process called parameter tuning
Too small → sensitive to noise
Too large → less precise
Thus, cross-validate using training and test data
Choose k < sqrt(n), where n is the number of training examples, and choose and odd number if dealing with an even number of classes
kNN algorithm
k is number of nearest neighbors, S is the set of instances, Q is the new query
Calculate the similarity of Q to all the instances in S
Rank the instances in S by their similarity to Q
Identify the set of k nearest neighbors
Report the majority vote of this set
Scaling (kNN problem)
Problem: distance measures are dominated by one of the attributes
Solution: scale attributes to be roughly square
Missing values (kNN problem)
Problem: Missing feature values can lead to misclassified examples
Outliers (kNN problem)
Problem: outliers create individual spaces that are separated from the rest of their class
Solution: dilute the dependence of the algorithm on individual instances (increase k)
Clustering Algorithm
A kind unsupervised learning algorithm that aims to find homogenous subgroups of data point with similar characteristics, given a finite set of data points.
The end goal of the clustering task is to generate/find a single feature that indicates that cluster that an instance belongs to.
Use cases: Customer segmentation, urban planning, seismology
k-means clustering
Goal: take a dataset of instances D and divide the instances in this data set d1,…,dn into k disjoint clusters. “k” is an input into the algorithm and each instance can only belong to one cluster. Algorithm aims to minimize:
∑ni=1minc1,…,ck Dist(di, cj) where c1 to ck are the centroids of k clusters and Dist is a distance measure used to compare instances to centroids.
Computationally efficient
k-means clustering approach
Data preprocessing: convert dataset into numerical values if not already
Initialize K and centroids
Can choose cluster centroids (seeds) at random, or divide dataset into k portions and pick one from each portion
Assign clusters to datapoints using some distance metric
Calculates the distance between the datapoint and all the centroids, and assign the datapoint to the nearest centroid
Update centroids
Update cluster centroid to be the mean of all the datapoints within that cluster
Stopping criterion
Perform 2 and 3 iteratively until some stopping criterion, such as…
Clusters stop changing, centroids remain stable, distance from datapoints to their centroid is minimum, fixed number of iterations have been reached
Inertia Score (k-means clustering)
Tells how far away the data points in a cluster are from the nearest centroid. Want small values. Score ranges from [0, infty)
Measured using Within-Cluster-Sum-of-Squares
WCSS=∑all centroids(∑all points within each centroiddistance(p, C)2
Silhouette Width (k-means clustering)
A score for how well an individual data point is clustered. Considers both inter-cluster and intra-cluster distances. Ranges [-1,1]. A good score is closer to 1.
s(i) = [b(i) - a(i)] / larger of a(i) and b(i)
a(i) is the average distance to other points within its own cluster
b(i) is the average distance to points within the nearest cluster
How to determine k? (k-means clustering)
plot data and see if there is a pattern
try different values of k and choose the one that gives the best performance
Elbow method
Silhouette method
Elbow method (k-means clustering)
Want small inertia and small k. Intertia decreases with increasing k.
Calculate inertia score for a range of k’s, choose the “elbow point” - the point where the change in inertia with increasing k becomes less significant.
Silhouette Method (k-means clustering)
Want silhouette width close to 1 and small k.
Calculate silhouette width for a range of k’s. Choose the one with widths closer to 1 and less outliers.
Association Rule Mining
A rule-based machine learning method that finds relationships hidden in large datasets
Finds associations between any and all attributes
Association Rules
Strong rules that define association between attributes in a dataset.
If antecedent is found in the data, then something else is also likely to be found
Antecedent → Consequent
X → Y
Market Basket Analysis
Proposed by Agrwal, Imielinski, Swami in 1993.
Analyze transaction data and identify association rules using measures of “interestingness”
Antecedent (association rule mining)
Something which is found in the data
“if”
Consequent (association rule mining)
Something which occurs when the antecedent occurs
“then”
Itemset (association rule mining)
The list of all the items in the antecedent and the consequent
Association Rule Mining Issues
Patterns can occur by chance
Solution: Support, confidence, and lift
Rule mining from large data-sets can be computationally expensive
Learn only strong rules; use the Apriori Algorithm
Support (association rule mining)
Indicates how frequently an item/itemset is in all the dataset. Low support for a rule means there is not enough information, no conclusions can be drawn.
Ex. % of customers that purchased milk+cheese
Support(A) = Frequency(A) / Total number of entries in dataset
Support of (A→D) = Frequency(A&D) / Total number of entries in dataset
Confidence (association rule mining)
Indicates how often a rule is found to be true in the dataset
Ex. % of customers that bought milk and also bought cheese
Confidence(X→Y) = Support(X→Y) / Support(X)
Lift (association rule mining)
Indicates how the antecedent and consequent are related to one another, i.e., the rise in probability of the occurrence of Y when X has already occurred.
Lift(X→Y) = Support(X→Y) / [Support(X) * Support(Y)]
Lift = 1, X and Y are not dependent on one another
Lift < 1, negatively correlated
Lift > 1, positively correlated
Good rules have… (association rule mining)
Support as high as possible
Confidence close to 1
Lift higher than 1
Apriori Algorithm
For frequent itemset mining and association rule mining from large databases.
Generally, identify the frequent individual items in the dataset and extending them to larger and larger itemsets so long as those itemsets appear sufficiently often in the database.
Anti-Monotone Property
If a set cannot pass a test, all of its supersets will fail the same test.
i.e., All subsets of a frequent itemset must be frequent. All supersets of an infrequent itemset must be infrequent.
Apriori Algorithm Pseudocode
Dataset D
Apriori(D, min_sup, min_conf)
k=1
generate frequent itemsets of length k
generate length (k+1) candidates from length k frequent itemsets
prune candidate itemsets containing subsets of length k that are infrequent (< min_sup)
count the support each candidate by scanning D
eliminate infrequent candidates, leaving only those that are frequent
k++