1/31
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Neural Networks
These are modeled on the human brain. Just as neurons are networked to one another, artificial neural networks emphasize the relationships between attributes. They have proven especially effective in analyzing visual, audio, and written language data. Similar to Principle Component Analysis, they are affective at identifying which features are most relevant.
Perceptron
This algorithm attempts to classify objects using a single line, plane, or hyperplane. It begins with a set of weights matching the number of dimensions under consideration, plus 1 additional weight. For example, if we’re looking at a two-dimensional field, we will have three weights, ω0, ω1, and ω2. These values will be multiplied by the value of their corresponding dimensions and added together, this is ω Transpose x:
ω T x = (ω0 x0) + (ω1 x1) + (ω2 * x2)
Bayesian Nets
The probability of x is the number of times x occurs in the observations:
P(x) = Count(x)/N
Joint Probability
The probability that two variables have particular values is the count of records where they have those values divided by the number of records:
P(x,y) = Count(x, y) / N
Sum Rule
Once you have a joint probability, you can use that to determine a single probability by ‘summing out’ the other variable:
P(+x) = P(+x, +y) + P(+x, -y)
Conditional Probability
Probability x given y is the probability of x and y divided by probability of y:
P(x|y) = P(x, y) / P(y)
Bayes Theorem
If the probability of P(y|x) is P(x,y)/P(x), and P(x, y) = P(x|y) * P(y), then:
P(y|x) = P(x|y)*P(y) / P(x)
Independence
Two variables are independent if:
P(x,y) = P(x)P(y); Or
P(x|y) = P(x)
Conditional Independence
P(x, y|z) = P(x|y)P(y|z); Or
P(x|z,y) = P(x|z)
Unsupervised learning
Used on datasets with no predefined classes. The machine will learn by observation instead of examples. The objectives can be the same as for supervised learning, and it is also useful for stand-alone applications and preprocessing.
Examples for unsupervised learning
Customer segmentation
Patient cohorts with similar characteristics
Topics covered in documents
Geological predictions
Economics: market research
Finding nearest neighbors
Compression
Types of unsupervised learning
Partitional: an object can belong to only one class.
Hierarchical: a class can have a subclass.
Overlapping: an object can belong to more than one class
Fuzzy Cluster: an object belongs to every class, but a weight is attached, usually between 0 and 1. This is similar to fuzzy sets in math.
Other Properties of Unsupervised Algorithms
Prototype: class is defined by a representative for that cluster, like a centroid.
Density: class is defined by membership in a tight cluster of similar objects.
Shared-Property: membership in a class is based on concepts held in common.
Graph: Connection to other objects defines class.
Most Popular Algorithms
K-Means: Distance-based partitioning
DBSCAN: Density-Based
Cobweb: model-based conceptual clustering
Expectation Maximization: Statistical modeling
Nearest Neighbor: Distance-based partitioning
Linking Clusters
Single: Distance between nearest objects
Complete: Distance between furthest objects
Average: Distance between centroids
Important Measures
Cohesion: Average distance of points within cluster (Similar to Sum of Square Error)
Separation: Average distance to points of nearest outside cluster
Silhouette: Cohesion + Separation normalized
S = (b - a)/max(a, b)
Where b is Separation and a is Cohesion
Silhouette will be between -1 and 1
-1 is a poor clustering
1 is a good clustering
Association
Put simply, this analysis aims to find items that frequently appear together in data. The technique can be applied to many domains including retail sales, bioinformatics, scholarly authorships, parliamentary voting, and web data mining. The patterns discovered are useful for many purposes including promotions, medical discoveries, and classifications.
Market Basket
To preform association analysis, we need data that records transactions. You can think of this as items purchased in a market basket.
TID | Items Purchased |
1 | Beer, Nuts, Diaper |
2 | Beer, Coffee, Diaper |
3 | Beer, Diaper, Eggs |
4 | Nuts, Eggs, Milk |
5 | Nuts, Coffee, Diaper, Eggs |
Support
We want to discover items that are frequently purchased together in the data. We could generate a list of all possible sets and count how many times those sets occur in the recorded transactions. How many transactions contain a set divided by the total count of transactions is the support for that itemset:
Count(itemset) / Number of transactions = Support; Or
σ(X)/N = S → Where X is the itemset and N is the number of transactions
In the data above, the support for the itemset {Beer, Diapers} is 3/5 or .6 because Beer and Diapers are purchased together 3 times and there are 5 transactions.
Association Rules
Association rules are stated Y given X, or X -> Y. X is the antecedent and Y is the consequent. We want to discover conditional relationships between subsets. For example, if someone has purchased Diapers, what is the likelihood they also purchased Beer? What is the probability of Beer given Diapers?
Confidence
To find the answer, we can find the count of {Beer, Diaper} and divide it by the count of diapers: 3 / 4 = .75. This is the confidence we have that beer will be purchased if diapers are purchased.
confidence(X -> Y) = σ(X∪Y) / σ(X); OR
If X, Then Y = count(X and Y) / count(x)
Apriori Analysis
The computational cost to calculate the support and confidence increases exponentially as the number of distinct items grows. In the data above, there are 64 possible subsets of the six items and 602 possible association rules. If the data represented a real store, there would likely be thousands of items and millions of transactions. To calculate support and confidence each of those transactions would be scanned for each of the 602 rules. Even for a computer, the task would take far too much time. For this reason, we need an algorithmic approach. We only want to consider possible subsets and rules that are most likely to yield meaningful results. Strategies for narrowing the relevant itemsets are often based on the apriori principle principles:
Apriori Principle: if an itemset if frequent, then all of its subsets must also be frequent.
Some measures adhere for this rule based on the anti-monotone property:
Anti-monotone Property: the support for every subset of a set must be greater than the support for the set. For example, support adheres to the anti-monotone property. If X is a subset of Y:
s(X) > s(Y)
Interestingness
Using confidence and support to judge association rules is often insufficient. Some uninteresting rules will have high confidence and support. For example, peanut butter and jelly likely have high confidence and support values, but the association is already well known. Confidence can also be misleading. To demonstrate this, consider these contingency tables. Contingency tables are often used to count an item contingent on whether or not another item is present. For example, the first contingency table shows frequencies for people who drink coffee, tea, both, or neither. The first cell, 15, is the number of people who drink both tea and coffee. The cell with the value 5 shows the number of who drink tea but not coffee (the symbol ~ means ‘not’). The cells to the far right and lowermost rows are the totals for those categories.
Lift
Lift is the combination of two measures: support divided by independence. This is also called the Interest Factor.
Lift = s(itemset) / s(Antecedent itemset) X s(Consequence itemset)
For example, the lift for Tea -> Coffee is:
s(Tea and Coffee) / s(Tea) X s(Coffee)
.15 / .2 X .8
.9375
Lift greater than 1 shows the items are positively related; Less than 1 shows they are negatively related; and lift equal to 1 shows the items are independent. In the Coffee, Tea, and Honey example, Tea -> Coffee has a lift of .9375 and Tea -> Honey has a lift of 4.1667. Therefore, the rule Tea -> Honey is a more interesting rule than Tea -> Coffee.
Null invariance
If additional market baskets are added to a dataset that do not contain items A or B, then a measure of A and B is null invariant if it does not change.
Inversion invariance
In some datasets, the presence of an item is more meaningful than its absence. For example, we might be interested to know if people who buy batteries are also likely to buy pens, but we aren’t as interested in how many people did not buy batteries but did buy pens. The presence of batteries is more meaningful than their absence. But, when finding associations between peoples’ opinions, the presence of a “yes” answer is just as meaningful as the presence of a “no” answer. The binary options are equally weighted. Measures are invariant to inversion when the result does not change when all the binary values are switched (‘yes’ is changed to ‘no’, True is changed to False, 1 is changed to 0, etc).
Scaling invariance
If the results of a measure do not change as the distributions between classes change, then the measure is scaling invariant. For example, if a sample goes from having 100 dogs and 50 cats to 300 dogs and 50 cats, then the scaling has changed. If the measure does not change with the additional dogs, then it is scaling invariant.
Symmetry invariance
If the order of items in an itemset does not affect the measure, then the measure is symmetry invariant. For example, confidence can change if a rule is stated A->B rather than B->A, and is therefore not symmetry invariant.
Frequent Pattern Tree Representation
The apriori algorithm has its limitations. Its performance is hindered when the data include many items. When the data are wide, many candidate itemsets must be generated.
Another algorithm uses a depth-first approach by restructuring the data into an FP – Tree. Each pattern is illustrated by a series of nodes connected by a solid line. Dotted lines connect nodes with the same value, but different patterns. If the pattern is repeated, it is indicated by a number representing the number of itemsets in the transaction that start the same way.
The advantage of this approach is, once the tree is constructed, it doesn’t need to revisit the transaction table. We can determine support based on the patterns and numbers alone.
This approach can compress the data if patterns frequently repeat. The amount of compression decreases, however, as the uniqueness of itemsets increases. The more a pattern is repeated, the greater the compression.
Uniqueness
Most association algorithms require that data be formatted in a matrix of zero’s and one’s. When pivoting a categorical field to many flag fields, the number of unique categorical values can affect the outcome and performance of your analysis. If the values are too specific, like ‘Coors Light Beer’, then they likely won’t have enough frequency to pass a minimum support threshold. If the values are too generic, like ‘Drinks’, they might appear too frequently and lead to redundant or uninteresting association rules.
Taxonomies
One approach to decrease the uniqueness of a categorical field is to use a taxonomy. This way, items can be grouped together in a concept hierarchy. For example, using this taxonomy, chicken, rabbit, game, and beef, can all be converted to ‘meats’. Meat will have higher frequency, higher support, and likely result in more association rules.
Binary Fields
Sometimes, only the presence of an item is meaningful. For example, when analyzing grocery store purchases, only the presence of an item is considered. We don’t often want association rules like ‘not batteries -> milk’. Other times, like when considering answers to a survey, a no answer is just as meaningful as a yes answer. If that is the case, you may want to preserve both no and yes answers in your pivot.