1/21
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is machine learning?
Machine learning is a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty.
What are the two types of machine learning?
Supervised
Unsupervised
What is classification/prediction?
Classification/prediction is like human learn from past experiences.
Computer does not have “experience” so it learns from data, which represent some “past experiences” of an application domain.
What is the general flow process of supervised learning?
i ) Training Text, Documents, Images, Sounds, … > Features Vectors > / Labels >
ii) Machine Learning Algorithm >
iii) Predictive Model
iv) New Text Document, Images, Sound > Features Vector > iii > Expected Label
Where can dataset can be retrieved?
Public datasets
Data marketplace
Company and organization datasets
Web scraping
What are the activities carried out in preprocessing?
Data cleaning -Handling missing values by by either removing the corresponding samples or filling in the missing values with techniques such as mean, median, or mode imputation. Address outliers by using Winsorization technique or outlier imputation
Data Integration - If you have multiple data sources or datasets, you may need to integrate them into a single dataset. This typically involves handling inconsistencies in attribute names, resolving conflicts in data formats, and merging the data based on common identifiers.
Data Transformation - Common transformations include scaling numerical features to a similar range (e.g., using normalization or standardization), encoding categorical variables into numerical representations (e.g., one-hot encoding, front-end development → 1, other development → 2), and transforming skewed distributions
Imbalanced Data - Technique such as oversampling the minority class, undersampling the majority class, or using advanced algorithms like SMOTE (Synthetic Minority Over- sampling Technique) can be employed to address the imbalance.
Time-Series Data - Handling missing or irregular timestamps, resampling or interpolating the data to a regular time interval, and creating lag features or rolling windows for capturing temporal patterns.
What are the major types of machine learning algorithms?
Classification - Uses categorical / nominal
Regression - Continuous values
What are the types of classification methods?
K Nearest Neighbour
Decision Tree
Support Vector Machine
Bayesian Classification
What are the conditions for stopping partitioning?
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning - majority voting is employed for classifying the leaf
There are no samples left
What is entropy?
Entropy is the measure of randomness in a dataset
What is the aim of decision tree?
Split the data in a way that the entropy in the data decreases so it is easier to make predictions
What are the two approaches to avoid overfitting?
Prepruning: Halt tree construction early - do not split a node if this would result in the goodness measure failing below a threshold
Difficult to choose an appropriate threshold
Postprunning: remove branches from a “fully grown” tree - get a sequence of progressively pruned trees
Use a set of data different from the training data to decide which is the “best pruned tree”
What are the contents of Naive Bayesian Classification?
Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems.
Incremental: each training example can incrementally increase/decrease the probability that a hypothesis is correct. prior knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities
Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
What is a confusion matrix?
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
What are the clustering methods?
K-Means
Gaussian Mixture Model
Mean-Shift
Hierarchical Clustering
What is the stopping/convergence criterion for K-Means?
no (or minimum) re-assignments of data points to different clusters
no (or minimum) change of centroids, or
minimum decrease in the sum of squared error J
What are the limitations of K-Means?
very sensitive to the initial points. - do many runs of k-means, each with different initial centroids.
must manually choose k - learn the optimal k for the clustering.
K-means has problems when clusters are of differing size, densities, non-globular shapes
K-means has problems when the data contains outliers
What are the advantages and disadvantages of mean shift?
Advantages
Does not assume number of clusters
Just a single parameter
Finds variable number of modes
Robust to outliers
Disadvantages
Output depends on window size
Computationally expensive ( one that, for a given input size, requires a relatively large number of steps to complete )
What are the types of hierarchical clustering?
Agglomerative (bottom up) clustering: It build the dendrogram (tree) from the bottom level
Divisive (top down) clustering: It starts with all data points in one cluster, the root.
What are the evaluation based on internal information?
Intra-cluster cohesion (compactness)
Cohesion measures how near the data points in a cluster are to the cluster centroid.
Sum of squared error (SSE) is a commonly used measure.
inter-cluster separation (isolation)
Separation means that different cluster centroids should be far away from one another.
Confusion matrix for predicted no, predicted yes, actual no and actual yes
Predicted no - Negative
Predicted yes - Positive
Predicted is actual - True
Predicted is not actual - False
Actual no but predicted yes - False positive
Actual no and predicted no -True negative ( True = correct, negative = no )
What is the formula for accuracy, precision and recall?
Accuracy = TP+TN / total
Precision = TP/ Predicted yes
Recall = TP/Actual yes