1/116
Flashcards generated from Data Mining Final Revision lecture notes.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What type of datasets are sales/transactional data, customer databases, and healthcare records typically stored in?
Relational databases
What is the organization level of Semistructured data?
They do not have a strict format, but have some level of organization.
What are examples of Semistructured data?
xml, json, emails
What is the main characteristic of Unstructured data sets?
They lack predefined semantic structure.
Give examples of Unstructured data sets.
text data, images, audio files
Give an example of an ordinal attribute.
The ranking in number of stars of the last hotel visited by a customer
What are the characteristics of data when performing data Classification?
Data set with n input attributes x1,…,xn and a target categorical attribute y
List examples of Classification data mining tasks
medical diagnosis, churn prediction, sentiment analysis, spam detection, loan underwriting
What are the characteristics of data when performing data Regression?
Data set with n input attributes x1,…,xn and a target numeric attribute y
List examples of Regression data mining tasks
Demand and price prediction, marketing return of investment, traffic flow prediction
What are the characteristics of data when performing data Clustering?
A data set with m unlabelled instances x1,…,xm. and an output
List examples of Clustering data mining tasks
customer segmentation, recommendations, transport
What are the characteristics of data when performing data Association?
A data set with m instances and n features. The output is frequenct combinations of attribute values and association rules X → Y , where X and Y are subsets of attribute values.
List examples of Association data mining tasks
market basket analysis, statistical data analysis
Which are unsupervised learning tasks, Association or Classification?
Association and Clustering
Which are Major issues that we may have to deal with when working with real-world data sets?
Massive data, High-dimensional data, Insufficient, biased, or sparse data, Missing values, Noisy data
List a few important steps of data mining as a process.
Objective specification, Data exploration, Data cleaning and preparation, Model building, Model evaluation, Repeat
Which of the following statements about Data Mining is false? Can be used for making predictions, Can be automated, Extracts meaningful and useful patterns from data, Requires generating new data.
Requires generating new data.
Which of the following statements about tabular data sets, i.e. ones represented in the form of a two-dimensional table, is false? a. Columns correspond to attributes. b. Attributes are typically assumed to be independent. c. Instances are typically assumed to be independent. d. Rows correspond to instances.
Attributes are typically assumed to be independent.
Which of the following data mining tasks aims to divide the instances into groups, so that instances in the same group are more similar than instances in different groups? Clustering, Regression, Association, Classification.
Clustering
Which of the following statements about data mining as a process is false? Creating data sets from raw data in order to apply data mining methods can be a very time-consuming task in practice. Selecting the right attributes in the data set affects the performance of the constructed model. Deciding which model to use depends on the type of the available data. Overfitting implies the constructed model is very good.
Overfitting implies the constructed model is very good.
Which of the following knowledge representations do not require the computation of an actual model? Instance-based, Classification Rules, Decision Trees, Clusters.
Instance-based
What is the definition of Attribute?
Individual measurable characteristic of the data mining problem under consideration
What is a classification rule?
A type of a data mining model for predicting the class of an instance in the form of an “if-then” statement.
The if part is called the antecedent or precondition and the then part the consequent of conclusion.
What is a decision tree?
It is tree-like model of decisions for making predictions. Nodes represents decisions
Describe one way Linear Regression can be useful in data mining.
Interpretable insights about attribute relationships. OR Efficiently computable. OR Impactful, e.g. Moore’s law, BIM index. OR Confidence intervals. OR Part of complex models, e.g. piecewise regression, model trees.
What are possible approaches to learn a linear regression model?
Gradient descent OR Closed form solution using matrix inversion.
What are the sources of overfitting?
Too many (irrelevant or redundant) features OR Too complex models may capture noise / local variations
Give examples of ways to deal with Overfitting.
Feature selection and dimensionality reduction, e.g. Linear Regression, Principal Component Analysis. OR Resampling, e.g. Cross-validation, Bootstrapping. Hyper-parameter tuning and model selection, e.g. Grid Search, Random Search, Bayesian Optimisation. OR Modified loss function, e.g. Regularization, Akaike Information Criterion, Bayesian Information Criterion.
What does Regularization or Shrinkage do to the model parameters?
Shrinks (regularizes) the model parameters towards zero.
What is a linear separator?
A linear function (or hyperplane) w⊤x = 0
Which of the following statements about multilayer perceptron and neural networks is false? a) The activation function type is a hyper-parameter. b) During training, weights are updated with a forward pass. c) The MSE is commonly used as evaluation metric. d) The number of hidded-layers is decided during validation.
During training, weights are updated with a forward pass.
What are parameters and hyper-parameters in a multilayer perceptron neural network?
Parameters: Computed during training. Weights of the network links.Hyper-parameters: Computed during validation (model selection) Number of hidden layers, number of nodes per layer, incoming links per node (connectivity), activation function, etc.
List properties of the activation function
Nonlinear, Differentiable, Monotonic
Which of the following statements about linear regression is true? Predicts a numeric output from a nominal input. Predicts a nominal output from a numeric input. Predicts a nominal output from a nominal input. Predicts a numeric output from a numeric input.
Predicts a numeric output from a numeric input.
Which of the following statements about 10-fold cross-validation is false Estimates the error rate of a regression model on new instances which have not been used for training. Computes 10 different models. Divides the instances into 10 disjoint groups of roughly equal size. Each fold is used exactly once for training.
Each fold is used exactly once for training.
Given the linear regression equation y = w0 + w1x1 + w2x2, what do the terms x1 and x_2 represent? Labels. Residuals. Weights. Attribute values.
Attribute values.
Which of the following statements about perceptrons and neural networks is true? The observed and true values of target variables are added to compute the score of a neural network. The activation function of a neural network can be decided when building the model. To make predictions with a given neural network, we perform backpropagation. A perceptron is a multilayered neural network
The activation function of a neural network can be decided when building the model.
Which of the following statements about correlation and causation is true? They always occur together. None of the other choices is true. They never occur together. They sometimes occur together.
They sometimes occur together.
In the nominal weather data set, what is the error rate of the classification rule “if temperature=mild, then play=yes”?
2/6
1R algorithm because
The algorithm learns a set of Rules. 1 because the rules are only based on 1 input attribute
What is a key disadvantage of such a model?
Overfitting
Classification rules can be appropriate for preliminary medical diagnosis because:
Given a set of symptoms, they predict a candidate disease. They may associate a set of symptoms with multiple diseases. Frequency of a symptom being associated with a disease. Interpretable to general public.
What is Entropy?
A measure of uncertainty / information of random variables.
What is a deterministic classification models?
Can be viewed as dividing the space of attribute values into regions and classifying instances based on their locations in the space
What do Probabilistic classification models do?
Yield a probability distribution over the set of possible labels for each instance:
What is FPR?
False Positive Rate - FPR
What is TPR?
True Positive Rate - TPR
What is meant by Stratification?
Stratified holdout: the process of randomly sampling to create the training and test sets so that they contain proportional representations of each class.
What are Ensemble Methods?
We may combine individual models using: weighted voting for classification, weighted averaging for regression. Common ensemble methods: Bagging Boosting
Which of the following measures the relationship between true positives and false positives for different subsets of instances in the data set? Confusion Matrix Cohen's Kappa Lift Receiver Operating Characteristic (ROC)
Receiver Operating Characteristic (ROC)
What is 1R?
An algorithm for computing classification rules, based on class frequencies of individual attributes.
A node in a decision tree represents
A group of data points (or instances) that meet certain conditions.
Describe a decision-tree building process.
Divide-and-Conquer Method for Constructing Decision Trees Select an attribute at the root. Make one split for each possible attribute value (assume a finite set of possible values). Repeat (1) for each child. Recursively break down the problem into two or more smaller subproblems, until these become simple enough. Combine subproblem solutions to solve the original problem
Describe Notion of the entropy and how it is useful for building decision trees.
Entropy Measure of disorder of a system Quantifies the uncertainty associated with a random variable. Many low probability events have high entropy. Few high probability events have lower entropy.Computes the amount of information in number of bits. The higher the information aqcuisition, the higher the entropy. The more certain an event is, the less information it contains. If a leaf is entirely pure, then the entropy is low. If a leaf is not pure, then the entropy is high.Select splits maximizing information gain,
What are the characteristics of a good set of clusters?
Low within-cluster distance and high between-cluster distance.
Advantage of using Euclidean distance metric
Insightful with useful geometric properties and interpretations.
Limitations of using Euclidean distance metrics?
Sensitive to magnitude of attribute values, Does not effectively represent (non-) Euclidean spaces.
Which metrics are alternatives to Euclidean distance metrics?
Manhattan, Minkowski, Chebychev, Cosine Similarity
The Euclidean, Manhattan, Chebyshev, and Cosine distance metrics tend not to work well with
Nominal attribute values OR Non-Euclidean spaces
What does distance metric learning aim to do?
Learn distances for a data to enhance the performance of similarity-based algorithms
From a computational perspective, optimization algorithms have two main qualities
Running time and solution quality।
What are the properties of Kmeans with Euclidean distances?
The objective value decreases at each iteration.
List Kmeans Advantages
Simple to implement and interpretable. Scalable to massive data sets with many instances. Easily adaptable to new data points.
List Kmeans Disadvantages
Manually selecting the number of clusters. Sensitive to outliers (Kmedoids variant). Does not scale well to high-dimensional data. annot handle non-convex sets
List Hierarchical Clustering Advantages
More informative than unstructured flat clusters. Easy to implement
List Hierarchical Clustering Disadvantages
High running time. Not easily adaptaptable (cannot undo previous operations). Sensitive to noise.
DBSCAN Advantages:
Does not require a specified number of clusters. Can compute clusters with arbitrary shapes. Incorporates a notion of noise and is robust to outliers.
DBSCAN Disadvantages
Quality deteriorates if the data set contains instances with large density differences. Border points might be reachable from multiple core points.
Which of the following statements about overlapping clusters is true? Some instances may belong to more than one clusters. All instances must belong to more than one clusters. Instances belong to only one cluster. Empty clusters may exist.
Some instances may belong to more than one clusters.
What is the complete-linkage metric? A distance metric often used by k-means clustering. The largest distance between two instances in the same cluster. A distance metric often used for hierarchical clustering. The average distance between all members of two clusters.
The largest distance between two instances in the same cluster.
Which of the following clustering algorithms successively divides clusters into sub-clusters? k-means DBSCAN Incremental Hierarchical
Hierarchical
Which of the following statements for an outlier point is true? Has high probability of occurrence if we build a good probabilistic model for the data Is close to a large number of points. Has low proximity to all other data points. Strongly belongs to some cluster
Has low proximity to all other data points.
Fuzzy clustering is useful mainly for:
Overlapping clusters.
Statistical Modeling use cases:
Data Generation, Simulations, Classificatio, Regression, Clustering
Statistical Model:
A set M(θ) of candidate probabilistic models to have generated the observed data D Typically, it is specified by a set Θ of parameter values Parameter θ ∈ Θ corresponds to a probabilistic model M(θ).
Bernoulli Distribution
models a random event with only two possible outcomes, like a coin flip
Maximum Likelihood Estimation (MLE)
Find the parameter value that makes the observed data most likely.assumes that the model parameters have fixed values
Maximum Aposteriori Estimation (MAP)
Like MLE, but it also includes prior beliefs about the parameter. It finds the parameter value that is most likely after combining data with prior knowledge. MAP = MLE + prior
Bayesian Estimation
Goes further than MAP: it gives a whole distribution of possible parameter values, not just one best guess. It updates our belief about the parameter using Bayes’ Theorem.
Parametric models:
the model has a specific form with fixed number of parameters, independent of the training set size.
Non-parametric models:
the model incorporates less assumptions, i.e. is more data-driven, and typically number of parameters grows as the sample size increases.
Which of the following statements about probabilities is false? Probability density functions model mainly discrete data. Probability density functions model mainly continuous data. Two variables are independent if the knowledge of one does not affect the probability distribution of the other. Joint probability distributions model relationships between multiple variables.
Probability density functions model mainly discrete data.
What is a prior probability? The probability of observing the data, given that the a hypothesis holds. The probability of observing the data without a hypothesis. The probability that a hypothesis holds, given the data. The probability that a hypothesis holds without having observed any data.
The probability that a hypothesis holds without having observed any data.
Which of the following methods of estimation assumes a probability distribution for the parameter values Bayesian estimation. Binomial distribution. Maximum likelihood estimation. Gaussian mixture model.
Bayesian estimation.
What is a mixture model?
A model combining simpler component distributions.
Which of the following statements about Naïve Bayes is false? It can be mainly applied to nominal attributes. It computes probabilities by calculating proportions of instances. It assumes that the attributes are conditionally independent. It uses Laplacian estimation to deal with missing values.
It uses Laplacian estimation to deal with missing values.
What are Instance-Based Learning main components?
Distance function: Calculate distance between any pair of instances. Combination function: Combine nearest neighbors to compute the output
Advantages of KNN
Nonlinear decision boundaries. Easily adaptable to new instances Not restricted by the data format.
Which of the following is not a weakness of instance-based learning models? Prediction is time-consuming. Lack of strong distance / combination functions. Memory-intensive. Training is time-consuming
Training is time-consuming
Which of the following allows the efficient comparison of a snippet with a song in Shazam music recognition? Constellation plot. Anchor point kD-tree. Spectrogram.
Anchor point kD-tree.
Which of the following statements about kernel density estimation is false? Applies mainly to low-dimensional data in practice. Computes a smooth density estimate Attempts to improve histograms. The choice of the kernel function is more critical than the choice of the bandwidth.
The choice of the kernel function is more critical than the choice of the bandwidth.
Locally weighted regression models: Contain a bounded number of smaller linear regression models. Use a kernel function. Have many spikes. Are non-smooth.
Use a kernel function.
Support vector machines use a kernel function mainly to: build a binary separator. map the original data into a higher-dimensional space. enforce maximum margin separation. ensure linearity.
map the original data into a higher-dimensional space.
Bag-of-Words
Simplifying representation of a corpus, e.g. using a term-document matrix, disregarding grammar and orders of words in documents.
How to reduce the dimensionality of the term-document matrix?
Stop word removal. Non-alphabetic character removal. Case folding. Stemming: Reduce words to their stem, e.g. eating, eats, eaten to eat. Sparse representation: Inverted index, e.g. wj: {d2 : 7,d5 : 4}. Dimensionality reduction: Latent Semantic Indexing via SVD. Word embeddings:
Log-probability
the the name exp to the power number
What are Information Retrieval Components
Document representation, Query representation, Evaluation Matching and ranking
TF-IDF Statistic
TFIDF (t;, d;) = log2 (1+ TF (t;, d;))· IDF (t;).
PageRank
Rank a set of objects given votes between them