Data Mining Final Revision

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/116

Earn XP

Description and Tags

Flashcards generated from Data Mining Final Revision lecture notes.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

117 Terms

New cards

What type of datasets are sales/transactional data, customer databases, and healthcare records typically stored in?

Relational databases

New cards

What is the organization level of Semistructured data?

They do not have a strict format, but have some level of organization.

New cards

What are examples of Semistructured data?

xml, json, emails

New cards

What is the main characteristic of Unstructured data sets?

They lack predefined semantic structure.

New cards

Give examples of Unstructured data sets.

text data, images, audio files

New cards

Give an example of an ordinal attribute.

The ranking in number of stars of the last hotel visited by a customer

New cards

What are the characteristics of data when performing data Classification?

Data set with n input attributes x1,…,xn and a target categorical attribute y

New cards

List examples of Classification data mining tasks

medical diagnosis, churn prediction, sentiment analysis, spam detection, loan underwriting

New cards

What are the characteristics of data when performing data Regression?

Data set with n input attributes x1,…,xn and a target numeric attribute y

New cards

List examples of Regression data mining tasks

Demand and price prediction, marketing return of investment, traffic flow prediction

New cards

What are the characteristics of data when performing data Clustering?

A data set with m unlabelled instances x1,…,xm. and an output

New cards

List examples of Clustering data mining tasks

customer segmentation, recommendations, transport

New cards

What are the characteristics of data when performing data Association?

A data set with m instances and n features. The output is frequenct combinations of attribute values and association rules X → Y , where X and Y are subsets of attribute values.

New cards

List examples of Association data mining tasks

market basket analysis, statistical data analysis

New cards

Which are unsupervised learning tasks, Association or Classification?

Association and Clustering

New cards

Which are Major issues that we may have to deal with when working with real-world data sets?

Massive data, High-dimensional data, Insufficient, biased, or sparse data, Missing values, Noisy data

New cards

List a few important steps of data mining as a process.

Objective specification, Data exploration, Data cleaning and preparation, Model building, Model evaluation, Repeat

New cards

Which of the following statements about Data Mining is false? Can be used for making predictions, Can be automated, Extracts meaningful and useful patterns from data, Requires generating new data.

Requires generating new data.

New cards

Which of the following statements about tabular data sets, i.e. ones represented in the form of a two-dimensional table, is false? a. Columns correspond to attributes. b. Attributes are typically assumed to be independent. c. Instances are typically assumed to be independent. d. Rows correspond to instances.

Attributes are typically assumed to be independent.

New cards

Which of the following data mining tasks aims to divide the instances into groups, so that instances in the same group are more similar than instances in different groups? Clustering, Regression, Association, Classification.

Clustering

New cards

Which of the following statements about data mining as a process is false? Creating data sets from raw data in order to apply data mining methods can be a very time-consuming task in practice. Selecting the right attributes in the data set affects the performance of the constructed model. Deciding which model to use depends on the type of the available data. Overfitting implies the constructed model is very good.

Overfitting implies the constructed model is very good.

New cards

Which of the following knowledge representations do not require the computation of an actual model? Instance-based, Classification Rules, Decision Trees, Clusters.

Instance-based

New cards

What is the definition of Attribute?

Individual measurable characteristic of the data mining problem under consideration

New cards

What is a classification rule?

A type of a data mining model for predicting the class of an instance in the form of an “if-then” statement.
The if part is called the antecedent or precondition and the then part the consequent of conclusion.

New cards

What is a decision tree?

It is tree-like model of decisions for making predictions. Nodes represents decisions

New cards

Describe one way Linear Regression can be useful in data mining.

Interpretable insights about attribute relationships. OR Efficiently computable. OR Impactful, e.g. Moore’s law, BIM index. OR Confidence intervals. OR Part of complex models, e.g. piecewise regression, model trees.

New cards

What are possible approaches to learn a linear regression model?

Gradient descent OR Closed form solution using matrix inversion.

New cards

What are the sources of overfitting?

Too many (irrelevant or redundant) features OR Too complex models may capture noise / local variations

New cards

Give examples of ways to deal with Overfitting.

Feature selection and dimensionality reduction, e.g. Linear Regression, Principal Component Analysis. OR Resampling, e.g. Cross-validation, Bootstrapping. Hyper-parameter tuning and model selection, e.g. Grid Search, Random Search, Bayesian Optimisation. OR Modified loss function, e.g. Regularization, Akaike Information Criterion, Bayesian Information Criterion.

New cards

What does Regularization or Shrinkage do to the model parameters?

Shrinks (regularizes) the model parameters towards zero.

New cards

What is a linear separator?

A linear function (or hyperplane) w⊤x = 0

New cards

Which of the following statements about multilayer perceptron and neural networks is false? a) The activation function type is a hyper-parameter. b) During training, weights are updated with a forward pass. c) The MSE is commonly used as evaluation metric. d) The number of hidded-layers is decided during validation.

During training, weights are updated with a forward pass.

New cards

What are parameters and hyper-parameters in a multilayer perceptron neural network?

Parameters: Computed during training. Weights of the network links.Hyper-parameters: Computed during validation (model selection) Number of hidden layers, number of nodes per layer, incoming links per node (connectivity), activation function, etc.

New cards

List properties of the activation function

Nonlinear, Differentiable, Monotonic

New cards

Which of the following statements about linear regression is true? Predicts a numeric output from a nominal input. Predicts a nominal output from a numeric input. Predicts a nominal output from a nominal input. Predicts a numeric output from a numeric input.

Predicts a numeric output from a numeric input.

New cards

Which of the following statements about 10-fold cross-validation is false Estimates the error rate of a regression model on new instances which have not been used for training. Computes 10 different models. Divides the instances into 10 disjoint groups of roughly equal size. Each fold is used exactly once for training.

Each fold is used exactly once for training.

New cards

Given the linear regression equation y = w0 + w1x1 + w2x2, what do the terms x1 and x_2 represent? Labels. Residuals. Weights. Attribute values.

Attribute values.

New cards

Which of the following statements about perceptrons and neural networks is true? The observed and true values of target variables are added to compute the score of a neural network. The activation function of a neural network can be decided when building the model. To make predictions with a given neural network, we perform backpropagation. A perceptron is a multilayered neural network

The activation function of a neural network can be decided when building the model.

New cards

Which of the following statements about correlation and causation is true? They always occur together. None of the other choices is true. They never occur together. They sometimes occur together.

They sometimes occur together.

New cards

In the nominal weather data set, what is the error rate of the classification rule “if temperature=mild, then play=yes”?

2/6

New cards

1R algorithm because

The algorithm learns a set of Rules. 1 because the rules are only based on 1 input attribute

New cards

What is a key disadvantage of such a model?

Overfitting

New cards

Classification rules can be appropriate for preliminary medical diagnosis because:

Given a set of symptoms, they predict a candidate disease. They may associate a set of symptoms with multiple diseases. Frequency of a symptom being associated with a disease. Interpretable to general public.

New cards

What is Entropy?

A measure of uncertainty / information of random variables.

New cards

What is a deterministic classification models?

Can be viewed as dividing the space of attribute values into regions and classifying instances based on their locations in the space

New cards

What do Probabilistic classification models do?

Yield a probability distribution over the set of possible labels for each instance:

New cards

What is FPR?

False Positive Rate - FPR

New cards

What is TPR?

True Positive Rate - TPR

New cards

What is meant by Stratification?

Stratified holdout: the process of randomly sampling to create the training and test sets so that they contain proportional representations of each class.

New cards

What are Ensemble Methods?

We may combine individual models using: weighted voting for classification, weighted averaging for regression. Common ensemble methods: Bagging Boosting

New cards

Which of the following measures the relationship between true positives and false positives for different subsets of instances in the data set? Confusion Matrix Cohen's Kappa Lift Receiver Operating Characteristic (ROC)

Receiver Operating Characteristic (ROC)

New cards

What is 1R?

An algorithm for computing classification rules, based on class frequencies of individual attributes.

New cards

A node in a decision tree represents

A group of data points (or instances) that meet certain conditions.

New cards

Describe a decision-tree building process.

Divide-and-Conquer Method for Constructing Decision Trees Select an attribute at the root. Make one split for each possible attribute value (assume a finite set of possible values). Repeat (1) for each child. Recursively break down the problem into two or more smaller subproblems, until these become simple enough. Combine subproblem solutions to solve the original problem

New cards

Describe Notion of the entropy and how it is useful for building decision trees.

Entropy Measure of disorder of a system Quantifies the uncertainty associated with a random variable. Many low probability events have high entropy. Few high probability events have lower entropy.Computes the amount of information in number of bits. The higher the information aqcuisition, the higher the entropy. The more certain an event is, the less information it contains. If a leaf is entirely pure, then the entropy is low. If a leaf is not pure, then the entropy is high.Select splits maximizing information gain,

New cards

What are the characteristics of a good set of clusters?

Low within-cluster distance and high between-cluster distance.

New cards

Advantage of using Euclidean distance metric

Insightful with useful geometric properties and interpretations.

New cards

Limitations of using Euclidean distance metrics?

Sensitive to magnitude of attribute values, Does not effectively represent (non-) Euclidean spaces.

New cards

Which metrics are alternatives to Euclidean distance metrics?

Manhattan, Minkowski, Chebychev, Cosine Similarity

New cards

The Euclidean, Manhattan, Chebyshev, and Cosine distance metrics tend not to work well with

Nominal attribute values OR Non-Euclidean spaces

New cards

What does distance metric learning aim to do?

Learn distances for a data to enhance the performance of similarity-based algorithms

New cards

From a computational perspective, optimization algorithms have two main qualities

Running time and solution quality।

New cards

What are the properties of Kmeans with Euclidean distances?

The objective value decreases at each iteration.

New cards

List Kmeans Advantages

Simple to implement and interpretable. Scalable to massive data sets with many instances. Easily adaptable to new data points.

New cards

List Kmeans Disadvantages

Manually selecting the number of clusters. Sensitive to outliers (Kmedoids variant). Does not scale well to high-dimensional data. annot handle non-convex sets

New cards

List Hierarchical Clustering Advantages

More informative than unstructured flat clusters. Easy to implement

New cards

List Hierarchical Clustering Disadvantages

High running time. Not easily adaptaptable (cannot undo previous operations). Sensitive to noise.

New cards

DBSCAN Advantages:

Does not require a specified number of clusters. Can compute clusters with arbitrary shapes. Incorporates a notion of noise and is robust to outliers.

New cards

DBSCAN Disadvantages

Quality deteriorates if the data set contains instances with large density differences. Border points might be reachable from multiple core points.

New cards

Which of the following statements about overlapping clusters is true? Some instances may belong to more than one clusters. All instances must belong to more than one clusters. Instances belong to only one cluster. Empty clusters may exist.

Some instances may belong to more than one clusters.

New cards

What is the complete-linkage metric? A distance metric often used by k-means clustering. The largest distance between two instances in the same cluster. A distance metric often used for hierarchical clustering. The average distance between all members of two clusters.

The largest distance between two instances in the same cluster.

New cards

Which of the following clustering algorithms successively divides clusters into sub-clusters? k-means DBSCAN Incremental Hierarchical

Hierarchical

New cards

Which of the following statements for an outlier point is true? Has high probability of occurrence if we build a good probabilistic model for the data Is close to a large number of points. Has low proximity to all other data points. Strongly belongs to some cluster

Has low proximity to all other data points.

New cards

Fuzzy clustering is useful mainly for:

Overlapping clusters.

New cards

Statistical Modeling use cases:

Data Generation, Simulations, Classificatio, Regression, Clustering

New cards

Statistical Model:

A set M(θ) of candidate probabilistic models to have generated the observed data D Typically, it is specified by a set Θ of parameter values Parameter θ ∈ Θ corresponds to a probabilistic model M(θ).

New cards

Bernoulli Distribution

models a random event with only two possible outcomes, like a coin flip

New cards

Maximum Likelihood Estimation (MLE)

Find the parameter value that makes the observed data most likely.assumes that the model parameters have fixed values

New cards

Maximum Aposteriori Estimation (MAP)

Like MLE, but it also includes prior beliefs about the parameter. It finds the parameter value that is most likely after combining data with prior knowledge. MAP = MLE + prior

New cards

Bayesian Estimation

Goes further than MAP: it gives a whole distribution of possible parameter values, not just one best guess. It updates our belief about the parameter using Bayes’ Theorem.

New cards

Parametric models:

the model has a specific form with fixed number of parameters, independent of the training set size.

New cards

Non-parametric models:

the model incorporates less assumptions, i.e. is more data-driven, and typically number of parameters grows as the sample size increases.

New cards

Which of the following statements about probabilities is false? Probability density functions model mainly discrete data. Probability density functions model mainly continuous data. Two variables are independent if the knowledge of one does not affect the probability distribution of the other. Joint probability distributions model relationships between multiple variables.

Probability density functions model mainly discrete data.

New cards

What is a prior probability? The probability of observing the data, given that the a hypothesis holds. The probability of observing the data without a hypothesis. The probability that a hypothesis holds, given the data. The probability that a hypothesis holds without having observed any data.

The probability that a hypothesis holds without having observed any data.

New cards

Which of the following methods of estimation assumes a probability distribution for the parameter values Bayesian estimation. Binomial distribution. Maximum likelihood estimation. Gaussian mixture model.

Bayesian estimation.

New cards

What is a mixture model?

A model combining simpler component distributions.

New cards

Which of the following statements about Naïve Bayes is false? It can be mainly applied to nominal attributes. It computes probabilities by calculating proportions of instances. It assumes that the attributes are conditionally independent. It uses Laplacian estimation to deal with missing values.

It uses Laplacian estimation to deal with missing values.

New cards

What are Instance-Based Learning main components?

Distance function: Calculate distance between any pair of instances. Combination function: Combine nearest neighbors to compute the output

New cards

Advantages of KNN

Nonlinear decision boundaries. Easily adaptable to new instances Not restricted by the data format.

New cards

Which of the following is not a weakness of instance-based learning models? Prediction is time-consuming. Lack of strong distance / combination functions. Memory-intensive. Training is time-consuming

Training is time-consuming

New cards

Which of the following allows the efficient comparison of a snippet with a song in Shazam music recognition? Constellation plot. Anchor point kD-tree. Spectrogram.

Anchor point kD-tree.

New cards

Which of the following statements about kernel density estimation is false? Applies mainly to low-dimensional data in practice. Computes a smooth density estimate Attempts to improve histograms. The choice of the kernel function is more critical than the choice of the bandwidth.

The choice of the kernel function is more critical than the choice of the bandwidth.

New cards

Locally weighted regression models: Contain a bounded number of smaller linear regression models. Use a kernel function. Have many spikes. Are non-smooth.

Use a kernel function.

New cards

Support vector machines use a kernel function mainly to: build a binary separator. map the original data into a higher-dimensional space. enforce maximum margin separation. ensure linearity.

map the original data into a higher-dimensional space.

New cards

Bag-of-Words

Simplifying representation of a corpus, e.g. using a term-document matrix, disregarding grammar and orders of words in documents.

New cards

How to reduce the dimensionality of the term-document matrix?

Stop word removal. Non-alphabetic character removal. Case folding. Stemming: Reduce words to their stem, e.g. eating, eats, eaten to eat. Sparse representation: Inverted index, e.g. wj: {d2 : 7,d5 : 4}. Dimensionality reduction: Latent Semantic Indexing via SVD. Word embeddings:

New cards

Log-probability

the the name exp to the power number

New cards

What are Information Retrieval Components

Document representation, Query representation, Evaluation Matching and ranking

New cards

TF-IDF Statistic

TFIDF (t;, d;) = log2 (1+ TF (t;, d;))· IDF (t;).

100

New cards

PageRank