data analytics

0.0(0)

Studied by 0 people

0.0(0)

Call with Kai

Knowt Play

New

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/63

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

64 Terms

New cards

In a supervised learning task, when dealing with a data set exhibiting a normal distribution — which is a symmetrical distribution where most observations cluster around the central peak and probabilities for values further from the mean taper off equally in both directions — removing outliers specifically from the features always results in a significant reduction of noise and reliably improves the accuracy of the data analysis models.

False

New cards

In hierarchical clustering, the algorithm builds a hierarchy of clusters either by a divisive method, which starts with all observations in a single cluster and divides them into smaller clusters, or by an agglomerative method, which starts with each observation as its own cluster and merges them into larger clusters.

True

New cards

In feature engineering, scaling and centering are techniques that modify the range of dependent variables in the data, and these methods are generally not necessary for algorithms sensitive to feature scales, such as gradient descent-based algorithms, k-means clustering, and support vector machines.

False

New cards

Variance in a machine learning model refers to the error introduced by excessive simplicity in the learning algorithm, which often results in a model that underestimates the complexity of the underlying data distribution, leading to high error rates on both training and unseen data.

False

New cards

In data modeling, feature engineering means choosing a subset of available features, and feature selection means creating or transforming features, both primarily aimed at reducing the risk of overfitting and accelerating model training.

False

New cards

You have a very large dataset and need to make real-time predictions, where computational efficiency and speed are critical. For real-time predictions with a large dataset, which algorithm is more appropriate?

Deep Neural Network

New cards

You have a data set with a clear margin of separation between classes, but it’s not linearly separable. For this dataset with a non-linear separation between classes, which algorithm below should you choose?

SVM with a non-linear kernel

New cards

You need to classify text documents into different categories based on their content. For classifying text documents, which algorithm below would be most suitable?

Naive Bayes

New cards

For clustering a dataset with varying cluster shapes, densities, unknown number of clusters, and the presence of outliers, which algorithm below would be most suitable?

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

New cards

Your dataset includes a mix of numeric and categorical variables, and the relationships between features and the outcome are non-linear. Which algorithm below should you use for this dataset?

Random Forest

New cards

In the context of the K-Nearest Neighbors (K-NN) algorithm, consider the following scenario: You are working on a classification problem using a K-NN model. Your dataset is relatively large and consists of several numerical features. As part of fine-tuning the model's performance, you are considering various approaches, such as hyperparameter tuning, feature engineering, data preprocessing, and adjustments to the model's training process.

Normalizing the feature scales to ensure that all features contribute equally to the distance calculations.

New cards

Consider a scenario where you are using a Decision Tree algorithm for a classification task. The dataset has a mix of categorical and numerical features, and the target variable is binary. You are in the process of optimizing the Decision Tree to improve its accuracy and prevent overfitting.

Which of the following techniques is an effective method for preventing overfitting in a Decision Tree classifier?

Pruning the tree by setting a maximum depth or minimum samples per leaf.

New cards

You want to cluster data that has noise.

DBSCAN

New cards

You know how many clusters you want to end up with.

K-means

New cards

You know the clusters have odd shapes.

DBSCAN

New cards

You want a complete clustering of the data.

K-Means

New cards

You want to minimize the sum squared error (SSE) of the clusters.

K-Means

New cards

Soft margin SVM allows for some misclassifications to achieve larger margin and better generalization on the training data.

True

New cards

Ensemble methods like boosting focuses more on difficult to classify instances.

True

New cards

KNN performs the classification decision based solely on a majority vote of the nearest neighbors regardless of their distances.

False

New cards

K-means clustering algorithm guarantees convergence to the global optimum.

False

New cards

K-means++ is an algorithm that guarantees better clustering than K-means by optimizing the initial placement of centroids.

False

New cards

Density-based clustering methods like DBSCAN can identify clusters of any shape and are particularly good at separating high-density clusters from low-density areas.

True

New cards

DBSCAN requires the number of clusters to be specified in advance.

False

New cards

Hierarchical clustering can only use a single metric for measuring the distances between clusters throughout the entire clustering process.

False

New cards

In the Apriori algorithm, all subsets of a frequent itemset must also be frequent.

True

New cards

Association rules that have high confidence necessarily have high lift.

True

New cards

The lift value of an association rule that is less than 1 indicates that the items in the rule are negatively associated.

False

New cards

Support is a measure of how frequently the items in an association rule appear together in all transactions.

True

New cards

A Naïve Bayes classifier requires a larger amount of data to perform well compared to more complex models because it assumes that all features are independent given the class label.

False

New cards

The leverage of an association rule X → Y measures the difference between the observed frequency of X and Y appearing together and the frequency expected if X and Y were statistically independent.

True

New cards

In association rule analysis, if an itemset has a high support, any rule derived from this itemset will also have high support.

False

New cards

You are working on a binary classification problem using a Support Vector Machine (SVM). The dataset involves features that are not linearly separable in the current feature space. Which of the following strategies would be most effective for improving the classification accuracy of the SVM?

Utilize a polynomial kernel to implicitly transform the features into a higher-dimensional space where the classes are more likely to be linearly separable.

New cards

When training a Support Vector Machine (SVM) for a classification task, the concept of a margin is crucial for understanding how the model discriminates between classes. Which of the following best describes the role of maximizing the margin and the use of slack variables in SVM?

Maximizing the margin involves creating the largest possible distance between the decision boundary and the nearest data points from each class, thereby enhancing the model’s robustness and generalization capabilities. Slack variables permit some degree of misclassification, particularly for data that is not linearly separable, by allowing flexibility in the margin constraints to achieve a broader margin.

New cards

When applying a Support Vector Machine (SVM) in a binary classification task, various factors influence the performance and applicability of the model. Suppose you are comparing SVM to other classifiers on a dataset with imbalanced classes and a moderate amount of noise. Which of the following statements most accurately reflects the considerations and adaptations you might need for effective SVM deployment?

Adjust the class weights in SVM to give more importance to the minority class, helping to offset the class imbalance during model training.

New cards

True Positive (TP)

The classifier correctly identifies a network intrusion; an actual intrusion attempt is detected by the system.

New cards

False Positive (FP)

The classifier incorrectly flags normal network traffic as an intrusion; benign activity is mistakenly identified as malicious.

New cards

True Negative (TN)

The classifier correctly recognizes normal network traffic; benign activity is correctly identified as non-malicious.

New cards

False Negative (FN)

The classifier fails to detect an actual intrusion; a malicious activity goes unnoticed by the system.

New cards

A significant drawback of using the Train/Validate/Test split is that a considerable part of the dataset may need to be reserved for testing.

True

New cards

The K models created during the K-fold cross-validation process should be used for making predictions on new data.

False

New cards

Decision Trees are considered “Lazy Learners” because they do not generalize training data until it is needed to classify test examples.

False

New cards

Calculating class-conditional probability with Naive Bayes involves finding the product of the probabilities of observing each feature given the class label.

True

New cards

A high confidence value for an association rule always implies a strong and meaningful relationship between the items.

False

New cards

Feature scaling is essential for algorithms that compute distances between data points, such as K-Nearest Neighbors.

True

New cards

In classification tasks, accuracy is always the best metric to evaluate model performance.

False

New cards

In association rule mining, the measure that indicates the proportion of transactions containing one itemset relative to another is called leverage.

False

New cards

When using K-means clustering, the final result is independent of the initial centroid positions.

False

New cards

Neural networks with no hidden layers can only learn linear decision boundaries.

True

New cards

Apriori principle states that all subsets of an infrequent itemset must also be infrequent.

False

New cards

In ensemble methods like bagging, the individual models are trained on the same dataset to ensure consistency in their predictions.

False

New cards

The Lift metric in association rule mining measures how much more often the antecedent and consequent occur together than expected if they were statistically independent.

True

New cards

The elbow method in K-Means clustering involves plotting the explained variance as a function of the number of clusters and looking for a point where the rate of variance reduction sharply decreases.

True

New cards

Isolation Forest detects anomalies based on the assumption that anomalous points are easier to isolate than normal points.

True

New cards

Single-linkage hierarchical clustering tends to create more chain-like clusters ( i.e. clusters that are stretched out and less compact in shape) compared to complete-linkage clustering.

True

New cards

Feature selection aims to reduce the number of features by creating new combinations of existing features, whereas feature extraction removes irrelevant or redundant features.

False

New cards

A frequent itemset is an itemset whose support is equal or greater than some minsup threshold.

True

New cards

In hierarchical clustering, the dendrogram can be cut at different levels to obtain different numbers of clusters, allowing flexibility in choosing the number of clusters after the clustering process is complete.

True

New cards

The K-Means algorithm always converges to the global optimum solution.

False

New cards

You are using DBSCAN for clustering data with varying densities. You notice that some clusters are not being identified correctly. Which parameter should you adjust to improve clustering in such cases?

The epsilon (ε) parameter (maximum neighborhood radius)

New cards

Which linkage method in hierarchical clustering considers the maximum distance between elements of two clusters when merging them?

Complete linkage

New cards

Which of the following is a key advantage of DBSCAN over K-Means clustering?

It can identify clusters of arbitrary shapes and detect noise.

New cards

Which assumption is fundamental to the Naïve Bayes classifier?

All features are independent given the class label.

New cards

A manufacturing company employs a network of sensors to monitor the operational status of its machinery in real time. The sensor data includes temperature, vibration, pressure, and other operational metrics collected at high frequency, resulting in thousands of features. Unexpected equipment failures can be costly, so the company aims to detect anomalies that may indicate impending malfunctions. Key requirements are:

1. Must process streaming data efficiently

2. Must scale well with high-dimensional data (thousands of features)

3. Must adapt to evolving normal patterns over time

4. Must handle unlabeled data

5. Must detect rare anomalies without prior knowledge of anomaly patterns

Which of the following algorithms is most appropriate for detecting anomalies in this scenario, and why?

Isolation Forest, because it efficiently isolates anomalies in high-dimensional data using random partitioning, and it is well-suited for streaming data.