CAP4770 Exam 1

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/48

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

49 Terms

1
New cards

Classify bronze, silver, and gold medals as awarded at the Olympics:

  • binary

  • discrete

  • ordinal

  • continuous

Ordinal - categories with meaningful order/ranking

2
New cards

Predicting sales amounts of a new product based on advertising expenditure is an example of regression. T/F?

TRUE

3
New cards

“Outlier” can be desired or interesing. T/F?

TRUE

4
New cards

Binary attributes are a special case of categorical attributes. T/F?

FALSE (special case of discrete)

5
New cards

Combining two or more attributes (or objects) into a single attribute (or object) is an example of …

  • sampling

  • aggregation

  • feature extraction

  • feature selection

Aggregation

6
New cards

In sampling without replacement, the same object can be picked up more than once. T/F?

FALSE

7
New cards

In order to do simple random sampling, we split the data into several partitions; and draw random samples from each partition. T/F?

FALSE

8
New cards

Arrangement is a broader term that encompasses both similarity and dissimilarity. It’s like having a single word to describe how two things are connected - whether they’re alike or different. T/F?

FALSE

9
New cards

Euclidean Density: Center-Based is one of the simplest approaches to divide the region into a number of rectangular cells of equal volume and define density as a number of points teh cell contains. T/F?

FALSE

10
New cards

Jaccard similarity = number of 11 matches / number of non-zero attributes

x = 0101010001

y = 0100011001

What is the Jaccard similarity?

0.6

11
New cards

For a coin with probability p of heads and probability q =1 - p of tails.

For p = 1, what is the entropy (H)?

  • 0

  • 1

  • 1.32

  • 2

0

12
New cards

While correlation is a valuable measure for assessing linear relationships, it’s important to remember that it only captures linear connections. In cases where the relationship between variables is non-linear, the correlation coefficient may not accurately represent the underlying patter. T/F?

TRUE

13
New cards

Cosine similarity measures the cosine of the angle between two vectors and is especially useful for text-based data. T/F?

TRUE

14
New cards

When you are the fourth tallest person in a group of 20, it means that you are the 80th percentile. T/F?

TRUE

15
New cards
<p>Assume we have three points A, B, and C as shown in the figure. No matter which distance metric we employ, we can assert that the distance from A to C is larger than the distance from A to B. T/F?</p><p></p>

Assume we have three points A, B, and C as shown in the figure. No matter which distance metric we employ, we can assert that the distance from A to C is larger than the distance from A to B. T/F?

FALSE

16
New cards
<p>Minkowski distance with r=1 is Manhattan distance. T/F?</p><p></p>

Minkowski distance with r=1 is Manhattan distance. T/F?

TRUE

17
New cards

Entropy quantifies the amount of information needed to describe the state of a system or the unpredictability of an outcome. T/F?

TRUE

18
New cards

EDA is a short form of Exploratory Data Analysis. T/F?

TRUE

19
New cards

Density measures the degree to which data objects are close to each other in a specified area. Concept of density is typically used for clustering and anomaly detection. T/F?

TRUE

20
New cards

The curse of dimensionality means that adding more variables always improves the performance of data models. T/F?

FALSE

21
New cards

PCA is suitable for both numerical and categorical variables. T/F?

FALSE

22
New cards

PCA focuses on maximizing the variance in data. T/F?

TRUE

23
New cards

LDA is an unsupervised dimensionality reduction technique. T/F?

FALSE

24
New cards

t-SNE is primarily used for feature reduction before classification. T/F?

FALSE

25
New cards

In PCA, the principal components are orthogonal to each other. T/F?

TRUE

26
New cards

LDA aims to maximize the variance between classes and minimize the variance within classes. T/F?

TRUE

27
New cards

The maximum number of principal components in a dataset with n samples and p features is min(n-1, p). T/F?

TRUE

28
New cards

PCA always leads to perfect class separability. T/F?

FALSE

29
New cards

t-SNE captures non-linear relationships in data better than PCA and LDA. T/F?

TRUE

30
New cards

Which of the following best describes the curse of dimensionality?

  • adding more variables leads to sparse data space and poor model performance

  • adding more variables improves model performance

  • adding more data points always solves data sparsity

  • none of the above

Adding more variables leads to sparse data space and poor model performance

31
New cards

Which dimensionality reduction technique is supervised?

  • t-SNE

  • LDA

  • PCA

  • All are supervised

LDA

32
New cards

Which technique focuses on maximizing the variance in the data?

  • Logistic Regression

  • t-SNE

  • LDA

  • PCA

PCA

33
New cards

Which technique is most suitable for visualizing high-dimensional non-linear data?

  • Random forest

  • LDA

  • PCA

  • t-SNE

t-SNE

34
New cards

What is the main goal of LDA?

  • Maximize variance in data

  • Minimize error in data points

  • Maximize class separability

  • Reduce number of features randomly

Maximize class separability

35
New cards

Which of the following steps is common to both PCA and LDA?

  • Reducing dimensions based on neighborhood probabilities

  • Standardizing the data

  • Using class labels reduction

  • Calculating between-class scatter

Standardizing the data

36
New cards

t-SNE is particularly useful for:

  • Visualizing high-dimensional data

  • Linear transformations of data

  • Optimizing regression models

  • Feature selection for classification

Visualizing high-dimensional data

37
New cards

In PCA, the eigenvector with the highest eigenvalue is called:

  • The main axis

  • The principal component

  • The variance vector

  • The discriminant function

The principal component

38
New cards

Which of the following is a limitation of PCA?

  • PCA doesn’t have any limitations

  • It is supervised

  • It ignores class labels

  • It requires non-linear separability

It ignores class labels

39
New cards

Which technique focuses on maximizing the separation between classes while minimizing variance within each class?

  • LDA

  • PCA

  • t-SNE

  • Factor analysis

LDA

40
New cards

What are the categorical attributes?

Nominal: no coherent order

Ordinal: meaningful order or rank

41
New cards

What are the numerical attributes?

Interval: meaningful order and consistent intervals

Ratio: meaningful order and a TRUE ZERO

42
New cards

What is a discrete attribute?

Has a finite or countably infinite set of values

BINARY is a special case of DISCRETE

43
New cards

What is a continuous attribute?

has real numbers as attribute values

(typically represented as floating point values)

44
New cards

How do we identify outliers?

Z-score; this measures how many standard deviations a data point is away from the mean of a dataset

45
New cards

Can outliers have a positive impact too?

Yes; they can provide insight on unusual trends, opportunities, or exceptional cases in data.

46
New cards

What is clustering and why is it used?

Clustering finds groups of objects with similar relation to one another within each group compared to other groups.

Used for understanding groups of similarities in datasets and reducing the size of large datasets.

47
New cards

What is Association Rule Discovery?

Finding similarities or relations between variables in databases.

48
New cards

Consider a dataset where the “Income” attribute is heavily skewed. What transformation technique would you apply to normalize this data, and why?

Logarithmic transformation; this will reduce skewness by normalizing the data distribution

49
New cards