Data Science Exam Questions and Answers

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/76

flashcard set

Earn XP

Description and Tags

Flashcards for reviewing Data Science concepts.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

77 Terms

1
New cards

What is Data Science?

The process of extracting knowledge and insights from data using statistical methods, programming, and domain knowledge.

2
New cards

What are the key components of Data Science?

Data collection, data cleaning/preprocessing, data analysis & visualization, machine learning, interpretation & communication of results.

3
New cards

What is Structured Data?

Organized in rows/columns (e.g., databases, Excel). Easy to search and analyze.

4
New cards

What is Unstructured Data?

No fixed format (e.g., text, images, videos). Requires advanced tools to analyze.

5
New cards

What is Stemming?

Cuts words to root form (e.g., 'studies' → 'studi'). Can produce non-dictionary words.

6
New cards

What is Lemmatization?

Reduces words to base dictionary form using grammar (e.g., 'studies' → 'study'). More accurate.

7
New cards

What are the steps in Data Preprocessing?

Handling missing values, removing duplicates, normalization/standardization, encoding categorical data, detecting and removing outliers, feature selection and transformation.

8
New cards

What is k-Anonymity and how does it protect privacy?

Ensures that each record is indistinguishable from at least (k–1) others based on quasi-identifiers. It prevents re-identification by grouping similar data.

9
New cards

What is the Apriori Algorithm?

A classic algorithm used for frequent itemset mining and association rule learning.

10
New cards

What are the steps of the Apriori Algorithm?

Generate candidate itemsets, apply support threshold to prune infrequent ones, generate larger itemsets from previous ones (join step), generate association rules that meet confidence threshold.

11
New cards

What is the purpose of Data Visualization?

To represent data graphically, making it easier to identify patterns, trends, and outliers.

12
New cards

What is Demographic Clustering?

A clustering method using demographic features (e.g., age, gender, income), often using Hamming distance to measure similarity between categorical data.

13
New cards

What is Univariate Analysis?

One variable (mean, median, histogram).

14
New cards

What is Bivariate Analysis?

Relationship between two variables (scatter plot, correlation).

15
New cards

What is Euclidean distance?

Straight-line distance for numeric data.

16
New cards

What is Manhattan distance?

Grid-like distance.

17
New cards

What is Hamming distance?

Number of mismatches (for categorical attributes).

18
New cards

What is Feature Selection?

The process of selecting the most relevant variables (features) for model building.

19
New cards

Why is Feature Selection important?

Improves model accuracy, reduces overfitting, decreases training time.

20
New cards

What is Data Reduction?

Simplifies data while retaining essential information.

21
New cards

What is Attribute Reduction?

Removing irrelevant or redundant columns.

22
New cards

What is Instance Reduction?

Sampling or removing duplicate/irrelevant rows.

23
New cards

What is a Confusion Matrix?

A matrix showing actual vs. predicted classifications. Used to evaluate performance.

24
New cards

What is Precision?

TP / (TP + FP) → proportion of correct positives.

25
New cards

What is Recall?

TP / (TP + FN) → proportion of actual positives found.

26
New cards

What is F1-Score?

Harmonic mean of precision and recall.

27
New cards

What is a Decision Tree?

Builds rules based on labeled data.

28
New cards

What is K-Means?

Partitions data into clusters based on distance.

29
New cards

What is Cross-Validation?

A method to evaluate model performance by dividing the dataset into training and test folds.

30
New cards

What is k-Fold Cross-Validation?

Splits the data into k parts and rotates the validation set.

31
New cards

What is Overfitting?

Occurs when a model performs well on training data but poorly on unseen data.

32
New cards

How can Overfitting be prevented?

Cross-validation, simplifying the model, regularization, pruning (in decision trees).

33
New cards

What is the purpose of PCA (Principal Component Analysis)?

Transforms correlated variables into a smaller set of uncorrelated components and preserves most of the variance.

34
New cards

What are Ethical Issues in Data Science?

Bias and fairness in models, privacy violations, data manipulation, transparency of algorithm decisions (black-box models).

35
New cards

What is Classification?

Supervised learning; assigns labels to data (e.g., spam vs. not spam).

36
New cards

What is Clustering?

Unsupervised learning; groups data based on similarity (e.g., customer segments).

37
New cards

What is the role of a Data Scientist?

Collects and cleans data, analyzes data using statistics and ML, builds predictive models, and communicates findings to help decision-making.

38
New cards

What is a Dataset?

A collection of data, typically organized in rows (instances) and columns (features/variables).

39
New cards

What is Qualitative data?

Categorical (e.g., color, gender).

40
New cards

What is Quantitative data?

Numeric (e.g., height, income).

41
New cards

What is Exploratory Data Analysis (EDA)?

Involves visually and statistically analyzing datasets to uncover patterns, trends, and anomalies before formal modeling.

42
New cards

Name three types of plots used in EDA.

Histogram, Boxplot, Scatter plot.

43
New cards

What is a Histogram?

A graphical representation showing the distribution of a numeric variable via bins.

44
New cards

What is a Scatter Plot used for?

To show the relationship or correlation between two continuous variables.

45
New cards

What does Correlation Coefficient (r) indicate?

Strength and direction of a linear relationship between two variables. Ranges from -1 (strong negative) to +1 (strong positive), 0 = no correlation.

46
New cards

What is Multicollinearity?

When two or more independent variables are highly correlated, causing problems in regression models.

47
New cards

What are Dummy Variables?

Binary variables created from categorical data (e.g., Male = 1, Female = 0).

48
New cards

What is a Decision Tree?

A flowchart-like model used for classification/regression by splitting data into branches based on conditions.

49
New cards

What is Entropy in Decision Trees?

A measure of impurity or randomness in the data. Lower entropy means more pure data.

50
New cards

What is Information Gain?

The reduction in entropy after a dataset is split on an attribute.

51
New cards

What is KNN (K-Nearest Neighbors)?

A non-parametric algorithm that classifies a point based on the majority label of its K closest neighbors.

52
New cards

What is K-Means Clustering?

An unsupervised algorithm that groups data into K clusters based on similarity (minimizing within-cluster variance).

53
New cards

How to choose K in K-Means?

Use the Elbow Method — plot inertia (cost) vs. K and choose the elbow point.

54
New cards

What is a Confounding Variable?

A hidden variable that affects both the independent and dependent variables, potentially distorting the result.

55
New cards

What is Sampling?

Selecting a subset of data from a population to estimate characteristics of the whole population.

56
New cards

What is a Population?

The entire group being studied.

57
New cards

What is a Sample?

A subset of the population used for analysis.

58
New cards

What is Central Limit Theorem?

It states that the sampling distribution of the mean of a large number of independent samples will be approximately normal, regardless of the population's distribution.

59
New cards

What is P-Value?

Probability that the observed result is due to chance. P < 0.05 usually means statistically significant.

60
New cards

What is Overfitting?

Model fits training data too well, fails on new data.

61
New cards

What is Underfitting?

Model is too simple and misses patterns.

62
New cards

What is Supervised Learning?

Learning from labeled data (e.g., spam detection, disease prediction).

63
New cards

What is Unsupervised Learning?

Finding hidden patterns in unlabeled data (e.g., clustering customers).

64
New cards

What is Semi-Supervised Learning?

A mix of labeled and unlabeled data for training.

65
New cards

What is Reinforcement Learning?

An agent learns by interacting with an environment and receiving feedback (rewards/punishments).

66
New cards

What is a Null Hypothesis?

A default assumption that there is no effect or difference (e.g., no correlation between variables).

67
New cards

What is Feature Engineering?

Creating new features or modifying existing ones to improve model performance.

68
New cards

What is One-Hot Encoding?

Representing categorical variables as binary columns.

69
New cards

What is Dimensionality Reduction?

Reducing the number of input variables while preserving important information (e.g., using PCA).

70
New cards

What are the steps in a typical Data Science workflow?

Problem definition, Data collection, Data preprocessing, EDA, Modeling, Evaluation, Deployment.

71
New cards

What are some common data quality issues?

Missing values, duplicates, inconsistent formats, outliers, noise.

72
New cards

What is Bias in the Bias-Variance Tradeoff?

Error from incorrect assumptions (underfitting).

73
New cards

What is Variance in the Bias-Variance Tradeoff?

Error from sensitivity to small fluctuations (overfitting).

74
New cards

What is the goal of the Bias-Variance Tradeoff?

Find a balance for good generalization.

75
New cards

What is ROC Curve?

A plot of True Positive Rate vs. False Positive Rate. Used to evaluate classifiers.

76
New cards

What is AUC?

Area Under the ROC Curve — higher AUC means better classification performance.

77
New cards

What is a Data Pipeline?

A sequence of steps to collect, process, analyze, and store data automatically.