1/76
Flashcards for reviewing Data Science concepts.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
What is Data Science?
The process of extracting knowledge and insights from data using statistical methods, programming, and domain knowledge.
What are the key components of Data Science?
Data collection, data cleaning/preprocessing, data analysis & visualization, machine learning, interpretation & communication of results.
What is Structured Data?
Organized in rows/columns (e.g., databases, Excel). Easy to search and analyze.
What is Unstructured Data?
No fixed format (e.g., text, images, videos). Requires advanced tools to analyze.
What is Stemming?
Cuts words to root form (e.g., 'studies' → 'studi'). Can produce non-dictionary words.
What is Lemmatization?
Reduces words to base dictionary form using grammar (e.g., 'studies' → 'study'). More accurate.
What are the steps in Data Preprocessing?
Handling missing values, removing duplicates, normalization/standardization, encoding categorical data, detecting and removing outliers, feature selection and transformation.
What is k-Anonymity and how does it protect privacy?
Ensures that each record is indistinguishable from at least (k–1) others based on quasi-identifiers. It prevents re-identification by grouping similar data.
What is the Apriori Algorithm?
A classic algorithm used for frequent itemset mining and association rule learning.
What are the steps of the Apriori Algorithm?
Generate candidate itemsets, apply support threshold to prune infrequent ones, generate larger itemsets from previous ones (join step), generate association rules that meet confidence threshold.
What is the purpose of Data Visualization?
To represent data graphically, making it easier to identify patterns, trends, and outliers.
What is Demographic Clustering?
A clustering method using demographic features (e.g., age, gender, income), often using Hamming distance to measure similarity between categorical data.
What is Univariate Analysis?
One variable (mean, median, histogram).
What is Bivariate Analysis?
Relationship between two variables (scatter plot, correlation).
What is Euclidean distance?
Straight-line distance for numeric data.
What is Manhattan distance?
Grid-like distance.
What is Hamming distance?
Number of mismatches (for categorical attributes).
What is Feature Selection?
The process of selecting the most relevant variables (features) for model building.
Why is Feature Selection important?
Improves model accuracy, reduces overfitting, decreases training time.
What is Data Reduction?
Simplifies data while retaining essential information.
What is Attribute Reduction?
Removing irrelevant or redundant columns.
What is Instance Reduction?
Sampling or removing duplicate/irrelevant rows.
What is a Confusion Matrix?
A matrix showing actual vs. predicted classifications. Used to evaluate performance.
What is Precision?
TP / (TP + FP) → proportion of correct positives.
What is Recall?
TP / (TP + FN) → proportion of actual positives found.
What is F1-Score?
Harmonic mean of precision and recall.
What is a Decision Tree?
Builds rules based on labeled data.
What is K-Means?
Partitions data into clusters based on distance.
What is Cross-Validation?
A method to evaluate model performance by dividing the dataset into training and test folds.
What is k-Fold Cross-Validation?
Splits the data into k parts and rotates the validation set.
What is Overfitting?
Occurs when a model performs well on training data but poorly on unseen data.
How can Overfitting be prevented?
Cross-validation, simplifying the model, regularization, pruning (in decision trees).
What is the purpose of PCA (Principal Component Analysis)?
Transforms correlated variables into a smaller set of uncorrelated components and preserves most of the variance.
What are Ethical Issues in Data Science?
Bias and fairness in models, privacy violations, data manipulation, transparency of algorithm decisions (black-box models).
What is Classification?
Supervised learning; assigns labels to data (e.g., spam vs. not spam).
What is Clustering?
Unsupervised learning; groups data based on similarity (e.g., customer segments).
What is the role of a Data Scientist?
Collects and cleans data, analyzes data using statistics and ML, builds predictive models, and communicates findings to help decision-making.
What is a Dataset?
A collection of data, typically organized in rows (instances) and columns (features/variables).
What is Qualitative data?
Categorical (e.g., color, gender).
What is Quantitative data?
Numeric (e.g., height, income).
What is Exploratory Data Analysis (EDA)?
Involves visually and statistically analyzing datasets to uncover patterns, trends, and anomalies before formal modeling.
Name three types of plots used in EDA.
Histogram, Boxplot, Scatter plot.
What is a Histogram?
A graphical representation showing the distribution of a numeric variable via bins.
What is a Scatter Plot used for?
To show the relationship or correlation between two continuous variables.
What does Correlation Coefficient (r) indicate?
Strength and direction of a linear relationship between two variables. Ranges from -1 (strong negative) to +1 (strong positive), 0 = no correlation.
What is Multicollinearity?
When two or more independent variables are highly correlated, causing problems in regression models.
What are Dummy Variables?
Binary variables created from categorical data (e.g., Male = 1, Female = 0).
What is a Decision Tree?
A flowchart-like model used for classification/regression by splitting data into branches based on conditions.
What is Entropy in Decision Trees?
A measure of impurity or randomness in the data. Lower entropy means more pure data.
What is Information Gain?
The reduction in entropy after a dataset is split on an attribute.
What is KNN (K-Nearest Neighbors)?
A non-parametric algorithm that classifies a point based on the majority label of its K closest neighbors.
What is K-Means Clustering?
An unsupervised algorithm that groups data into K clusters based on similarity (minimizing within-cluster variance).
How to choose K in K-Means?
Use the Elbow Method — plot inertia (cost) vs. K and choose the elbow point.
What is a Confounding Variable?
A hidden variable that affects both the independent and dependent variables, potentially distorting the result.
What is Sampling?
Selecting a subset of data from a population to estimate characteristics of the whole population.
What is a Population?
The entire group being studied.
What is a Sample?
A subset of the population used for analysis.
What is Central Limit Theorem?
It states that the sampling distribution of the mean of a large number of independent samples will be approximately normal, regardless of the population's distribution.
What is P-Value?
Probability that the observed result is due to chance. P < 0.05 usually means statistically significant.
What is Overfitting?
Model fits training data too well, fails on new data.
What is Underfitting?
Model is too simple and misses patterns.
What is Supervised Learning?
Learning from labeled data (e.g., spam detection, disease prediction).
What is Unsupervised Learning?
Finding hidden patterns in unlabeled data (e.g., clustering customers).
What is Semi-Supervised Learning?
A mix of labeled and unlabeled data for training.
What is Reinforcement Learning?
An agent learns by interacting with an environment and receiving feedback (rewards/punishments).
What is a Null Hypothesis?
A default assumption that there is no effect or difference (e.g., no correlation between variables).
What is Feature Engineering?
Creating new features or modifying existing ones to improve model performance.
What is One-Hot Encoding?
Representing categorical variables as binary columns.
What is Dimensionality Reduction?
Reducing the number of input variables while preserving important information (e.g., using PCA).
What are the steps in a typical Data Science workflow?
Problem definition, Data collection, Data preprocessing, EDA, Modeling, Evaluation, Deployment.
What are some common data quality issues?
Missing values, duplicates, inconsistent formats, outliers, noise.
What is Bias in the Bias-Variance Tradeoff?
Error from incorrect assumptions (underfitting).
What is Variance in the Bias-Variance Tradeoff?
Error from sensitivity to small fluctuations (overfitting).
What is the goal of the Bias-Variance Tradeoff?
Find a balance for good generalization.
What is ROC Curve?
A plot of True Positive Rate vs. False Positive Rate. Used to evaluate classifiers.
What is AUC?
Area Under the ROC Curve — higher AUC means better classification performance.
What is a Data Pipeline?
A sequence of steps to collect, process, analyze, and store data automatically.