1/181
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Anscombe’s quartet
A set of four datasets that have nearly identical statistical summaries (mean, varianData science process
Information
Processed or organized data that has meaning and context.
Knowledge
Information interpreted and applied to make decisions or take action.
Data Science
An interdisciplinary field that uses statistics, mathematics, programming, and domain knowlKnowledge Engineering
Median
Middle value of a sorted dataset.
Mode
Most frequently occurring value.
Range
Difference between the largest and smallest values.
Variance
Average squared deviation from the mean.
Standard deviation
Square root of variance; represents spread in original units.
Interquartile range (IQR)
Difference between Q3 and Q1; robust measure of spread.
Skewness
Measure of asymmetry in a distribution.
Kurtosis
Measure of tail heaviness compared to a normal distribution.
Z-score
Standardized value calculated as (x − µ)/σ.
Expected value
Long-term average outcome of a random variable.
Population
Entire group of interest in a study.
Sample
Subset drawn from a population to estimate characteristics.
Parameter
Numerical summary of a population (µ, σ).
Statistic
Numerical summary of a sample (x
Law of large numbers
Sample mean approaches population mean as sample size increases.
Central limit theorem
Distribution of sample means approaches normal as sample size increases.
Correlation
Measures how strongly two variables move together.
Correlation vs causation
Correlation indicates association; causation requires proof that one variable cPearson correlation
Spearman correlation
Rank-based correlation robust to outliers and nonlinear relationships.
Covariance
Measures whether two variables increase or decrease together.
Outlier
Data point far from the rest of the dataset.
Probability
Likelihood of an event occurring.
Sample space
All possible outcomes of an experiment.
Conditional probability
Probability of event A occurring given event B.
Joint probability
Probability of two events occurring together.
Bayes’ theorem
Formula for updating probabilities using evidence.
Prior probability
Belief about an event before observing data.
Posterior probability
Updated belief after observing evidence.
Likelihood
Probability of observing data given parameters.
Normal distribution
Bell-shaped distribution defined by mean and standard deviation.
Empirical rule
68%, 95%, and 99.7% of data fall within 1, 2, and 3 standard deviations.
Binomial distribution
Distribution for number of successes in fixed trials.
Poisson distribution
Distribution for counts of events occurring in a time interval.
Exponential distribution
Distribution describing time between events.
Bernoulli distribution
Distribution for a single success/failure trial.
Structured data
Data organized into rows and columns.
Unstructured data
Data without predefined structure (text, images, audio).
Numeric data
Data represented with numbers.
Categorical data
Data represented as categories or labels.
Discrete variable
Variable with countable values.
Continuous variable
Variable with infinite values within a range.
Random variable
Variable whose value depends on random outcomes.
Bar graph
Used to compare categories.
Histogram
Shows distribution of numerical data.
Line graph
Shows trends over time.
Scatter plot
Shows relationship between two variables.
Box plot
Displays quartiles, spread, and outliers.
Pie chart
Shows proportions of a whole.
Heatmap
Uses color to represent data intensity or correlation.
Data visualization
Graphical representation of data to reveal patterns.
Data cleaning
Fixing errors, removing duplicates, handling missing values.
Data wrangling
Organizing raw data for analysis.
Data transformation
Converting data into a useful format.
Data quality issues
Problems such as missing values, duplicates, or noise.
Missing data mechanisms
MCAR, MAR, MNAR.
Imputation methods
Methods to fill missing values (mean, median, KNN, regression).
Feature engineering
Creating new features from existing data.
Feature scaling
Adjusting data ranges using normalization or standardization.
Feature encoding
Converting categorical variables to numerical format.
Feature selection
Choosing relevant variables for modeling.
Dimensionality reduction
Reducing number of variables while preserving information.
Principal component analysis (PCA)
Linear dimensionality reduction technique.
t-SNE / UMAP
Nonlinear dimensionality reduction for visualization.
Autoencoders
Neural networks used for dimensionality reduction.
Data leakage
When training data accidentally includes information from test data.
Machine learning
Algorithms that learn patterns from data.
Supervised learning
Learning using labeled data.
Unsupervised learning
Finding patterns in unlabeled data.
Reinforcement learning
Learning by rewards and penalties.
Training dataset
Data used to train a model.
Validation dataset
Data used to tune model parameters.
Test dataset
Data used to evaluate final model performance.
Bias-variance tradeoff
Balance between underfitting and overfitting.
Overfitting
Model memorizes training data but performs poorly on new data.
Underfitting
Model too simple to capture patterns.
Regularization
Techniques that prevent overfitting.
Early stopping
Stopping training when validation performance stops improving.
Ensemble methods
Combine multiple models for better performance.
Bagging
Training models independently on bootstrapped datasets.
Boosting
Sequentially improving models by focusing on errors.
Stacking
Combining predictions of multiple models.
Linear regression
Predicts continuous values using a best-fit line.
Multiple linear regression
Regression using multiple predictors.
Logistic regression
Classification model predicting probabilities.
K-nearest neighbors (KNN)
Classifies based on nearby training examples.
Naive Bayes
Probabilistic classifier assuming feature independence.
Decision tree
Tree structure splitting data based on conditions.
Random forest
Ensemble of decision trees using bagging.
Gradient boosting
Sequential ensemble method focusing on residual errors.
Support vector machine (SVM)
Classifier maximizing margin between classes.
K-means clustering
Groups data into k clusters.
Hierarchical clustering
Creates nested clusters represented as dendrograms.
DBSCAN
Density-based clustering algorithm.
Topic modeling (LDA)
Extracts topics from text documents.
Accuracy
Proportion of correct predictions.
Precision
True positives ÷ predicted positives.