Data Science

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/249

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

250 Terms

1
New cards

Null Hypothesis

A statistical assumption that there is no effect, no difference, or no relationship between variables in a population, serving as the default claim to be tested.

2
New cards

Shapiro-Wilk Test

A statistical test used to check if a dataset is normally distributed, with the null hypothesis stating that the data comes from a normal distribution.

3
New cards

Normal Distribution

A continuous probability distribution that is symmetric and bell-shaped, where most values cluster around the mean and probabilities decrease as you move further from it.

4
New cards

Correlation

A statistical measure that describes the strength and direction of the relationship between two variables, ranging from -1 (perfect negative) to +1 (perfect positive).

5
New cards

Autocorrelation

A statistical measure that shows how a time series is correlated with its own past values, indicating repeating patterns, trends, or seasonality.

6
New cards

Non-Parametric Methods

Statistical techniques that do not assume a specific probability distribution for the data, making them useful for small samples, ordinal data, or when normality assumptions are not met.

7
New cards

Uniform Distribution

A probability distribution where all outcomes in a given range are equally likely, with constant probability across the interval.

8
New cards

Log-Normal Distribution Test

A statistical test used to determine if data follows a log-normal distribution, where the logarithm of the variable is normally distributed.

9
New cards

Discrete Random Variable

A random variable that can take on a countable number of distinct values, often representing outcomes like counts, categories, or integers.

10
New cards

Empirical Distribution

A probability distribution derived directly from observed data, where probabilities are assigned based on the relative frequencies of outcomes in the sample.

11
New cards

Descriptive Probability

The use of probability to summarize and describe the likelihood of outcomes based on observed data, rather than predicting or inferring about a larger population.

12
New cards

Algorithm in Data Science

A step-by-step set of rules or instructions used to process data, perform analysis, or build models to identify patterns, make predictions, or solve problems.

13
New cards

Model in Data Science

A mathematical or computational representation of a real-world process or system, built from data to describe relationships, make predictions, or support decision-making.

14
New cards

Joint Distribution

A probability distribution that describes the likelihood of two or more random variables occurring together, showing how their probabilities are related.

15
New cards

Marginal Probability

The probability of a single event occurring, found by summing or integrating over all possible outcomes of the other variables in a joint distribution.

16
New cards

Conditional Probability

The probability of an event occurring given that another event has already occurred, expressed as 𝑃(𝐴∣𝐵)= 𝑃(𝐴∩𝐵) / 𝑃(𝐵).

17
New cards

Discretize

The process of converting continuous data or variables into discrete categories or intervals, often used in data preprocessing or feature engineering.

18
New cards

Discrete Variable

A variable that can take only distinct, separate values (often integers or categories), with no intermediate values possible between them.

19
New cards

Marginal Probability

The probability of a single event or variable occurring, obtained by summing or integrating over the other variables in a joint distribution.

20
New cards

Central Limit Theorem

A statistical principle stating that the sampling distribution of the sample mean approaches a normal distribution as the sample size becomes large, regardless of the population's original distribution.

21
New cards

Confidence Interval

A range of values, derived from sample data, that is likely to contain the true population parameter with a specified level of confidence (e.g., 95%).

22
New cards

Graph of a Function

A visual representation of all possible input-output pairs (x,f(x)), usually plotted on a coordinate plane, showing how the function behaves across its domain.

23
New cards

Cartesian Coordinates

A system that locates points on a plane using ordered pairs (x,y), where 𝑥 represents the horizontal position and y represents the vertical position, defined relative to perpendicular axes.

24
New cards

Euler's Number

A mathematical constant denoted by 𝑒≈2.718, which is the base of the natural logarithm and arises in continuous growth, compound interest, and many areas of calculus and probability.

25
New cards

Domain of a Function

The set of all possible input values (𝑥) for which the function is defined.

26
New cards

Range of a Function

The set of all possible output values (𝑦) that the function can produce.

27
New cards

K-Means Clustering

An unsupervised machine learning algorithm that partitions data into K clusters by assigning each point to the nearest centroid and updating centroids to minimize within-cluster distances.

28
New cards

Clustering (Unsupervised)

A machine learning technique that groups similar data points together without using predefined labels, aiming to discover hidden patterns or structures in the data.

29
New cards

Scale the Data (KMeans)

Preprocessing step where features are normalized or standardized so that all variables contribute equally to distance calculations, preventing features with larger ranges from dominating clustering.

30
New cards

Scale the Data (KMeans)

A preprocessing step where features are normalized or standardized so all variables contribute equally to distance calculations, preventing larger-scale features from dominating clustering.

31
New cards

Elbow Method

A technique to choose the optimal number of clusters in K-Means by plotting the within-cluster sum of squares (WCSS) and identifying the "elbow" point where adding more clusters gives little improvement.

32
New cards

Elbow Method

A technique to determine the optimal number of clusters in K-Means by plotting the within-cluster sum of squares (inertia) against the number of clusters and choosing the point where the curve bends ("elbow"), showing diminishing returns from adding more clusters.

33
New cards

Supervised Classification Model

A machine learning model trained on labeled data to predict categorical outcomes (class labels) for new, unseen inputs.

34
New cards

Random Forest

An ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.

35
New cards

F1-Score

A performance metric that combines precision and recall into a single value using their harmonic mean, especially useful for evaluating models on imbalanced datasets.

36
New cards

Regression

A supervised learning method used to model and predict continuous numerical outcomes based on input features.

37
New cards

Imputation

The process of filling in missing data values using strategies like mean, median, or predictive methods.

38
New cards

Type Conversion

Changing data from one type or format to another, such as string to integer or float.

39
New cards

Downstream Modeling

The stage where preprocessed data is used to train and evaluate machine learning models.

40
New cards

Exploratory Data Analysis (EDA)

The process of analyzing and visualizing data to find patterns, detect anomalies, and test assumptions before modeling.

41
New cards

MinMaxScaler

A preprocessing method that rescales features to a fixed range, usually [0, 1].

42
New cards

Heatmap

A graphical representation of data where values are shown as colors, often used to visualize correlations in a matrix.

43
New cards

Radar Chart

A graphical method for displaying multivariate data, where multiple axes start from the same point and values are plotted in a spider-web shape.

44
New cards

Tokenization

Splitting text into smaller units such as words, sentences, or subwords.

45
New cards

Stemming

Reducing words to their root form by trimming suffixes or prefixes (e.g., "running" → "run").

46
New cards

Lemmatization

Reducing words to their base dictionary form using linguistic rules (e.g., "better" → "good").

47
New cards

POS Tagging

Assigning parts of speech (noun, verb, adjective, etc.) to each word in text.

48
New cards

Sentiment Analysis

Determining the emotional tone or opinion expressed in text (positive, negative, neutral).

49
New cards

Corpora

Large structured collections of text used for training, testing, and analyzing NLP models.

50
New cards

Named Entity Recognition (NER)

Identifying and classifying entities in text such as names, places, dates, and organizations.

51
New cards

Dependency Parsing

Analyzing the grammatical structure of a sentence to determine how words are related.

52
New cards

Word Vectors

Numerical representations of words in continuous space that capture meaning and semantic relationships.

53
New cards

Denominator

The bottom number in a fraction, representing the total number of equal parts the whole is divided into.

54
New cards

One-Hot Encoding

A method to convert categorical variables into binary vectors, where each category is represented by 1 (present) and 0 (absent).

55
New cards

Absolute Value

=The distance of a number from zero on the number line, always non-negative.

56
New cards

Dot Product

An operation that multiplies two vectors and returns a scalar value, often used to measure similarity.

57
New cards

Vector Magnitude

The length of a vector, calculated as the square root of the sum of squared components.

58
New cards

Norms

Functions that measure the size or length of vectors, such as L1 (Manhattan) or L2 (Euclidean) norms.

59
New cards

Matrix

A rectangular array of numbers arranged in rows and columns.

60
New cards

Vector

An ordered list of numbers that can represent a point or direction in space.

61
New cards

Tensor

A generalization of scalars, vectors, and matrices to higher dimensions.

62
New cards

Linear Maps

Functions between vector spaces that preserve addition and scalar multiplication.

63
New cards

Linear System of Equations

A collection of linear equations involving the same set of variables, often solved simultaneously.

64
New cards

Eigenvectors

Non-zero vectors that remain in the same direction after a linear transformation, only scaled.

65
New cards

Eigenvalues

Scalars that indicate how much the corresponding eigenvector is stretched or compressed during a linear transformation.

66
New cards

Normalization

A preprocessing technique that rescales data so that values fall within a standard range, often between 0 and 1.

67
New cards

Customer Churn

When customers stop doing business with a company or service, often measured as the percentage of customers lost during a given period.

68
New cards

DBSCAN

A density-based clustering algorithm that groups together points that are closely packed and labels points in sparse regions as noise.

69
New cards

Pairplot

A Seaborn visualization that shows relationships between pairs of variables using scatterplots and histograms (or KDEs) in a grid format.

70
New cards

Cosine Similarity

A metric that measures how similar two vectors are by calculating the cosine of the angle between them, with values ranging from -1 (opposite) to 1 (identical).

71
New cards

Cosine

A trigonometric function that represents the ratio of the adjacent side to the hypotenuse in a right triangle.

72
New cards

Matrix Determinant

A scalar value that represents how a matrix scales space and indicates whether the matrix is invertible (non-zero determinant) or not (zero determinant).

73
New cards

Orthogonal Matrices

Square matrices whose rows and columns are orthonormal vectors, meaning their transpose is equal to their inverse (𝐴^𝑇𝐴=𝐼).

74
New cards

Random Forest Model

An ensemble learning method that builds multiple decision trees and combines their predictions to improve accuracy and reduce overfitting.

75
New cards

Logistic Regression Model

A supervised learning model that predicts the probability of binary outcomes using the logistic (sigmoid) function.

76
New cards

Data Serialization

The process of converting data into a standard format (like JSON, CSV, or binary) so it can be stored, transmitted, and later reconstructed.

77
New cards

Kernel Density Estimation (KDE)

A non-parametric method to estimate the probability density function of a dataset by smoothing data points with a kernel function, producing a smooth curve that represents the distribution.

78
New cards

Rolling Average (Moving Average)

A technique that calculates the average of data points over a sliding window, used to smooth short-term fluctuations and highlight longer-term trends.

79
New cards

Feature Selection

The process of choosing the most relevant features (variables) from a dataset to improve model performance, reduce overfitting, and simplify the model.

80
New cards

Train/Test Split

A method of dividing data into two sets: one for training the model and the other for testing its performance on unseen data.

81
New cards

Nominal Data

Categorical data without order (e.g., colors, names).

82
New cards

Ordinal Data

Categorical data with a meaningful order but unequal intervals (e.g., rankings).

83
New cards

Interval Data

Data with equal intervals but no true zero (e.g., temperature in °C).

84
New cards

Ratio Data

Data with equal intervals and a true zero, allowing ratio comparisons (e.g., weight, height).

85
New cards

Supervised Learning

Models trained on labeled data to predict outcomes.

86
New cards

Unsupervised Learning

Models that find patterns or group data without labels.

87
New cards

Classification

A supervised task that predicts discrete categories (e.g., spam/not spam).

88
New cards

Regression

A supervised task that predicts continuous values (e.g., price, age).

89
New cards

Clustering

An unsupervised task that groups similar data points without predefined labels.

90
New cards

Logistic Regression

A supervised model that predicts the probability of binary outcomes using the logistic (sigmoid) function.

91
New cards

Multilayer Perceptron (MLP)

A type of artificial neural network composed of input, hidden, and output layers, used for supervised learning.

92
New cards

Sigmoid Function (Logistic Regression)

A mathematical function that maps any real value into the range (0, 1), giving an S-shaped curve for probabilities.

93
New cards

Latent Processes (ML)

Hidden or unobserved variables or structures that influence observed data, often inferred using models like PCA, factor analysis, or topic modeling.

94
New cards

Confocal Images

High-resolution optical images produced by confocal microscopy, which uses focused laser light and spatial filtering to remove out-of-focus blur and improve depth accuracy.

95
New cards

Convolutional Layers

Layers in a neural network that apply filters to input data (like images) to detect features such as edges, textures, and shapes.

96
New cards

Latent Variables

Hidden or unobserved variables that influence observed data, often representing underlying patterns or factors inferred through models like PCA, factor analysis, or autoencoders.

97
New cards

Empirical Risk Minimization (ERM)

A principle in machine learning where a model is trained to minimize the average loss (error) on the training data, serving as an estimate of the true risk over the entire data distribution.

98
New cards

Rational Numbers

Numbers that can be expressed as a fraction of two integers (ab\frac{a}{b}ba​), where b≠0b \neq 0b=0.

99
New cards

Irrational Numbers

Numbers that cannot be expressed as a fraction of two integers; their decimals are non-repeating and non-terminating (e.g., √2, π).

100
New cards

Natural Numbers

The counting numbers starting from 1, 2, 3, and so on.