20250415 - part 2

0.0(0)

Studied by 0 people

View linked note

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/49

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

50 Terms

New cards

Imbalanced Datasets

When one class in a dataset is significantly more prevalent than others, which can lead to biased models.

New cards

Random Oversampling

Increasing the minority class samples to match the number of majority class samples.

New cards

Random Undersampling

Reducing the majority class samples to match the number of minority class samples.

New cards

Imbalanced-learn library

A library for dealing with imbalanced datasets through methods like Random Oversampling and Undersampling.

New cards

Importance of Balanced Datasets

Ensures that models are trained fairly, helping to avoid bias towards one class.

New cards

pd.numeric()

A function to convert a column to a numeric data type, potentially causing NaN for non-convertible values.

New cards

print(sosurveydf['RawSalary'][idx])

This syntax retrieves the salary value at the specified index from the 'RawSalary' column.

New cards

Seaborn

A Python visualization library based on matplotlib that provides a high-level interface for drawing attractive statistical graphics.

New cards

Wrapper

A library or module that encapsulates the functionality of another library, simplifying its use.

New cards

Parquet file

A columnar storage file format optimized for use with big data processing frameworks.

New cards

Correlation Matrix

A table showing correlation coefficients between variables, indicating strength and direction of relationships.

New cards

df.columns if 'Time' in c

List comprehension to extract column names containing the term 'Time' from a DataFrame.

New cards

Labels vs Columns vs Indexes

Labels refer to the identifiers for rows or columns, columns refer to the vertical arrangement of data, and indexes are labels for rows.

New cards

Series name

The name assigned to a pandas Series object, which can be referenced in operations.

New cards

pd.read_csv()

Function to read a CSV file into a DataFrame, with options for setting an index column.

New cards

Multicollinearity

A situation in which two or more independent variables in a regression model are highly correlated.

New cards

Pairplots

Visual displays of pairwise relationships between several variables in a dataset.

New cards

SQL Query

A request for data or information from a database structured in a specific way using SQL commands.

New cards

COUNT() function

SQL function that returns the number of rows that matches a specified criterion.

New cards

FROM clause

Indicates the table from which to select or delete data in an SQL query.

New cards

SELECT statement

Specifies the columns to return in a query.

New cards

LIMIT clause

Restricts the number of rows returned by a SQL query.

New cards

Random Over Sampling (code)

ros = RandomOverSampler(sampling_strategy='not majority') creates an instance of oversampling.

New cards

Filling Missing Values

Use df['column'].fillna(mean_value) to replace NaN values in a DataFrame with calculated mean.

New cards

Method Chaining

A technique to apply multiple methods sequentially in a single statement.

New cards

Taxi Duration Calculation

df['duration'] = (df['dropofftime'] - df['pickuptime']).dt.total_seconds() / 60 converts timedelta to minutes.

New cards

Indexing with pd.DataFrame()

Creating a DataFrame using a dictionary where keys are column names and values are lists of entries.

New cards

pd.isna()

Function that checks for missing values, returning a Series of boolean results.

New cards

pd.sample()

Function to obtain a random sample of rows from the DataFrame.

New cards

df.loc[]

Method to retrieve data based on label indexing for rows and columns.

New cards

df.iloc[]

Method to retrieve data based on integer location indexing.

New cards

Data Leakage

Occurs when information from the test dataset is used during training, leading to unrealistic model performance.

New cards

Feature Engineering

The process of using domain knowledge to select, transform, and create features from raw data.

New cards

Feature Selection

Choosing the most relevant features for modeling based on their relationships with the target variable.

New cards

Correlation Coefficient

A numerical measure of the strength and direction of a relationship between two variables.

New cards

Heatmap

A graphical representation of data where individual values are represented as colors.

New cards

Outlier Detection in Box Plots

Identifying data points that fall outside the whiskers of a box plot.

New cards

Histogram

A graphical representation showing the distribution of a continuous variable.

New cards

Data Quality Inspection

The process of evaluating accuracy, completeness, and consistency of data.

New cards

Count Plot

A categorical plot depicting the counts of occurrences for each category in a dataset.

New cards

SQL processing order

The logical sequence in which SQL queries are processed: FROM, WHERE, GROUP BY, SELECT, ORDER BY, LIMIT.

New cards

pd.to_numeric()

Function to convert argument to a numeric type, coercing errors to NaN.

New cards

sns.heatmap() usage

A function to visualize a correlation matrix in a heatmap format with optional annotations.

New cards

DataFrame creation syntax

Creating a DataFrame involves specifying column names and corresponding values in a dictionary format.

New cards

Binning Strategies

Method of converting continuous variables into categorical variables through intervals.

New cards

Statistical Measures in Box Plots

Visual indicators of median, quartiles, and overall data spread.

New cards

Data Distribution Visualization

Analyzing how values of a variable are spread across a range.

New cards

Feature Engineering Importance

Improves machine learning model's performance by enhancing the input data quality.

New cards

Data Entry Errors

Mistakes made during data input that can compromise dataset quality.

New cards

Data Transformation Techniques

Methods used to alter the format, structure, or values of data.