1/49
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Imbalanced Datasets
When one class in a dataset is significantly more prevalent than others, which can lead to biased models.
Random Oversampling
Increasing the minority class samples to match the number of majority class samples.
Random Undersampling
Reducing the majority class samples to match the number of minority class samples.
Imbalanced-learn library
A library for dealing with imbalanced datasets through methods like Random Oversampling and Undersampling.
Importance of Balanced Datasets
Ensures that models are trained fairly, helping to avoid bias towards one class.
pd.numeric()
A function to convert a column to a numeric data type, potentially causing NaN for non-convertible values.
print(sosurveydf['RawSalary'][idx])
This syntax retrieves the salary value at the specified index from the 'RawSalary' column.
Seaborn
A Python visualization library based on matplotlib that provides a high-level interface for drawing attractive statistical graphics.
Wrapper
A library or module that encapsulates the functionality of another library, simplifying its use.
Parquet file
A columnar storage file format optimized for use with big data processing frameworks.
Correlation Matrix
A table showing correlation coefficients between variables, indicating strength and direction of relationships.
df.columns if 'Time' in c
List comprehension to extract column names containing the term 'Time' from a DataFrame.
Labels vs Columns vs Indexes
Labels refer to the identifiers for rows or columns, columns refer to the vertical arrangement of data, and indexes are labels for rows.
Series name
The name assigned to a pandas Series object, which can be referenced in operations.
pd.read_csv()
Function to read a CSV file into a DataFrame, with options for setting an index column.
Multicollinearity
A situation in which two or more independent variables in a regression model are highly correlated.
Pairplots
Visual displays of pairwise relationships between several variables in a dataset.
SQL Query
A request for data or information from a database structured in a specific way using SQL commands.
COUNT() function
SQL function that returns the number of rows that matches a specified criterion.
FROM clause
Indicates the table from which to select or delete data in an SQL query.
SELECT statement
Specifies the columns to return in a query.
LIMIT clause
Restricts the number of rows returned by a SQL query.
Random Over Sampling (code)
ros = RandomOverSampler(sampling_strategy='not majority') creates an instance of oversampling.
Filling Missing Values
Use df['column'].fillna(mean_value) to replace NaN values in a DataFrame with calculated mean.
Method Chaining
A technique to apply multiple methods sequentially in a single statement.
Taxi Duration Calculation
df['duration'] = (df['dropofftime'] - df['pickuptime']).dt.total_seconds() / 60 converts timedelta to minutes.
Indexing with pd.DataFrame()
Creating a DataFrame using a dictionary where keys are column names and values are lists of entries.
pd.isna()
Function that checks for missing values, returning a Series of boolean results.
pd.sample()
Function to obtain a random sample of rows from the DataFrame.
df.loc[]
Method to retrieve data based on label indexing for rows and columns.
df.iloc[]
Method to retrieve data based on integer location indexing.
Data Leakage
Occurs when information from the test dataset is used during training, leading to unrealistic model performance.
Feature Engineering
The process of using domain knowledge to select, transform, and create features from raw data.
Feature Selection
Choosing the most relevant features for modeling based on their relationships with the target variable.
Correlation Coefficient
A numerical measure of the strength and direction of a relationship between two variables.
Heatmap
A graphical representation of data where individual values are represented as colors.
Outlier Detection in Box Plots
Identifying data points that fall outside the whiskers of a box plot.
Histogram
A graphical representation showing the distribution of a continuous variable.
Data Quality Inspection
The process of evaluating accuracy, completeness, and consistency of data.
Count Plot
A categorical plot depicting the counts of occurrences for each category in a dataset.
SQL processing order
The logical sequence in which SQL queries are processed: FROM, WHERE, GROUP BY, SELECT, ORDER BY, LIMIT.
pd.to_numeric()
Function to convert argument to a numeric type, coercing errors to NaN.
sns.heatmap() usage
A function to visualize a correlation matrix in a heatmap format with optional annotations.
DataFrame creation syntax
Creating a DataFrame involves specifying column names and corresponding values in a dictionary format.
Binning Strategies
Method of converting continuous variables into categorical variables through intervals.
Statistical Measures in Box Plots
Visual indicators of median, quartiles, and overall data spread.
Data Distribution Visualization
Analyzing how values of a variable are spread across a range.
Feature Engineering Importance
Improves machine learning model's performance by enhancing the input data quality.
Data Entry Errors
Mistakes made during data input that can compromise dataset quality.
Data Transformation Techniques
Methods used to alter the format, structure, or values of data.