1/58
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Recoding Data
The process of transforming data from one form to another.
When to recode numerical data
When converting numerical data into categories for easier analysis.
Example of recoding numerical data
Changing people's names to a consistent format.
Income range for 56,000
50,000 - 100,000.
Standardizing city names
To ensure there are no misspellings and that data is consistent.
Advantage of Power BI for recoding data
Allows for easy graphical creation of conditional columns to recode data.
Example of recoding categorical data
Standardizing how country names are entered (e.g., PHL to PH).
Main concept of recoding numerical to categorical
To classify numerical data into meaningful categories for analysis.
Creating an additional column when recoding
To preserve the original data while showing the recoded version.
Derived variable
A new data point created from existing data through a calculation.
Difference between derived variable and recoded data
Derived variables create new data; recoded data reorganizes existing data.
Average lessons per day calculation
Divide lessons viewed by the difference of start date from the end date.
Purpose of a derived variable
To create a new data point from existing data, often based on a calculation.
Storing the formula for derived variables
More space-efficient, as the derived values can be recomputed when needed.
When to store derived variable directly in a dataset
If speed matters and frequent recomputation of the variable is inefficient.
Advantage of using derived variables
They create new insights from existing data through calculations.
Date diff function usage in derived variable creation
To calculate the difference between two dates in a specified format.
Use of derived variables in online course website
To compute time metrics like lessons per day or total time spent in the course.
Goal of data imputation
To estimate missing data based on existing information.
Imputed rating calculation in movie ratings
Using the average rating of the user for other movies.
Reason for keeping original and imputed ratings separate
To ensure that the imputed values do not overwrite the original data.
SQL command for user's average rating
GROUP BY username and AVG(rating).
Important consideration for data imputation
The imputation method should not alter the data's original context.
Sophisticated method of imputing missing values
Using machine learning algorithms or artificial neural methods.
Consideration of movie's genre in imputation
To provide a more accurate estimate of the missing value.
Benefit of using GROUP BY command in data imputation
Helps create a lookup table with users and their average ratings.
Importance of transparency in data imputation
To ensure that others understand the data is estimated, not original.
Simplest method for imputing missing numerical data
Using the average rating of the user for other movies.
Purpose of data reduction
To reduce the size of the data while retaining useful information.
Example of aggregation in data reduction
Summing data points for a zip code and representing it as one value.
Why use aggregation to reduce data set
To simplify data for analysis without losing key information.
Main goal of sampling data in reduction
To reduce data size while maintaining accuracy in the results.
Potential issue with simple random sampling
It may introduce bias or lead to under-sampling of certain groups.
Difference between simple random sampling and stratified sampling
Stratified uses subgroups while random sampling does not.
Importance of breaking data into subgroups in stratified sampling
To ensure the sample accurately represents all demographic groups.
Advantage of using aggregated data
Reduces data size, making it easier to handle while remaining useful.
Situation to use stratified sampling
To ensure every subset of the population is represented in the sample.
Benefit of random sampling in data reduction
It reduces the data size while keeping it representative of the entire set.
Purpose of data masking
To protect sensitive information from being exposed.
Example of personally identifiable information
Social security number.
How to mask sensitive information while joining datasets
Using an anonymous identifier or index field.
Anonymous identifiers for replacing usernames
Row number from the list of unique usernames.
Importance of removing PII in datasets
To ensure compliance with data protection laws.
Issue with anonymizing data using unique identifiers
It might be possible to reconstruct someone's identity based on patterns.
Simple method for creating an anonymous identifier
Assigning a unique index number to each individual.
Reason for replacing sensitive information with anonymous identifiers
To ensure privacy and protect identities.
What does it mean to transpose data
Breaking down pivoted data into individual records.
Difference between pivoted and unpivoted data
Pivoted data is compact for reporting; unpivoted has individual records.
Primary function of the ‘unpivot’ command
To break down pivoted data into individual records.
Truth about pivoted data
It is compact and useful for reporting purposes.
Type of data used in pivoted format
Data that needs to be summarized or reported in a compact form.
What does appending data refer to?
Combining data from one dataset to another.
Difference between inline append and intermediate append
Inline discards original dataset, while intermediate keeps them.
Feature allowing appending data from one table to another
Append queries.
Result of an inline append operation
A new dataset is created, and all original datasets are discarded.
What does an intermediate append do?
Combines datasets into a new table while keeping the original datasets.
Example of appending data
Merging two datasets into one.
Difference between appending data into a new table and copying data
Appending creates an entirely new dataset.
What happens to original datasets during an inline append?
The original datasets are discarded, and only the combined dataset is kept.