(Recoding Data)
What does the process of recoding data involve? – transforming data from one form to another
When might you need to recode numerical data? – when you want to convert numerical data into categories for easier analysis
What is an example of recoding numerical data? – changing people’s names to a consistent format
Which income range would a person with an income of 56,000 fall into? – 50,000 - 100,000
When recoding categorical data, why might you want to standardize city names? – to ensure there are no misspellings and that data is consistent
What is the main advantage of using a toolo like Power BI for recoding data? – it allows for easy graphical creation of conditional columns to recode data
What is an example of recoding categorical data when dealing with inconsistent data entries? – standardizing how country names are entered (e.g., PHL to PH)
What is the main concept behind recoding numerical data to categorical data? – to classify numerical data into meaningful categories for analysis
Why would you create an additional column when recoding data? – to preserve the original data while showing the recoded version
(Derived Variables)
What is a derived variable? – a new data point created from existing data through a calculation
How does a derived variable differ from recoded data? – derived variables create new data; recoded data reorganizes existing data
How is the derived variable “average lessons per day” calculated? – divide lessons viewed by difference of start date from the end date
Which of the following best describes the purpose of a derived variable? – to create a new data point from existing data, often based on a calculation
Why might you choose to store the formula for a derived variable instead of storing the derived values themselves? – more space-efficient, as the derived values can be recomputed when needed
When would it make sense to store the derived variable directly in a dataset, rather than just storing the formula? – if speed matters and frequent recomputation of the variable is inefficient
What is the main advantage of using derived variables in data analysis? – they create new insights from existing data through calculations
How is date diff function typically used in the creation of a derived variable? – calculate the difference between two dates in a specified format
What might be a reason to use derived variables in the context of an online course website? – compute time metrics like lessons per day or total time spent in the course
(Value Imputation)
What is the main goal of data imputation? – to estimate missing data based on existing information
In the movie ratings scenario, how could the imputed rating calculated? – by using the average rating of the user for other movies
What is the key reason for keeping the original and imputed ratings in separate columns? – to ensure that the imputed values do not overwrite the original data
What SQL command can be used to calculate the user’s average rating? – GROUP BY username and AVG(rating)
What is an importan consideration when performing data imputation? – the imputation method should not alter the data’s original context
Which of the following is a more sophisticated method of imputing missing values? – using machine learning algorithms or artificial neural methods
In the context of imputation, why might it be useful to consider a movie’s genre or the date it was rated? – to provide a more accurate estimate of the missing value
What is the benefit of using the GROUP BY command in the context of data imputation? – it helps create a lookup table with users and their average ratings
Why is transparency important when performing data imputation? – to ensure that others understand the data is estimated, not original
What is one of the simplest methods of imputing missing numerical data in the movie ratings example? – by using the average rating of the user for other movies
(Aggregation and Reduction)
What is the primary purpose of data reduction? – to reduce the size of the data while retaining useful information
Which of the following is an example of aggregation as a method of data reduction? – summing data points for a zip code and representing it as one value
Why might someone use aggregation to reduce a data set? – to simplify data for analysis without losing key information
What is the main goal of sampling data in the context of reduction? – to reduce data size while maintaining accuracy in the results
What is one potential issue with simple random sampling? – it may introduce bias or lead to under-sampling of certain groups
What is the key difference between simple random sampling and stratified sampling? – stratified uses subgroups while random sampling does not
When using stratified sampling, why is it important to break the data into subgroups? – to ensure the sample accurately represents all demographic groups
What is the advantage of using aggregated data, such as census data by zip code or demographic information? – it reduces the data size, making it easier to handle while remaining useful
In what situation might you use stratified sampling instead of simple random sampling? – to ensure every subset of the population is represented in the sample
What is one benefit of using random sampling in data reduction – it reduces the data size while keeping it representative of the entire set
(Data Masking)
What is the main purpose of data masking in data analysis? – to protect sensitive information from being exposed
Which of the following is an example of personally identifiable information? – social security number
How can you mask sensitive information while joining a data set based on unique identifiers? – by using an anonymous identifier or index field
What could be used as anonymous identifiers to replace user names? – row number from the list of unique usernames
Why is it important to remove personally identifiable information (PII) in data sets? – to ensure compliance with data protection laws
What is the potential issue with anonymizing data using unique identifiers when tracking people's movements? – it might be possible to reconstruct someone's identity based on patterns
Which of the following is a simple method for creating an anonymous identifier for sensitive data? – assigning a unique index number to each individual
What is the primary reason for replacing sensitive information like user names with anonymous identifiers in reports? – to ensure privacy and protect identities
(Transposing Data)
What does it mean to transpose data in the context of data mining? – breaking down pivoted data into individual records
What is the difference between pivoted and unpivoted data? – pivoted data is compact for reporting; unpivoted has individual records
What is the primary function of the ‘unpivot’ command in many data tools like ms excel? – to break down pivoted data into individual records
Which of the following is true about pivoted data? – it is compact and useful for reporting purposes
What type of data is typically used in the pivoted format? – data that needs to be summarized or reported in a compact form
What does appending data refer to? – combining data from one dataset to another
What is the difference between an inline append and an intermediate append? – an inline discards original dataset, while intermediate keeps them In power query,
What feature allows you to append data from one table to another? – append queries
What is the result of an inline append operation? – a new dataset is created, and all original datasets are discarded
What does an intermediate append do in contrast to an inline append? – it combines datasets into a new table while keeping the original datasets
Which of the following is an example of appending data? – merging two datasets into one
What is a key difference between appending data into a new table and just copying the data? – appending data into a new table creates an entirely new dataset
When performing an inline append, what happened to the original datasets? – the original datasets are discarded, and only the combined dataset is kept