(Recoding Data)
What does the process of recoding data involve? β transforming data from one form to another
When might you need to recode numerical data? β when you want to convert numerical data into categories for easier analysis
What is an example of recoding numerical data? β changing peopleβs names to a consistent format
Which income range would a person with an income of 56,000 fall into? β 50,000 - 100,000
When recoding categorical data, why might you want to standardize city names? β to ensure there are no misspellings and that data is consistent
What is the main advantage of using a toolo like Power BI for recoding data? β it allows for easy graphical creation of conditional columns to recode data
What is an example of recoding categorical data when dealing with inconsistent data entries? β standardizing how country names are entered (e.g., PHL to PH)
What is the main concept behind recoding numerical data to categorical data? β to classify numerical data into meaningful categories for analysis
Why would you create an additional column when recoding data? β to preserve the original data while showing the recoded version
(Derived Variables)
What is a derived variable? β a new data point created from existing data through a calculation
How does a derived variable differ from recoded data? β derived variables create new data; recoded data reorganizes existing data
How is the derived variable βaverage lessons per dayβ calculated? β divide lessons viewed by difference of start date from the end date
Which of the following best describes the purpose of a derived variable? β to create a new data point from existing data, often based on a calculation
Why might you choose to store the formula for a derived variable instead of storing the derived values themselves? β more space-efficient, as the derived values can be recomputed when needed
When would it make sense to store the derived variable directly in a dataset, rather than just storing the formula? β if speed matters and frequent recomputation of the variable is inefficient
What is the main advantage of using derived variables in data analysis? β they create new insights from existing data through calculations
How is date diff function typically used in the creation of a derived variable? β calculate the difference between two dates in a specified format
What might be a reason to use derived variables in the context of an online course website? β compute time metrics like lessons per day or total time spent in the course
(Value Imputation)
What is the main goal of data imputation? β to estimate missing data based on existing information
In the movie ratings scenario, how could the imputed rating calculated? β by using the average rating of the user for other movies
What is the key reason for keeping the original and imputed ratings in separate columns? β to ensure that the imputed values do not overwrite the original data
What SQL command can be used to calculate the userβs average rating? β GROUP BY username and AVG(rating)
What is an importan consideration when performing data imputation? β the imputation method should not alter the dataβs original context
Which of the following is a more sophisticated method of imputing missing values? β using machine learning algorithms or artificial neural methods
In the context of imputation, why might it be useful to consider a movieβs genre or the date it was rated? β to provide a more accurate estimate of the missing value
What is the benefit of using the GROUP BY command in the context of data imputation? β it helps create a lookup table with users and their average ratings
Why is transparency important when performing data imputation? β to ensure that others understand the data is estimated, not original
What is one of the simplest methods of imputing missing numerical data in the movie ratings example? β by using the average rating of the user for other movies
(Aggregation and Reduction)
What is the primary purpose of data reduction? β to reduce the size of the data while retaining useful information
Which of the following is an example of aggregation as a method of data reduction? β summing data points for a zip code and representing it as one value
Why might someone use aggregation to reduce a data set? β to simplify data for analysis without losing key information
What is the main goal of sampling data in the context of reduction? β to reduce data size while maintaining accuracy in the results
What is one potential issue with simple random sampling? β it may introduce bias or lead to under-sampling of certain groups
What is the key difference between simple random sampling and stratified sampling? β stratified uses subgroups while random sampling does not
When using stratified sampling, why is it important to break the data into subgroups? β to ensure the sample accurately represents all demographic groups
What is the advantage of using aggregated data, such as census data by zip code or demographic information? β it reduces the data size, making it easier to handle while remaining useful
In what situation might you use stratified sampling instead of simple random sampling? β to ensure every subset of the population is represented in the sample
What is one benefit of using random sampling in data reduction β it reduces the data size while keeping it representative of the entire set
(Data Masking)
What is the main purpose of data masking in data analysis? β to protect sensitive information from being exposed
Which of the following is an example of personally identifiable information? β social security number
How can you mask sensitive information while joining a data set based on unique identifiers? β by using an anonymous identifier or index field
What could be used as anonymous identifiers to replace user names? β row number from the list of unique usernames
Why is it important to remove personally identifiable information (PII) in data sets? β to ensure compliance with data protection laws
What is the potential issue with anonymizing data using unique identifiers when tracking people's movements? β it might be possible to reconstruct someone's identity based on patterns
Which of the following is a simple method for creating an anonymous identifier for sensitive data? β assigning a unique index number to each individual
What is the primary reason for replacing sensitive information like user names with anonymous identifiers in reports? β to ensure privacy and protect identities
(Transposing Data)
What does it mean to transpose data in the context of data mining? β breaking down pivoted data into individual records
What is the difference between pivoted and unpivoted data? β pivoted data is compact for reporting; unpivoted has individual records
What is the primary function of the βunpivotβ command in many data tools like ms excel? β to break down pivoted data into individual records
Which of the following is true about pivoted data? β it is compact and useful for reporting purposes
What type of data is typically used in the pivoted format? β data that needs to be summarized or reported in a compact form
What does appending data refer to? β combining data from one dataset to another
What is the difference between an inline append and an intermediate append? β an inline discards original dataset, while intermediate keeps them In power query,
What feature allows you to append data from one table to another? β append queries
What is the result of an inline append operation? β a new dataset is created, and all original datasets are discarded
What does an intermediate append do in contrast to an inline append? β it combines datasets into a new table while keeping the original datasets
Which of the following is an example of appending data? β merging two datasets into one
What is a key difference between appending data into a new table and just copying the data? β appending data into a new table creates an entirely new dataset
When performing an inline append, what happened to the original datasets? β the original datasets are discarded, and only the combined dataset is kept
[Data Manipulation: Recoding Data]
(Recoding Data)
What does the process of recoding data involve? β transforming data from one form to another
When might you need to recode numerical data? β when you want to convert numerical data into categories for easier analysis
What is an example of recoding numerical data? β changing peopleβs names to a consistent format
Which income range would a person with an income of 56,000 fall into? β 50,000 - 100,000
When recoding categorical data, why might you want to standardize city names? β to ensure there are no misspellings and that data is consistent
What is the main advantage of using a toolo like Power BI for recoding data? β it allows for easy graphical creation of conditional columns to recode data
What is an example of recoding categorical data when dealing with inconsistent data entries? β standardizing how country names are entered (e.g., PHL to PH)
What is the main concept behind recoding numerical data to categorical data? β to classify numerical data into meaningful categories for analysis
Why would you create an additional column when recoding data? β to preserve the original data while showing the recoded version
(Derived Variables)
What is a derived variable? β a new data point created from existing data through a calculation
How does a derived variable differ from recoded data? β derived variables create new data; recoded data reorganizes existing data
How is the derived variable βaverage lessons per dayβ calculated? β divide lessons viewed by difference of start date from the end date
Which of the following best describes the purpose of a derived variable? β to create a new data point from existing data, often based on a calculation
Why might you choose to store the formula for a derived variable instead of storing the derived values themselves? β more space-efficient, as the derived values can be recomputed when needed
When would it make sense to store the derived variable directly in a dataset, rather than just storing the formula? β if speed matters and frequent recomputation of the variable is inefficient
What is the main advantage of using derived variables in data analysis? β they create new insights from existing data through calculations
How is date diff function typically used in the creation of a derived variable? β calculate the difference between two dates in a specified format
What might be a reason to use derived variables in the context of an online course website? β compute time metrics like lessons per day or total time spent in the course
(Value Imputation)
What is the main goal of data imputation? β to estimate missing data based on existing information
In the movie ratings scenario, how could the imputed rating calculated? β by using the average rating of the user for other movies
What is the key reason for keeping the original and imputed ratings in separate columns? β to ensure that the imputed values do not overwrite the original data
What SQL command can be used to calculate the userβs average rating? β GROUP BY username and AVG(rating)
What is an importan consideration when performing data imputation? β the imputation method should not alter the dataβs original context
Which of the following is a more sophisticated method of imputing missing values? β using machine learning algorithms or artificial neural methods
In the context of imputation, why might it be useful to consider a movieβs genre or the date it was rated? β to provide a more accurate estimate of the missing value
What is the benefit of using the GROUP BY command in the context of data imputation? β it helps create a lookup table with users and their average ratings
Why is transparency important when performing data imputation? β to ensure that others understand the data is estimated, not original
What is one of the simplest methods of imputing missing numerical data in the movie ratings example? β by using the average rating of the user for other movies
(Aggregation and Reduction)
What is the primary purpose of data reduction? β to reduce the size of the data while retaining useful information
Which of the following is an example of aggregation as a method of data reduction? β summing data points for a zip code and representing it as one value
Why might someone use aggregation to reduce a data set? β to simplify data for analysis without losing key information
What is the main goal of sampling data in the context of reduction? β to reduce data size while maintaining accuracy in the results
What is one potential issue with simple random sampling? β it may introduce bias or lead to under-sampling of certain groups
What is the key difference between simple random sampling and stratified sampling? β stratified uses subgroups while random sampling does not
When using stratified sampling, why is it important to break the data into subgroups? β to ensure the sample accurately represents all demographic groups
What is the advantage of using aggregated data, such as census data by zip code or demographic information? β it reduces the data size, making it easier to handle while remaining useful
In what situation might you use stratified sampling instead of simple random sampling? β to ensure every subset of the population is represented in the sample
What is one benefit of using random sampling in data reduction β it reduces the data size while keeping it representative of the entire set
(Data Masking)
What is the main purpose of data masking in data analysis? β to protect sensitive information from being exposed
Which of the following is an example of personally identifiable information? β social security number
How can you mask sensitive information while joining a data set based on unique identifiers? β by using an anonymous identifier or index field
What could be used as anonymous identifiers to replace user names? β row number from the list of unique usernames
Why is it important to remove personally identifiable information (PII) in data sets? β to ensure compliance with data protection laws
What is the potential issue with anonymizing data using unique identifiers when tracking people's movements? β it might be possible to reconstruct someone's identity based on patterns
Which of the following is a simple method for creating an anonymous identifier for sensitive data? β assigning a unique index number to each individual
What is the primary reason for replacing sensitive information like user names with anonymous identifiers in reports? β to ensure privacy and protect identities
(Transposing Data)
What does it mean to transpose data in the context of data mining? β breaking down pivoted data into individual records
What is the difference between pivoted and unpivoted data? β pivoted data is compact for reporting; unpivoted has individual records
What is the primary function of the βunpivotβ command in many data tools like ms excel? β to break down pivoted data into individual records
Which of the following is true about pivoted data? β it is compact and useful for reporting purposes
What type of data is typically used in the pivoted format? β data that needs to be summarized or reported in a compact form
What does appending data refer to? β combining data from one dataset to another
What is the difference between an inline append and an intermediate append? β an inline discards original dataset, while intermediate keeps them In power query,
What feature allows you to append data from one table to another? β append queries
What is the result of an inline append operation? β a new dataset is created, and all original datasets are discarded
What does an intermediate append do in contrast to an inline append? β it combines datasets into a new table while keeping the original datasets
Which of the following is an example of appending data? β merging two datasets into one
What is a key difference between appending data into a new table and just copying the data? β appending data into a new table creates an entirely new dataset
When performing an inline append, what happened to the original datasets? β the original datasets are discarded, and only the combined dataset is kept