AR

[Data Manipulation: Recoding Data]

(Recoding Data)

What does the process of recoding data involve? – transforming data from one form to another

When might you need to recode numerical data? – when you want to convert numerical data into categories for easier analysis

What is an example of recoding numerical data? – changing people’s names to a consistent format

Which income range would a person with an income of 56,000 fall into? – 50,000 - 100,000

When recoding categorical data, why might you want to standardize city names? – to ensure there are no misspellings and that data is consistent

What is the main advantage of using a toolo like Power BI for recoding data? – it allows for easy graphical creation of conditional columns to recode data

What is an example of recoding categorical data when dealing with inconsistent data entries? – standardizing how country names are entered (e.g., PHL to PH)

What is the main concept behind recoding numerical data to categorical data? – to classify numerical data into meaningful categories for analysis

Why would you create an additional column when recoding data? – to preserve the original data while showing the recoded version

(Derived Variables)

What is a derived variable? – a new data point created from existing data through a calculation

How does a derived variable differ from recoded data? – derived variables create new data; recoded data reorganizes existing data

How is the derived variable “average lessons per day” calculated? – divide lessons viewed by difference of start date from the end date

Which of the following best describes the purpose of a derived variable? – to create a new data point from existing data, often based on a calculation

Why might you choose to store the formula for a derived variable instead of storing the derived values themselves? – more space-efficient, as the derived values can be recomputed when needed

When would it make sense to store the derived variable directly in a dataset, rather than just storing the formula? – if speed matters and frequent recomputation of the variable is inefficient

What is the main advantage of using derived variables in data analysis? – they create new insights from existing data through calculations

How is date diff function typically used in the creation of a derived variable? – calculate the difference between two dates in a specified format

What might be a reason to use derived variables in the context of an online course website? – compute time metrics like lessons per day or total time spent in the course

(Value Imputation)

What is the main goal of data imputation? – to estimate missing data based on existing information

In the movie ratings scenario, how could the imputed rating calculated? – by using the average rating of the user for other movies

What is the key reason for keeping the original and imputed ratings in separate columns? – to ensure that the imputed values do not overwrite the original data

What SQL command can be used to calculate the user’s average rating? – GROUP BY username and AVG(rating)

What is an importan consideration when performing data imputation? – the imputation method should not alter the data’s original context

Which of the following is a more sophisticated method of imputing missing values? – using machine learning algorithms or artificial neural methods

In the context of imputation, why might it be useful to consider a movie’s genre or the date it was rated? – to provide a more accurate estimate of the missing value

What is the benefit of using the GROUP BY command in the context of data imputation? – it helps create a lookup table with users and their average ratings

Why is transparency important when performing data imputation? – to ensure that others understand the data is estimated, not original

What is one of the simplest methods of imputing missing numerical data in the movie ratings example? – by using the average rating of the user for other movies

(Aggregation and Reduction)

What is the primary purpose of data reduction? – to reduce the size of the data while retaining useful information

Which of the following is an example of aggregation as a method of data reduction? – summing data points for a zip code and representing it as one value

Why might someone use aggregation to reduce a data set? – to simplify data for analysis without losing key information

What is the main goal of sampling data in the context of reduction? – to reduce data size while maintaining accuracy in the results

What is one potential issue with simple random sampling? – it may introduce bias or lead to under-sampling of certain groups

What is the key difference between simple random sampling and stratified sampling? – stratified uses subgroups while random sampling does not

When using stratified sampling, why is it important to break the data into subgroups? – to ensure the sample accurately represents all demographic groups

What is the advantage of using aggregated data, such as census data by zip code or demographic information? – it reduces the data size, making it easier to handle while remaining useful

In what situation might you use stratified sampling instead of simple random sampling? – to ensure every subset of the population is represented in the sample

What is one benefit of using random sampling in data reduction – it reduces the data size while keeping it representative of the entire set

(Data Masking)

What is the main purpose of data masking in data analysis? – to protect sensitive information from being exposed

Which of the following is an example of personally identifiable information? – social security number

How can you mask sensitive information while joining a data set based on unique identifiers? – by using an anonymous identifier or index field

What could be used as anonymous identifiers to replace user names? – row number from the list of unique usernames

Why is it important to remove personally identifiable information (PII) in data sets? – to ensure compliance with data protection laws

What is the potential issue with anonymizing data using unique identifiers when tracking people's movements? – it might be possible to reconstruct someone's identity based on patterns

Which of the following is a simple method for creating an anonymous identifier for sensitive data? – assigning a unique index number to each individual

What is the primary reason for replacing sensitive information like user names with anonymous identifiers in reports? – to ensure privacy and protect identities

(Transposing Data)

What does it mean to transpose data in the context of data mining? – breaking down pivoted data into individual records

What is the difference between pivoted and unpivoted data? – pivoted data is compact for reporting; unpivoted has individual records

What is the primary function of the ‘unpivot’ command in many data tools like ms excel? – to break down pivoted data into individual records

Which of the following is true about pivoted data? – it is compact and useful for reporting purposes

What type of data is typically used in the pivoted format? – data that needs to be summarized or reported in a compact form

What does appending data refer to? – combining data from one dataset to another

What is the difference between an inline append and an intermediate append? – an inline discards original dataset, while intermediate keeps them In power query,

What feature allows you to append data from one table to another? – append queries

What is the result of an inline append operation? – a new dataset is created, and all original datasets are discarded

What does an intermediate append do in contrast to an inline append? – it combines datasets into a new table while keeping the original datasets

Which of the following is an example of appending data? – merging two datasets into one

What is a key difference between appending data into a new table and just copying the data? – appending data into a new table creates an entirely new dataset

When performing an inline append, what happened to the original datasets? – the original datasets are discarded, and only the combined dataset is kept