(Cleansing Data)
What is the primary goal of data profiling – to identify trends and information in a dataset
Which of the following is not a key aspect of data profiling – encrypsting the data security
Which of the following is a key step in the data profiling process – identifying redundant and duplicate data
What is the prupose of consolidating duplicate or redundant data – to ensure data accuracy and efficiency
When cleansing data, which of the followin is a primary concern – ensuring that missing or invalid values are handled appropriately
What is an outlier in a dataset – a data point that falls outside the average or statistically relevant range
Why is it important to cleanse data before analysis – to ensure it meets system requirements and improves accuracy
Which of the following is a benefit of data profiling tools – they help identify inconsistencies and patterns in data
(Data Profiling Steps)
What is the first step in the data profiling process – identifying and documenting the source of the data and its integrity
Why is it important to identify the source of the data – to ensure teh data is coming from a reliable and consistent format
What is the purpose of the second step in data profiling – to identify the field names and data types and check their appropriateness
Which of the following is not a data type that should be identified during data profiling – algorithm
Why might a numeric value need to be converted during data profiling – to correct formatting inconsistencies e.g., converting decimals to currency
What is the third step in data profiling – determining which fields are used for reporting
Why is it important to identify primary, natural or foreign keys in a dataset – to ensure that every record has a unique identifier
What is the risk if a dataset lacks a primary key – not possible to uniquely identify records, leading to data integrity issues
What is the purpose of the fifth and final step in data profiling – validate total number of records and calculations make sense in the dataset
(Data Profiling Tools)
What is the primary purpose of data profiling tools – to ensure good-quality source data for accurate analysis
Which of the following tools includes a built-in data profiling feature known as Power Query – Excel
How can Power Query help identify data errors – by visualizing data distribution and detecting outliers
Which of the following can Power Query display for a column of numerical data – minimum, maximum, average, and outliers
What is one benefit of Power Query’s histogram features – it shows the most and least common values in a column
If a test score column shows a maximum value of 153 out of 100, what does this indicate — there is an error in the source data
(Redundant and Duplicated Data)
What is the primary difference between redundant and duplicated data – identical data stored in multiple places; repeated within the same dataset
Which of the following is an example of redundant data – a customer name/email stored in the sales/invty, e-commerce system, etc.
How can duplicate data be identified in Microsoft Excel – using the Duplicate Record function
What is the best way to identify duplicate records in an SQL database – applying a DISTINCT query
Why is it important to idenfity and remove redundant data – it reduces storage costs and improves system efficiency
(Unnecessary Data)
Why is it important for a data analyst to remove unnecessary data from a dataset – to reduce the time spent analyzing irrelevant information
What determines whether certain is data necessary or unnecessary in a dataset – the specific business questions being answered
What is one negative consequence of collecting unnecessary data – it increases processing and storage costs
How can data analysts avoid working with unnecessary data when using tools like Excel, Power BI, or Tableau – by selecting only the columns and fieds needed for analysis
Why do organizations often collect more data than necessary – because they don’t know what specific data they need
(MIssing Values)
What does the term “null” represent in a dataset – a placeholder for missing data
Which of the following is NOT a common way null values are displayed in datasets – the word “ERROR”
What is one reason null values might appear in a dataset – the value is not applicable to the field
In a sales process database, why might the Delivery Date field be null for a product that is marked as In-Store Pickup? – the delivery date field is irrelevant for in-store pickup orders
What should a data analyst do when encountering null values in a dataset? – filter them out or replace them with meaningful values
Why might a dataset have null values for a “last access time” field in an e-learning system? – the user has never been logged in to access the course
How can incomplete survey responses lead to null values? – participants skip questions without mandatory requirements
What is one potential solution for handling null values in a dataset? – replace with meaningful placeholders like “No color” or “No card on file”
(Invalid Data)
Why might a previously valid tax rate in a dataset become invalid? – the tax law changed, making the old rate outdated
What makes an extreme value, such as a student weighing 1,923 pounds, an example of invalid data? – it falls outside the expected range for the data
How can invisible characters cause invalid data in a dataset? – they disrupt processing by creating inconsistencies in fields
Which of the following is an example of invalid data caused by a formatting issue in a purchase order? – A purchase date entered as “2024/15/01”
What is one common method for handling invalid data caused by leading and trailing spaces? – removing the spaces to make the data valid
(Meeting Specifications)
Why is it important for data to meet specifications during system migration? – to prevent errors during data import
What is the most common reason data fails to meet specifications during an import process? – wrong data type
How can you prevent type mismatch errors during data import? – convert data types to match the new system’s format
Which tool can automatically adjust data types during import? – tableau
(Data Outliers)
What is a data outlier? – a value that is far outside the normal range of other values
How can data outliers usually be identified? – by visualizing the data using graphs like a histogram
Which of the following could be a cause of an outlier? – incorrect data entry during collection
What is non-parametric data? – does not follow a prescribed model & analyzed based on its own distribution
How does a parametric model differ from a non-parametric model when identifying outliers? – parametric uses a known baseline; non-parametric evaluates data
Which of the following is an example of a parametric approach to identifying an outlier? – checking if a sales figure exceeds the average by more than 10%