AR

[Cleansing and Profiling Data]

(Cleansing Data)

What is the primary goal of data profiling – to identify trends and information in a dataset

Which of the following is not a key aspect of data profiling – encrypsting the data security

Which of the following is a key step in the data profiling process – identifying redundant and duplicate data

What is the prupose of consolidating duplicate or redundant data – to ensure data accuracy and efficiency

When cleansing data, which of the followin is a primary concern – ensuring that missing or invalid values are handled appropriately

What is an outlier in a dataset – a data point that falls outside the average or statistically relevant range

Why is it important to cleanse data before analysis – to ensure it meets system requirements and improves accuracy

Which of the following is a benefit of data profiling tools – they help identify inconsistencies and patterns in data

(Data Profiling Steps)

What is the first step in the data profiling process – identifying and documenting the source of the data and its integrity

Why is it important to identify the source of the data – to ensure teh data is coming from a reliable and consistent format

What is the purpose of the second step in data profiling – to identify the field names and data types and check their appropriateness

Which of the following is not a data type that should be identified during data profiling – algorithm

Why might a numeric value need to be converted during data profiling – to correct formatting inconsistencies e.g., converting decimals to currency

What is the third step in data profiling – determining which fields are used for reporting

Why is it important to identify primary, natural or foreign keys in a dataset – to ensure that every record has a unique identifier

What is the risk if a dataset lacks a primary key – not possible to uniquely identify records, leading to data integrity issues

What is the purpose of the fifth and final step in data profiling – validate total number of records and calculations make sense in the dataset

(Data Profiling Tools)

What is the primary purpose of data profiling tools – to ensure good-quality source data for accurate analysis

Which of the following tools includes a built-in data profiling feature known as Power Query – Excel

How can Power Query help identify data errors – by visualizing data distribution and detecting outliers

Which of the following can Power Query display for a column of numerical data – minimum, maximum, average, and outliers

What is one benefit of Power Query’s histogram features – it shows the most and least common values in a column

If a test score column shows a maximum value of 153 out of 100, what does this indicate — there is an error in the source data

(Redundant and Duplicated Data)

What is the primary difference between redundant and duplicated data – identical data stored in multiple places; repeated within the same dataset

Which of the following is an example of redundant data – a customer name/email stored in the sales/invty, e-commerce system, etc.

How can duplicate data be identified in Microsoft Excel – using the Duplicate Record function

What is the best way to identify duplicate records in an SQL database – applying a DISTINCT query

Why is it important to idenfity and remove redundant data – it reduces storage costs and improves system efficiency

(Unnecessary Data)

Why is it important for a data analyst to remove unnecessary data from a dataset – to reduce the time spent analyzing irrelevant information

What determines whether certain is data necessary or unnecessary in a dataset – the specific business questions being answered

What is one negative consequence of collecting unnecessary data – it increases processing and storage costs

How can data analysts avoid working with unnecessary data when using tools like Excel, Power BI, or Tableau – by selecting only the columns and fieds needed for analysis

Why do organizations often collect more data than necessary – because they don’t know what specific data they need

(MIssing Values)

What does the term “null” represent in a dataset – a placeholder for missing data

Which of the following is NOT a common way null values are displayed in datasets – the word “ERROR”

What is one reason null values might appear in a dataset – the value is not applicable to the field

In a sales process database, why might the Delivery Date field be null for a product that is marked as In-Store Pickup? – the delivery date field is irrelevant for in-store pickup orders

What should a data analyst do when encountering null values in a dataset? – filter them out or replace them with meaningful values

Why might a dataset have null values for a “last access time” field in an e-learning system? – the user has never been logged in to access the course

How can incomplete survey responses lead to null values? – participants skip questions without mandatory requirements

What is one potential solution for handling null values in a dataset? – replace with meaningful placeholders like “No color” or “No card on file”

(Invalid Data)

Why might a previously valid tax rate in a dataset become invalid? – the tax law changed, making the old rate outdated

What makes an extreme value, such as a student weighing 1,923 pounds, an example of invalid data? – it falls outside the expected range for the data

How can invisible characters cause invalid data in a dataset? – they disrupt processing by creating inconsistencies in fields

Which of the following is an example of invalid data caused by a formatting issue in a purchase order? – A purchase date entered as “2024/15/01”

What is one common method for handling invalid data caused by leading and trailing spaces? – removing the spaces to make the data valid

(Meeting Specifications)

Why is it important for data to meet specifications during system migration? – to prevent errors during data import

What is the most common reason data fails to meet specifications during an import process? – wrong data type

How can you prevent type mismatch errors during data import? – convert data types to match the new system’s format

Which tool can automatically adjust data types during import? – tableau

(Data Outliers)

What is a data outlier? – a value that is far outside the normal range of other values

How can data outliers usually be identified? – by visualizing the data using graphs like a histogram

Which of the following could be a cause of an outlier? – incorrect data entry during collection

What is non-parametric data? – does not follow a prescribed model & analyzed based on its own distribution

How does a parametric model differ from a non-parametric model when identifying outliers? – parametric uses a known baseline; non-parametric evaluates data

Which of the following is an example of a parametric approach to identifying an outlier? – checking if a sales figure exceeds the average by more than 10%