[Cleansing and Profiling Data]

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/54

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

55 Terms

1
New cards

What is the primary goal of data profiling?

To identify trends and information in a dataset.

2
New cards

Which of the following is not a key aspect of data profiling?

Encrypting the data security.

3
New cards

What is a key step in the data profiling process?

Identifying redundant and duplicate data.

4
New cards

What is the purpose of consolidating duplicate or redundant data?

To ensure data accuracy and efficiency.

5
New cards

When cleansing data, what is a primary concern?

Ensuring that missing or invalid values are handled appropriately.

6
New cards

What is an outlier in a dataset?

A data point that falls outside the average or statistically relevant range.

7
New cards

Why is it important to cleanse data before analysis?

To ensure it meets system requirements and improves accuracy.

8
New cards

What is the first step in the data profiling process?

Identifying and documenting the source of the data and its integrity.

9
New cards

Why is it important to identify the source of the data?

To ensure the data is coming from a reliable and consistent format.

10
New cards

What is the purpose of the second step in data profiling?

To identify the field names and data types and check their appropriateness.

11
New cards

Which of the following is not a data type that should be identified during data profiling?

Algorithm.

12
New cards

Why might a numeric value need to be converted during data profiling?

To correct formatting inconsistencies e.g., converting decimals to currency.

13
New cards

What is the third step in data profiling?

Determining which fields are used for reporting.

14
New cards

Why is it important to identify primary, natural, or foreign keys in a dataset?

To ensure that every record has a unique identifier.

15
New cards

What is the risk if a dataset lacks a primary key?

Not possible to uniquely identify records, leading to data integrity issues.

16
New cards

What is the purpose of the fifth and final step in data profiling?

Validate total number of records and calculations make sense in the dataset.

17
New cards

What is the primary purpose of data profiling tools?

To ensure good-quality source data for accurate analysis.

18
New cards

Which tool includes a built-in data profiling feature known as Power Query?

Excel.

19
New cards

How can Power Query help identify data errors?

By visualizing data distribution and detecting outliers.

20
New cards

Which of the following can Power Query display for a column of numerical data?

Minimum, maximum, average, and outliers.

21
New cards

What is one benefit of Power Query’s histogram features?

It shows the most and least common values in a column.

22
New cards

If a test score column shows a maximum value of 153 out of 100, what does this indicate?

There is an error in the source data.

23
New cards

What is the primary difference between redundant and duplicated data?

Identical data stored in multiple places; repeated within the same dataset.

24
New cards

Which of the following is an example of redundant data?

A customer name/email stored in the sales/invty, e-commerce system, etc.

25
New cards

How can duplicate data be identified in Microsoft Excel?

Using the Duplicate Record function.

26
New cards

What is the best way to identify duplicate records in an SQL database?

Applying a DISTINCT query.

27
New cards

Why is it important to identify and remove redundant data?

It reduces storage costs and improves system efficiency.

28
New cards

Why is it important for a data analyst to remove unnecessary data from a dataset?

To reduce the time spent analyzing irrelevant information.

29
New cards

What determines whether certain data is necessary or unnecessary in a dataset?

The specific business questions being answered.

30
New cards

What is one negative consequence of collecting unnecessary data?

It increases processing and storage costs.

31
New cards

How can data analysts avoid working with unnecessary data when using tools like Excel, Power BI, or Tableau?

By selecting only the columns and fields needed for analysis.

32
New cards

Why do organizations often collect more data than necessary?

Because they don’t know what specific data they need.

33
New cards

What does the term 'null' represent in a dataset?

A placeholder for missing data.

34
New cards

Which of the following is NOT a common way null values are displayed in datasets?

The word 'ERROR'.

35
New cards

What is one reason null values might appear in a dataset?

The value is not applicable to the field.

36
New cards

In a sales process database, why might the Delivery Date field be null for a product that is marked as In-Store Pickup?

The delivery date field is irrelevant for in-store pickup orders.

37
New cards

What should a data analyst do when encountering null values in a dataset?

Filter them out or replace them with meaningful values.

38
New cards

Why might a dataset have null values for a 'last access time' field in an e-learning system?

The user has never been logged in to access the course.

39
New cards

How can incomplete survey responses lead to null values?

Participants skip questions without mandatory requirements.

40
New cards

What is one potential solution for handling null values in a dataset?

Replace with meaningful placeholders like 'No color' or 'No card on file.'

41
New cards

Why might a previously valid tax rate in a dataset become invalid?

The tax law changed, making the old rate outdated.

42
New cards

What makes an extreme value, such as a student weighing 1,923 pounds, an example of invalid data?

It falls outside the expected range for the data.

43
New cards

How can invisible characters cause invalid data in a dataset?

They disrupt processing by creating inconsistencies in fields.

44
New cards

Which of the following is an example of invalid data caused by a formatting issue in a purchase order?

A purchase date entered as '2024/15/01'.

45
New cards

What is one common method for handling invalid data caused by leading and trailing spaces?

Removing the spaces to make the data valid.

46
New cards

Why is it important for data to meet specifications during system migration?

To prevent errors during data import.

47
New cards

What is the most common reason data fails to meet specifications during an import process?

Wrong data type.

48
New cards

How can you prevent type mismatch errors during data import?

Convert data types to match the new system’s format.

49
New cards

Which tool can automatically adjust data types during import?

Tableau.

50
New cards

What is a data outlier?

A value that is far outside the normal range of other values.

51
New cards

How can data outliers usually be identified?

By visualizing the data using graphs like a histogram.

52
New cards

Which of the following could be a cause of an outlier?

Incorrect data entry during collection.

53
New cards

What is non-parametric data?

Data that does not follow a prescribed model & analyzed based on its own distribution.

54
New cards

How does a parametric model differ from a non-parametric model when identifying outliers?

Parametric uses a known baseline; non-parametric evaluates data.

55
New cards

Which of the following is an example of a parametric approach to identifying an outlier?

Checking if a sales figure exceeds the average by more than 10%.