[Data Manipulation: Recoding Data]

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/58

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

59 Terms

1
New cards

Recoding Data

The process of transforming data from one form to another.

2
New cards

When to recode numerical data

When converting numerical data into categories for easier analysis.

3
New cards

Example of recoding numerical data

Changing people's names to a consistent format.

4
New cards

Income range for 56,000

50,000 - 100,000.

5
New cards

Standardizing city names

To ensure there are no misspellings and that data is consistent.

6
New cards

Advantage of Power BI for recoding data

Allows for easy graphical creation of conditional columns to recode data.

7
New cards

Example of recoding categorical data

Standardizing how country names are entered (e.g., PHL to PH).

8
New cards

Main concept of recoding numerical to categorical

To classify numerical data into meaningful categories for analysis.

9
New cards

Creating an additional column when recoding

To preserve the original data while showing the recoded version.

10
New cards

Derived variable

A new data point created from existing data through a calculation.

11
New cards

Difference between derived variable and recoded data

Derived variables create new data; recoded data reorganizes existing data.

12
New cards

Average lessons per day calculation

Divide lessons viewed by the difference of start date from the end date.

13
New cards

Purpose of a derived variable

To create a new data point from existing data, often based on a calculation.

14
New cards

Storing the formula for derived variables

More space-efficient, as the derived values can be recomputed when needed.

15
New cards

When to store derived variable directly in a dataset

If speed matters and frequent recomputation of the variable is inefficient.

16
New cards

Advantage of using derived variables

They create new insights from existing data through calculations.

17
New cards

Date diff function usage in derived variable creation

To calculate the difference between two dates in a specified format.

18
New cards

Use of derived variables in online course website

To compute time metrics like lessons per day or total time spent in the course.

19
New cards

Goal of data imputation

To estimate missing data based on existing information.

20
New cards

Imputed rating calculation in movie ratings

Using the average rating of the user for other movies.

21
New cards

Reason for keeping original and imputed ratings separate

To ensure that the imputed values do not overwrite the original data.

22
New cards

SQL command for user's average rating

GROUP BY username and AVG(rating).

23
New cards

Important consideration for data imputation

The imputation method should not alter the data's original context.

24
New cards

Sophisticated method of imputing missing values

Using machine learning algorithms or artificial neural methods.

25
New cards

Consideration of movie's genre in imputation

To provide a more accurate estimate of the missing value.

26
New cards

Benefit of using GROUP BY command in data imputation

Helps create a lookup table with users and their average ratings.

27
New cards

Importance of transparency in data imputation

To ensure that others understand the data is estimated, not original.

28
New cards

Simplest method for imputing missing numerical data

Using the average rating of the user for other movies.

29
New cards

Purpose of data reduction

To reduce the size of the data while retaining useful information.

30
New cards

Example of aggregation in data reduction

Summing data points for a zip code and representing it as one value.

31
New cards

Why use aggregation to reduce data set

To simplify data for analysis without losing key information.

32
New cards

Main goal of sampling data in reduction

To reduce data size while maintaining accuracy in the results.

33
New cards

Potential issue with simple random sampling

It may introduce bias or lead to under-sampling of certain groups.

34
New cards

Difference between simple random sampling and stratified sampling

Stratified uses subgroups while random sampling does not.

35
New cards

Importance of breaking data into subgroups in stratified sampling

To ensure the sample accurately represents all demographic groups.

36
New cards

Advantage of using aggregated data

Reduces data size, making it easier to handle while remaining useful.

37
New cards

Situation to use stratified sampling

To ensure every subset of the population is represented in the sample.

38
New cards

Benefit of random sampling in data reduction

It reduces the data size while keeping it representative of the entire set.

39
New cards

Purpose of data masking

To protect sensitive information from being exposed.

40
New cards

Example of personally identifiable information

Social security number.

41
New cards

How to mask sensitive information while joining datasets

Using an anonymous identifier or index field.

42
New cards

Anonymous identifiers for replacing usernames

Row number from the list of unique usernames.

43
New cards

Importance of removing PII in datasets

To ensure compliance with data protection laws.

44
New cards

Issue with anonymizing data using unique identifiers

It might be possible to reconstruct someone's identity based on patterns.

45
New cards

Simple method for creating an anonymous identifier

Assigning a unique index number to each individual.

46
New cards

Reason for replacing sensitive information with anonymous identifiers

To ensure privacy and protect identities.

47
New cards

What does it mean to transpose data

Breaking down pivoted data into individual records.

48
New cards

Difference between pivoted and unpivoted data

Pivoted data is compact for reporting; unpivoted has individual records.

49
New cards

Primary function of the ‘unpivot’ command

To break down pivoted data into individual records.

50
New cards

Truth about pivoted data

It is compact and useful for reporting purposes.

51
New cards

Type of data used in pivoted format

Data that needs to be summarized or reported in a compact form.

52
New cards

What does appending data refer to?

Combining data from one dataset to another.

53
New cards

Difference between inline append and intermediate append

Inline discards original dataset, while intermediate keeps them.

54
New cards

Feature allowing appending data from one table to another

Append queries.

55
New cards

Result of an inline append operation

A new dataset is created, and all original datasets are discarded.

56
New cards

What does an intermediate append do?

Combines datasets into a new table while keeping the original datasets.

57
New cards

Example of appending data

Merging two datasets into one.

58
New cards

Difference between appending data into a new table and copying data

Appending creates an entirely new dataset.

59
New cards

What happens to original datasets during an inline append?

The original datasets are discarded, and only the combined dataset is kept.