chapter 6 slater

Chapter 6: Transforming Data - Comprehensive Overview

Introduction: Data may be dirty due to issues like incompleteness, being outdated, duplication, or errors like typos and spelling mistakes. Understanding and correcting these is crucial for maintaining data integrity.

Data Structuring:

  • Aggregate Data: Data summarized for analysis but loses detail, which is why original data should be stored in a disaggregated form.

  • Data Joining: Combining data from different sources is necessary for comprehensive analysis, especially in ETL processes.

  • Data Pivoting: Rotating data from rows to columns to facilitate analysis in certain applications.

Data Standardization:

  • Parsing: Separating data into multiple fields using pattern recognition; e.g., using Excel to parse "N-0504-2002" into meaningful components.

  • Concatenation: Combining multiple fields into one; e.g., using Excel to combine fields like product category and number into a single identifier.

  • Handling Cryptic Data Values: Using schemes to decode values that are otherwise meaningless without context.

  • Misfielded Data Values: Correcting data that is formatted correctly but placed in the wrong field.

  • Consistency: Ensuring uniform storage and formatting of data across systems; e.g., consistent date and monetary formats.

Data Cleaning:

  • De-duplication: Removing duplicates while maintaining original data integrity; using tools like Excel for identification.

  • Filtering: Removing unnecessary data; e.g., inactive vendors or unwanted categories.

  • Imputation: Replacing missing values with appropriate substitutes, documented for transparency.

  • Contradiction Errors: Resolving conflicts in data descriptions; e.g., different addresses for the same entity.

  • Threshold Violations: Correcting data that falls outside predefined acceptable ranges.

  • Attribute Dependencies: Fixing errors where related data does not match; e.g., incorrect zip codes for cities.

  • Entry Errors: Addressing input mistakes from both human and system sources.

Data Validation:

  • Visual Inspection: Manually checking data for inconsistencies or missing elements.

  • Statistical Tests: Using basic statistics to check for data integrity and anomalies.

  • Sample Audit: Analyzing a sample of the data set to ensure representation and accuracy.

  • Advanced Techniques: Employing complex validation methods to ensure all data is accurate and adheres to business rules.

robot