unit 2

Why Preprocess the Data?

  • Importance of Data Quality: Data quality is crucial as it must satisfy the requirements of its intended use. Factors include:

    • Accuracy: Data must represent the real-world condition accurately.

    • Completeness: Data should have all the necessary values recorded.

    • Consistency: Data must be consistent across different datasets and systems.

    • Timeliness: Data should be up-to-date and available when needed.

    • Believability: Users must trust the data.

    • Interpretability: Data must be understandable to the users.

Real-World Data Scenarios

  • Example scenario in data analysis for AllElectronics highlights the challenges faced:

    • Observed missing values in key attributes (e.g., items sold, prices).

    • Errors and unusual values reported by users.

  • This illustrates common data quality issues in large databases:

    • Inaccuracy may arise from faulty data entry instruments, human or computer errors, or disguised missing data.

    • Completeness issues can occur due to omitted data for various reasons.

    • Inconsistencies may stem from inconsistent naming conventions or data formats.

Data Quality Perspectives

  • Quality perception may differ based on user roles:

    • Marketing Analyst: May find a database sufficient if 80% of addresses are accurate for target marketing.

    • Sales Manager: May find the same data inadequate due to inaccuracies.

  • Timeliness impact:

    • Late submissions from staff can lead to temporary gaps in data that affect accuracy during analysis.

  • Believability and Interpretability: If past errors exist, users may distrust even corrected data.

Data Cleaning

  • Importance of Data Cleaning: Real-world data is often incomplete, noisy, and inconsistent.

  • Methods for Handling Missing Values:

    1. Ignore the Tuple: Used primarily when the class label is missing.

    2. Manual Filling: Time-consuming, and not practical for large datasets.

    3. Global Constant: E.g., replace missing values with "Unknown", which can lead to biases.

    4. Central Tendency: Fill missing values using mean or median based on data distribution.

    5. Class-Based Mean/Median: Use attribute mean/median for the same class of records.

    6. Most Probable Value: Estimate with regression or decision tree techniques.

  • Noise in Data: Random errors can distort measurements.

  • Data Smoothing Techniques:

    • Binning: Organizes values into bins for local smoothing.

      • E.g., Mean/median binning smooths data by replacing values within a bin with average values.

    • Regression: Fits data values to a function, revealing relationships.

    • Outlier Analysis: Identifies anomalies that may skew results.

Data Cleaning Process

  • Discrepancy Detection: Identify and rectify inconsistencies due to various factors:

    • Such as poor data entry forms and inconsistent data representations.

  • Use of Metadata: Exploit data properties to assess trends and identify anomalies.

  • Data Integration:

    • Involves merging data from various stores through careful schema integration, addressing:

      1. Entity Identification Problem: Matching attributes across datasets.

      2. Redundancy Analysis: Correlating attributes to detect repeated information.

Data Reduction Techniques

  • Overview: Aim to create a more manageable dataset that retains analytical integrity.

  • Strategies include:

    • Dimensionality Reduction: Reducing the number of attributes considered.

    • Numerosity Reduction: Replacing large data volumes with smaller representations.

    • Data Compression: Applying transformations to reduce data size without losing essential information.

Specific Techniques

  • Wavelet Transforms: Transforms data for efficient compression and noise reduction without losing essential data features.

  • Principal Component Analysis (PCA):

    • Normalizes data then computes orthonormal vectors along the direction of greatest variance.

  • Attribute Subset Selection: Methodologies to filter out irrelevant attributes, enhancing the efficiency of data mining processes.

Transformation Strategies

  • Data Transformation: Involves adjusting data into suitable formats for mining:

    1. Smoothing: Removal of data noise through various techniques.

    2. Attribute Construction: Forming new attributes to facilitate analysis.

    3. Aggregation: Summary computations for multi-level data insights.

    4. Normalization: Scaling data to a specific range.

    5. Discretization: Converting numeric attributes into categorical labels.

    6. Concept Hierarchy Generation: Creating a structured hierarchy for data attributes to facilitate deeper insights.

robot