Importance of Data Quality: Data quality is crucial as it must satisfy the requirements of its intended use. Factors include:
Accuracy: Data must represent the real-world condition accurately.
Completeness: Data should have all the necessary values recorded.
Consistency: Data must be consistent across different datasets and systems.
Timeliness: Data should be up-to-date and available when needed.
Believability: Users must trust the data.
Interpretability: Data must be understandable to the users.
Example scenario in data analysis for AllElectronics highlights the challenges faced:
Observed missing values in key attributes (e.g., items sold, prices).
Errors and unusual values reported by users.
This illustrates common data quality issues in large databases:
Inaccuracy may arise from faulty data entry instruments, human or computer errors, or disguised missing data.
Completeness issues can occur due to omitted data for various reasons.
Inconsistencies may stem from inconsistent naming conventions or data formats.
Quality perception may differ based on user roles:
Marketing Analyst: May find a database sufficient if 80% of addresses are accurate for target marketing.
Sales Manager: May find the same data inadequate due to inaccuracies.
Timeliness impact:
Late submissions from staff can lead to temporary gaps in data that affect accuracy during analysis.
Believability and Interpretability: If past errors exist, users may distrust even corrected data.
Importance of Data Cleaning: Real-world data is often incomplete, noisy, and inconsistent.
Methods for Handling Missing Values:
Ignore the Tuple: Used primarily when the class label is missing.
Manual Filling: Time-consuming, and not practical for large datasets.
Global Constant: E.g., replace missing values with "Unknown", which can lead to biases.
Central Tendency: Fill missing values using mean or median based on data distribution.
Class-Based Mean/Median: Use attribute mean/median for the same class of records.
Most Probable Value: Estimate with regression or decision tree techniques.
Noise in Data: Random errors can distort measurements.
Data Smoothing Techniques:
Binning: Organizes values into bins for local smoothing.
E.g., Mean/median binning smooths data by replacing values within a bin with average values.
Regression: Fits data values to a function, revealing relationships.
Outlier Analysis: Identifies anomalies that may skew results.
Discrepancy Detection: Identify and rectify inconsistencies due to various factors:
Such as poor data entry forms and inconsistent data representations.
Use of Metadata: Exploit data properties to assess trends and identify anomalies.
Data Integration:
Involves merging data from various stores through careful schema integration, addressing:
Entity Identification Problem: Matching attributes across datasets.
Redundancy Analysis: Correlating attributes to detect repeated information.
Overview: Aim to create a more manageable dataset that retains analytical integrity.
Strategies include:
Dimensionality Reduction: Reducing the number of attributes considered.
Numerosity Reduction: Replacing large data volumes with smaller representations.
Data Compression: Applying transformations to reduce data size without losing essential information.
Wavelet Transforms: Transforms data for efficient compression and noise reduction without losing essential data features.
Principal Component Analysis (PCA):
Normalizes data then computes orthonormal vectors along the direction of greatest variance.
Attribute Subset Selection: Methodologies to filter out irrelevant attributes, enhancing the efficiency of data mining processes.
Data Transformation: Involves adjusting data into suitable formats for mining:
Smoothing: Removal of data noise through various techniques.
Attribute Construction: Forming new attributes to facilitate analysis.
Aggregation: Summary computations for multi-level data insights.
Normalization: Scaling data to a specific range.
Discretization: Converting numeric attributes into categorical labels.
Concept Hierarchy Generation: Creating a structured hierarchy for data attributes to facilitate deeper insights.