0.0(0)

Generate Practice test

Chat with Kai

View the linked pdf

Explore Top Notes

Studied by 45 people

Chapter 16- How Genes Work

Studied by 14 people

Studied by 2 people

Current Issues and Future Trends in Health Care in Canada,

Studied by 10 people

neurotransmitters

Studied by 2 people

PROCESS: Determining the Chemical (copy) (copy)

Studied by 8 people

unit 2

Why Preprocess the Data?

Importance of Data Quality: Data quality is crucial as it must satisfy the requirements of its intended use. Factors include:
- Accuracy: Data must represent the real-world condition accurately.
- Completeness: Data should have all the necessary values recorded.
- Consistency: Data must be consistent across different datasets and systems.
- Timeliness: Data should be up-to-date and available when needed.
- Believability: Users must trust the data.
- Interpretability: Data must be understandable to the users.

Real-World Data Scenarios

Example scenario in data analysis for AllElectronics highlights the challenges faced:
- Observed missing values in key attributes (e.g., items sold, prices).
- Errors and unusual values reported by users.
This illustrates common data quality issues in large databases:
- Inaccuracy may arise from faulty data entry instruments, human or computer errors, or disguised missing data.
- Completeness issues can occur due to omitted data for various reasons.
- Inconsistencies may stem from inconsistent naming conventions or data formats.

Data Quality Perspectives

Quality perception may differ based on user roles:
- Marketing Analyst: May find a database sufficient if 80% of addresses are accurate for target marketing.
- Sales Manager: May find the same data inadequate due to inaccuracies.
Timeliness impact:
- Late submissions from staff can lead to temporary gaps in data that affect accuracy during analysis.
Believability and Interpretability: If past errors exist, users may distrust even corrected data.

Data Cleaning

Importance of Data Cleaning: Real-world data is often incomplete, noisy, and inconsistent.
Methods for Handling Missing Values:
1. Ignore the Tuple: Used primarily when the class label is missing.
2. Manual Filling: Time-consuming, and not practical for large datasets.
3. Global Constant: E.g., replace missing values with "Unknown", which can lead to biases.
4. Central Tendency: Fill missing values using mean or median based on data distribution.
5. Class-Based Mean/Median: Use attribute mean/median for the same class of records.
6. Most Probable Value: Estimate with regression or decision tree techniques.
Noise in Data: Random errors can distort measurements.
Data Smoothing Techniques:
- Binning: Organizes values into bins for local smoothing.
  - E.g., Mean/median binning smooths data by replacing values within a bin with average values.
- Regression: Fits data values to a function, revealing relationships.
- Outlier Analysis: Identifies anomalies that may skew results.

Data Cleaning Process

Discrepancy Detection: Identify and rectify inconsistencies due to various factors:
- Such as poor data entry forms and inconsistent data representations.
Use of Metadata: Exploit data properties to assess trends and identify anomalies.
Data Integration:
- Involves merging data from various stores through careful schema integration, addressing:
  1. Entity Identification Problem: Matching attributes across datasets.
  2. Redundancy Analysis: Correlating attributes to detect repeated information.

Data Reduction Techniques

Overview: Aim to create a more manageable dataset that retains analytical integrity.
Strategies include:
- Dimensionality Reduction: Reducing the number of attributes considered.
- Numerosity Reduction: Replacing large data volumes with smaller representations.
- Data Compression: Applying transformations to reduce data size without losing essential information.

Specific Techniques

Wavelet Transforms: Transforms data for efficient compression and noise reduction without losing essential data features.
Principal Component Analysis (PCA):
- Normalizes data then computes orthonormal vectors along the direction of greatest variance.
Attribute Subset Selection: Methodologies to filter out irrelevant attributes, enhancing the efficiency of data mining processes.

Transformation Strategies

Data Transformation: Involves adjusting data into suitable formats for mining:
1. Smoothing: Removal of data noise through various techniques.
2. Attribute Construction: Forming new attributes to facilitate analysis.
3. Aggregation: Summary computations for multi-level data insights.
4. Normalization: Scaling data to a specific range.
5. Discretization: Converting numeric attributes into categorical labels.
6. Concept Hierarchy Generation: Creating a structured hierarchy for data attributes to facilitate deeper insights.

0.0(0)

Generate Practice test

Chat with Kai

View the linked pdf

Explore Top Notes

Studied by 45 people

Chapter 16- How Genes Work

Studied by 14 people

Studied by 2 people

Current Issues and Future Trends in Health Care in Canada,

Studied by 10 people

neurotransmitters

Studied by 2 people

PROCESS: Determining the Chemical (copy) (copy)

Studied by 8 people