Data Analytics: Data Notes
Attributes and Objects
- Data: Collection of data objects and their attributes.
- Attribute:
- A property or characteristic of an object.
- Examples: eye color, temperature.
- Also known as variable, field, characteristic, or feature.
- Object:
- A collection of attributes describing it.
- Also known as record, point, case, sample, entity, or instance.
Data Quality
- Poor data quality negatively affects data processing.
- Poor data quality costs companies a significant percentage of revenue (estimated 10-20%).
- Example: Loan risk detection model built with poor data can lead to incorrect loan decisions.
Data Preprocessing
- Techniques:
- Aggregation
- Sampling
- Dimensionality Reduction
- Feature subset selection
- Feature creation
- Discretization and Binarization
- Attribute Transformation
Types of Attributes
- Nominal:
- Examples: ID numbers, eye color, zip codes.
- Ordinal:
- Examples: rankings (e.g., taste of potato chips), grades, height in {tall, medium, short}.
- Interval:
- Examples: calendar dates, temperatures in Celsius or Fahrenheit.
- Ratio:
- Examples: temperature in Kelvin, length, time, counts.
Properties of Attribute Values
- Distinctness: = \neq
- Order: < >
- Addition: + -
- Multiplication: * /
- Nominal: Distinctness
- Ordinal: Distinctness & Order
- Interval: Distinctness, Order & Addition
- Ratio: All 4 properties
Attribute Types and Operations
- Nominal:
- Description: Values only distinguish; operations include mode, entropy, contingency correlation, χ2 test.
- Examples: zip codes, employee ID numbers, eye color, sex.
- Ordinal:
- Description: Values also order objects; operations include median, percentiles, rank correlation, run tests, sign tests.
- Examples: hardness of minerals, {good, better, best}, grades, street numbers.
- Interval:
- Description: Differences between values are meaningful; operations include mean, standard deviation, Pearson's correlation, t and F tests.
- Examples: calendar dates, temperature in Celsius or Fahrenheit.
- Ratio:
- Description: Both differences and ratios are meaningful; operations include geometric mean, harmonic mean, percent variation.
- Examples: temperature in Kelvin, monetary quantities, counts, age, mass, length, current.
- Note: Categorization attributed to S. S. Stevens
- Nominal:
- Transformation: Any permutation of values.
- Comment: Reassigning employee ID numbers doesn't change data meaning.
- Ordinal:
- Transformation: Order-preserving change of values, where newvalue = f(oldvalue) and f is a monotonic function.
- Comment: Representing {good, better, best} as {1, 2, 3} or {0.5, 1.0, 1.5} is equivalent.
- Interval:
- Transformation: newvalue = a * oldvalue + b, where a and b are constants.
- Comment: Fahrenheit and Celsius scales differ in zero value and unit size.
- Ratio:
- Transformation: newvalue = a * oldvalue.
- Comment: Length can be measured in meters or feet.
- Note: Categorization attributed to S. S. Stevens
Discrete and Continuous Attributes
- Discrete:
- Has a finite or countably infinite set of values.
- Examples: Zip codes, counts, words in a document collection.
- Often represented as integer variables.
- Binary attributes are a special case.
- Continuous:
- Has real numbers as attribute values.
- Examples: Temperature, height, weight.
- Practically, real values are measured and represented with a finite number of digits.
- Typically represented as floating-point variables.
Data Quality Problems
- Noise and outliers
- Missing values
- Duplicate data
Noise
- Modification of original values.
- Examples: voice distortion on a poor phone,