Data Analytics: Data Notes

Attributes and Objects

  • Data: Collection of data objects and their attributes.
  • Attribute:
    • A property or characteristic of an object.
    • Examples: eye color, temperature.
    • Also known as variable, field, characteristic, or feature.
  • Object:
    • A collection of attributes describing it.
    • Also known as record, point, case, sample, entity, or instance.

Data Quality

  • Poor data quality negatively affects data processing.
  • Poor data quality costs companies a significant percentage of revenue (estimated 10-20%).
  • Example: Loan risk detection model built with poor data can lead to incorrect loan decisions.

Data Preprocessing

  • Techniques:
    • Aggregation
    • Sampling
    • Dimensionality Reduction
    • Feature subset selection
    • Feature creation
    • Discretization and Binarization
    • Attribute Transformation

Types of Attributes

  • Nominal:
    • Examples: ID numbers, eye color, zip codes.
  • Ordinal:
    • Examples: rankings (e.g., taste of potato chips), grades, height in {tall, medium, short}.
  • Interval:
    • Examples: calendar dates, temperatures in Celsius or Fahrenheit.
  • Ratio:
    • Examples: temperature in Kelvin, length, time, counts.

Properties of Attribute Values

  • Distinctness: = \neq
  • Order: < >
  • Addition: + -
  • Multiplication: * /
  • Nominal: Distinctness
  • Ordinal: Distinctness & Order
  • Interval: Distinctness, Order & Addition
  • Ratio: All 4 properties

Attribute Types and Operations

  • Nominal:
    • Description: Values only distinguish; operations include mode, entropy, contingency correlation, χ2 test.
    • Examples: zip codes, employee ID numbers, eye color, sex.
  • Ordinal:
    • Description: Values also order objects; operations include median, percentiles, rank correlation, run tests, sign tests.
    • Examples: hardness of minerals, {good, better, best}, grades, street numbers.
  • Interval:
    • Description: Differences between values are meaningful; operations include mean, standard deviation, Pearson's correlation, t and F tests.
    • Examples: calendar dates, temperature in Celsius or Fahrenheit.
  • Ratio:
    • Description: Both differences and ratios are meaningful; operations include geometric mean, harmonic mean, percent variation.
    • Examples: temperature in Kelvin, monetary quantities, counts, age, mass, length, current.
    • Note: Categorization attributed to S. S. Stevens

Attribute Transformations

  • Nominal:
    • Transformation: Any permutation of values.
    • Comment: Reassigning employee ID numbers doesn't change data meaning.
  • Ordinal:
    • Transformation: Order-preserving change of values, where newvalue = f(oldvalue) and f is a monotonic function.
    • Comment: Representing {good, better, best} as {1, 2, 3} or {0.5, 1.0, 1.5} is equivalent.
  • Interval:
    • Transformation: newvalue = a * oldvalue + b, where a and b are constants.
    • Comment: Fahrenheit and Celsius scales differ in zero value and unit size.
  • Ratio:
    • Transformation: newvalue = a * oldvalue.
    • Comment: Length can be measured in meters or feet.
    • Note: Categorization attributed to S. S. Stevens

Discrete and Continuous Attributes

  • Discrete:
    • Has a finite or countably infinite set of values.
    • Examples: Zip codes, counts, words in a document collection.
    • Often represented as integer variables.
    • Binary attributes are a special case.
  • Continuous:
    • Has real numbers as attribute values.
    • Examples: Temperature, height, weight.
    • Practically, real values are measured and represented with a finite number of digits.
    • Typically represented as floating-point variables.

Data Quality Problems

  • Noise and outliers
  • Missing values
  • Duplicate data

Noise

  • Modification of original values.
  • Examples: voice distortion on a poor phone,