Data

  • correlation does not equal causation

    • correlation: similarities, patterns

    • causation: this thing caused that thing

  • Metadata

    • data that describes other data

      • digital image may include metadata that describes the size of the image, number of colors, and creation date

Filtering and Cleaning Data

  • cleaning data

    • process that makes the data uniform without changing its meaning

      • needs to be done when:

        • data is incomplete

        • data is invalid

        • multiple tables are combined into one

      • What leads to “messy” data?

        • users enter in different types of data (“two”, 2)

        • users use different abbreviations to represent the same info

        • data may have different spellings or inconsistent capitalization

  • filtering data allows the user to look at.a subset of the data

Exploring Two Columns

  • Crosstab chart

    • counts how many times combinations of values appear. Arrows show where that row in the data table would be counted in the chart

    • useful for:

      • finding the most/least common combinations of values in two columns

      • finding patterns across two columns

      • exploring two columns when one or both are strings

    • Not useful:

      • if either column has too many values

  • Scatter

    • show combinations of values from two columns

      • useful for:

        • seeing patterns and trends between two balues

        • numeric data with lots of different values

      • not useful:

        • lots of repeated values

Big, Open, and Crowdsourced Data

  • big data

    • a broad term for datasets os large or complex that traditional data processing applications are inadequate

  • citizen science

    • scientific research conducted in whole or part by distributed individuals (not always scientists) who contribute relevant data to research using their own computing devices

  • crowdsourcing

    • the practice of obtaining input or information from a large number of people via the internet

  • Open data

    • data that can be freely used, re-used and redistributed by anyone

Machine Learning

“How do machines learn?”

  • Machine Learning

    • an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed

  • data bias

    • data that does not accurately reflect the full population or phenomenon being studied