Data
correlation does not equal causation
correlation: similarities, patterns
causation: this thing caused that thing
Metadata
data that describes other data
digital image may include metadata that describes the size of the image, number of colors, and creation date
Filtering and Cleaning Data
cleaning data
process that makes the data uniform without changing its meaning
needs to be done when:
data is incomplete
data is invalid
multiple tables are combined into one
What leads to “messy” data?
users enter in different types of data (“two”, 2)
users use different abbreviations to represent the same info
data may have different spellings or inconsistent capitalization
filtering data allows the user to look at.a subset of the data
Exploring Two Columns
Crosstab chart
counts how many times combinations of values appear. Arrows show where that row in the data table would be counted in the chart
useful for:
finding the most/least common combinations of values in two columns
finding patterns across two columns
exploring two columns when one or both are strings
Not useful:
if either column has too many values
Scatter
show combinations of values from two columns
useful for:
seeing patterns and trends between two balues
numeric data with lots of different values
not useful:
lots of repeated values
Big, Open, and Crowdsourced Data
big data
a broad term for datasets os large or complex that traditional data processing applications are inadequate
citizen science
scientific research conducted in whole or part by distributed individuals (not always scientists) who contribute relevant data to research using their own computing devices
crowdsourcing
the practice of obtaining input or information from a large number of people via the internet
Open data
data that can be freely used, re-used and redistributed by anyone
Machine Learning
“How do machines learn?”
Machine Learning
an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
data bias
data that does not accurately reflect the full population or phenomenon being studied