The Data Analysis Process
Step 1: Collect or Choose Data
Gather the data needed for analysis.
Step 2: Clean/Filter
Remove errors or inconsistencies from the data. Focus on relevant data for analysis.
Step 3: Visualize and Find Patterns
Use graphs/charts to observe data for patterns.
Step 4: Generate New Information
Produce results based on observations.
Data Vs. Metadata
Data: Information collected for analysis.
Metadata: Data about data including:
Time of data collection
Type of data
Location of data collection
Method of collection
Collector of the data
Types of Visualizations
Bar Charts:
Can be vertical or horizontal.
Shows frequency analysis; taller/longer bars indicate more frequent values.
Insights from Bar Charts:
Identify most and least common values, range, and presence of values.
Pie Charts:
Represents percentages of unique values in a dataset.
Insights from Pie Charts:
Identify highest/lowest percentages and compare values.
Histograms:
Displays frequency of values within ranges.
Read similarly to bar charts.
Insights from Histograms:
Identify most and least common ranges.
Scatterplots:
Compares two data columns to find relationships.
Types of relationships: direct, inverse, or none.
Insights from Scatterplots:
Identify relationships and trends; make predictions.
Correlation Vs. Causation
Correlation: Indicates similarities and apparent patterns between data sets.
Causation: Implies one event causes another.
Important to remember: CORRELATION DOES NOT EQUAL CAUSATION.
Examples of Correlation with No Causation:
Divorce rate in Maine correlates with per capita margarine consumption; correlation does not imply one causes the other.
Big Data:
Collection Method: Data is gathered through data mining and web scraping.
Problems Solved:
Efficiency in business, disease identification in healthcare, crime prevention, supply chain management, real-time data analysis.
Open Data:
Collection Method: Freely available data with minimal restrictions; sourced from open data repositories.
Problems Solved:
Promotes public oversight, aids in tracking public health risks and environmental hazards.
Crowdsourced Data:
Collection Method: Data collected by ordinary people for decision-making.
Problems Solved:
Similar to big data; focuses on public health and climate action predictions.
Machine Learning:
Involves algorithms that analyze data and adapt. Used in daily tasks and AI.
Limitations and Bias:
Algorithms may reflect human biases if the input data is not diverse.
Bias can occur when certain demographic data is overrepresented in training datasets.
Example of Bias:
Twitter’s cropping algorithm favored certain demographics due to biased training data.
Ways to Mitigate Bias:
Diversify training data by including underrepresented groups.
Definition:
A simulation models real-world situations/events. Useful for hypothesis testing when real experimentation is impractical or risky.
Usage:
Simulations help abstract complex processes and provide insights that cannot be easily realized in real life.