Data Extraction Analysis
1.1 Introduction of Data Analysis
What is Data Analysis?
Data analysis is a multi-step lifecycle involving:
Inspecting: Examining data for quality and relevance.
Cleaning: Identifying and correcting errors, missing values, or corrupt records.
Transforming: Converting data into usable formats (e.g., normalization, aggregation).
Interpreting: Applying statistical and logical techniques to describe and evaluate data to extract valuable insights and make informed decisions.
Big Data: The 5 Vs
Definition: Big data refers to extremely large, complex datasets that traditional data processing software can't manage effectively.
Characteristics (The 5 Vs):
Volume: The massive scale of data generated (Terabytes to Petabytes).
Velocity: The high speed at which new data is generated and processed (e.g., social media feeds).
Variety: The different formats of data (Structured, Semi-structured, and Unstructured).
Veracity: The accuracy and trustworthiness of the data.
Value: The potential of the data to provide beneficial insights.
Examples of datasets:
Point of sales (POS) transactions across global retail chains.
Covid-19 epidemiological tracking datasets.
IoT sensor data from smart cities or industrial machinery.
Medical history records using Electronic Health Records (EHR).
Importance of Data Analytics
Decision Support: Data analysis reduces uncertainty and provides a factual basis for strategies in business, science, and social sciences.
Pattern Recognition: It reveals hidden correlations, seasonal trends, and behavioral associations.
Analysis Phases:
Exploratory Data Analysis (EDA): Initial data investigation; focus on discovery.
Confirmatory Data Analysis (CDA): Applying statistical rigorousness to confirm or reject hypotheses.
Symbiotic Relationship: Findings from EDA are often the foundation for CDA. EDA generates the "what," while CDA proves the "why" and provides the statistical significance.
1.1.1 Exploratory Data Analysis (EDA)
Expansion of EDA Tasks
Univariate Analysis: Analyzing one variable at a time (e.g., mean, median, mode).
Bivariate Analysis: Examining the relationship between two variables (e.g., scatter plots, correlation coefficients).
Multivariate Analysis: Exploring complex interactions between three or more variables.
Detection of Anomalies: Using -scores$ or Interquartile Range (IQR) to find outliers that might skew results.
Dimensionality Reduction: Identifying which variables are truly relevant to the research goal.
Visual Tools for EDA
Histograms & Bar Charts: For frequency distribution.
Box Plots: For identifying outliers and understanding quartiles.
Scatter Plots: To visualize correlations.
Heatmaps: To identify patterns in large matrices.
1.1.2 Confirmatory Data Analysis (CDA)
Deep Dive into CDA Process
Significance Levels (\alpha): Usually set at 0.05 (5\%) or 0.01 (1\%), representing the risk of rejecting a true null hypothesis.
P-Values: If the p-value < \alpha, the result is considered statistically significant.
Confidence Intervals: A range of values likely to contain the population parameter (e.g., "We are 95\% confident the average price is between \$10 and \$15").
Error Types:
Type I Error (False Positive): Rejecting the null hypothesis when it is actually true.
Type II Error (False Negative): Failing to reject the null hypothesis when it is actually false.
1.1.3 Hypothesis
Types of Hypotheses
Null Hypothesis (H_0): The default position that there is no relationship or effect. It is the statement being tested for rejection.
Alternative Hypothesis (H1 \text{ or } Ha): The statement that there is a significant effect or relationship; it is what researchers hope to support.
Hypothesis Examples with Nulls
Question: Do video games improve reaction time?
H_1: Regular gamers have faster reaction times than non-gamers.
H_0: There is no difference in reaction times between gamers and non-gamers.
1.3 Variables in a Dataset
Classification of Variables
Independent Variable: The variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable.
Dependent Variable: The variable being tested and measured; it changes based on the independent variable.
Control Variables: Factors kept constant to prevent them from influencing the outcome.
1.3.1 Types of Variables & Levels of Measurement
1. Qualitative Data (Categorical)
Nominal: Names or labels with no inherent order (e.g., Blood Type: A, B, AB, O).
Ordinal: Categories with a logical sequence or rank, but the distance between values isn't mathematically defined (e.g., Social Class: Low, Middle, High).
2. Quantitative Data (Numerical)
Discrete: Countable items that cannot be split (e.g., number of cars in a parking lot, which must be a non-negative integer n \in {0, 1, 2, …}).
Continuous: Measurable values that can be infinitely divided (e.g., the exact temperature 22.54^{\circ}C or weight 70.45kg). These include:
Interval Scale: Numerical data with no true zero point (e.g., Temperature in Celsius - 0^{\circ}C doesn't mean no temperature).
Ratio Scale: Numerical data with a true zero point (e.g., Weight - 0kg$$ means no weight).
1.4 Importance of Correct Data Type Identification
Statistical Accuracy: You cannot calculate the "average" of nominal data like Car Type (you cannot average a Honda and a Toyota).
Algorithm Selection: Machine learning models require specific data types (e.g., most regression