Data science

Data Science Overview Instructor: Mr. Pratik Joshi

Data and Variables

DATA: Answers to questions or measurements from an experiment, serving as the foundation of any analysis in data science. It encapsulates factual information collected from various sources to provide insight into different phenomena.

VARIABLE: A measurable characteristic that can change or vary among subjects within a dataset (e.g., height, gender). Variables are crucial for statistical analyses as they allow researchers to explore relationships and patterns.

Data Structure:

One row per subject: Each subject or individual in the dataset is represented by a single row.
One variable per column: Each column corresponds to a specific variable, making it easier to view and analyze the data systematically.

Data Types

Categorical Data

Appears as categories; often qualitative in nature.
Example: Responses from questionnaires that can include various types of measurements and tick boxes.

Numerical Data

Scale Continuous: Measurements that can assume any value within a range (e.g., height, weight).
Discrete: Counts or non-fractional integers that can be listed distinctly (e.g., number of students).
Ordinal: Data that can be categorized in a meaningful order (e.g., ranking systems).
Nominal: Data that does not have a specific order (e.g., types of fruit).

Questionnaire Example for GCSE Maths Pupils

Questions:
- What is your favourite subject? (Nominal)
- Gender: (Binary/Nominal)
- I consider myself to be good at mathematics: (Ordinal)
- Recent GCSE Mathematics exam score: (Scale from 0% to 100%)

Populations and Samples

Sampling: Taking a subset (sample) from a larger group (population) to make inferences about the entire population. Effective sampling is crucial for obtaining reliable results.

Key Terms:

Parameter (Population Mean) vs Statistic (Sample Mean): Parameters refer to the complete values (e.g., population mean), while statistics are derived from samples.

Basics of Statistics

Definition: The science of collecting, presenting, analyzing, and reasonably interpreting data. Statistics is fundamental for drawing conclusions based on data.

Tasks Include: Summarization, inference, and predicting relationships between variables through analytical methods.

Understanding Data

Definition: Facts, figures, or information collected specifically for analysis purposes. Data can be numerical (quantitative) or qualitative, and it is presented in various formats (graphs, tables, etc.).

Types of Data

Qualitative Data: Examples include gender, hair color, and ethnicity (Nominal Data).
Quantitative Data: Types include counts (Discrete, e.g., number of children) and measurements (Continuous, e.g., height, weight).

Primary vs. Secondary Data

Primary Data: Collected specifically for the first time; considered raw data. Examples include surveys, interviews, and observational studies.
Secondary Data: Refers to data that has already been collected and analyzed for other purposes, often available from external sources such as government publications or organizational records, but may lack accuracy for current analyses.

Discrete vs. Continuous Data

Discrete Data: Fixed values that are countable (e.g., shoe sizes, number of participants).
Continuous Data: Values that can assume a range within a limit (e.g., height, weight).

Data Presentation

Types:

Graphical: Includes visual representations like bar diagrams, box plots, and histograms that highlight trends, outliers, and patterns in data.
Numerical: Summarizes data using measures of central tendency and dispersion.

Measures of Central Tendency

Mean: The average value computed by summing all values and dividing by the count.
Median: The middle value obtained from an ordered dataset, providing a measure of center that is less affected by outliers.
Mode: The value that appears most frequently in the dataset.

Measures of Dispersion

Definition: The degree to which data values vary around an average, which helps to understand data reliability and variability.

Types of Measures of Dispersion:

Absolute Measures: These retain the same unit as the original data (e.g., range, variance, standard deviation).
- Range: The difference between the maximum and minimum values in a set.
Relative Measures: These express variability in comparison with a standard, often utilizing coefficients or scores.

Variance and Standard Deviation

Variance (σ²): Measures how far data points differ from the mean, providing insight into data spread.
Standard Deviation: The square root of variance, serving as a critical measure to indicate how much variable data points are from the mean.

Hypothesis Testing

Objective Method: A structured approach to making inferences about population parameters based on the analysis of sample data.

Framework: It involves setting up two hypotheses:
- Null (H0): A statement suggesting no effect or relationship exists (e.g., testing if the average number of TVs in homes is ≥ 3).
- Alternative (H1): Suggests the contrary (i.e., the average is less than 3).

Chi-Squared Test

A statistical method used to assess the association between two categorical variables. This test involves comparing observed and expected frequencies to determine if discrepancies exist.

Analyzing Relationships

Correlation Coefficient

Measures the strength of the relationship between two continuous variables, ranging from 1 (perfect positive correlation) to -1 (perfect negative correlation).

Interpretation of the Correlation Coefficient:

Weak: -0.3 to 0.3
Moderate: 0.3 to 0.5
Strong: 0.5 to 0.9
Very Strong: -0.9 to -1.0.

Linear Regression

Analyzes and models relationships between two scale variables, allowing predictions based on input data.

Residuals: The difference between observed and predicted values, which helps in evaluating the model accuracy.

Model Evaluation

Use R² to determine how well a model explains the variability in the output data. Higher R² values indicate a better explanatory model.

Multiple Regression

Incorporates multiple independent variables to understand relationships while controlling for other factors, enhancing the model's predictive validity.