Data Science: Problem Formulation and Data Concepts (Vocabulary)

Key Concepts: Defining the Problem and Problem Statement

A well-stated problem guides data collection, analysis methods, and stakeholder communication.
Core components of a problem statement:
- Context: background to show relevance and urgency.
- Specificity: clearly define the problem in understandable terms.
- Objectives (Measurable): specify what success looks like with metrics or outcomes.
Structuring a problem statement:
- What is the problem?
- Why is it important?
- What are the expected outcomes?
Examples of problem statements in Data Science:
- Customer Churn Prediction: reduce churn by 15\% over the next quarter.
- Sales Forecasting: predict monthly sales for the next 12 months to improve inventory management.
- Product Recommendation: increase user engagement by recommending products based on past behavior.
A true data science problem can: categorize/group data, identify patterns, identify anomalies, show correlations, predict outcomes, or recommend actions.
Good problem statements are: clear, concise, and measurable.
Steps to formulate a problem:
- Understand goals & expectations: identify objectives, pain points, available data, potential benefits, risks.
- Translate goals into data analysis goals (what you will try to predict, classify, or discover).
- Frame the problem: write a clear, concise, and measurable statement (often 1–2 sentences).
Distinguishing good vs. bad statements:
- Better: specific, actionable, and measurable.
- Worse: vague, broad, or non-actionable.

Understanding the Data

Data basics:
- Data are the raw facts; information is data that have been organized and processed; knowledge is information applied to decisions; value comes from using knowledge to make decisions.
- Data types include numerical and non-numerical; data come from many sources and formats.
- Data can be large or small; even small data can yield insights.
Data preparation is a key decision step before analysis.
Data preparation is about turning raw data into information and knowledge to support decisions.

Population vs Sample; Inference

Population: all objects of interest. Often not all can be collected due to cost or feasibility.
Sample: a subset of the population used for analysis.
Population parameter: a summary measure of the population (usually unknown).
Sample statistic: a summary measure calculated from the sample.
Statistical inference: drawing conclusions about the population from the sample; requires unbiased sampling.

Big Data: Characteristics and Considerations

There is no universal definition; key perspective involves:
- Volume: huge data quantities that require scalable storage/processing (e.g., O(10^6) to O(10^{12}) records).
- Velocity: data generated/updated rapidly.
- Variety: structured, semi-structured, and unstructured data.
- Veracity: data quality and reliability.
- Values: methodological plan for forming questions and extracting insight.
Big data often requires parallel/distributed processing and advanced feature selection/dimensionality reduction.

Data Formats: Cross-Sectional vs Time Series

Time Series Data:
- Observations of a single subject across multiple time periods.
- Main focus: a single variable over time.
- Example: profits of an organization over T years.
Cross-Sectional Data:
- Observations of many subjects at the same point in time.
- Main focus: multiple variables at one point in time.
- Example: maximum temperatures of multiple cities on a single day.

Structured vs Unstructured Data

Structured Data:
- Highly organized, easy to analyze; can be stored in rows/spreadsheets/databases.
- Examples: customer phone numbers, names, ZIP codes.
Unstructured Data:
- Not easily organized into rows; harder to collect/analyze.
- Examples: emails, video files, images, social media data.
Transforming unstructured data into structured data:
- Data cleaning
- Extract meaningful insights with analysis tools (LLMs, NLP)
- Store in tidy data frames for analysis

Transforming Unstructured to Structured Data with LLMs

Unstructured data sources: emails, orders, ratings, social media, chats.
Steps: data cleaning → extract meaning → analyze with tools like LLMs (e.g., ChatGPT/OpenAI) → summarize, sentiments, themes, keywords → store in tidy format.
Structured data examples: files, lists, measurements; tidy data frames facilitate analysis.

Data Restructuring and Analysis Examples

Summarization: condense large text to key points.
Sentiment analysis: determine tone of text (positive/negative/neutral).
Thematic analysis: identify common themes/trends.
Keyword extraction: identify important terms for classification or indexing.

The Data in Our Class: Tidy Data

Tidy data: each column is a distinct variable; each row is a unique observation; each cell contains a single value.
This structure simplifies downstream analysis and modeling.

Variables, Observational Units, and Observations

Variables (features): stored in columns; describe the subject.
Observational units (cases): stored in rows; the individual subjects in the data.
Observations/Values: the data in the cells; the measurements for each observational unit across variables.
Example: a tidy data set with one row per student, one column per variable (e.g., GPA, sleep hours, etc.).

Types of Variables

Categorical (Qualitative) Variables:
- Ordinal: natural order (e.g., education level, class standing).
- Nominal: categories without a natural order (e.g., gender, eye color).
- Identifier (Special Case Nominal): used for identification, not for analysis (e.g., Student ID); not analyzed as a category.
- Categorical Numbers: codes for categories (e.g., 0/1 for No/Yes) but not quantitative values.
Numerical (Quantitative) Variables:
- Discrete: counts with a finite set of values (e.g., number of parts damaged).
- Continuous: measurements on a continuum (e.g., height, weight, GPA).

Data Types and Measurement Scales

Data types mapping:
- Text: categorical variables containing words (Nominal or Ordinal).
- Integer: quantitative or categorical without decimals (Discrete or Ordinal).
- Float: quantitative with decimals (Continuous).
- Boolean: two possible values (Nominal/Binary).
Four scales of measurement (how data are measured):
- Ratio: true zero exists; differences meaningful; ratios meaningful; e.g., profits.
- Interval: differences meaningful; zero is arbitrary; ratios not meaningful; e.g., temperature in Celsius.
- Ordinal: ordered categories; differences not necessarily equal; e.g., performance levels.
- Nominal: categories with no inherent order; e.g., gender, color.

Putting It All Together: The Problem Example (Student GPA)

Scenario: A university wants to understand factors influencing GPA at graduation and predict GPA from habits for advisement.
Data needs: collect data on student habits, sleep, study time, etc., to model GPA at graduation.
Target (Dependent) Variable: GPA upon graduation (the outcome to understand/predict).
- Also called: target, response, or dependent variable.
Independent (Explanatory/Predictor) Variables: factors that influence GPA (e.g., sleep, study hours, exercise, seat location, etc.).
Data dictionary: metadata detailing variable names, descriptions, data types, units, and scale.

Data Collection and the Data Dictionary

Data collection steps:
- Gather data from available private sources.
- Explore publicly available data for relevance.
- If needed, collect new data to enhance robustness.
Data Dictionary (Meta Data Table): a summary describing the dataset, including:
- Variable Name, Description
- General Type (Nominal, Ordinal, etc.) and Specific Type
- Data Types and Measurement Units
Example structure (Student Survey at Data University):
- Student ID #: Identifier, Nominal
- Sex: Categorical (Male/Female), Nominal
- Sleep: Average Hours per Night; Quantitative Continuous
- Alcohol: Average Drinks per Week; Quantitative Continuous
- Exercise: Average Hours per Week; Quantitative Continuous
- TV: Average Hours per Week; Quantitative Continuous
- Study: Average Hours Studying per Week; Quantitative Continuous
- Seat: Front/Middle/Back; Categorical Ordinal
- GPA: GPA of the Student; Quantitative Continuous
The data dictionary helps ensure consistent interpretation and analysis.

Quick Recall Tips

Always frame problems as: What, Why, and How (measurable outcomes).
Distinguish target variable from predictors early (GPA vs. habits).
Remember tidy data rules: one variable per column, one observation per row, one value per cell.
Distinguish variable types to choose appropriate analysis methods (e.g., regression for quantitative variables, classification for categorical variables).
Use data dictionaries to document data provenance and meaning before modeling.

Quick References (key takeaways)

Problem statement structure: What is the problem? Why is it important? What outcomes are expected? (1–2 sentences when framing).
Data science problems can involve classification, prediction, clustering, anomaly detection, or recommendations.
Cross-sectional vs time series data differ in focus and analysis approach.
Four scales of measurement (Nominal, Ordinal, Interval, Ratio) guide how you can analyze data.
Distinguish target vs predictor variables clearly before modeling.
Data cleaning and structuring are as important as modeling for reliable results.