Data Science: Problem Formulation and Data Concepts (Vocabulary)
Key Concepts: Defining the Problem and Problem Statement
- A well-stated problem guides data collection, analysis methods, and stakeholder communication.
- Core components of a problem statement:
- Context: background to show relevance and urgency.
- Specificity: clearly define the problem in understandable terms.
- Objectives (Measurable): specify what success looks like with metrics or outcomes.
- Structuring a problem statement:
- What is the problem?
- Why is it important?
- What are the expected outcomes?
- Examples of problem statements in Data Science:
- Customer Churn Prediction: reduce churn by 15\% over the next quarter.
- Sales Forecasting: predict monthly sales for the next 12 months to improve inventory management.
- Product Recommendation: increase user engagement by recommending products based on past behavior.
- A true data science problem can: categorize/group data, identify patterns, identify anomalies, show correlations, predict outcomes, or recommend actions.
- Good problem statements are: clear, concise, and measurable.
- Steps to formulate a problem:
- Understand goals & expectations: identify objectives, pain points, available data, potential benefits, risks.
- Translate goals into data analysis goals (what you will try to predict, classify, or discover).
- Frame the problem: write a clear, concise, and measurable statement (often 1–2 sentences).
- Distinguishing good vs. bad statements:
- Better: specific, actionable, and measurable.
- Worse: vague, broad, or non-actionable.
Understanding the Data
- Data basics:
- Data are the raw facts; information is data that have been organized and processed; knowledge is information applied to decisions; value comes from using knowledge to make decisions.
- Data types include numerical and non-numerical; data come from many sources and formats.
- Data can be large or small; even small data can yield insights.
- Data preparation is a key decision step before analysis.
- Data preparation is about turning raw data into information and knowledge to support decisions.
Population vs Sample; Inference
- Population: all objects of interest. Often not all can be collected due to cost or feasibility.
- Sample: a subset of the population used for analysis.
- Population parameter: a summary measure of the population (usually unknown).
- Sample statistic: a summary measure calculated from the sample.
- Statistical inference: drawing conclusions about the population from the sample; requires unbiased sampling.
Big Data: Characteristics and Considerations
- There is no universal definition; key perspective involves:
- Volume: huge data quantities that require scalable storage/processing (e.g., O(10^6) to O(10^{12}) records).
- Velocity: data generated/updated rapidly.
- Variety: structured, semi-structured, and unstructured data.
- Veracity: data quality and reliability.
- Values: methodological plan for forming questions and extracting insight.
- Big data often requires parallel/distributed processing and advanced feature selection/dimensionality reduction.
- Time Series Data:
- Observations of a single subject across multiple time periods.
- Main focus: a single variable over time.
- Example: profits of an organization over T years.
- Cross-Sectional Data:
- Observations of many subjects at the same point in time.
- Main focus: multiple variables at one point in time.
- Example: maximum temperatures of multiple cities on a single day.
Structured vs Unstructured Data
- Structured Data:
- Highly organized, easy to analyze; can be stored in rows/spreadsheets/databases.
- Examples: customer phone numbers, names, ZIP codes.
- Unstructured Data:
- Not easily organized into rows; harder to collect/analyze.
- Examples: emails, video files, images, social media data.
- Transforming unstructured data into structured data:
- Data cleaning
- Extract meaningful insights with analysis tools (LLMs, NLP)
- Store in tidy data frames for analysis
- Unstructured data sources: emails, orders, ratings, social media, chats.
- Steps: data cleaning → extract meaning → analyze with tools like LLMs (e.g., ChatGPT/OpenAI) → summarize, sentiments, themes, keywords → store in tidy format.
- Structured data examples: files, lists, measurements; tidy data frames facilitate analysis.
Data Restructuring and Analysis Examples
- Summarization: condense large text to key points.
- Sentiment analysis: determine tone of text (positive/negative/neutral).
- Thematic analysis: identify common themes/trends.
- Keyword extraction: identify important terms for classification or indexing.
The Data in Our Class: Tidy Data
- Tidy data: each column is a distinct variable; each row is a unique observation; each cell contains a single value.
- This structure simplifies downstream analysis and modeling.
Variables, Observational Units, and Observations
- Variables (features): stored in columns; describe the subject.
- Observational units (cases): stored in rows; the individual subjects in the data.
- Observations/Values: the data in the cells; the measurements for each observational unit across variables.
- Example: a tidy data set with one row per student, one column per variable (e.g., GPA, sleep hours, etc.).
Types of Variables
- Categorical (Qualitative) Variables:
- Ordinal: natural order (e.g., education level, class standing).
- Nominal: categories without a natural order (e.g., gender, eye color).
- Identifier (Special Case Nominal): used for identification, not for analysis (e.g., Student ID); not analyzed as a category.
- Categorical Numbers: codes for categories (e.g., 0/1 for No/Yes) but not quantitative values.
- Numerical (Quantitative) Variables:
- Discrete: counts with a finite set of values (e.g., number of parts damaged).
- Continuous: measurements on a continuum (e.g., height, weight, GPA).
Data Types and Measurement Scales
- Data types mapping:
- Text: categorical variables containing words (Nominal or Ordinal).
- Integer: quantitative or categorical without decimals (Discrete or Ordinal).
- Float: quantitative with decimals (Continuous).
- Boolean: two possible values (Nominal/Binary).
- Four scales of measurement (how data are measured):
- Ratio: true zero exists; differences meaningful; ratios meaningful; e.g., profits.
- Interval: differences meaningful; zero is arbitrary; ratios not meaningful; e.g., temperature in Celsius.
- Ordinal: ordered categories; differences not necessarily equal; e.g., performance levels.
- Nominal: categories with no inherent order; e.g., gender, color.
Putting It All Together: The Problem Example (Student GPA)
- Scenario: A university wants to understand factors influencing GPA at graduation and predict GPA from habits for advisement.
- Data needs: collect data on student habits, sleep, study time, etc., to model GPA at graduation.
- Target (Dependent) Variable: GPA upon graduation (the outcome to understand/predict).
- Also called: target, response, or dependent variable.
- Independent (Explanatory/Predictor) Variables: factors that influence GPA (e.g., sleep, study hours, exercise, seat location, etc.).
- Data dictionary: metadata detailing variable names, descriptions, data types, units, and scale.
Data Collection and the Data Dictionary
- Data collection steps:
- Gather data from available private sources.
- Explore publicly available data for relevance.
- If needed, collect new data to enhance robustness.
- Data Dictionary (Meta Data Table): a summary describing the dataset, including:
- Variable Name, Description
- General Type (Nominal, Ordinal, etc.) and Specific Type
- Data Types and Measurement Units
- Example structure (Student Survey at Data University):
- Student ID #: Identifier, Nominal
- Sex: Categorical (Male/Female), Nominal
- Sleep: Average Hours per Night; Quantitative Continuous
- Alcohol: Average Drinks per Week; Quantitative Continuous
- Exercise: Average Hours per Week; Quantitative Continuous
- TV: Average Hours per Week; Quantitative Continuous
- Study: Average Hours Studying per Week; Quantitative Continuous
- Seat: Front/Middle/Back; Categorical Ordinal
- GPA: GPA of the Student; Quantitative Continuous
- The data dictionary helps ensure consistent interpretation and analysis.
Quick Recall Tips
- Always frame problems as: What, Why, and How (measurable outcomes).
- Distinguish target variable from predictors early (GPA vs. habits).
- Remember tidy data rules: one variable per column, one observation per row, one value per cell.
- Distinguish variable types to choose appropriate analysis methods (e.g., regression for quantitative variables, classification for categorical variables).
- Use data dictionaries to document data provenance and meaning before modeling.
Quick References (key takeaways)
- Problem statement structure: What is the problem? Why is it important? What outcomes are expected? (1–2 sentences when framing).
- Data science problems can involve classification, prediction, clustering, anomaly detection, or recommendations.
- Cross-sectional vs time series data differ in focus and analysis approach.
- Four scales of measurement (Nominal, Ordinal, Interval, Ratio) guide how you can analyze data.
- Distinguish target vs predictor variables clearly before modeling.
- Data cleaning and structuring are as important as modeling for reliable results.