T

Data Science: Problem Formulation and Data Concepts (Vocabulary)

Key Concepts: Defining the Problem and Problem Statement

  • A well-stated problem guides data collection, analysis methods, and stakeholder communication.
  • Core components of a problem statement:
    • Context: background to show relevance and urgency.
    • Specificity: clearly define the problem in understandable terms.
    • Objectives (Measurable): specify what success looks like with metrics or outcomes.
  • Structuring a problem statement:
    • What is the problem?
    • Why is it important?
    • What are the expected outcomes?
  • Examples of problem statements in Data Science:
    • Customer Churn Prediction: reduce churn by 15\% over the next quarter.
    • Sales Forecasting: predict monthly sales for the next 12 months to improve inventory management.
    • Product Recommendation: increase user engagement by recommending products based on past behavior.
  • A true data science problem can: categorize/group data, identify patterns, identify anomalies, show correlations, predict outcomes, or recommend actions.
  • Good problem statements are: clear, concise, and measurable.
  • Steps to formulate a problem:
    • Understand goals & expectations: identify objectives, pain points, available data, potential benefits, risks.
    • Translate goals into data analysis goals (what you will try to predict, classify, or discover).
    • Frame the problem: write a clear, concise, and measurable statement (often 1–2 sentences).
  • Distinguishing good vs. bad statements:
    • Better: specific, actionable, and measurable.
    • Worse: vague, broad, or non-actionable.

Understanding the Data

  • Data basics:
    • Data are the raw facts; information is data that have been organized and processed; knowledge is information applied to decisions; value comes from using knowledge to make decisions.
    • Data types include numerical and non-numerical; data come from many sources and formats.
    • Data can be large or small; even small data can yield insights.
  • Data preparation is a key decision step before analysis.
  • Data preparation is about turning raw data into information and knowledge to support decisions.

Population vs Sample; Inference

  • Population: all objects of interest. Often not all can be collected due to cost or feasibility.
  • Sample: a subset of the population used for analysis.
  • Population parameter: a summary measure of the population (usually unknown).
  • Sample statistic: a summary measure calculated from the sample.
  • Statistical inference: drawing conclusions about the population from the sample; requires unbiased sampling.

Big Data: Characteristics and Considerations

  • There is no universal definition; key perspective involves:
    • Volume: huge data quantities that require scalable storage/processing (e.g., O(10^6) to O(10^{12}) records).
    • Velocity: data generated/updated rapidly.
    • Variety: structured, semi-structured, and unstructured data.
    • Veracity: data quality and reliability.
    • Values: methodological plan for forming questions and extracting insight.
  • Big data often requires parallel/distributed processing and advanced feature selection/dimensionality reduction.

Data Formats: Cross-Sectional vs Time Series

  • Time Series Data:
    • Observations of a single subject across multiple time periods.
    • Main focus: a single variable over time.
    • Example: profits of an organization over T years.
  • Cross-Sectional Data:
    • Observations of many subjects at the same point in time.
    • Main focus: multiple variables at one point in time.
    • Example: maximum temperatures of multiple cities on a single day.

Structured vs Unstructured Data

  • Structured Data:
    • Highly organized, easy to analyze; can be stored in rows/spreadsheets/databases.
    • Examples: customer phone numbers, names, ZIP codes.
  • Unstructured Data:
    • Not easily organized into rows; harder to collect/analyze.
    • Examples: emails, video files, images, social media data.
  • Transforming unstructured data into structured data:
    • Data cleaning
    • Extract meaningful insights with analysis tools (LLMs, NLP)
    • Store in tidy data frames for analysis

Transforming Unstructured to Structured Data with LLMs

  • Unstructured data sources: emails, orders, ratings, social media, chats.
  • Steps: data cleaning → extract meaning → analyze with tools like LLMs (e.g., ChatGPT/OpenAI) → summarize, sentiments, themes, keywords → store in tidy format.
  • Structured data examples: files, lists, measurements; tidy data frames facilitate analysis.

Data Restructuring and Analysis Examples

  • Summarization: condense large text to key points.
  • Sentiment analysis: determine tone of text (positive/negative/neutral).
  • Thematic analysis: identify common themes/trends.
  • Keyword extraction: identify important terms for classification or indexing.

The Data in Our Class: Tidy Data

  • Tidy data: each column is a distinct variable; each row is a unique observation; each cell contains a single value.
  • This structure simplifies downstream analysis and modeling.

Variables, Observational Units, and Observations

  • Variables (features): stored in columns; describe the subject.
  • Observational units (cases): stored in rows; the individual subjects in the data.
  • Observations/Values: the data in the cells; the measurements for each observational unit across variables.
  • Example: a tidy data set with one row per student, one column per variable (e.g., GPA, sleep hours, etc.).

Types of Variables

  • Categorical (Qualitative) Variables:
    • Ordinal: natural order (e.g., education level, class standing).
    • Nominal: categories without a natural order (e.g., gender, eye color).
    • Identifier (Special Case Nominal): used for identification, not for analysis (e.g., Student ID); not analyzed as a category.
    • Categorical Numbers: codes for categories (e.g., 0/1 for No/Yes) but not quantitative values.
  • Numerical (Quantitative) Variables:
    • Discrete: counts with a finite set of values (e.g., number of parts damaged).
    • Continuous: measurements on a continuum (e.g., height, weight, GPA).

Data Types and Measurement Scales

  • Data types mapping:
    • Text: categorical variables containing words (Nominal or Ordinal).
    • Integer: quantitative or categorical without decimals (Discrete or Ordinal).
    • Float: quantitative with decimals (Continuous).
    • Boolean: two possible values (Nominal/Binary).
  • Four scales of measurement (how data are measured):
    • Ratio: true zero exists; differences meaningful; ratios meaningful; e.g., profits.
    • Interval: differences meaningful; zero is arbitrary; ratios not meaningful; e.g., temperature in Celsius.
    • Ordinal: ordered categories; differences not necessarily equal; e.g., performance levels.
    • Nominal: categories with no inherent order; e.g., gender, color.

Putting It All Together: The Problem Example (Student GPA)

  • Scenario: A university wants to understand factors influencing GPA at graduation and predict GPA from habits for advisement.
  • Data needs: collect data on student habits, sleep, study time, etc., to model GPA at graduation.
  • Target (Dependent) Variable: GPA upon graduation (the outcome to understand/predict).
    • Also called: target, response, or dependent variable.
  • Independent (Explanatory/Predictor) Variables: factors that influence GPA (e.g., sleep, study hours, exercise, seat location, etc.).
  • Data dictionary: metadata detailing variable names, descriptions, data types, units, and scale.

Data Collection and the Data Dictionary

  • Data collection steps:
    • Gather data from available private sources.
    • Explore publicly available data for relevance.
    • If needed, collect new data to enhance robustness.
  • Data Dictionary (Meta Data Table): a summary describing the dataset, including:
    • Variable Name, Description
    • General Type (Nominal, Ordinal, etc.) and Specific Type
    • Data Types and Measurement Units
  • Example structure (Student Survey at Data University):
    • Student ID #: Identifier, Nominal
    • Sex: Categorical (Male/Female), Nominal
    • Sleep: Average Hours per Night; Quantitative Continuous
    • Alcohol: Average Drinks per Week; Quantitative Continuous
    • Exercise: Average Hours per Week; Quantitative Continuous
    • TV: Average Hours per Week; Quantitative Continuous
    • Study: Average Hours Studying per Week; Quantitative Continuous
    • Seat: Front/Middle/Back; Categorical Ordinal
    • GPA: GPA of the Student; Quantitative Continuous
  • The data dictionary helps ensure consistent interpretation and analysis.

Quick Recall Tips

  • Always frame problems as: What, Why, and How (measurable outcomes).
  • Distinguish target variable from predictors early (GPA vs. habits).
  • Remember tidy data rules: one variable per column, one observation per row, one value per cell.
  • Distinguish variable types to choose appropriate analysis methods (e.g., regression for quantitative variables, classification for categorical variables).
  • Use data dictionaries to document data provenance and meaning before modeling.

Quick References (key takeaways)

  • Problem statement structure: What is the problem? Why is it important? What outcomes are expected? (1–2 sentences when framing).
  • Data science problems can involve classification, prediction, clustering, anomaly detection, or recommendations.
  • Cross-sectional vs time series data differ in focus and analysis approach.
  • Four scales of measurement (Nominal, Ordinal, Interval, Ratio) guide how you can analyze data.
  • Distinguish target vs predictor variables clearly before modeling.
  • Data cleaning and structuring are as important as modeling for reliable results.