Notes on Business Data Management & Acquisition

Course Overview

Course Title: Business Data Management & Acquisition
Instructor: Hao Ding, Assistant Professor of Business Analytics and Information Systems
Institution: Harbert College of Business, Auburn University

Choosing the Right Data

Key Considerations for Data Selection:

Data Size: The dataset must be sufficiently large to ensure that it provides meaningful insights and accounts for variability within the population. A larger sample size reduces the margin of error and increases the reliability of statistical analyses.
Data Accuracy and Completeness: The data must be precise and complete. Inaccurate data can lead to misleading results, while incomplete data can skew findings and affect decision-making. Implementing thorough validation checks and requiring data sources to adhere to high standards of accuracy is vital.
Representativeness: The dataset should accurately reflect the larger population of interest. Any biases in data selection can produce skewed results that do not generalize well.

Data Preparation and Management

Steps to Prepare Data:

Check: Understand the structure and nature of the data, including its format, data types (e.g., numerical, categorical), and any inherent biases.
Clean: Identify and rectify any errors or missing values. This process may include imputing missing values, removing duplicates, and correcting inconsistencies. Data cleaning is critical to ensure that the analysis is based on reliable information.
Prep: Organize the data for analysis by transforming it into a suitable structure that aligns with analysis goals. This may involve normalizing data, aggregating values, or creating new calculated fields.

Understanding Data Structure

Data Structure Basics:

Rows: Represent individual observations (or units). Each row corresponds to a single record in the dataset, crucial for analyzing unique data points.
Columns: Represent variables or features (the characteristics measured). Each column should have a clear and concise name and a defined data type.

Types of Data Structures:

Cross-sectional Data:
- Definition: Data collected at one point in time across multiple units.
- Example: Statistics from multiple Auburn basketball players during a single game, where each player's performance metrics are captured.
- Applications: Sports statistics, financial analyses, operational inventories, and single-time-point marketing surveys.
Time-series Data:
- Definition: Data collected on a single unit across multiple time points, allowing for trends and patterns to be observed over time.
- Example: Performance data of a single basketball player over 20 games, tracking metrics such as points and assists.
- Key Concepts:
  - Trend: Long-term movement or pattern in the data, indicating overall direction.
  - Seasonality: Regular, repeated patterns observed within specific periods (e.g., quarterly sales spikes).

Comparison between Time-series and Cross-sectional Data

Time-Series Characteristics:

Observations depend on previous ones (sequential dependence), making it suitable for forecasting and analyzing trends over time.

Cross-sectional Characteristics:

Observations are considered independent. Each observation is treated as separate, allowing for various statistical analyses without concern for temporal effects.

Independent vs. Correlated Variables

Independent Variables:

No relationship exists between them; changes in one variable do not affect the other. An example could be height and jersey number in basketball.

Correlated Variables:

Variables that move together. For instance, there is often a correlation between minutes played and points scored, as more playing time typically allows for more opportunities to score.

Panel Data

Definition: Data that encompasses multiple units measured across multiple time points, effectively combining both cross-sectional and time series structures. This type allows for a richer analysis of dynamics within a dataset.
Example: Tracking all Auburn players over multiple games provides insights into both individual performance trends and overarching team dynamics, revealing how a player adapts over seasonally changing strategies.

Transforming Data Between Structures

From Panel to Cross-sectional: Calculate averages over time for each unit to produce a consolidated snapshot of performance.
From Panel to Time-series: Focus on a specific unit or team’s performance across time, preserving the temporal aspect for trend analysis.
Limitations: Not all panel data can be seamlessly transformed back into cross-sectional or time-series formats due to differences in underlying structures.

Understanding Distribution

Distribution:

Refers to how data is spread across different values, which illustrates the frequency of occurrences and highlights typical versus unusual values within the dataset, reflecting variability and potential outliers.

Key Statistics:

Center: Includes mean (average), median (the middle value), and mode (most frequent value) to describe the central tendency of the data distribution.
Spread: Measures range (difference between the maximum and minimum values) and variance (degree of spread around the mean), which can be represented by standard deviation.

Normal Distribution

Characteristics:

Recognized by its bell-shaped curve that is symmetrical around the mean. This distribution is vital for many statistical tests and analyses.
68-95-99.7 Rule:
- 68% of data falls within 1 standard deviation of the mean.
- 95% falls within 2 standard deviations.
- 99.7% falls within 3 standard deviations, providing insight into the probability and spread of data.

Shapes of Distribution

Symmetrical Distribution:

Both sides mirror each other, indicating balanced data.

Skewed Distributions:

Right skew: A long tail on the right side, which can often be found in datasets such as scores in basketball.
Left skew: A long tail on the left side, common in data like age at retirement.
Bimodal Distribution: Contains two peaks, suggesting the presence of distinct subgroups within the data that may represent different behaviors or characteristics.

Log-Normal Distribution

Definition: Right-skewed distribution; when log-transformed, it results in a normal distribution. It is commonly utilized in economic data, particularly in income distribution analysis.
Utility: This transformation stabilizes variance and enhances analytical efficiency, leading to more robust statistical conclusions.