C207 Master Story Guide: Data-Driven Decision Making

Quantitative Analysis Foundation and Big Data Concepts

Big Data refers to datasets so massive and complex that traditional database and spreadsheet tools are insufficient for processing. These datasets typically include both structured and unstructured data. Organizations such as Amazon, Google, and Netflix utilize Big Data by collecting transactions, clicks, searches, purchases, recommendations, and reviews. A common trap is assuming Big Data is only numerical or structured; it is critically defined as being both structured and unstructured. To manage this information, a Big Data Warehouse is utilized, which is a storage environment often relying on cloud storage, third-party storage, or multiple servers. A spreadsheet like Excel is not considered a Big Data Warehouse because Big Data, by definition, would not fit within it. Data Mining is the specific process of discovering useful patterns within these large datasets, such as a retailer identifying which customers are likely to respond to a specific coupon. In this context, Big Data is often described as the mountain, while Data Mining is the act of looking for gold inside that mountain.

Organizations collect data primarily to transform it into useful information for better decision-making. A core objective in business is analyzing past purchases to encourage specific buying behavior and attract new customers. Data is classified by its format. Structured data exists in a fixed format that is easy to classify, count, or analyze, such as multiple-choice surveys, check boxes, dropdowns, or preformatted dates. Conversely, unstructured data lacks a fixed format and requires interpretation or theme analysis. Examples include medical notes, customer comments, emails, and open-ended responses. Medical notes are frequently cited as a primary example of unstructured data because they consist of free text rather than predefined choices.

Levels of Measurement and Data Quality

Data can be classified as Quantitative (numerical and measurable/countable) or Qualitative (categorical labels and descriptions). Quantitative data is further divided into Discrete data, which consists of whole-number counts like the number of children in a household, and Continuous data, which involves measurements that allow for decimals, such as weight, height, or temperature. The NOIR framework defines four levels of measurement: Nominal, Ordinal, Interval, and Ratio. Nominal involves categories with no natural sequence, such as car colors or yes/no fields. Ordinal involves categories with a meaningful rank or sequence but unequal distance between them, such as economy versus first-class seating. Interval data consists of numerical scales where zero is merely a placeholder, like temperature in Fahrenheit. Ratio data is numerical data where zero indicates a true absence of the variable, such as money, price, revenue, age, or distance. In this course, all money-related terms are classified as ratio.

Measurement quality depends on Reliability and Validity. Reliability refers to the consistency and repeatability of a measurement instrument. For example, a thermometer is unreliable if it gives wildly different readings for the same subject in a single minute. Validity determines if the data measures what it is intended to measure; for instance, a jigsaw puzzle might not be a valid measure of IQ. A tool can be reliable without being valid, meaning it consistently measures the wrong thing. Data quality also requires cleaning data to remove mistakes. Common issues include Out-of-Range Errors, which are impossible values like a car listed at $188\,MPG$ , and Omission Errors, which are missing values or blank fields. Sorting spreadsheet columns is a primary method for identifying these artifacts. Systematic Errors are consistent biases that push results in one direction and must be fixed, such as a scale that always adds $5\,lbs$ . Random Errors are caused by chance or noise and often average out over time, such as temporary static during a phone call.

Research Design, Bias, and Association

Research designs include Observational Studies, where researchers collect info without applying a treatment, and Experimental Studies, where a treatment is applied to examine an effect on a response unit. A Cohort Study is a specific type of observational study focused on a group sharing a characteristic or timeframe. Experimental elements include the Unit (the subject), the Treatment (the change applied), and the Response (the measured outcome). Blinding is used to reduce bias: Blind studies keep participants unaware of the treatment; Double-Blind studies keep both participants and researchers unaware; and Triple-Blind studies include the analyst in the group unaware of treatment assignments. Faulty Operationalization occurs when a variable is not clearly defined or measured correctly, which directly impacts validity.

Bias manifests in several forms. Measurement Bias occurs during sample selection or data collection. Information Bias occurs after collection begins due to inaccurate or distorted records. Response Bias is caused by the influence of the interviewer or pressure felt by the respondent. Conscious Bias involves framing questions in a leading or persuasive way. Finally, the principle of Association versus Causation dictates that just because variables move together (correlation), it does not mean one causes the other. An example is ice cream sales and drowning rates both rising in summer; the relationship is an association, not a causal link.

Statistical Tools and Hypothesis Testing

The alpha level ( $\alpha$ ), typically set at $0.05$ , is the significance cutoff. The p-value is compared to this alpha to determine significance: if $p < 0.05$ , the result is significant and the null hypothesis is rejected; if $p > 0.05$ , the result is not significant and the null hypothesis is accepted or failed to be rejected. The Null Hypothesis always states there is no significant difference or relationship. Test selection depends on the data type: Chi-Square Analysis compares frequencies or counts of categorical/nominal data. A T-Test compares the means of exactly two groups. ANOVA (Analysis of Variance) compares the means of three or more groups. If a T-statistic is negative, its absolute value should be used for comparison against the T-critical value. In ANOVA outputs, scientific notation like $E-34$ indicates a very tiny p-value well below the $0.05$ threshold.

Regression Analysis measures relationships and predicts outcomes. Linear Regression uses one independent variable ( $X$ ) to predict one numeric dependent variable ( $Y$ ). Multiple Regression uses two or more independent variables to predict one numeric dependent variable. Logistic Regression is used when the dependent variable ( $Y$ ) is binary or nominal, such as yes/no or pass/fail. The Regression Equation is $Y = MX + B$ , where $M$ is the slope coefficient and $B$ is the intercept. R-Squared ( $R^2$ ) represents the goodness of fit, measuring how much variation in $Y$ is explained by $X$ , with values closer to $1$ being stronger. Scatter plots are used to visualize these relationships; Homoscedasticity indicates a consistent spread of points (pencil-shape), while Heteroscedasticity indicates a changing spread (ice-cream-cone shape).

Decision-Making and Optimization Tools

Decision Tree Analysis is used to choose between alternatives under uncertainty by calculating the Expected Value ( $EV$ ), which is the weighted average of outcomes. The decision rule is to choose the alternative with the highest expected value ( $EV = \text{Payoff} \times \text{Probability}$ ). Linear Programming is used to find the optimal solution, such as a product mix, to maximize or minimize an objective under specific constraints. Break-Even Analysis determines the point where total revenue equals total cost, identifying the volume needed before profit begins. Cross-Over Analysis compares cost-per-volume between alternatives to see which is best based on how much of a service or product is used. Cluster Analysis is a non-hypothesis tool used to group similar observations, often for market segmentation.

Simulations help model potential outcomes. A What-If Simulation changes input values to observe changes in outputs. A Monte Carlo Simulation is more complex, using many random outcomes to model uncertainty and forecasting potential profits or risks under varying conditions. These tools allow managers to quantify risk and success probabilities before committing resources.

Probability and Descriptive Statistics

Probability measures the likelihood of an event: $P = \text{favorable outcomes} / \text{total opportunities}$ . The Big Four principles of probability include Intersection (AND), which implies both events happen and requires multiplication; Union (OR), which implies either event happens and requires adding probabilities and subtracting the overlap; Complement (NOT), calculated as $1 - P(A)$ ; and Conditional Probability (Bayes Theorem), which is the probability of an event given that another has already occurred. Combination is a counting technique to determine how many possible groups can be formed; for example, grouping $10$ volunteers into teams of $3$ results in $120$ possible groups.

Descriptive statistics summarize data through central tendency and spread. The Mean is the arithmetic average, which is sensitive to outliers. The Median is the middle value of sorted data and is more robust against extremes. The Mode is the most frequent value; datasets can be bimodal (two modes) or multimodal. Variance measures the squared distance from the mean, and the Standard Deviation is the square root of variance, representing the volatility or spread. In a Normal Distribution (Bell Curve), the Empirical Rule states that $68\%$ of data falls within $1$ standard deviation, $95\%$ within $2$ , and $99.7\%$ within $3$ . The Z-Score tells how many standard deviations a value is from the mean: $Z = \frac{\text{Score} - \text{Mean}}{\text{SD}}$ . The Interquartile Range ( $IQR$ ) is the distance between the third and first quartiles ( $Q3 - Q1$ ), representing the middle $50\%$ of data. A Box Plot visually displays these quartiles along with the minimum, maximum, and outliers.

Quality Management and Quality Tools

Quality Management focuses on preventing defects and improving processes. The PDCA (Deming) Cycle follows four stages: Plan (investigate and define), Do (test/pilot), Check (measure/evaluate), and Act (standardize fix). SIPOC is a high-level process map looking at Suppliers, Inputs, Processes, Outputs, and Customers. Quality Assurance (QA) is proactive and focuses on preventing defects through training and process capability. Quality Control (QC) is reactive, focusing on catching and repairing defects after they occur. A defect is any product or service characteristic that fails to meet customer expectations.

There are seven basic quality tools. A Run Chart shows performance over time. A Control Chart is a run chart with Upper and Lower Control Limits (UCL/LCL); data within limits is Common Cause Variation (normal noise), while data outside limits is Special Cause Variation (unusual events). A Cause and Effect (Fishbone/Ishikawa) Diagram answers why a problem happens. A Flow Chart uses boxes and arrows to show where a process occurs or fails. Check Sheets are used to collect and tally data. Histograms show the distribution of numerical data across ranges (bins). A Pareto Chart ranks categories from highest to lowest to prioritize issues. A Scatter Diagram shows the relationship between two variables. Major programs include Lean (focused on waste reduction), Six Sigma (focused on reducing variation to $3.4$ defects per million), and Just-in-Time (JIT) (focused on inventory reduction).

Organizational Performance and Real-World Measures

Results-Based Management (RBM) is a monitoring approach used heavily by nonprofits to ensure actual results are achieved through partnership. It follows a flow from Inputs to Activities, Outputs, Outcomes, and finally long-term Impact. Index Numbers compare change relative to a base period using the formula $(\text{Current Price} / \text{Base Price}) \times 100$ . Health and population metrics include Incidence (new cases), Cumulative Incidence (new cases over a specific period), and Prevalence (total existing cases). Ratios compare one group to another, often described as being "times more likely" to experience an event.

Performance is also tracked via frameworks. A KPI (Key Performance Indicator) is a single specific metric used to measure progress toward a goal. SMART goals ensure KPIs are Specific, Measurable, Achievable, Relevant, and Time-bound. A Dashboard is a visual display tool that monitors various metrics in one place. The Balanced Scorecard views organizational strategy from four perspectives: Financial, Customer, Internal Process, and Learning and Growth. Finally, the Net Promoter Score (NPS) measures customer loyalty by asking how likely a customer is to recommend the company to others, categorizing respondents as promoters or detractors.