Lecture 1 Summary: Defining and Collecting Data

Statistics and Data

Objectives

Statistics is a way of thinking for better decisions.
The DCOVA framework guides the application of statistics.
Understand issues that arise when defining variables.
Define variables effectively.
Understand different measurement scales.
Collect data properly.
Identify different ways to collect a sample.
Understand data preparation issues.
Understand types of survey errors.

What Is Statistics?

Statistics helps make sense of the complex world and make valid business decisions.
Statistics are methods to work with data effectively:
- Summarize and visualize business data.
- Reach conclusions from business data.
- Make reliable predictions about business activities.
- Improve business processes.
Statistics are quantities calculated from data (e.g., average of a variable).
"Data" is plural; "datum" is singular.
Data are values along with their context.

The DCOVA Framework

The DCOVA framework minimizes errors in statistical applications:

Define: Define the data you want to study to solve a problem or meet an objective.
Collect: Collect the data from appropriate sources.
Organize: Organize the data collected by developing tables.
Visualize: Visualize the data by developing charts.
Analyze: Analyze the data collected to reach conclusions and present those results.
Chapters focus on these tasks sequentially.

What Are Data?

All data have a context (who and what).
Data values/observations are information collected regarding some subject.
Data can be numbers, names, etc., and tells us the "Who and What".
Data are often organized into a data table.
Rows represent individual cases about Whom we record some characteristics.
"Who" are the purchase orders, and we have 5 cases (each row is a case).
Cases go by different names:
- Items or observations.
- Objects.
- Respondents (individuals who answer a survey).
- Subjects or participants (people in an experiment).
- Experimental units (animals, plants, websites, or other inanimate objects).
Characteristics recorded about each case are called variables (columns in a data table).
The first row of each column identifies What (variable names) has been measured.
Data tables allow us to understand each individual case.

Microsoft Excel and Data Analysis

Excel is the primary data analysis application in Microsoft Office.
Excel uses worksheets to store data and present analysis and results.
A worksheet has rows and columns.
Data are typically saved in a spreadsheet, where the rows represent cases and the columns represent variables.
The intersection of a specific row and column is a cell.
Cells are referred to by column letter and row number address (e.g., A1).
Excel saves files as workbooks.
A workbook is a collection of worksheets and chart sheets.
Excel files are typically saved in .xlsx or .xls format.

Variable Types: Categorical vs. Numerical

Categorical (qualitative) variables: Take categories as their values (e.g., "yes", "no", "blue", "brown", "green", "child", "adult", "senior", "males", "females", area codes). They don't have units allowing us to compute meaningful analysis.
Numerical (quantitative) variables: Have values (with units) that represent a counted or measured quantity.
Discrete variables: Arise from a counting process (responses are finite integers). Example: number of students in a class.
Continuous variables: Arise from a measuring process (responses are values with no limit on the number of decimal places). Example: time spent waiting for the bank teller services, prices, stock prices, etc.

Examples of Variable Types

"Do you have a Facebook profile?" (Yes or No) - Categorical
"How many text messages have you sent in the past three days?" - Numerical (discrete)
"How long did the mobile app update take to download?" - Numerical (continuous)

Types of Variables

Categorical: Nominal (e.g., marital status, political party, eye color) and Ordinal (e.g., ratings - Good, Better, Best; Low, Med, High).
Numerical: Discrete (e.g., number of children, defects per hour) and Continuous (e.g., weight, voltage).

Measurement Scales (Categorical Variables)

Ordinal scale: Data values can be ordered; ranking is implied (e.g., employees ranked by months employed).
Nominal scale: Classifies data into distinct categories with no ranking implied (e.g., types of investment, cellular provider).

Measurement Scales (Numerical Variables)

Ratio scale: Ordered scale with a true zero point (zero represents nonexistence of the quantity) (e.g., zero salary means no salary; zero weight means no weight).
Interval scale: Ordered scale but measurements do not have a true zero point as zero is assigned arbitrarily (e.g., Zero temperature (Celsius) does not mean temperature doesn’t exist, Zero score on standardized exam does not mean no intelligence).

Data Sources: Data Collection

Population: All the items or individuals about which you want to draw conclusion(s).
Sample: A portion of the population of items or individuals.
Impractical to examine the entire population, so we examine a sample.
Parameter: Summarizes for a population (e.g., mean age of a population).
Statistic: Summarizes for sample data (e.g., mean age of a sample).
Descriptive Statistics: Values (mean, median, etc.) that summarize the sample data (Ch. 1 to Ch. 3).
Inferential Statistics: Drawing conclusions about population parameter from sample statistic (remainder of the course).

Sampling Frame

When you collect data by selecting a sample, you begin by defining the sampling frame.
The sampling frame is a listing of the items that make up the population from which the sample will be selected.
Inaccurate or biased results can occur if a frame excludes certain groups or portions of the population.
Using different frames to generate data can lead to dissimilar conclusions.

Types of Samples

Non-probability Samples
- Convenience
- Judgement
Probability Samples
- Simple Random
- Systematic
- Stratified
- Cluster

Nonprobability Samples

Items are chosen without regard to their probability of occurrence.
Convenience sampling: Items selected based on ease, expense, or convenience (biased).
Judgment sampling: Opinions of data collector affect sample selected (biased).

Probability Samples

Items in the sample are chosen on the basis of known probabilities.
Include a variety of people in the sample for making the underlying characteristics of the sample as representative as those of the population.
Four types presented:
- Simple Random.
- Systematic.
- Stratified.
- Cluster.

Simple Random Sample (SRS)

Every individual/item has an equal chance of being selected.
$N$ = frame size. Number every item in frame from 1 to $N$ (or 0 to $N−1$ ).
The chance that you will select any particular member of frame on first selection is $1/N$ .
Example: Selecting 5 students from 80.
- Number students from 00 to 79 (sampling frame).
- Use random digits from a table to make the selection.
- Ignore 85 because highest number is 79.
- Skip repeated numbers.
- SRS consists of students with the numbers 58, 55, 14, 38, and 50.

Systematic Sample

Decide on sample size: $n$ (= number of groups).
Divide frame of $N$ individuals into $n$ groups of $k$ individuals: $k=N/n$ .
Randomly select one individual from the 1st group.
Select every $k^{th}$ individual thereafter.

Stratified Random Sample

Divide population into ≥ 2 subgroups (strata) according to some common characteristic (gender, race, age).
A stratum includes items of common feature.
Simple random sampling used to select desired number from subgroups so fraction of individuals in sample matches population.
Samples from strata are combined into one.
Stratify (to ensure representation) and sample randomly within each stratum.

Cluster Sample

Population is divided into several “clusters,” each representative of the population (city blocks, election districts).
A cluster includes a wide variety of items without common feature.
Simple random sample of clusters is selected.
All items in selected clusters can be used, or items can be chosen using another probability sampling technique.

Comparing methods

Simple random sample (SRS) and Systematic sample (SS):
- Simple to use.
- May not be a good representation of the population’s underlying characteristics.
Stratified sample:
- Ensures representation of individuals across the entire population.
Cluster sample:
- More cost effective than SRS and SS.
- Less efficient (need larger sample to acquire the same level of precision).

Common Sampling Designs: Study Time Example

A professor wants to know how much time (in hours) the students spent studying for the Business Statistics exam at his school. He divides the students taking Business Statistics course into groups to select the first student randomly from the first group. Then every tenth student is selected from the remaining students. He calculates the average (or mean) value for the number of hours students spent on studying.

Population of interest – All students enrolled in Business Statistics course at the school
Population parameter – Mean value of the number of hours all students enrolled in Business Statistics course at the school spent studying for exam
Sample statistic – Mean value of the number of hours students in the sample spent studying for the exam
Method – Systematic sampling

Additional Resources (Non-test Material)

Business Analytics combines methods from Statistics, Information systems, and Management science to support fact-based decision making.
Where, How, and When data are collected matters (the context matters).
Observational Studies & Designed Experiments both quantify the effect of a process change (treatment) on a variable of interest; observational studies lack direct control over which items receive the treatment while, in designed experiments, there is direct control.
Data Cleaning is necessary to address irregularities like typographical or data entry errors, impossible/undefined values, missing values, outliers.
Recoding variables = redefining categories.
Ethical considerations for survey methods.