Chapter 1 Notes: Data and Statistics
What is Statistics?
Statistics can refer to numerical facts (averages, medians, percentages, maximums) used to understand business and economic situations.
Statistics also refers to the art and science of collecting, analyzing, presenting, and interpreting data.
Applications in Business and Economics
Accounting: public accounting firms use statistical sampling procedures for audits.
Economics: economists use statistical information to forecast economic trends.
Finance: price-earnings ratios and dividend yields guide investment advice.
Marketing: data collected by electronic point-of-sale scanners at retail counters support marketing research.
Production: statistical quality control charts monitor production processes.
Information Systems: statistical information helps administrators assess computer network performance.
Data and Data Sets
Data: facts and figures collected, analyzed, and summarized for presentation and interpretation.
Data set: all the data collected in a particular study.
If a study has n elements and p variables, the total number of data values in the data set is \text{Total values} = n \times p.
Elements, Variables, and Observations
Elements: entities on which data are collected.
A variable: a characteristic of interest for the elements.
Observation: the set of measurements for a particular element.
A data set with n elements contains n observations.
Total data values in a complete data set: \text{Total values} = n \times p where p is the number of variables.
Data, Data Sets, Elements, Variables, and Observations (Example)
Example data table includes: Company, Stock Exchange, Annual Sales ($M), Earnings per share ($).
Data set example (from slides): Dataram, EnergySouth, Keystone, LandCare, Psychemedics with corresponding stock exchange, sales, and earnings per share.
Scales of Measurement
Scales determine the amount of information and appropriate analyses.
Scales: Nominal, Ordinal, Interval, Ratio.
Nominal: data are labels or names identifying an attribute (nonnumeric label or numeric code allowed).
Ordinal: data have order/rank meaningful in addition to being labels/codes.
Interval: data have ordinal properties with a fixed unit of measure between observations; data are numeric.
Ratio: like interval data, with a meaningful zero allowing ratio comparisons.
Nominal Scale
Data are labels or names; example: classification of students by the school in which they are enrolled (Business, Humanities, Education) or numeric codes for schools (e.g., 1=Business, 2=Humanities, 3=Education).
Ordinal Scale
Properties of nominal data plus meaningful order; example: class standing (Freshman, Sophomore, Junior, Senior) or coded equivalents (e.g., 1=Freshman, 2=Sophomore, etc.).
Interval Scale
Properties of ordinal data with a fixed unit of measure; example: SAT scores; differences are meaningful (e.g., 1985 vs 1880 differ by 105 points).
Interval data are always numeric.
Ratio Scale
All properties of interval data plus a meaningful zero value; example: price of a book, where 0 represents absence of price and ratios (200/100 = 2) are meaningful.
Categorical and Quantitative Data
Data can be categorized as categorical (qualitative) or quantitative (numerical).
Analyses depend on whether data are categorical or quantitative; generally more analytical options exist for quantitative data.
Categorical Data
Labels or names identifying an attribute of each element (often qualitative).
Can be nominal or ordinal; may be numeric or nonnumeric.
Analytical options are relatively limited.
Quantitative Data
Indicate how many or how much; always numeric.
Ordinary arithmetic operations are meaningful for quantitative data.
Types of Data: Cross-Sectional and Time Series
Cross-Sectional Data: collected at the same or approximately the same point in time (e.g., building permits issued in November 2013 in each Ohio county).
Time Series Data: collected over several time periods (e.g., permits in Lucas County over the last 36 months).
Graphs of time series help analysts understand past behavior, identify trends, and project future levels.
Graphical Summary: Time Series Graphs and Histograms
Graphs help summarize data visually.
Example: histogram of tune-up parts costs shows distribution of costs across 50 observations.
Descriptive Statistics
The most common numerical descriptive statistic is the mean, a measure of central tendency.
Hudson Auto example: mean cost of parts for 50 tune-ups is $79, computed as
\bar{x} = \frac{1}{n} \sum{i=1}^{n} xi = \frac{79\times 50}{50} = 79.
Population, Sample, and Statistical Inference
Population: the set of all elements of interest in a study.
Sample: a subset of the population.
Statistical inference: using data from a sample to estimate population characteristics or test hypotheses.
Census: collecting data for the entire population.
Sample survey: collecting data for a sample.
Process of Statistical Inference
1) Population consists of all tune-ups; the population average cost is unknown.
2) A sample of 50 tune-ups is examined.
3) The sample provides a sample average of $79 per tune-up.
4) Use the sample average to estimate the population average.
Analytics
Analytics is the scientific process of transforming data into insight for better decisions.
Descriptive analytics: describes what happened in the past.
Predictive analytics: uses models from past data to predict the future or assess effects of one variable on another.
Prescriptive analytics: yields a best course of action.
Big Data and Data Mining
Big data: large, complex data sets.
The Three Vs: Volume (amount of data), Velocity (speed of data collection/processing), Variety (different data types).
Data warehousing: capturing, storing, and maintaining data.
Examples: Wal‑Mart captures 20–30 million transactions per day; Visa processes 6,800 payment transactions per second.
Data mining: methods for developing useful decision-making information from large databases; combines statistics, mathematics, and computer science; automated procedures discover relationships and predict outcomes.
Data Mining Applications
Major applications in consumer-focused industries: retail, financial, and telecommunications.
Used to identify related products for cross-sell and to target discounts based on past purchasing volumes.
Data Mining Requirements
Requires statistical methods such as multiple regression, logistic regression, and correlation.
Requires computer science tech including artificial intelligence and machine learning.
Substantial time and financial investment are needed.
Model Reliability and Validation
A model that fits a particular sample well may not generalize to other data.
Large data sets can be partitioned into a training set for model development and a test set for validation.
Risk of overfitting: model captures noise as if it were signal.
Careful interpretation and extensive testing are crucial.
Ethical Guidelines for Statistical Practice
Unethical behaviors include improper sampling, inappropriate data analysis, misleading graphs, inappropriate summary statistics, and biased interpretation.
Strive to be fair, thorough, objective, and neutral in data collection, analysis, and presentation.
Consumers of statistics should be aware of possible unethical practices by others.
The American Statistical Association developed the report “Ethical Guidelines for Statistical Practice,” containing 67 guidelines organized into 8 topic areas (Professionalism, Responsibilities to Funders/Clients/Employers, Publications/Testimony, Research Subjects, Research Team Colleagues, Other Statisticians/Practitioners, Allegations of Misconduct, Employers’ Responsibilities).
End of Chapter 1
Data Sources
Existing sources include internal company records, business database services (e.g., Dow Jones & Co.), government agencies, industry associations, special-interest organizations, and the Internet.
Data Available From Internal Company Records
Records: Employee records, Production records, Inventory records, Sales records, Credit records, Customer profile data (age, gender, income, household size), etc.
Data Available From Selected Government Agencies
Census Bureau (population data, households, income): www.census.gov
Federal Reserve Board (money supply, exchange rates, rates): www.federalreserve.gov
Office of Management & Budget (OMB) (federal revenues, expenditures, debt): www.whitehouse.gov/omb
Department of Commerce (business activity, shipments, profits by industry): www.doc.gov
Bureau of Labor Statistics (unemployment, earnings, safety): www.bls.gov
Data Sources: Observational vs Experimental Studies
Observational (nonexperimental): no attempt to control variables; example: surveys (e.g., smokers vs nonsmokers).
Experimental: identify a variable of interest, then control other variables to study their influence on the variable of interest; historical example: 1954 Public Health Service Salk polio vaccine trial with nearly two million children.
Data Acquisition Considerations
Time requirement: information gathering can be time consuming; information may become outdated.
Cost of acquisition: information can be costly to obtain.
Data errors: data gathered with little care can mislead.
Statistical Inference: Problems and Solutions (Selected Examples)
Problem 9 (Solution):
a. The data are categorical.
b. 30/71 = 0.423; 0.423 × 100 = 42.3%
Problem 12 (Solution):
a. Population: all visitors to Hawaii.
b. Since most visitors arrive by air, use questionnaires on incoming flights (on the back of the plants/animals declaration form); a large percentage complete.
c. Questions 1 and 4 are quantitative (visits, days); Questions 2 and 3 are qualitative (reason for trip, where stay); both could be correct.
Problem 20 (Solution):
a. 43% bullish; 21% expect health care to lead in next 12 months.
b. Estimated average 12-month return for population of investment managers: \text{average return} = 11.2\%.
c. Estimated average tenure: 2.5\text{ years}.
Problem 21 (Solution):
a. Populations: women whose mothers took DES during pregnancy vs those whose mothers did not.
b. The study was a survey.
c. \frac{63}{3980} = 0.0158 \approx 15.8\text{ abnormalities per 1000}.
d. If article reports twice as many abnormalities in DES-exposed group, rough estimate: \frac{15.8}{2} \approx 7.9\text{ abnormalities per 1000}.
e. Disease occurrences are rare; large samples are needed to observe sufficient cases.