Chapter 1 Notes: Data and Statistics

What is Statistics?

  • Statistics can refer to numerical facts (averages, medians, percentages, maximums) used to understand business and economic situations.

  • Statistics also refers to the art and science of collecting, analyzing, presenting, and interpreting data.

Applications in Business and Economics

  • Accounting: public accounting firms use statistical sampling procedures for audits.

  • Economics: economists use statistical information to forecast economic trends.

  • Finance: price-earnings ratios and dividend yields guide investment advice.

  • Marketing: data collected by electronic point-of-sale scanners at retail counters support marketing research.

  • Production: statistical quality control charts monitor production processes.

  • Information Systems: statistical information helps administrators assess computer network performance.

Data and Data Sets

  • Data: facts and figures collected, analyzed, and summarized for presentation and interpretation.

  • Data set: all the data collected in a particular study.

  • If a study has n elements and p variables, the total number of data values in the data set is \text{Total values} = n \times p.

Elements, Variables, and Observations

  • Elements: entities on which data are collected.

  • A variable: a characteristic of interest for the elements.

  • Observation: the set of measurements for a particular element.

  • A data set with n elements contains n observations.

  • Total data values in a complete data set: \text{Total values} = n \times p where p is the number of variables.

Data, Data Sets, Elements, Variables, and Observations (Example)

  • Example data table includes: Company, Stock Exchange, Annual Sales ($M), Earnings per share ($).

  • Data set example (from slides): Dataram, EnergySouth, Keystone, LandCare, Psychemedics with corresponding stock exchange, sales, and earnings per share.

Scales of Measurement

  • Scales determine the amount of information and appropriate analyses.

  • Scales: Nominal, Ordinal, Interval, Ratio.

  • Nominal: data are labels or names identifying an attribute (nonnumeric label or numeric code allowed).

  • Ordinal: data have order/rank meaningful in addition to being labels/codes.

  • Interval: data have ordinal properties with a fixed unit of measure between observations; data are numeric.

  • Ratio: like interval data, with a meaningful zero allowing ratio comparisons.

Nominal Scale

  • Data are labels or names; example: classification of students by the school in which they are enrolled (Business, Humanities, Education) or numeric codes for schools (e.g., 1=Business, 2=Humanities, 3=Education).

Ordinal Scale

  • Properties of nominal data plus meaningful order; example: class standing (Freshman, Sophomore, Junior, Senior) or coded equivalents (e.g., 1=Freshman, 2=Sophomore, etc.).

Interval Scale

  • Properties of ordinal data with a fixed unit of measure; example: SAT scores; differences are meaningful (e.g., 1985 vs 1880 differ by 105 points).

  • Interval data are always numeric.

Ratio Scale

  • All properties of interval data plus a meaningful zero value; example: price of a book, where 0 represents absence of price and ratios (200/100 = 2) are meaningful.

Categorical and Quantitative Data

  • Data can be categorized as categorical (qualitative) or quantitative (numerical).

  • Analyses depend on whether data are categorical or quantitative; generally more analytical options exist for quantitative data.

Categorical Data

  • Labels or names identifying an attribute of each element (often qualitative).

  • Can be nominal or ordinal; may be numeric or nonnumeric.

  • Analytical options are relatively limited.

Quantitative Data

  • Indicate how many or how much; always numeric.

  • Ordinary arithmetic operations are meaningful for quantitative data.

Types of Data: Cross-Sectional and Time Series

  • Cross-Sectional Data: collected at the same or approximately the same point in time (e.g., building permits issued in November 2013 in each Ohio county).

  • Time Series Data: collected over several time periods (e.g., permits in Lucas County over the last 36 months).

  • Graphs of time series help analysts understand past behavior, identify trends, and project future levels.

Graphical Summary: Time Series Graphs and Histograms

  • Graphs help summarize data visually.

  • Example: histogram of tune-up parts costs shows distribution of costs across 50 observations.

Descriptive Statistics

  • The most common numerical descriptive statistic is the mean, a measure of central tendency.

  • Hudson Auto example: mean cost of parts for 50 tune-ups is $79, computed as
    \bar{x} = \frac{1}{n} \sum{i=1}^{n} xi = \frac{79\times 50}{50} = 79.

Population, Sample, and Statistical Inference

  • Population: the set of all elements of interest in a study.

  • Sample: a subset of the population.

  • Statistical inference: using data from a sample to estimate population characteristics or test hypotheses.

  • Census: collecting data for the entire population.

  • Sample survey: collecting data for a sample.

Process of Statistical Inference

1) Population consists of all tune-ups; the population average cost is unknown.
2) A sample of 50 tune-ups is examined.
3) The sample provides a sample average of $79 per tune-up.
4) Use the sample average to estimate the population average.

Analytics

  • Analytics is the scientific process of transforming data into insight for better decisions.

  • Descriptive analytics: describes what happened in the past.

  • Predictive analytics: uses models from past data to predict the future or assess effects of one variable on another.

  • Prescriptive analytics: yields a best course of action.

Big Data and Data Mining

  • Big data: large, complex data sets.

  • The Three Vs: Volume (amount of data), Velocity (speed of data collection/processing), Variety (different data types).

  • Data warehousing: capturing, storing, and maintaining data.

  • Examples: Wal‑Mart captures 20–30 million transactions per day; Visa processes 6,800 payment transactions per second.

  • Data mining: methods for developing useful decision-making information from large databases; combines statistics, mathematics, and computer science; automated procedures discover relationships and predict outcomes.

Data Mining Applications

  • Major applications in consumer-focused industries: retail, financial, and telecommunications.

  • Used to identify related products for cross-sell and to target discounts based on past purchasing volumes.

Data Mining Requirements

  • Requires statistical methods such as multiple regression, logistic regression, and correlation.

  • Requires computer science tech including artificial intelligence and machine learning.

  • Substantial time and financial investment are needed.

Model Reliability and Validation

  • A model that fits a particular sample well may not generalize to other data.

  • Large data sets can be partitioned into a training set for model development and a test set for validation.

  • Risk of overfitting: model captures noise as if it were signal.

  • Careful interpretation and extensive testing are crucial.

Ethical Guidelines for Statistical Practice

  • Unethical behaviors include improper sampling, inappropriate data analysis, misleading graphs, inappropriate summary statistics, and biased interpretation.

  • Strive to be fair, thorough, objective, and neutral in data collection, analysis, and presentation.

  • Consumers of statistics should be aware of possible unethical practices by others.

  • The American Statistical Association developed the report “Ethical Guidelines for Statistical Practice,” containing 67 guidelines organized into 8 topic areas (Professionalism, Responsibilities to Funders/Clients/Employers, Publications/Testimony, Research Subjects, Research Team Colleagues, Other Statisticians/Practitioners, Allegations of Misconduct, Employers’ Responsibilities).

End of Chapter 1

Data Sources

  • Existing sources include internal company records, business database services (e.g., Dow Jones & Co.), government agencies, industry associations, special-interest organizations, and the Internet.

Data Available From Internal Company Records

  • Records: Employee records, Production records, Inventory records, Sales records, Credit records, Customer profile data (age, gender, income, household size), etc.

Data Available From Selected Government Agencies

  • Census Bureau (population data, households, income): www.census.gov

  • Federal Reserve Board (money supply, exchange rates, rates): www.federalreserve.gov

  • Office of Management & Budget (OMB) (federal revenues, expenditures, debt): www.whitehouse.gov/omb

  • Department of Commerce (business activity, shipments, profits by industry): www.doc.gov

  • Bureau of Labor Statistics (unemployment, earnings, safety): www.bls.gov

Data Sources: Observational vs Experimental Studies

  • Observational (nonexperimental): no attempt to control variables; example: surveys (e.g., smokers vs nonsmokers).

  • Experimental: identify a variable of interest, then control other variables to study their influence on the variable of interest; historical example: 1954 Public Health Service Salk polio vaccine trial with nearly two million children.

Data Acquisition Considerations

  • Time requirement: information gathering can be time consuming; information may become outdated.

  • Cost of acquisition: information can be costly to obtain.

  • Data errors: data gathered with little care can mislead.

Statistical Inference: Problems and Solutions (Selected Examples)

  • Problem 9 (Solution):

    • a. The data are categorical.

    • b. 30/71 = 0.423; 0.423 × 100 = 42.3%

  • Problem 12 (Solution):

    • a. Population: all visitors to Hawaii.

    • b. Since most visitors arrive by air, use questionnaires on incoming flights (on the back of the plants/animals declaration form); a large percentage complete.

    • c. Questions 1 and 4 are quantitative (visits, days); Questions 2 and 3 are qualitative (reason for trip, where stay); both could be correct.

  • Problem 20 (Solution):

    • a. 43% bullish; 21% expect health care to lead in next 12 months.

    • b. Estimated average 12-month return for population of investment managers: \text{average return} = 11.2\%.

    • c. Estimated average tenure: 2.5\text{ years}.

  • Problem 21 (Solution):

    • a. Populations: women whose mothers took DES during pregnancy vs those whose mothers did not.

    • b. The study was a survey.

    • c. \frac{63}{3980} = 0.0158 \approx 15.8\text{ abnormalities per 1000}.

    • d. If article reports twice as many abnormalities in DES-exposed group, rough estimate: \frac{15.8}{2} \approx 7.9\text{ abnormalities per 1000}.

    • e. Disease occurrences are rare; large samples are needed to observe sufficient cases.