Chapter 3 Notes: Data, Data Types, and Descriptive Statistics

Making Sense from Data: Descriptive Statistics

  • Statistics is the science and art of making decisions using data.
  • It is about analyzing data and drawing meaningful conclusions.
  • Data and statistical tools are essential for:
    • Collecting, describing, analyzing, and interpreting data for informed decision-making.
    • Recognizing variation as an integral part of data.
    • Understanding the nature and pattern of variability.
    • Measuring the reliability of population parameters from sample data to draw valid inferences.
  • Applications of statistics are prevalent in everyday life, such as surveys, marketing studies, and polls.

Current Developments in Data Analysis

  • Technological advancements have enabled the collection of massive amounts of data.
  • Businesses face increasing pressure to provide high-quality products and services.
  • Analyzing large datasets efficiently to identify hidden patterns is crucial.
  • The processing and analysis of large data sets fall under the emerging field of big data and data mining.
  • Data mining uses statistical techniques and algorithms to extract non-trivial and potentially useful patterns.
  • Business intelligence (BI) uses techniques and processes to aid in fact-based decision-making.

Preparing Data for Analysis

  • Data analysis involves descriptive statistics, data visualization, and exploratory data analysis (EDA).
  • Data preparation is crucial, including data cleaning, transformation, and warehousing.
  • Data quality is a critical requirement for drawing meaningful conclusions and making data-driven decisions.
  • Data analysis techniques vary depending on the objectives.
  • Data Mining is a data analysis technique for knowledge discovery and predictive purposes.

Prerequisites to Data Analytics: Data Preparation

  • Essential data preparation steps include:
    1. Data cleansing
    2. Scripting
    3. Data transformation
    4. Data warehousing
  • Data cleansing is the process of detecting and correcting or removing corrupt or inaccurate records.
    • It involves identifying incomplete, incorrect, inaccurate, or irrelevant data.
    • Data wrangling transforms data from one format to another.
    • Scripting is often used in data cleaning and transformation to automate tasks.
  • Data transformation is the process of converting data from one format or structure into another.
    • It is a fundamental aspect of data integration and management tasks.
  • Data Warehousing:
    • A data warehouse (DW or DWH) is a system for storing, reporting, and analyzing huge amounts of data.
    • DWs are central repositories for integrating current and historical data from various sources.
    • Data is used for creating analytical and visual reports.
    • Cleansing, transformation, and data quality are critical before performing analyses.

Data and Data Quality

  • Data can be viewed as information or measurements.
  • The purpose of data analysis is to make sense of data.
  • Raw data is unprocessed data.
  • Data needs to be converted into a suitable form for reporting and analytics.
  • Data quality is crucial to the reliability and success of business analytics (BA) and BI programs.
  • Analytics involves analyzing data to drive business decisions, while BI is about reporting.
  • Data quality is affected by how data is collected, entered, stored, and managed.
  • Data quality assurance (DQA) verifies the reliability and effectiveness of data.
  • Aspects of data quality include:
    • Accuracy
    • Completeness
    • Update status
    • Relevance
    • Consistency across data sources
    • Reliability
    • Appropriate presentation
    • Accessibility
  • Maintaining data quality requires periodic data scrubbing, updating, standardizing, and de-duplicating records.

Data Analysis: Advanced Applications

  • Advanced applications include data mining for knowledge discovery and predictive purposes.
  • Statistical applications involve descriptive statistics, data visualization, inferential statistics techniques, exploratory data analysis (EDA), and statistical modeling.

Statistics Defined

  • Statistics is about making decisions from data.
  • Statistics is the science of collection, tabulation, analysis, interpretation, and presentation of data.
  • Statistics is concerned with problems involving chance variations from numerous small, independent influences.
  • Statistics deals with making inferences or predictions about a population based on sample data.

Two Main Reasons for Studying Statistics

  1. Statistics deals with variation, known as the mathematics of variation.
    • Data collected often show variation, with variables differing among observations.
    • Statistical thinking and variation reduction are major goals in data analysis and quality improvement programs like Six Sigma.
  2. Statistical methods enable drawing conclusions from limited data.
    • Allows inferences about a population using sample data.
    • Example: Estimating the average height of women in a county without measuring all of them.

Statistics and Statistical Methods

  • Used in collecting, presenting, and analyzing data.
  • Subsequent chapters cover data collection, analysis, visual representation, and tools for interpretation.
  • Statistics deals with variation, which must be kept within limits for processes to work efficiently.
  • Analyzing and reducing variation is a major goal of quality control programs like Six Sigma.
  • Topics include understanding statistics, data analysis, variation, and tools for data analysis in business.
  • Computer software use is emphasized.
  • Two broad categories of statistics:
    • Descriptive statistics: uses graphical and numerical methods to describe and analyze data.
    • Inferential statistics: draws conclusions about a population using sample data.
  • Population: the entire set of measurements theoretically possible.
    • Also known as the universe, it is the totality of items under consideration.
    • Example: Total light bulbs manufactured by a company.
  • Sample: A portion of the population selected for analysis.
  • A population is described by its parameter, while a sample is described by its statistic.
  • Parameter: A summary measure describing a characteristic of the population.
  • Statistic: A summary measure describing a characteristic of a sample.

Population Parameters and Sample Statistics

  • Population mean: μ\mu
  • Population variance: σ2\sigma^2
  • Population standard deviation: σ\sigma
  • Population proportion: pp
  • Sample mean: xˉ\bar{x}
  • Sample variance: s2s^2
  • Sample standard deviation: ss
  • Sample median:
  • Sample proportion: pˉ\bar{p}
  • Statistical inference involves generalization and probability of validity.
  • Decisions based on sample results raise questions about correctness, hence the use of probability theory.
  • Probability models estimate population parameters, with the choice of distribution based on experience and statistical theory.
  • Statistical hypotheses test the correctness of the probability distribution.

Data and Classification of Data

  • Data are related observations collected to draw conclusions or make decisions.
  • A single observation is a data point, and a collection is a data set.
  • Data can be qualitative (categorical) or quantitative (numerical).
    • Quantitative data: Numerical data, e.g., temperature, sales, length.
    • Qualitative data: Categorical data, e.g., color of car, yes/no responses.
      Data classification
    • Time series data are recorded over time, e.g., weekly sales, monthly demand.
    • Cross-sectional data are observed at the same point in time, e.g., stock market closing values on a specific date.
  • Statistical techniques are more suited to quantitative data.

Data Elements and Variables

  • Data elements are the specific items data is collected about.
  • A variable is an object upon which data are collected (person, entity, thing, event).
  • Stock price is a variable because prices vary. Statistics help study this variation.
  • Data set may contain one or more variables of interest.

Another Classification of Data

  • Data can also be classified as:
    • Discrete: result of a counting process, expressed as whole numbers (integers).
    • e.g., cars sold, number of houses sold, number of defective parts.
    • Continuous: can take any value within a given range.
    • Measured on a continuum or scale that can be divided infinitely.
    • e.g., measurements of length, height, diameter, temperature, stock value, sales.
  • Continuous data are preferred due to availability of more powerful statistical tools.

Data Types and Data Collection

  • Data are collected on variables of interest, which can be qualitative, quantitative, discrete, or continuous.

Describing Data Using the Levels of Measurement

  • All collected data are measured in some form, even discrete quantitative data.

Types of Measurement Scales

  • Four levels of measurements:
    1. Nominal Scale
    2. Ordinal Scale
    3. Interval Scale
    4. Ratio Scale
  • Nominal is the weakest, and ratio is the strongest.
  • Nominal and Ordinal Scales:
    • Data from qualitative variables are measured on these scales.
    • Nominal Scale: Data classified into distinct categories with no implied order.
    • Examples: Marital status (married, single), stock ownership (yes, no).
    • Ordinal Scale: Data classified into distinct categories with implied order.
    • Examples: Student grades (A, B, C, D, F), product quality (excellent, good, poor).
  • Interval and Ratio Scales:
    • Data from quantitative variables are measured on these scales.
    • Interval Scale: Ordered scale with meaningful and equal differences between measurements.
    • Examples: Temperature, time interval.
    • Ratio Scale: Meaningful differences and a true zero point, allowing sensible ratio measurements.
    • Examples: Height, weight, age, salary.

Data Collection, Presentation, and Analysis

  • Describes how data are collected, presented, and analyzed.
  • Effective decision-making requires appropriate data.
  • Insufficient, flawed, or ambiguous data will not yield meaningful results.

How Data Are Collected: Sources of Data for Research and Analysis

  • Data can be obtained from industrial, individual, or government sources.
  • Major sources include:
    • Internet: Websites with data on employment, CPI, population, housing, manufacturing.
    • Government agencies: Data on travel, health care, economic measures, unemployment, interest rates.
    • Experimental design: Changing input variables to observe effects on output variables.
    • Telephone/mail surveys: Inexpensive but may have low response rates.
    • Processes: Manufacturing and service systems.
  • Survey design is important, with concise, unambiguous, closed-ended questions.

Analyzing Data Using Different Tools

  • Raw data must be processed and analyzed to make sense.
  • Software available for handling small to massive amounts of data.

Data Related Terms Applied to Analytics

  • Big Data:
    • Collections of large, complex data sets that are difficult to process using conventional tools.
    • Volume, velocity, and variety are key characteristics.
    • Frontier of a firm’s ability to store, process, and access data for effective operation and decision-making.
  • Data mining:
    • Finding meaningful patterns and insights in large sets of data using pattern recognition techniques.
    • Uses statistics, statistical modeling, machine learning algorithms, and artificial intelligence.
  • Data Warehouse (DW or DWH):
    • System for storing, reporting, and analysis of huge amounts of data.
    • Central repositories for integrating current and historical data from various sources.
    • Used for creating analytical and visual reports.
  • Structured versus Unstructured Data:
    • Structured data can be stored in relational databases and related via tables.
    • Unstructured data cannot be directly put in databases, e.g., e-mails, social media posts.
  • Data Quality:
    • Affected by how data is collected, entered, stored, and managed.
    • Efficient storage, cleansing, and transformation are critical.
    • Aspects include accuracy, completeness, update status, relevance, consistency, reliability, appropriate presentation, and accessibility.

Summary

  • Covered basic concepts of data, types of data, statistics, and statistical methods.
  • Explained descriptive and inferential statistics and data measurement scales.
  • Discussed data collection steps and sources.
  • Outlined data-related terms applied to analytics.
  • Understanding data is critical for analytics and using different types of models.