Chapter 3 Notes: Data, Data Types, and Descriptive Statistics
Making Sense from Data: Descriptive Statistics
- Statistics is the science and art of making decisions using data.
- It is about analyzing data and drawing meaningful conclusions.
- Data and statistical tools are essential for:
- Collecting, describing, analyzing, and interpreting data for informed decision-making.
- Recognizing variation as an integral part of data.
- Understanding the nature and pattern of variability.
- Measuring the reliability of population parameters from sample data to draw valid inferences.
- Applications of statistics are prevalent in everyday life, such as surveys, marketing studies, and polls.
Current Developments in Data Analysis
- Technological advancements have enabled the collection of massive amounts of data.
- Businesses face increasing pressure to provide high-quality products and services.
- Analyzing large datasets efficiently to identify hidden patterns is crucial.
- The processing and analysis of large data sets fall under the emerging field of big data and data mining.
- Data mining uses statistical techniques and algorithms to extract non-trivial and potentially useful patterns.
- Business intelligence (BI) uses techniques and processes to aid in fact-based decision-making.
Preparing Data for Analysis
- Data analysis involves descriptive statistics, data visualization, and exploratory data analysis (EDA).
- Data preparation is crucial, including data cleaning, transformation, and warehousing.
- Data quality is a critical requirement for drawing meaningful conclusions and making data-driven decisions.
- Data analysis techniques vary depending on the objectives.
- Data Mining is a data analysis technique for knowledge discovery and predictive purposes.
Prerequisites to Data Analytics: Data Preparation
- Essential data preparation steps include:
- Data cleansing
- Scripting
- Data transformation
- Data warehousing
- Data cleansing is the process of detecting and correcting or removing corrupt or inaccurate records.
- It involves identifying incomplete, incorrect, inaccurate, or irrelevant data.
- Data wrangling transforms data from one format to another.
- Scripting is often used in data cleaning and transformation to automate tasks.
- Data transformation is the process of converting data from one format or structure into another.
- It is a fundamental aspect of data integration and management tasks.
- Data Warehousing:
- A data warehouse (DW or DWH) is a system for storing, reporting, and analyzing huge amounts of data.
- DWs are central repositories for integrating current and historical data from various sources.
- Data is used for creating analytical and visual reports.
- Cleansing, transformation, and data quality are critical before performing analyses.
Data and Data Quality
- Data can be viewed as information or measurements.
- The purpose of data analysis is to make sense of data.
- Raw data is unprocessed data.
- Data needs to be converted into a suitable form for reporting and analytics.
- Data quality is crucial to the reliability and success of business analytics (BA) and BI programs.
- Analytics involves analyzing data to drive business decisions, while BI is about reporting.
- Data quality is affected by how data is collected, entered, stored, and managed.
- Data quality assurance (DQA) verifies the reliability and effectiveness of data.
- Aspects of data quality include:
- Accuracy
- Completeness
- Update status
- Relevance
- Consistency across data sources
- Reliability
- Appropriate presentation
- Accessibility
- Maintaining data quality requires periodic data scrubbing, updating, standardizing, and de-duplicating records.
Data Analysis: Advanced Applications
- Advanced applications include data mining for knowledge discovery and predictive purposes.
- Statistical applications involve descriptive statistics, data visualization, inferential statistics techniques, exploratory data analysis (EDA), and statistical modeling.
Statistics Defined
- Statistics is about making decisions from data.
- Statistics is the science of collection, tabulation, analysis, interpretation, and presentation of data.
- Statistics is concerned with problems involving chance variations from numerous small, independent influences.
- Statistics deals with making inferences or predictions about a population based on sample data.
Two Main Reasons for Studying Statistics
- Statistics deals with variation, known as the mathematics of variation.
- Data collected often show variation, with variables differing among observations.
- Statistical thinking and variation reduction are major goals in data analysis and quality improvement programs like Six Sigma.
- Statistical methods enable drawing conclusions from limited data.
- Allows inferences about a population using sample data.
- Example: Estimating the average height of women in a county without measuring all of them.
Statistics and Statistical Methods
- Used in collecting, presenting, and analyzing data.
- Subsequent chapters cover data collection, analysis, visual representation, and tools for interpretation.
- Statistics deals with variation, which must be kept within limits for processes to work efficiently.
- Analyzing and reducing variation is a major goal of quality control programs like Six Sigma.
- Topics include understanding statistics, data analysis, variation, and tools for data analysis in business.
- Computer software use is emphasized.
- Two broad categories of statistics:
- Descriptive statistics: uses graphical and numerical methods to describe and analyze data.
- Inferential statistics: draws conclusions about a population using sample data.
- Population: the entire set of measurements theoretically possible.
- Also known as the universe, it is the totality of items under consideration.
- Example: Total light bulbs manufactured by a company.
- Sample: A portion of the population selected for analysis.
- A population is described by its parameter, while a sample is described by its statistic.
- Parameter: A summary measure describing a characteristic of the population.
- Statistic: A summary measure describing a characteristic of a sample.
Population Parameters and Sample Statistics
- Population mean: μ
- Population variance: σ2
- Population standard deviation: σ
- Population proportion: p
- Sample mean: xˉ
- Sample variance: s2
- Sample standard deviation: s
- Sample median:
- Sample proportion: pˉ
- Statistical inference involves generalization and probability of validity.
- Decisions based on sample results raise questions about correctness, hence the use of probability theory.
- Probability models estimate population parameters, with the choice of distribution based on experience and statistical theory.
- Statistical hypotheses test the correctness of the probability distribution.
Data and Classification of Data
- Data are related observations collected to draw conclusions or make decisions.
- A single observation is a data point, and a collection is a data set.
- Data can be qualitative (categorical) or quantitative (numerical).
- Quantitative data: Numerical data, e.g., temperature, sales, length.
- Qualitative data: Categorical data, e.g., color of car, yes/no responses.
Data classification - Time series data are recorded over time, e.g., weekly sales, monthly demand.
- Cross-sectional data are observed at the same point in time, e.g., stock market closing values on a specific date.
- Statistical techniques are more suited to quantitative data.
Data Elements and Variables
- Data elements are the specific items data is collected about.
- A variable is an object upon which data are collected (person, entity, thing, event).
- Stock price is a variable because prices vary. Statistics help study this variation.
- Data set may contain one or more variables of interest.
Another Classification of Data
- Data can also be classified as:
- Discrete: result of a counting process, expressed as whole numbers (integers).
- e.g., cars sold, number of houses sold, number of defective parts.
- Continuous: can take any value within a given range.
- Measured on a continuum or scale that can be divided infinitely.
- e.g., measurements of length, height, diameter, temperature, stock value, sales.
- Continuous data are preferred due to availability of more powerful statistical tools.
Data Types and Data Collection
- Data are collected on variables of interest, which can be qualitative, quantitative, discrete, or continuous.
Describing Data Using the Levels of Measurement
- All collected data are measured in some form, even discrete quantitative data.
Types of Measurement Scales
- Four levels of measurements:
- Nominal Scale
- Ordinal Scale
- Interval Scale
- Ratio Scale
- Nominal is the weakest, and ratio is the strongest.
- Nominal and Ordinal Scales:
- Data from qualitative variables are measured on these scales.
- Nominal Scale: Data classified into distinct categories with no implied order.
- Examples: Marital status (married, single), stock ownership (yes, no).
- Ordinal Scale: Data classified into distinct categories with implied order.
- Examples: Student grades (A, B, C, D, F), product quality (excellent, good, poor).
- Interval and Ratio Scales:
- Data from quantitative variables are measured on these scales.
- Interval Scale: Ordered scale with meaningful and equal differences between measurements.
- Examples: Temperature, time interval.
- Ratio Scale: Meaningful differences and a true zero point, allowing sensible ratio measurements.
- Examples: Height, weight, age, salary.
Data Collection, Presentation, and Analysis
- Describes how data are collected, presented, and analyzed.
- Effective decision-making requires appropriate data.
- Insufficient, flawed, or ambiguous data will not yield meaningful results.
How Data Are Collected: Sources of Data for Research and Analysis
- Data can be obtained from industrial, individual, or government sources.
- Major sources include:
- Internet: Websites with data on employment, CPI, population, housing, manufacturing.
- Government agencies: Data on travel, health care, economic measures, unemployment, interest rates.
- Experimental design: Changing input variables to observe effects on output variables.
- Telephone/mail surveys: Inexpensive but may have low response rates.
- Processes: Manufacturing and service systems.
- Survey design is important, with concise, unambiguous, closed-ended questions.
- Raw data must be processed and analyzed to make sense.
- Software available for handling small to massive amounts of data.
- Big Data:
- Collections of large, complex data sets that are difficult to process using conventional tools.
- Volume, velocity, and variety are key characteristics.
- Frontier of a firm’s ability to store, process, and access data for effective operation and decision-making.
- Data mining:
- Finding meaningful patterns and insights in large sets of data using pattern recognition techniques.
- Uses statistics, statistical modeling, machine learning algorithms, and artificial intelligence.
- Data Warehouse (DW or DWH):
- System for storing, reporting, and analysis of huge amounts of data.
- Central repositories for integrating current and historical data from various sources.
- Used for creating analytical and visual reports.
- Structured versus Unstructured Data:
- Structured data can be stored in relational databases and related via tables.
- Unstructured data cannot be directly put in databases, e.g., e-mails, social media posts.
- Data Quality:
- Affected by how data is collected, entered, stored, and managed.
- Efficient storage, cleansing, and transformation are critical.
- Aspects include accuracy, completeness, update status, relevance, consistency, reliability, appropriate presentation, and accessibility.
Summary
- Covered basic concepts of data, types of data, statistics, and statistical methods.
- Explained descriptive and inferential statistics and data measurement scales.
- Discussed data collection steps and sources.
- Outlined data-related terms applied to analytics.
- Understanding data is critical for analytics and using different types of models.