Statistical Description of Data Study Guide
THE ROLE AND SCOPE OF STATISTICS
- Universal Application: Modern development in fields such as Management, Commerce, Economics, Social Sciences, and Mathematics is heavily dependent on statistics. It also extends to public services, defense, banking, insurance, tourism, hospitality, police, and military sectors.
- Strategic Functions: Statistics enriches specific domains by:
- Collecting data in a specific field.
- Analyzing data by applying various statistical techniques.
- Making statistical inferences about the domain.
- User Profiles:
- Governments: Use statistics for effective and pragmatic economic planning.
- Businessmen: Plan and expand business operations based on feedback data analysis.
- Political Parties: Use performance statistics to impress the general public.
- Research Scholars: Present research papers in an authoritative manner using statistical evidence.
HISTORICAL ORIGINS AND ETYMOLOGY
- Etymological Roots: There are four primary schools of thought regarding the origin of the word "statistics":
- Latin: Derived from the word "status".
- Italian: Derived from the word "statista".
- German: Derived from the word "statistik".
- French: Derived from the word "statistique".
- Historical Context: Historically, statistics was analogous to the "state," representing data collected for the welfare of the people.
- Ancient Records:
- Kautilya: In his book "Arthashastra", recorded births, deaths, and other precious data during the reign of Chandragupta in the 4th century B.C.
- Egypt: The first census was conducted by the Pharaoh between 3000 B.C. and 2000 B.C.
- Akbar's Reign: Statistical records on agriculture are found in "Ain-i-Akbari", written by Abu Fazl in the 16th century A.D.
DEFINING STATISTICS: SINGULAR AND PLURAL
- Plural Sense: Statistics refers to data, both qualitative and quantitative, collected with the intent of statistical analysis.
- Singular Sense: Statistics is defined as the scientific method employed for collecting, analyzing, and presenting data, leading to statistical inferences. It is often described as the "science of counting" or the "science of averages".
APPLICATIONS OF STATISTICAL METHODS
- Economics: Modern Economic developments are rooted in statistics. Overlapping areas include:
- Time Series Analysis.
- Index Numbers.
- Demand Analysis.
- Econometrics: A specialized branch where Economics interacts positively with statistics.
- Regression Analysis: Used for future projections of demand, sales, prices, and quantities for economic planning.
- Business Management: Decisions have moved away from intuition or trial-and-error to quantitative techniques.
- Inference: Drawing conclusions about a universe from a sample.
- Statistical Decision Theory: Analyzing complicated business strategies by evaluating the merits and demerits of various alternatives.
- Commerce and Industry: Statistical procedures are used to maximize profits in competitive environments. Data analyzed includes previous sales, raw materials, wages, salaries, and competitor products. Methods used include:
- Measures of central tendency and dispersion.
- Correlation and regression analysis.
- Time series analysis and index numbers.
- Sampling and statistical quality control.
LIMITATIONS OF THE STATISTICAL APPROACH
- Aggregates: Statistics deals with groups or aggregates; an individual unit has no significance unless it is part of the whole.
- Quantitative Focus: Statistics primarily concerns quantitative data. Qualitative data must be converted into numerical descriptions for analysis.
- Conditionality: Projections (sales, production, price) are only valid under specific sets of conditions. If conditions change, inaccuracies occur.
- Sampling Errors: The theory of statistical inference relies on random sampling. If sampling rules are not followed, conclusions drawn from unrepresentative samples will be erroneous.
DATA TYPES AND VARIABLES
- Definition of Data: Quantitative information about specific characteristics under consideration.
- Variable: A measurable quantitative characteristic. There are two types:
- Discrete Variable: Assumes a finite or countably infinite number of isolated values. Examples: number of petals in a flower, misprints in a book, or road accidents.
- Continuous Variable: Can assume any value from a given interval. Examples: height, weight, sale, and profit.
- Attribute: A qualitative characteristic. Examples: gender of a baby, nationality, or the color of a flower.
- Primary Data: Data collected for the first time by an investigator or agency. Example: Prof. Das collecting student heights directly.
- Secondary Data: Data already collected by one person/agency and subsequently used by another. Example: Professor Bhargava using the height data collected by Prof. Das.
COLLECTION OF PRIMARY DATA
- Interview Methods:
- Personal Interview: Meeting respondents directly. Most accurate and useful for natural calamities (cyclones, earthquakes, epidemics like plague) despite being slow and covering small areas.
- Indirect Interview: Used when respondents cannot be reached directly, such as in rail accidents. Information is gathered from associated persons.
- Telephone Interview: Quick and non-expensive with wide coverage, though less consistent. It suffers from the maximum number of non-responses.
- Mailed Questionnaire: Framing well-drafted, sequenced questions sent with pre-paid stamps. Covers wide areas but has a very high rate of non-response.
- Observation Method: Collecting data via direct observation or instruments (e.g., measuring height/weight). It is the most accurate but time-consuming and covers small areas.
- Questionnaires via Enumerators: Enumerators explain questions and collect information directly for larger inquiries.
SOURCES AND SCRUTINY OF SECONDARY DATA
- Secondary Sources:
- International: WHO, ILO, IMF, World Bank.
- Government: CSO (Statistical Abstract), Ministry of Food and Agriculture (Indian Agricultural Statistics).
- Private/Quasi-Gov: ISI, ICAR, NCERT.
- Unpublished: Research institute records.
- Scrutiny of Data: Checking for accuracy and consistency is essential. Errors can arise from enumerator bias or transcription mistakes.
- Internal Consistency Check: If multiple related series (population, area, density) are provided, the relation can be verified:
Density=AreaPopulation
CLASSIFICATION AND ORGANIZATION OF DATA
- Objectives: Organize data into groups based on similarities to eliminate unnecessary detail and facilitate comparison.
- Classification Types:
- Chronological / Temporal / Time Series: Data classified by time (e.g., CA final students over 20 years).
- Geographical / Spatial: Data arranged by region or state.
- Qualitative / Ordinal: Data classified by attributes (e.g., nationality, smoking habit).
- Quantitative / Cardinal: Data classified by variables (e.g., height, weight, profit).
- Frequency vs. Non-Frequency: Qualitative and Quantitative data are frequency data. Time Series and Geographical data are non-frequency data.
MODES OF DATA PRESENTATION
- Textual Presentation: Presenting data via paragraphs. Used in official reports (e.g., Roy Enamel Factory example). It is simple but dull, monotonous, and hinders comparison.
- Tabular Presentation (Tabulation): Systematic presentation in rows and columns.
- Merits: Facilitates comparison, handles complicated data, and is essential for diagrammatic and statistical analysis.
- Parts of a Table:
- Title: Self-explanatory with a serial number.
- Caption: Upper part describing columns.
- Box-head: Entire upper part including caption, column numbers, and units.
- Stub: Left part describing rows.
- Body: Main part containing numerical figures.
- Footnotes/Source: Bottom part for clarity and origins.
DIAGRAMMATIC REPRESENTATION TECHNIQUES
- General Features: Attractive, suitable for all sections of society, and helps identify hidden trends. However, it is less accurate than tabulation.
- Line Diagram (Historiagram): Used for time series data. Plots points of (t,yt) joined by segments.
- Logarithmic/Ratio Chart: Log yt is plotted against t when there are wide fluctuations.
- Multiple Line Chart: Represents two or more related series in the same unit.
- Multiple Axis Chart: Used for variables in different units.
- Bar Diagram: Rectangles of equal width.
- Horizontal Bar: Used for qualitative or spatial data.
- Vertical Bar: Used for quantitative or time series data.
- Multiple/Grouped Bar: Used to compare related series.
- Component/Sub-divided Bar: Used for data divided into components.
- Pie Chart (Circle Diagram): Used for comparing components of a variable relative to the whole. Central angle calculation:
Central Angle=Total ValueComponent Value×360∘
FREQUENCY DISTRIBUTIONS
- Definition: A tabular representation of statistical data (usually ascending) relating to a measurable characteristic.
- Class frequency: The number of times a particular class occurs.
- Types:
- Discrete / Ungrouped: Tabulation against single values (e.g., car accidents in a year).
- Grouped: Tabulation against a group of values (e.g., heights of students).
CONSTRUCTION OF FREQUENCY DISTRIBUTIONS
- Range: Find the difference between largest and smallest observations.
- Class Intervals: Determine class count using the relation:
No. of Class Intervals×Class Length≈Range
- Tally Marks: Stroke marks against the occurrence of values in the interval.
- Frequency: Counting tally marks for the frequency column.
KEY STATISTICAL TERMS IN FREQUENCY DISTRIBUTIONS
- Class Limit (CL): Minimum (LCL) and Maximum (UCL) values of a class.
- Class Boundary (CB): Actual class limits.
- For overlapping (exclusive) classes like 10−20,20−30: CL coincides with CB.
- For non-overlapping (inclusive) classes like 0−9,10−19:
LCB=LCL−2DUCB=UCL+2D(Where D is the difference between the LCL of the next class and UCL of current class.)
- Mid-point (Class Mark):
Mid-point=2LCL+UCL=2LCB+UCB
- Width (Size):
Width=UCB−LCB
- Cumulative Frequency:
- Less Than: Total observations less than or equal to the UCB.
- More Than: Total observations more than or equal to the LCB.
- Frequency Density:
Frequency Density=Class LengthClass Frequency
- Relative Frequency:
Relative Frequency=Total FrequencyClass Frequency
- Percentage Frequency:
Percentage Frequency=Total FrequencyClass Frequency×100
GRAPHICAL REPRESENTATION OF FREQUENCY DISTRIBUTIONS
- Histogram (Area Diagram): Adjacent rectangles with class intervals as the base and frequency/density as the height. Heights are proportional to frequency density if widths are unequal. It is used to determine the Mode.
- Frequency Polygon: A closed figure formed by plotting (xi,fi) where xi is the mid-point. Joined to additional zero-frequency points (x0,0) and (xn+1,0).
- Ogives (Cumulative Frequency Graphs):
- Less Than Ogive: Plotted against upper class boundaries.
- More Than Ogive: Plotted against lower class boundaries.
- Median and Quartiles: The intersection of the two ogives projects to the x-axis to reveal the Median (Q2). Quartiles (Q1,Q3) can also be determined graphically.
TYPES OF FREQUENCY CURVES
- Definition: A smooth curve where total area equals unity; the limiting form of a histogram.
- Bell-shaped: Frequency starts low, reaches a central maximum, and decreases (e.g., height, weight, marks, profits).
- U-shaped: Frequency is minimum in the center and maximum at the extremities (e.g., Kolkata bound commuters in peak morning and evening hours).
- J-shaped: Starts with minimum frequency and ends at maximum (e.g., morning commuters entering peak hour). An inverted J-shape also exists.
- Mixed: A combination of multiple frequency curve types.