UNIT 2 – Collection of Data, Sampling, Census & Errors
Statistical Enquiry & Data Sources
- Statistical enquiry: systematic search employing statistical methods to collect quantitative information.
- Data can originate from two fundamental sources:
- Primary source: data generated first-hand by the investigator for the current study.
- Secondary source: pre-existing data gathered earlier for a different purpose.
Primary Data: Meaning & Collection Methods
- Primary data: original, first-hand information collected directly from its source of origin.
- Collected only once for the specific investigation; therefore highly specific, \text{accurate} and usually \text{costly}.
- Three classic methods (also referred to as the "basic" ways):
- Personal Interview / Direct Personal Investigation
- Investigator meets respondents face-to-face and records answers.
- Advantages:
- High response rate.
- Allows use of all question formats (open-ended, probing, etc.).
- Disadvantages:
- Most expensive (travel, time, manpower).
- Potential interviewer influence or bias.
- Time-consuming; respondents can be influenced; may require significant energy.
- Mailing Questionnaire (Mailed Interview)
- Printed questionnaire sent to respondents; they fill and mail back.
- Advantages:
- Least expensive of the three.
- Only practical method to reach very remote areas.
- No interviewer influence; anonymity may encourage frank answers.
- Disadvantages:
- Can be used only with literate respondents.
- Long response time; lower return rates.
- Cannot observe respondent reactions.
- Telephone Interview
- Data collected via phone calls.
- Advantages:
- Relatively low cost (cheaper than face-to-face).
- Relatively high response rate compared with mail.
- Reduced interviewer influence (no physical presence).
- Disadvantages:
- Limited to populations with telephone access.
- Still cannot watch non-verbal cues.
- Potential for interviewer influence through voice tone.
Secondary Data: Meaning & Sources
- Secondary data: information already collected, processed and possibly published by someone else.
- Typically quicker, cheaper, but potentially less specific and less reliable.
- Published sources (formal):
- Government publications (census volumes, economic surveys, statistical yearbooks).
- Semi-government publications.
- Reports of committees & commissions.
- Private publications (journals, newspapers, trade association reports, research-institute monographs).
- International publications (UN, IMF, World Bank, etc.).
- Unpublished sources (informal):
- Internal records of private firms.
- Business enterprise databases.
- Working papers of scholars, researchers, NGOs, etc.
- Often restricted; may not be released to outsiders.
Drafting Questionnaires & Pilot Survey
- Good questionnaire design principles:
- Start with a brief introduction stating purpose.
- Include a reasonable number of questions; avoid respondent fatigue.
- Keep questions short, clear and unambiguous.
- Arrange logically (general → specific; easy → difficult).
- Provide clear instructions for completion.
- Allocate proper space for answers (especially open-ended).
- Ensure all questions are relevant to the investigation.
- Avoid personal or sensitive questions unless essential.
- Do not force respondents to perform calculations.
- Avoid double negatives, leading wording, or indicating preferred answers.
- Plan for cross-verification where feasible.
- Pilot survey (pre-test):
- Small-scale trial of the questionnaire with a subset of the target population.
- Detects ambiguities, timing issues, logistical problems.
- Allows refinement before full-scale rollout.
Census vs Sample Survey
- Census survey (complete enumeration): investigates every element in the population.
- Produces highly accurate, reliable data \Rightarrow negligible sampling error.
- Time-consuming, expensive; suitable when population size is small or results must be exhaustive.
- Sample survey: studies a subset (sample) meant to represent the whole.
- Cheaper, faster, operationally feasible.
- Accuracy depends on sampling design; introduces sampling error.
- Comparative summary:
- Enumeration: \text{Complete (Census)} vs \text{Partial (Sample)}.
- Time: Census \text{long}, Sampling \text{short}.
- Cost: Census \text{expensive}, Sampling \text{economical}.
- Reliability: Census \text{high}; Sampling \text{subject to error}.
- Appropriate for: Census – heterogeneous populations; Sampling – homogeneous populations (though modern designs handle heterogeneity via stratification).
Sampling: Types & Procedures
- Population / Universe: entire aggregate of items under investigation.
- Sample: subset chosen to represent the population.
Random (Probability) Sampling
- Each unit has a known, non-zero, often equal probability of selection.
- Ensures representativeness; allows estimation of sampling error.
- Major forms:
- Simple (Unrestricted) Random Sampling – basic lottery or random-number method; each unit equally likely.
- Stratified Sampling – population divided into homogeneous strata; random sample from each stratum improves precision.
- Systematic Sampling – select every k^{\text{th}} unit after a random start.
- Multistage / Cluster Sampling – successively sample groups (e.g., states → districts → villages) to reduce cost when population is geographically spread.
Non-Random (Non-Probability) Sampling
- Selection based on investigator judgement or convenience; probabilities unknown.
- Useful in exploratory research, but limited inference.
- Common forms:
- Judgement / Purposive Sampling – expert selects units deemed typical.
- Quota Sampling – interviewer fills quotas for certain characteristics (e.g., age, gender).
- Convenience Sampling – choose whoever is easiest to access.
Errors in Data Collection
- Sampling error: difference between a statistic computed from a sample and the true population parameter (what you would obtain under a census). Formulaically, if \hat{\theta} is the estimator and \theta the true value, then \text{Sampling Error} = \hat{\theta} - \theta.
- Non-sampling errors (can occur in both census and sample surveys):
- Data acquisition errors – recording wrong responses, measurement mistakes, instrument defects (e.g., differing tape measures in the classroom table example).
- Non-response errors – selected units not contacted or refuse to participate; can bias results if non-respondents differ systematically.
- Sampling bias – design/systematic exclusion of some population segments, making it impossible for them to be selected (coverage error).
- Non-sampling errors are often larger and harder to quantify than sampling errors.
Census of India & National Sample Survey Organisation (NSSO)
- Census of India
- One of the world’s largest administrative operations.
- Conducted every 10 years (\text{interval} = 10 \text{ years}).
- Provides complete, continuous demographic data: size, growth, distribution, density, sex ratio, literacy, etc.
- NSSO (National Sample Survey Organisation)
- Government body founded to execute nationwide sample surveys on socio-economic topics: employment, literacy, maternity, child care, PDS utilisation, etc.
- Releases findings through detailed reports and quarterly journal "Sarvekshana".
- Data underpin policy formulation and mid-term planning.
Key Terminology & Quick Facts
- Statistical enquiry: search for quantitative facts via statistical methods.
- Data: collection of facts/measurements; fundamental tool for inference.
- Pilot survey: small-scale rehearsal to refine main survey.
- Sampling frame: list of population units from which sample is drawn (implicit in random sampling).
- Representativeness: degree to which sample mirrors population characteristics.
- Reliability vs Accuracy: reliability – consistency; accuracy – closeness to true value.
- Census decades in India: 1951, 1961, … 2011, next expected 2021 (delayed due to contingencies).