UNIT 2 – Collection of Data, Sampling, Census & Errors

Statistical Enquiry & Data Sources

  • Statistical enquiry: systematic search employing statistical methods to collect quantitative information.
  • Data can originate from two fundamental sources:
    • Primary source: data generated first-hand by the investigator for the current study.
    • Secondary source: pre-existing data gathered earlier for a different purpose.

Primary Data: Meaning & Collection Methods

  • Primary data: original, first-hand information collected directly from its source of origin.
  • Collected only once for the specific investigation; therefore highly specific, \text{accurate} and usually \text{costly}.
  • Three classic methods (also referred to as the "basic" ways):
    • Personal Interview / Direct Personal Investigation
    • Investigator meets respondents face-to-face and records answers.
    • Advantages:
      • High response rate.
      • Allows use of all question formats (open-ended, probing, etc.).
    • Disadvantages:
      • Most expensive (travel, time, manpower).
      • Potential interviewer influence or bias.
      • Time-consuming; respondents can be influenced; may require significant energy.
    • Mailing Questionnaire (Mailed Interview)
    • Printed questionnaire sent to respondents; they fill and mail back.
    • Advantages:
      • Least expensive of the three.
      • Only practical method to reach very remote areas.
      • No interviewer influence; anonymity may encourage frank answers.
    • Disadvantages:
      • Can be used only with literate respondents.
      • Long response time; lower return rates.
      • Cannot observe respondent reactions.
    • Telephone Interview
    • Data collected via phone calls.
    • Advantages:
      • Relatively low cost (cheaper than face-to-face).
      • Relatively high response rate compared with mail.
      • Reduced interviewer influence (no physical presence).
    • Disadvantages:
      • Limited to populations with telephone access.
      • Still cannot watch non-verbal cues.
      • Potential for interviewer influence through voice tone.

Secondary Data: Meaning & Sources

  • Secondary data: information already collected, processed and possibly published by someone else.
  • Typically quicker, cheaper, but potentially less specific and less reliable.
  • Published sources (formal):
    1. Government publications (census volumes, economic surveys, statistical yearbooks).
    2. Semi-government publications.
    3. Reports of committees & commissions.
    4. Private publications (journals, newspapers, trade association reports, research-institute monographs).
    5. International publications (UN, IMF, World Bank, etc.).
  • Unpublished sources (informal):
    • Internal records of private firms.
    • Business enterprise databases.
    • Working papers of scholars, researchers, NGOs, etc.
    • Often restricted; may not be released to outsiders.

Drafting Questionnaires & Pilot Survey

  • Good questionnaire design principles:
    • Start with a brief introduction stating purpose.
    • Include a reasonable number of questions; avoid respondent fatigue.
    • Keep questions short, clear and unambiguous.
    • Arrange logically (general → specific; easy → difficult).
    • Provide clear instructions for completion.
    • Allocate proper space for answers (especially open-ended).
    • Ensure all questions are relevant to the investigation.
    • Avoid personal or sensitive questions unless essential.
    • Do not force respondents to perform calculations.
    • Avoid double negatives, leading wording, or indicating preferred answers.
    • Plan for cross-verification where feasible.
  • Pilot survey (pre-test):
    • Small-scale trial of the questionnaire with a subset of the target population.
    • Detects ambiguities, timing issues, logistical problems.
    • Allows refinement before full-scale rollout.

Census vs Sample Survey

  • Census survey (complete enumeration): investigates every element in the population.
    • Produces highly accurate, reliable data \Rightarrow negligible sampling error.
    • Time-consuming, expensive; suitable when population size is small or results must be exhaustive.
  • Sample survey: studies a subset (sample) meant to represent the whole.
    • Cheaper, faster, operationally feasible.
    • Accuracy depends on sampling design; introduces sampling error.
  • Comparative summary:
    • Enumeration: \text{Complete (Census)} vs \text{Partial (Sample)}.
    • Time: Census \text{long}, Sampling \text{short}.
    • Cost: Census \text{expensive}, Sampling \text{economical}.
    • Reliability: Census \text{high}; Sampling \text{subject to error}.
    • Appropriate for: Census – heterogeneous populations; Sampling – homogeneous populations (though modern designs handle heterogeneity via stratification).

Sampling: Types & Procedures

  • Population / Universe: entire aggregate of items under investigation.
  • Sample: subset chosen to represent the population.

Random (Probability) Sampling

  • Each unit has a known, non-zero, often equal probability of selection.
  • Ensures representativeness; allows estimation of sampling error.
  • Major forms:
    1. Simple (Unrestricted) Random Sampling – basic lottery or random-number method; each unit equally likely.
    2. Stratified Sampling – population divided into homogeneous strata; random sample from each stratum improves precision.
    3. Systematic Sampling – select every k^{\text{th}} unit after a random start.
    4. Multistage / Cluster Sampling – successively sample groups (e.g., states → districts → villages) to reduce cost when population is geographically spread.

Non-Random (Non-Probability) Sampling

  • Selection based on investigator judgement or convenience; probabilities unknown.
  • Useful in exploratory research, but limited inference.
  • Common forms:
    • Judgement / Purposive Sampling – expert selects units deemed typical.
    • Quota Sampling – interviewer fills quotas for certain characteristics (e.g., age, gender).
    • Convenience Sampling – choose whoever is easiest to access.

Errors in Data Collection

  • Sampling error: difference between a statistic computed from a sample and the true population parameter (what you would obtain under a census). Formulaically, if \hat{\theta} is the estimator and \theta the true value, then \text{Sampling Error} = \hat{\theta} - \theta.
  • Non-sampling errors (can occur in both census and sample surveys):
    1. Data acquisition errors – recording wrong responses, measurement mistakes, instrument defects (e.g., differing tape measures in the classroom table example).
    2. Non-response errors – selected units not contacted or refuse to participate; can bias results if non-respondents differ systematically.
    3. Sampling bias – design/systematic exclusion of some population segments, making it impossible for them to be selected (coverage error).
  • Non-sampling errors are often larger and harder to quantify than sampling errors.

Census of India & National Sample Survey Organisation (NSSO)

  • Census of India
    • One of the world’s largest administrative operations.
    • Conducted every 10 years (\text{interval} = 10 \text{ years}).
    • Provides complete, continuous demographic data: size, growth, distribution, density, sex ratio, literacy, etc.
  • NSSO (National Sample Survey Organisation)
    • Government body founded to execute nationwide sample surveys on socio-economic topics: employment, literacy, maternity, child care, PDS utilisation, etc.
    • Releases findings through detailed reports and quarterly journal "Sarvekshana".
    • Data underpin policy formulation and mid-term planning.

Key Terminology & Quick Facts

  • Statistical enquiry: search for quantitative facts via statistical methods.
  • Data: collection of facts/measurements; fundamental tool for inference.
  • Pilot survey: small-scale rehearsal to refine main survey.
  • Sampling frame: list of population units from which sample is drawn (implicit in random sampling).
  • Representativeness: degree to which sample mirrors population characteristics.
  • Reliability vs Accuracy: reliability – consistency; accuracy – closeness to true value.
  • Census decades in India: 1951, 1961, … 2011, next expected 2021 (delayed due to contingencies).