Notes on Official Statistics, Data Access, and Research Ethics

Assignment overview

  • Two-question, learning-by-doing assignment is live; just under a couple of weeks to complete. Do not leave it to the last moment.
  • Question 1: short, bullet-point answers about the survey we did in class; essentially the same survey, with slightly different results; include a sensitivity analysis (how bad could things be).
  • Question 2: official statistics question; use Stats NZ website to investigate whether higher qualifications improve employment chances using recent, real data.
  • You need to learn to navigate Stats NZ resources, including finding variables that match the question, then exporting a table to Excel or viewing it on screen.
  • The emphasis is on learning-by-doing and practical data access, not long prose answers. For Question 1, copy the slide example for each faculty and perform the same calculation; assume those who didn’t attend class were in that faculty (or not in that faculty) as per the slide example.
  • If you’re stuck, use the slide example, watch the video, and repeat the steps; email the lecturer with questions; there will be (virtual) office hours during the assignment window.
  • Today’s main topics: official statistics, the integrated data infrastructure (IDI), data labs, data linkage, and research ethics.
  • The course materials emphasize practical skills for working with official statistics and responsible usage of linked data.

What we’re aiming to cover about Stats NZ and data

  • Stats NZ (the national statistics office) and its role in policy, oversight, analysis, data collection, and increasingly linkage of datasets.
  • The independence of the Government Chief Statistician: answers to the Governor-General, not to Parliament or minister, though ministers set policy and budget; cannot command data collection or suppression of data.
  • The 5 SAFES framework for accessing microdata resources (security and ethics): Safe People, Safe Projects, Safe Settings, Safe Data, Safe Output.
  • The Integrated Data Infrastructure (IDI): a linkage-based research database that combines data from multiple government sources; access is restricted to research use in a safe environment; it is not for running policy agencies or surveillance.
  • The IDI creates an identity spine (a linkage backbone) using identifiers like visa data, birth data, and tax data; other datasets are deterministically linked where possible or probabilistically linked when identifiers differ or are missing.
  • Access to IDI requires a credible statistical project with five greens (see SAFES). Projects must be for statistical purposes, for public good, conducted by a credible team, and data must be available for use in NZ (or researchers must be based in NZ).
  • The IDI environment includes a lockdown setting where researchers can work with data subsets; outputs must be confidentialized and checked before release.
  • Typical tools and skills for working with IDI include R, SQL, and Python; SAS/SPSS/Stata may appear in the environment as proprietary software.
  • There are other important data ecosystems: Aotearoa Data Explorer (a user-friendly tabulated data tool), and data link databases (as in Singapore’s model) that enable longitudinal linkage for microdata research.
  • Australia, the US, Canada, and the EU have similar national statistics systems; international comparability is a key feature of official statistics (tier-one statistics).

Question 1: survey analysis and sensitivity

  • The first question uses a survey example from class; you will perform the same calculation for each faculty derived from the survey data.
  • The sensitivity analysis asks how bad the situation could be (i.e., worst-case scenarios) and what it implies for the results.
  • You should provide brief, bullet-point style answers, not long paragraphs.
  • The exercise reinforces the idea that official statistics provide standardized, comparable measures across time and countries, even though there might be non-obvious biases or missing data.

Question 2: official statistics and using Stats NZ data

  • The second question asks you to investigate whether higher qualifications improve employment prospects using recent, real data from Stats NZ.
  • You will need to locate variables in the Labour Force Survey (LFS) within Stats NZ’s info-share tools and select the variables that match the question.
  • Steps to access data on Stats NZ site:
    • Go to Tools -> Info Share.
    • Look for topics related to work, income, and spending; select Labor Force Survey (LFS).
    • Identify the variables needed for your question and add them to the table on screen; you can export to Excel.
    • If you get stuck, use the Help button at the top.
  • The assignment emphasizes learning to discover data resources, rather than memorizing static datasets.
  • There are not many marks for Question 2; it’s primarily about learning-by-doing and using real official statistics.

Official statistics: what they are and why they matter

  • Official statistics are data and statistics produced by government departments, surveys, administrative records, registrations, maps, etc., with legal definitions and standardized methods.
  • They serve government, local authorities, and businesses for policy decisions; they also provide baseline data for comparing samples to populations and assessing sampling biases.
  • They are robust and internationally comparable, following documented methods and principles; this comparability supports international treaties and finance arrangements.
  • In NZ, Stats NZ leads the system; data from health, education, and other sectors are linked through IDI for research, under strict governance and ethics rules.
  • The production of official statistics is not about “objective truth” but about standardized measures that enable reproducibility and comparability across time and borders.
  • Tier-one statistics represent the most important categories (economic, population, environment, etc.) with long-term continuity and international comparability.
  • A key tension: official statistics are often more about process and standardization than raw “truth,” but their rigor makes them essential for policy and governance.

Data collection methods: sample surveys vs administrative data

  • Sample surveys
    • Focused on representing the population at a point in time (snapshot).
    • Structured observation with a well-defined sampling frame and relatively high response rates (often 80-90% range; historically close to or around 90%).
    • Expensive and carefully designed to minimize nonresponse bias; intended response rates are typically in the high 80s to 90% range, depending on survey type.
    • Examples include household surveys and the Household Labour Force Survey (HLFS).
  • Administrative data (admin data)
    • Generated as a by-product of government services and transactions; continuous, transactional data.
    • Not a snapshot; ongoing, updated as people interact with services (e.g., entering or leaving a service system, paying a bill, crossing borders).
    • Not a population-wide census of the general public; rather a “census of those engaged with the service”.
    • Comes with issues like missing data and definitional mismatches because variables reflect administrative, not research, purposes.
    • Useful for secondary data analysis, but researchers must be mindful of how concepts map to statistical definitions.
  • Key contrast: surveys provide constructs defined for research; admin data reflects operational records and may require harmonization to align with research questions.
  • COVID example: measuring vaccination rates using administrative data can miss groups not engaging with health services, leading to biased or understated population coverage if a census population is not considered.
  • Admin data can yield rich longitudinal perspectives, but missing data and linkage bias are ongoing concerns; synthetic data and confidentialized unit record files (CURFs) are used in the US to protect privacy in microdata releases.

The Integrated Data Infrastructure (IDI) and data access

  • IDI is a linked microdata environment that combines data from health, education, tax, migration, and other sources for research.
  • It creates an identity spine that links individuals across datasets; the spine uses identifiers (deterministic links when possible; probabilistic links when exact matches are incomplete).
  • Access to IDI is strictly for research in safe environments and requires meeting the 5 SAFES criteria (Green checks and five greens):
    • Safe People
    • Safe Projects
    • Safe Settings
    • Safe Data
    • Safe Output
  • IDI operations are privacy-protective: identifiers are removed, data is de-identified in a controlled environment, and researchers access data within a locked-down system.
  • Data in IDI are linked deterministically where possible; when deterministic linking is not possible, probabilistic linking is used, which requires careful handling of linkage bias.
  • You must be based in NZ to access IDI; collaboration with credible teams is required; the project must have a clear statistical purpose and public benefit.
  • The IDI supports a wide range of data, with thousands of variables (e.g., up to about 54,00054{,}000 variables in the IDI as of the latest update mentioned).
  • There are practical tools to work with IDI data, including search apps that map datasets to years and variables and show data dictionaries; data access is via a lockdown environment and outputs must be reviewed for confidentiality before release.
  • Skills recommended for IDI work include: R, SQL, and Python; other statistical packages (SAS, SPSS, Stata) may be used as needed.
  • The IDI is part of a broader ecosystem that includes the Virtual Health Information Network and data-linking efforts; it supports secondary reuse of official statistics data for research and policy analysis.
  • The five greens are what you must achieve to access IDI data for a project: research must be statistical, for public good, conducted by a credible team, data must be accessible in NZ, and there must be a data-use agreement.

Five SAFES framework (in detail)

  • Safe People: researchers must be trusted individuals with appropriate expertise and roles.
  • Safe Projects: research must have a legitimate purpose with explicit statistical aims and ethical considerations.
  • Safe Settings: data access occurs in controlled, secure environments with appropriate oversight.
  • Safe Data: data are protected, de-identified, and access is restricted to approved projects and personnel.
  • Safe Output: results are checked to prevent disclosure of individuals; outputs are aggregated or suitably anonymized before release.
  • The SAFES framework is widely adopted globally as a standard for microdata access and is foundational for working with the IDI and similar linked data systems.

Data linkage concepts: deterministic vs probabilistic linking

  • Deterministic linking: exact matches on identifiers (e.g., exact birth date, name, national ID) to join records across datasets.
  • Probabilistic linking: uses statistical models to link records when exact matches aren’t available or identifiers differ (e.g., misspellings, missing data); yields linking probabilities and may require bias assessment.
  • The IDI uses both methods to construct a coherent identity spine and to maximize linkage while preserving privacy.
  • Linkage bias is a risk when certain groups are more likely to be linked successfully than others; researchers must account for this in analysis and reporting.

Tools and platforms for Stats NZ data access

  • Stats NZ website: primary portal for official statistics; use the main search bar and navigate via the top bar to access data and publications.
  • Info Share: a data discovery tool under Stats NZ Tools that helps you identify topics (e.g., work, income, spending) and the related datasets (e.g., the Labour Force Survey) and select variables.
  • Aotearoa Data Explorer: a user-friendly tool for building tabulated data sets; NZ version of a common data-explorer interface; helpful for generating tables and basic analyses.
  • IDI search app (and data dictionaries): a search app that shows which datasets and years contain specific variables; helps you plan analyses before applying for access.
  • IDI process: data from multiple agencies are processed and linked by Stats NZ; identifiers are removed; a lockdown environment is used for analysis; outputs must be confidentialized and checked.
  • Language and skill requirements: proficiency in R, SQL, and Python is highly recommended for doing analysis in the IDI environment; familiarity with SAS/SPSS/Stata may also be useful depending on dataset and code.

Ethics in research

  • Ethics matters globally and across official statistics; there are international standards (Nuremberg Code; IEEE Global Initiative on Ethics of AI; etc.) and national/regional institutional standards.
  • Why ethics approval matters
    • It’s the ethically correct thing to do and often a requirement from employers, universities, and funders.
    • Without ethics approval, researchers may face consequences, including loss of funding and potential legal exposure; some funding bodies require ethics approval before funding.
    • If research involves health data or personal data, ethics approval is typically required due to privacy and disclosure risks.
  • The role of ethics in linked-data and secondary-use research
    • Linked data and secondary data reuse require careful ethical review; many institutions now require formal ethics review for such projects.
  • Maori responsiveness and equity in research
    • There is a focus on ensuring research reporting and sampling designs support equity for Maori and other indigenous populations; reporting must be capable of producing valid estimates for these groups.
  • Examples and cautionary tales shared
    • A historical case of unethical cervical cancer research at the university, which led to unnecessary surgeries and fatalities; whistleblowers played a role in stopping unethical practices.
    • A separate case (Middlemore Hospital) involving a proposed trial on new mothers without proper consent, which was stopped quickly.
  • Practical implications for statisticians
    • Minimizing disclosure risk and protecting participant privacy through statistical techniques and careful reporting.
    • Ensuring study designs are scientifically sound and ethically defensible.
  • The ethics landscape in NZ
    • NZ has a structured ethics framework with multiple levels of review (institutional, national, etc.); the review process can slow projects but provides essential safety oversight.
  • Summary takeaway
    • Ethics are integral to responsible data analysis, particularly when dealing with sensitive health data, linked data, or research that could reveal information about identifiable individuals or groups.

NZ official statistics governance and independence

  • Stats NZ is the national statistics office; it runs the official statistics system and is responsible for policy oversight, data collection, analysis, and increasingly data linkage.
  • The Government Chief Statistician (GCS) has a high degree of independence and reports to the Governor-General rather than directly to Parliament or the Minister.
  • There have been historical shifts in Stats NZ’s role, including reduction in some areas (e.g., internal research/training activities) and a stronger emphasis on data gathering and linking.
  • The relationship with other agencies (e.g., Ministry of Health) involves collaboration and policy alignment but remains distinct in terms of data governance and ethics.
  • Practical tip for students: when searching Stats NZ, use the top search bar and check the first few pages of results; the site is large and frequently updated, especially around mid-year changes.

Health data, mortality, morbidity, and data sources (overview)

  • Mortality data: death certification is a foundational health statistic; historically tied to pandemics and property (inheritance) concerns; essential for health policy and epidemiology.
  • Ill-health and morbidity data: hospital discharges and mandatory reporting for certain conditions (e.g., infectious diseases, lead poisoning) contribute to official health statistics.
  • Health data sources are multi-layered:
    • Admin data from health services (e.g., hospital records, private sector data when Government pays for services).
    • Population health indicators often rely on sample surveys (e.g., HLFS, health surveys) to capture information not present in admin data.
  • The challenge in health statistics: admin data reflects service use, not true population health, so surveys are needed to capture health status in the population beyond service use.
  • The relationship between levels of care (primary, secondary, tertiary) and data sources reflects how data is captured and aggregated for official statistics.

Practical reminders for exam preparation

  • Be ready to explain the difference between official statistics and other data sources, and why standardization and international comparability matter.
  • Be able to describe the two main data collection approaches and their trade-offs (surveys vs admin data).
  • Be comfortable with the concept of the IDI, identity spine, linking methods (deterministic vs probabilistic), and the five greens of SAFES.
  • Know where to find data in Stats NZ (Info Share, the Labour Force Survey, Aotearoa Data Explorer) and how to export or view tables.
  • Understand the ethical and governance framework around official statistics and linked data, including the historical context and the importance of Maori responsiveness.
  • For practical questions, remember to keep answers concise (Question 1) but accurate, and demonstrate ability to locate and interpret official statistics data for Question 2.

Quick glossary and key terms

  • Integrated Data Infrastructure (IDI): linked microdata repository for research; includes an identity spine; accessed under strict SAFE data rules.
  • Five SAFES: Safe People, Safe Projects, Safe Settings, Safe Data, Safe Output.
  • Safe Output: outputs are checked to prevent disclosure risk; often requires aggregation or suppression of small cells.
  • Deterministic linking: exact matches across datasets.
  • Probabilistic linking: probabilistic matches when exact identifiers are missing or differ.
  • Identity spine: the linking backbone that associates records to individuals across datasets.
  • Aotearoa Data Explorer: user-friendly tool to build tabulated data.
  • Labour Force Survey (LFS): key official statistics survey for work, income, and employment measures.
  • Tier-one statistics: the most important official statistics, with long-run continuity and international comparability.
  • Health statistics vs ill-health statistics: mortality and morbidity data; some health data come from admin records, others from surveys.

Note on approaching the material

  • The lecture emphasizes practical data literacy: how to navigate official data portals, identify relevant variables, and conduct analyses that align with official statistics standards.
  • Remember the ethical, legal, and governance boundaries around data access and analysis when using official statistics and IDI data.
  • The overarching goal is to develop skills that allow you to access real data, understand its limitations, and produce informative, policy-relevant insights while protecting privacy and honoring ethical considerations.