Lecture 5 Notes: Primary vs Secondary Research, Sampling, and Secondary Data Sources

Opening inspiration: "Supposing is good, but finding out is better." – emphasizes humility in research: many people assume they know the information, but the goal is to actually find out through data collection and analysis.
Core focus of today: secondary vs primary research, why you should do secondary research first, and how to evaluate data quality.
Real-world framing: example job description in marketing research analysis reveals implied expectations (tight timelines, multiple quick-turn deliverables) and the importance of understanding sampling when interpreting survey results.

Primary vs Secondary Research

Primary research: data you collect yourself for a specific question or project.
Secondary research: data collected by someone else that is repurposed for your use. Can be internal or external.
Purpose of secondary data: to build context, identify gaps, test feasibility, and guide the design of primary data collection.
Key idea: always consider method and limitations of data sources before acting on them.

Sampling concepts: size, randomness, and representativeness

National survey example: Simmons uses a 25,000-person sample; for many questions, a 2,000-respondent sample could suffice, depending on the questions and desired precision.
Why larger samples? Bigger samples reduce sampling error for many small-effect questions and help with subgroup analyses, but the benefit depends on how random the sample is.
Randomness and representativeness:
- A completely random sample should mirror the population. If the population is 50% female, the sample should be ~50% female; if 20% are Greek students, the sample should reflect ~20% Greek students.
- Sampling from a limited universe (e.g., only students in a class, or people in a residence hall) can be random within that small universe but not representative of the broader population (e.g., all U.S. adults).
- Pseudo-random samples from a small or biased frame (e.g., sampling only people at a quad or a lobby) can yield misleading conclusions about the broader group.
Randomness vs size quadrant analogy:
- Imagine a 2x2 grid with axes: randomness (complete randomness vs no randomness) and size (large vs small).
- Top-right quadrant: large and completely random—best chance for statistically significant findings (though not a guarantee).
- Top-left: large but not random—may be large but biased.
- Bottom-right: small but random—limited power and precision.
- Bottom-left: small and not random—lowest reliability.
Statistical significance:
- Affected by both sample size and randomness. Larger, random samples have the best chance to yield statistically significant and generalizable results.
- Significance does not guarantee truth about the entire population, but it improves confidence in inferences.
Practical illustration: hypothetical national drinking habits survey
- If you surveyed the entire adult U.S. population (~250,000,000 people) and tried to quantify drinking habits, costs would be enormous (
- Example figure: Qualtrics sampling a national population could cost around $5 per completion; with 250,000,000 responses, cost would be astronomical and often impractical, illustrating why researchers use smaller, carefully designed samples.
Key questions to assess data quality:
- What is the sample size?
- What geography is covered (national, state, city)?
- How was the survey conducted (online, phone, in person)?
- Was the sampling method truly random? If not, what biases might exist?
- Who funded or conducted the study (vested interests) and could that influence results?
- What exactly was asked (survey questions can shape answers)?
Practical takeaway: small, non-random samples can look like random samples but are not generalizable. Conversely, large random samples are most informative for generalizing to a population, but only if the methodology is sound.
Example case: misalignment between what a chart shows and what the data actually represents (apple pie top in one chart, but Apple Crumb pie appears in another; one dataset excludes Alaska due to small sample). This highlights the importance of understanding what the data truly represents and where limitations lie.

Understanding Secondary Data and its Methodology

Definition of secondary data: data collected by someone else for a purpose other than your current project, but reused for your analysis.
Sources of secondary data can be internal (within your organization) or external (third-party providers, public sources).
Why methodology matters:
- Reputable sources disclose how data was collected, adjusted, and weighted to be representative and random where possible.
- Look for information on sample size, sampling frame, response rates, weighting methods, margins of error, and confidence intervals.
The role of secondary data beyond numbers:
- Not all data are numerical; qualitative observations, trends, and reports can also be considered secondary data.
- Example: TikTok trend observations during 2020 (hashtags like #renegade, #onlineclasses, #learningthedog, #celebratedoctors) represent secondary data in a broad sense (observations published or compiled by others).
- When using non-numeric secondary data, scrutinize sources for reliability, context, and potential biases.

Library and Institutional Secondary Data Resources

University library secondary data resources are largely online and centralized for convenience.
APR Research Portal (aprresearchportal.com) provides consolidated access to about 37 links/resources related to advertising and PR, data sources, and market intelligence.
Notable groups and links within the APR portal:
- Advertising and PR information resources:
- Winmo: agency information, client lists, offices, HR contacts, departments; useful for job hunting.
- Adweek: public relations strategy and tactics.
- Ad Spender: ad spend data for brands and categories.
- Note: This section is most relevant to advertising/PR professionals; other fields may skip details.
- News information resources:
- Internet TV News, ProQuest Newspaper, Nexis Uni, America’s News, Ethnic News Watch: search newspapers and news sources.
- Market data and analytics: Mintel, MR_Simmons, Simply Analytics, eMarketer, Tapestry, US Census, Statista.
- Public opinion and polling: IColl, Roper polls (used for attitudes on religion, politics, and other topics).
- Nonprofits and philanthropy: Foundation Center stats/guides for nonprofit data and organization search.
Practical tips:
- You can access many of these resources electronically; you don’t always need to visit the physical library.
- If a source seems limited or biased, cross-check with other sources to validate findings.
- The Internet remains a valuable tool for initial exploration, followed by deeper dives through the library portal as needed.
Real-world use: combine multiple secondary sources to triangulate a topic (e.g., pie preferences, consumer trends, or attitudes toward brands) before designing a primary study.

Pie Case Study: What Secondary Data Can Reveal—and What It Cannot

Demonstration of pie popularity across sources:
- A dataset (e.g., Ethnic News Watch) shows Apple as a frequent top choice in reports about pies.
- A separate source (EatThisNotThat) ranks pies by weight-loss potential, not by popularity; it may list Peanut Butter Pie as the worst for weight loss, which can be misconstrued as being the most popular.
Key insights:
- Different datasets answer different questions (popularity vs weight-loss impact).
- Inconsistent reporting or labeling (e.g., Apple Pie not appearing in one chart but appearing in another) signals data quality issues or misalignment of taxonomy across sources.
- Local context or sample limitations (e.g., Alaska sample too small) can undermine confidence in results for certain geographies.
Lessons for researchers:
- Always verify what question the data is answering and whether the sampling frame and geography match your research needs.
- Be cautious of sensational headlines or chart designs that oversimplify, mislead, or omit crucial caveats.
- Acknowledge limitations of secondary data and avoid overgeneralizing beyond the data’s scope.

Practical Takeaways and Key Messages

Four main takeaways from today:
- Understand the data type: primary vs secondary, and the role of methodology in evaluating data quality.
- Assess the sampling design: sample size, randomness, geography, and framing; these determine the data’s representativeness and reliability.
- Evaluate sources critically: question the data’s origin, purpose, and potential biases; examine margins of error and confidence intervals when available.
- Recognize the limitations of secondary data: it may not answer why something happens, and sometimes primary data is needed to uncover underlying causes.
Reading and assignments: engage with readings and Blackboard materials; focus on survey concepts and evaluating data sources.
Practical exercise: a quick field test – ask three people about a question (e.g., which destination wins) to illustrate whether a tiny sample can reflect broader public opinion.

Quick Practice: Three-Person Survey Exercise

Task: Ask three people you encounter (e.g., at the beach or in the mountains) which destination they prefer and why.
Critical question: Do you think results from just three people reflect the view of the overall public?
- Use this to discuss limitations of small samples and the importance of broader sampling for generalizable conclusions.

Formative Connections: Why This Matters in Research Practice

Ethical and philosophical considerations:
- Respect for data integrity: avoid misrepresenting data or cherry-picking results to fit a narrative.
- Skepticism in the face of vested interests: sources with a stake in outcomes require scrutiny of methodology and potential bias.
- Humility in research: always be prepared to discover information that contradicts your initial assumptions.
Foundational principles linked to today’s topics:
- Representativeness and generalizability: the alignment between sample and population.
- Sampling theory: the relationship between sample size, randomness, bias, and the reliability of inferences.
- The role of secondary data: a starting point for understanding a topic, guiding design, and identifying gaps for targeted primary data collection.

Recap: Key Concepts and Formulas to Remember

Primary vs Secondary data definitions and purposes.
Representativeness and randomness: what makes a sample reflect the population?
Statistical significance vs practical significance.
Margin of error for a proportion (large population):
- $MOE = z_{\,\alpha/2} \sqrt{\frac{p(1-p)}{n}}$
- Where: $n$ = sample size, $p$ = estimated proportion, $z_{\alpha/2}$ = z-score for the desired confidence level.
- If $p$ unknown, use the conservative estimate $p = 0.5$ to maximize $MOE$.
Real-world constraints: cost, feasibility, and trade-offs in designing surveys (e.g., $5 per completion, large-scale nationwide sampling costs).
Critical evaluation questions for data sources:
- What is the sampling frame and geography?
- How was randomness achieved and was there any nonresponse bias?
- What is the sample size and margin of error? Is it reported?
- Who funded the study, and could there be vested interests?
- What is the exact question wording and potential measurement bias?
Practical nuance: secondary data can be misinterpreted if the question it answers differs from your research question; always align data source with your research objective.

Next Steps and Preparedness

Review the APR Research Portal and explore at least three sources relevant to a potential project.
Practice evaluating a secondary data source by identifying its population, sampling method, and stated limitations.
Prepare a short reflection on how you would design a primary study to answer a why-question that secondary data cannot, using insights from today’s lecture.