Lecture 1 Notes: Data, Excel, and Statistical Concepts

Two-Dimensional Data Frameworks and Excel

  • Data are commonly presented in two-dimensional frameworks with variables across columns and observations down rows (as in Excel workbooks).

  • Visual intuition: think of a workbook as a file cabinet where each object has a specific address; you access data by its location in the folder/file structure.

File Access, Addresses, and Copying Data

  • To use a dataset in other tools (R, Python, SAS, SQL), you must provide its address/path so the program can load it.

  • A “full address” can be obtained by a special copy/paste action in the environment; this address is what tells other programs where to locate the data.

  • A concrete example: a lecture data workbook used in this course is referenced on lecture slides and a backup page.

  • The speaker showed a street-address metaphor to illustrate how data are located within a filesystem (e.g., 707 East 30 First Street … potato peeler).

Data Storage, Cloud vs Local, and Default Locations

  • Office 365 can be accessed via virtual labs; students are encouraged to install it on their own machines.

  • File structures: examples include OneDrive, Berkeley Research Group, and Persol drives; these often exist in a cloud storage setup by default.

  • Windows/macOS often set cloud storage as the default location for saving files (e.g., OneDrive).

  • Pros of cloud storage: automatic backups and remote access; cons: requires Internet to access when working off-network.

  • Practical advice: create a local folder structure on your machine (e.g., Documents/April/Homeworks, Documents/April/TermProject) to know where to find materials.

  • In Excel, you can change the default save location to your local documents folder via File → Options → Save (to override the system cloud default).

  • For downloads, browsers typically save to a Downloads folder by default; you can set the browser to prompt for a save location to avoid extra steps.

  • Security note: when downloading class materials, you may see security warnings; ensure files are virus-scanned before opening.

  • The Mac file structure mirrors the PC structure closely; the old DOS-like file-navigation concepts persist under modern WYSIWYG interfaces.

ZIP Files and Data Extraction

  • ZIP files compress folders into a single archive; you can open the ZIP and see the files inside.

  • Best practice: extract (copy out) the files from a ZIP folder into a new folder before using them in programs, since some applications can’t read data directly from ZIPs.

Data Series Codes and Geographic Identifiers

  • LAUS code: a Department of Labor Economics/Statistics series code that identifies individual data series published by the agency.

  • FIPS codes: geographic identifiers assigned by the U.S. Census to uniquely identify areas (state and county codes).

    • Example: Alabama state code = 1; Mobile County (example) code = 001.

  • These codes are categorical, not purely numeric in meaning; the numbers function as labels for categories.

Quick Excel Navigation and the Fill/Drag Method

  • Excel shortcut basics discussed include Ctrl+C (copy) and Ctrl+V (paste); Mac equivalents include Command+C, Command+V.

  • To jump to the last filled cell to the right, use Ctrl+Right Arrow; to the left, Ctrl+Left Arrow; to move around quickly, use these navigation shortcuts.

  • The lecture demonstrated a data block with ~3,200 rows; selecting large ranges efficiently matters for calculations.

  • The fill handle (the small green square in the bottom-right corner) can automatically fill a formula down to the last filled cell, which is faster than manual dragging.

  • Caution: if there are blank rows in the data, propagating a formula down can yield incomplete or misleading results; consider checking for gaps and undo if needed.

  • The formula-drag fill technique propagates the formula down the visible range; if blank rows exist, you may get unintended results.

Filtering Data and Subsetting

  • Data → Filter creates drop-downs next to each variable for quick subsetting.

  • Filtering in Excel hides rows; the visible rows can be counted and averaged, but be aware that some functions (like AVERAGE) may still reference the entire original data range unless the calculation is performed on the filtered subset.

  • The COUNT of observations in a filtered view may differ from the AVERAGE if you include hidden data in your range; Excel’s behavior can lead to misleading results if you don’t copy filtered data to a new worksheet first.

  • Best practice: after filtering, copy the visible data and paste into a new worksheet to perform calculations on the subset only.

  • Filters offer a variety of options: less than/greater than/between, contains, and text-specific filters; this is useful for exploratory data analysis and subsetting.

Random Variables, Populations, and Samples

  • Random variables take on different values with certain probabilities; there is inherent variation in data.

  • Population: the set of all possible outcomes for a variable (e.g., all currently registered Texas Aggies).

  • Sample: a subset of observations taken from the population (the group you study).

  • Random sample: a subset where each member of the population has an equal chance of being selected; every possible sample of a given size is equally likely.

  • The speaker’s student example is used humorously to illustrate how sampling can be non-random in real life (e.g., a group in front of a building may not represent the population).

  • Why sampling matters: studying an entire population can be expensive; good random samples can provide strong inferences about population characteristics (size, location, associations).

  • Historical cautionary tale: 1948 presidential election polling bias (Dewey vs. Truman) where exit polls led to a misleading headline due to sampling bias.

Data Types and Measurement Scales

  • Categorical variables: observations placed into different buckets (e.g., gender, major) without inherent order.

  • Ordinal data: categories with a meaningful order but not necessarily equal intervals (e.g., strongly agree to strongly disagree in course evaluations).

  • Interval data: ranking and differences are meaningful, but there is no natural zero point (e.g., Fahrenheit/Celsius temperatures). Zero is arbitrary in these scales.

  • Absolute/ratio scales: have a natural zero, allowing meaningful ratios (e.g., GDP, unemployment counts); most economic variables fall into this category.

  • Absolute interpretation example: if Texas GDP growth is 5.5% and Arkansas is 2.5%, Texas grew more than twice as fast as Arkansas; this is a ratio-scale inference.

  • Temperature example: 0 K is an absolute zero; Fahrenheit/Celsius have arbitrary zeros, limiting certain comparisons across scales.

  • Practical implication: report comparable data on the same scale (e.g., percentages) when comparing groups with different bases, to make fair comparisons.

Visualizing Data and Storytelling

  • Visual presentation often communicates information more effectively than raw numbers.

  • Percentages can be more informative than counts when category bases differ (e.g., percentage of males vs females who earned a grade B).

  • Pie charts can be useful for market share in certain contexts, though they are sometimes criticized; they can be effective when the goal is to show composition.

  • Always consider the audience and the base used when choosing visualization types.

Cross-Sectional vs Time Series Data

  • Cross-sectional data: observations at a single point in time with no natural ordering; fits well with random sampling concepts.

  • Time-series data: observations ordered in time; naturally stacked by quarter, month, year, etc.

  • Time-series characteristics: natural ordering enables growth rates and trend analysis but introduces challenges like autocorrelation (current values correlated with past values).

  • Common example visuals: time-series graphs showing monthly or quarterly data (e.g., crude oil prices over 2001–2022).

  • Important caveats in time-series visualization:

    • Different units across series can obscure relationships (e.g., dollars per barrel vs dollars per gallon for crude oil and gasoline).

    • Line-of-best-fit (regression line) can illustrate the general relationship but does not capture all variability.

    • Heteroskedasticity: variance around the line of best fit increases with the level of the independent variable, so the spread of residuals is not constant across the data range.

    • In the crude oil vs gasoline example, as crude oil prices rise, gasoline prices tend to rise too, but with non-constant variance (more noise at higher price levels).

  • Practical takeaway: always visualize data to detect issues (patterns, outliers, nonlinearity) before formal modeling.

  • Case illustration: a constructed set of four x-y examples can produce the same line of best fit yet have very different underlying patterns (e.g., quadratic, linear with a few outliers, a single outlier, or all observations at one value with a single outlier). This underscores that a single summary (line of best fit or correlation) may mask important data structure.

Interpreting Data and Guidelines for Analysis

  • Do not rely solely on summary statistics; always visualize data to understand structure, outliers, and potential issues.

  • When using subgroups or subsets, consider whether the chosen method treats the subset correctly (e.g., do not average across all data when you only want the subset you filtered).

  • Always verify that your data are representative and that sampling methods do not introduce bias.

Class Logistics and Upcoming Schedule

  • Recitation Friday; no class on Monday (Labor Day weekend).

  • The class will resume next Wednesday.

  • The lecture referenced a specific class website with updated links and security features; students were reminded to ensure access via HTTPS and to be aware of potential security warnings when downloading materials.

Miscellaneous Practical Notes from the Lecture

  • GDP, personal income, and international trade debt data are commonly reported in thousands, millions, etc.; always check the scaling and add the appropriate zeros to interpret correctly.

  • When reading macroeconomic data expressed in thousands, a figure like 12345 actually represents 12,345,000 if you’re thinking in raw dollars.

  • The context emphasizes clear communication of data scale to avoid misinterpretation.

  • The lecture stresses that data storage, file organization, and version control are practical foundations for reproducible data analysis (e.g., organized folders by month/assignment, consistent naming conventions).

  • The overarching goal is to enable students to tell a data-informed story effectively by combining data organization, appropriate measurement scales, proper sampling concepts, and clear visualization.

extPopulationparameterextisoftenunknown;aextsamplestatisticextiscomputedfromdata.ext{Population parameter} ext{ is often unknown; a } ext{sample statistic} ext{ is computed from data.}
ar{x}= rac{1}{n}

abla
abla

  • Note: The content above preserves the instructional emphasis on data literacy, data handling in Excel, understanding scales and types of data, sampling concepts, and the importance of visualization in statistical thinking.