SAS Studio: EDA & Univariate Analysis Quick Reference

Exploratory Data Analysis (EDA)

Raw data lack context; Exploratory Data Analysis (EDA) converts data from its raw form into information with context, telling the story of your research question. EDA involves organizing and summarizing data, identifying important features and patterns, spotting striking deviations, and interpreting findings within the problem context. It starts with univariate analysis—examining one variable at a time—and requires summarizing and examining the distribution of variables of interest. The distribution answers: what values the variable takes and how often those values occur. For a categorical variable, this means listing categories, counts, and percentages. The percentage for category i is $p<em>i = \frac{n</em>i}{N} \times 100$ where $n_i$ is the count in category i and $N$ is the total observations. For example, in a body image study, the distribution might report: $71.3\%$ in the “about right” category, $9.2\%$ as underweight, and $19.6\%$ as overweight.

Getting Started with SAS Studio

To work with large datasets, you’ll use SAS Studio via SAS OnDemand for Academics. Registration steps are on the SAS page; after replying to the registration email, enroll in the course. Bookmark the SAS login page for easy access, and use two browser windows side by side—SAS Studio in one window and the course page in the other. Save your program frequently, give it a descriptive name, and then run it to execute the code. When you log in, you’ll see the SAS Dashboard and the SAS Studio interface, which comprises Code, Log, and Results tabs.

SAS Syntax Essentials

SAS statements end with a semicolon; it’s not case sensitive. The basics start with LIBNAME to point to your data: the LIBNAME line tells SAS where to find the data and assigns a library name (e.g., $MyData$ ). The path to the data is in quotes, and options like ACCESS=READONLY prevent modification. After LIBNAME, you typically have a DATA step (which creates a temporary dataset) and a SET statement to specify the data file (e.g., $NESARC_PDS$ ). A typical end of the DATA step is a PROC statement, such as PROC SORT, which sorts data by a unique identifier (e.g., $IDNUM$ ). Remember to end every SAS statement with a semicolon.

Working with PROC STEPS and FREQ

PROC steps perform analyses and outputs; PROC FREQ generates frequency distributions. Use: PROC FREQ; TABLES var1 var2; RUN;. The TABLES statement lists the variables you want to examine. After running, you’ll see frequency tables that include counts, percentages, and sometimes cumulative frequencies. You can add labels to variables so results are easier to interpret, for example, labeling a variable with a more readable description via a LABEL statement. The program must be saved before running, and you can view logs and results to verify execution and outputs.

Labeling and Interpreting Results

SAS supports labeling variables to improve readability in output. Use a LABEL statement like: LABEL tobacco_dep = "Tobacco Dependence (past 12 months)"; This makes frequency tables easier to read. In output, a value of $1$ often represents "Yes" and $0$ represents "No" for binary variables such as nicotine dependence. When results display, interpret distributions by consulting the codebook for what each value means and summarizing the percentage in each category (e.g., proportion of smokers with nicotine dependence).

Subsetting Observations (Logic Statements)

You can restrict analyses to a subset of observations using logic statements in the DATA step before PROC SORT. Common operations include: equality, inequality, greater/less than, and inclusive comparisons: $=$ , $<$ , $>$ , $<=$ , $>=$ , $~=$ (not equal). Example: to analyze only individuals who smoked in the past 12 months and are age 25 or younger, you would apply logic like: IF smoking12m = 1; IF age <= 25; before the end of the DATA step. You can also include age in the TABLES statement to examine distributions by age groups. If you want to limit to a subset, add these logic statements before the final PROC step. For instance, when you run the program, you’ll see results only for the seventeen hundred and six individuals aged 25 or younger who reported smoking in the past twelve months (i.e., $1706$ observations).

Comments and Documentation

To remember why you wrote certain lines, use comments. SAS comments are written as /* ... */. Comments do not affect program execution and help you track decisions (e.g., "subset to past twelve-month smokers aged 18 to 25"). It’s good practice to add notes throughout your code and to save regularly.

Practical Workflow and Examples

A typical NESARC-like workflow includes: assigning the data library with LIBNAME, creating a DATA step to read and prepare data, using a SET statement to specify the dataset, and applying a PROC step (e.g., PROC SORT) to organize data by a unique identifier (e.g., $IDNUM$ ). The run statement executes the preceding statements. When you run, the log shows how many observations were read; for example, the NESARC dataset might yield $43093$ observations. The results tab then shows the frequency distributions for your variables, now with both their internal names and their labels. For a sample variable like tobaccodependencepast_12m, the output will display the counts, percentages, and cumulative frequencies for the values, with 1 indicating Yes and 0 indicating No.

Quick Reference Points

Distributions help answer: what values and how often for each variable; use $p<em>i = \frac{n</em>i}{N} \times 100$ to compute percentages.
Use PROC FREQ and TABLES to generate distributions.
Use LABEL to improve output readability; end statements with semicolons.
Subset data with logical conditions in the DATA step (before PROC steps).
Document with /* ... */ comments and save frequently.
Check the Log after running for Errors, Warnings, or Notes, and refer to Results for the produced tables.

Key Identifiers and Common Values

Unique identifiers commonly appear as $IDNUM$ (NESARC) or similar (e.g., $ID$ , $aid$ , $country$ ).
In outputs, binary responses often map to 1/0 (Yes/No).
Sample sizes shown in the log are important sanity checks (e.g., $43093$ observations).