Organizing and Visualizing Variables - Study Notes

Organizing Categorical Data

Categorical data are organized primarily using tables and visual displays to summarize frequencies and patterns.
The DCOVA workflow (Define, Collect, Organize, Visualize, Analyze) is used throughout to guide the process of organizing and visualizing data.
Key concepts:
- Summary Table: tallies the frequencies or percentages of items in categories to show differences between categories.
- Contingency Table (two categorical variables): cross-tabulates responses to study patterns between two categorical variables; rows correspond to one variable and columns to the other.
Example: Summary table of devices used to watch movies/TV shows
- Television Set: 49% | Tablet: 9% | Smartphone: 10% | Laptop/Desktop: 32%
- Source: Sharma, Wall Street Journal, 2016.
Example: Contingency table for invoices by size and presence of errors (two categorical variables: Size and Errors)
- Table layout (frequencies):
- Small Amount: No Errors 170 | Errors 20 | Total 190
- Medium Amount: No Errors 100 | Errors 40 | Total 140
- Large Amount: No Errors 65 | Errors 5 | Total 70
- Totals: No Errors 335 | Errors 65 | Total 400
Percentages (three ways to view the contingency data):
- Based on overall total (denominator = 400):
- Small Amount: No Errors = rac{170}{400}=0.425=42.50\%, Errors = rac{20}{400}=0.05=5.00\%, Total = rac{190}{400}=0.475=47.50\%
- Medium Amount: No Errors = rac{100}{400}=0.25=25.00\%, Errors = rac{40}{400}=0.10=10.00\%, Total = rac{140}{400}=0.35=35.00\%
- Large Amount: No Errors = rac{65}{400}=0.1625=16.25\%, Errors = rac{5}{400}=0.0125=1.25\%, Total = rac{70}{400}=0.175=17.50\%
- Based on row totals (denominator = row total):
- For Small Amount (190 total): No Errors = rac{170}{190}=0.8947=89.47\%, Errors = rac{20}{190}=0.1053=10.53\%
- For Medium Amount (140 total): No Errors = rac{100}{140}=0.7143=71.43\%, Errors = rac{40}{140}=0.2857=28.57\%
- For Large Amount (70 total): No Errors = rac{65}{70}=0.9286=92.86\%, Errors = rac{5}{70}=0.0714=7.14\%
- Based on column totals (denominator = column total No Errors / Errors):
- No Errors column total = 335; No Errors by size: Small = rac{170}{335}=0.5075=50.75\%, Medium = rac{100}{335}=0.2985=29.85\%, Large = rac{65}{335}=0.1940=19.40\%
- Errors column total = 65; Errors by size: Small = rac{20}{65}=0.3077=30.77\%, Medium = rac{40}{65}=0.6154=61.54\%, Large = rac{5}{65}=0.0769=7.69\%
Practical implications:
- Percent of a row, a column, or the overall total can tell different stories; choose the interpretation that best supports the decision context.
Summary points:
- Tables enable precise organization and quick tabulation of categorical data.
- Contingency tables reveal relationships between two categorical variables and support multiple percentage viewpoints (overall, row, column).

Visualizing Categorical Data

Graphical displays help assess patterns quickly:
- For one categorical variable: Bar Chart, Pie/Doughnut Chart, Pareto Chart, Summary Table.
- For two categorical variables: Side-by-Side Bar Chart (a contingency view).
- Other displays: Donut (Doughnut) Chart, Pareto Chart shows the most important categories and their cumulative share.
Pareto chart characteristics:
- Vertical bar chart with categories in descending order of frequency.
- Includes a cumulative polygon to emphasize the "vital few" vs the "trivial many".
Side-by-Side Bar Chart uses a contingency table to compare No Errors vs Errors across invoice sizes (Small, Medium, Large).
Doughnut chart is a circular chart like a pie chart but with a central hole; used to display categorical data from a contingency table.
Practical notes:
- Bar charts and pie/doughnut charts should have clear labels, consistent scales, and emphasis on actual percentages or frequencies.
- Pareto charts help prioritize issues by highlighting the most significant categories first.
Example details (ATM/two-variable view):
- In separate visuals, show the breakdown of Invoices by Size and Errors.
- When combined, clear statements about which category combinations dominate (e.g., medium-size invoices have higher error rates).
Common pitfalls to avoid in categorical visuals:
- Using too many categories, which obscures patterns.
- Inconsistent or misleading scales, chartjunk, or starting axes away from zero unnecessarily.

Organizing Numerical Data

Numerical data can be organized into ordered arrays and various distribution formats:
- Ordered Array: data in rank order from smallest to largest; shows range and potential outliers.
- Frequency Distribution: data are grouped into class intervals; key decisions include the number of classes, class width, and class boundaries to avoid overlap.
- The number of classes typically ranges from 5 to 15.
Important formulas:
- Class width: w = \frac{\text{range}}{k} where range = max − min and k is the desired number of classes.
- Range: \text{range} = \max(x) - \min(x).
- Relative frequency distribution: RFi = \frac{fi}{N} where f_i is the frequency of class i and N is the total number of observations.
- Cumulative distribution: CFi = \sum{j=1}^{i} f_j
- Cumulative percentage (ogive context): \text{Cum\%}i = \frac{CFi}{N} \times 100\%
Example: Insulation temperature data across 20 winter days (illustrative data set of 20 values):
- Raw data (sorted): 10, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
- Range = 58 − 10 = 48; choose 5 classes; width w ≈ \left\lceil\frac{48}{5}\right\rceil = 10
- Class intervals (examples): 10\le x < 20, 20\le x < 30, 30\le x < 40, 40\le x < 50, 50\le x < 60
- Class midpoints: 15, 25, 35, 45, 55
Frequency distribution example (20 data):
- Class midpoints and frequencies:
- 10 but less than 20: 3
- 20 but less than 30: 6
- 30 but less than 40: 5
- 40 but less than 50: 4
- 50 but less than 60: 2
- Total = 20
Relative and cumulative frequency example:
- Relative frequency: for each class, e.g., 3/20 = 0.15, 6/20 = 0.30, 5/20 = 0.25, 4/20 = 0.20, 2/20 = 0.10
- Cumulative frequency: 3, 9, 14, 18, 20
- Cumulative percentage: 15%, 45%, 70%, 90%, 100%
Practical notes:
- A frequency distribution condenses raw data and supports quick visual interpretation of where data are concentrated.
- The choice of class boundaries can affect the appearance of the distribution, especially with small data sets; larger data sets reduce this effect.
- When comparing two or more groups with different sample sizes, use relative frequency or percentages for fair comparison.

Visualizing Numerical Data

Graphical displays for numerical data include:
- Ordered array, Stem-and-Leaf display, Histogram, Polygon (frequency polygon), Ogive (cumulative percentage polygon).
- Frequency distributions and cumulative distributions are often shown in histograms and ogives.
Stem-and-Leaf Display:
- A simple way to visualize distribution by separating leading digits (stems) from trailing digits (leaves).
- Example structure (notional):
- Stem 1: Leaves 67788899
- Stem 2: Leaves 0012257
- Stem 3: Leaves 28
- Stem 4: Leaves 2
Example data in stem-and-leaf form can be used to show age distributions for day vs night students.
Histogram:
- Vertical bar chart of a frequency distribution with no gaps between bars.
- Class boundaries (or midpoints) on horizontal axis; vertical axis shows frequency, relative frequency, or percentage.
Frequency distribution visuals:
- Example of a 5-class histogram with frequencies and relative frequencies provided, and a total of 20 observations.
Ogive (cumulative percentage polygon):
- Plot cumulative percentages against class boundaries or midpoints; useful for comparing groups when multiple distributions are present.
Frequency polygon and percentage polygon:
- Connect class midpoints with line segments to form a polygon; useful for comparing distributions across groups.

Visualizing Two Numerical Variables

Scatter plots:
- Used for paired observations from two numerical variables; one variable on the x-axis and the other on the y-axis.
- Purpose: examine possible relationships or correlations between the two variables.
- Example: Volume per day vs Cost per day plotted as paired observations.
Time-series plots:
- Used to study patterns in a numeric variable over time.
- Axes: time on the horizontal axis; numeric value on the vertical axis.
- Example: Yearly data for the number of franchises from 2011 to 2019.

Multidimensional Data and Drill-Down

Multidimensional contingency tables (three or more categorical variables):
- Tallies each combination of the variables to discover patterns that simpler tables cannot reveal.
- Practical guideline: keep tables to no more than three or four variables.
- In practice, extend contingency tables to more rows/columns or replace frequencies with numeric summaries when needed.
Pivot tables in Excel enable interactive displays of multidimensional contingency data.
Drill-down:
- Clicking a summary cell reveals the underlying row-level data that comprise that summary.
- Simple form of data discovery.
Displays to visualize a mix of many variables:
- Colored scatter plots, bubble charts (size encodes a third variable), pivot charts, treemaps, sparklines.
Excel PivotChart and Tableau visualizations can visualize specific categories from PivotTables and PivotCharts.

Filtering, Querying, and Interactive Exploration

Filtering and querying data help prepare tabular or visual summaries:
- Data Filtering: selects rows based on criteria for specific variable values.
- Data Querying: similar but may not select all columns from matching rows.
Excel features for filtering/querying include: slicers, filter options, and pivot table interactions.
Slicers (Excel): a panel of clickable buttons that filter data displayed in a PivotTable; each button represents a unique value of a variable.
Practical use: asking and answering questions like which attributes correspond to the lowest expense ratio or which expense ratios are associated with large market-cap value funds with a star rating of five.

Pitfalls, Ethical and Practical Considerations in Visualization

Common pitfalls when organizing/visualizing data:
- Underestimating perceptual limits of audiences.
- Creating summaries that obscure data or mislead (false impressions).
- Chartjunk and overly complex visuals.
An example of obscuring data: data overload hides key patterns in the underlying data.
False impressions can arise from:
- Selective summarization (showing only parts of the data).
- Improperly scaled axes (e.g., axes not starting at zero or using broken axes).
- Pie charts that are hard to interpret or misrepresent proportions.
Example illustrating selective summarization: two different year-to-year summaries can tell different stories for the same company data; improving involves presenting complementary views or full data.
Graphical errors to watch for:
- No relative basis or inconsistent baselining across charts.
- Compressing or exploding axes, misaligned scales, or starting points that mislead.
- Chartjunk: extraneous images or decorations that do not convey information.
Best practices for constructing visualizations:
- Use the simplest visualization that communicates the message.
- Include clear titles and label all axes; provide axis scales where relevant.
- Start vertical axis at zero with a constant scale if comparing values.
- Avoid 3D, exploded views, and chartjunk; use consistent coloring when comparing charts.
- Prefer standard chart types; avoid obscure ones (e.g., radar, 3D cone/pyramid) for general interpretation.

Chapter Summary

This chapter covers:
- How to organize and visualize categorical variables.
- How to organize and visualize numerical variables.
- How to summarize a mix of variables (including multidimensional tabulations).
- How to avoid common errors and misrepresentations in data visualization.
Key takeaway: choose appropriate visualization methods for the data type and context, be mindful of how the presentation could influence interpretation, and use checks (e.g., multiple views) to ensure accurate representation.