Organizing and Visualizing Variables - Study Notes
Organizing Categorical Data
- Categorical data are organized primarily using tables and visual displays to summarize frequencies and patterns.
- The DCOVA workflow (Define, Collect, Organize, Visualize, Analyze) is used throughout to guide the process of organizing and visualizing data.
- Key concepts:
- Summary Table: tallies the frequencies or percentages of items in categories to show differences between categories.
- Contingency Table (two categorical variables): cross-tabulates responses to study patterns between two categorical variables; rows correspond to one variable and columns to the other.
- Example: Summary table of devices used to watch movies/TV shows
- Television Set: 49% | Tablet: 9% | Smartphone: 10% | Laptop/Desktop: 32%
- Source: Sharma, Wall Street Journal, 2016.
- Example: Contingency table for invoices by size and presence of errors (two categorical variables: Size and Errors)
- Table layout (frequencies):
- Small Amount: No Errors 170 | Errors 20 | Total 190
- Medium Amount: No Errors 100 | Errors 40 | Total 140
- Large Amount: No Errors 65 | Errors 5 | Total 70
- Totals: No Errors 335 | Errors 65 | Total 400
- Percentages (three ways to view the contingency data):
- Based on overall total (denominator = 400):
- Small Amount: No Errors = rac{170}{400}=0.425=42.50\%, Errors = rac{20}{400}=0.05=5.00\%, Total = rac{190}{400}=0.475=47.50\%
- Medium Amount: No Errors = rac{100}{400}=0.25=25.00\%, Errors = rac{40}{400}=0.10=10.00\%, Total = rac{140}{400}=0.35=35.00\%
- Large Amount: No Errors = rac{65}{400}=0.1625=16.25\%, Errors = rac{5}{400}=0.0125=1.25\%, Total = rac{70}{400}=0.175=17.50\%
- Based on row totals (denominator = row total):
- For Small Amount (190 total): No Errors = rac{170}{190}=0.8947=89.47\%, Errors = rac{20}{190}=0.1053=10.53\%
- For Medium Amount (140 total): No Errors = rac{100}{140}=0.7143=71.43\%, Errors = rac{40}{140}=0.2857=28.57\%
- For Large Amount (70 total): No Errors = rac{65}{70}=0.9286=92.86\%, Errors = rac{5}{70}=0.0714=7.14\%
- Based on column totals (denominator = column total No Errors / Errors):
- No Errors column total = 335; No Errors by size: Small = rac{170}{335}=0.5075=50.75\%, Medium = rac{100}{335}=0.2985=29.85\%, Large = rac{65}{335}=0.1940=19.40\%
- Errors column total = 65; Errors by size: Small = rac{20}{65}=0.3077=30.77\%, Medium = rac{40}{65}=0.6154=61.54\%, Large = rac{5}{65}=0.0769=7.69\%
- Practical implications:
- Percent of a row, a column, or the overall total can tell different stories; choose the interpretation that best supports the decision context.
- Summary points:
- Tables enable precise organization and quick tabulation of categorical data.
- Contingency tables reveal relationships between two categorical variables and support multiple percentage viewpoints (overall, row, column).
Visualizing Categorical Data
- Graphical displays help assess patterns quickly:
- For one categorical variable: Bar Chart, Pie/Doughnut Chart, Pareto Chart, Summary Table.
- For two categorical variables: Side-by-Side Bar Chart (a contingency view).
- Other displays: Donut (Doughnut) Chart, Pareto Chart shows the most important categories and their cumulative share.
- Pareto chart characteristics:
- Vertical bar chart with categories in descending order of frequency.
- Includes a cumulative polygon to emphasize the "vital few" vs the "trivial many".
- Side-by-Side Bar Chart uses a contingency table to compare No Errors vs Errors across invoice sizes (Small, Medium, Large).
- Doughnut chart is a circular chart like a pie chart but with a central hole; used to display categorical data from a contingency table.
- Practical notes:
- Bar charts and pie/doughnut charts should have clear labels, consistent scales, and emphasis on actual percentages or frequencies.
- Pareto charts help prioritize issues by highlighting the most significant categories first.
- Example details (ATM/two-variable view):
- In separate visuals, show the breakdown of Invoices by Size and Errors.
- When combined, clear statements about which category combinations dominate (e.g., medium-size invoices have higher error rates).
- Common pitfalls to avoid in categorical visuals:
- Using too many categories, which obscures patterns.
- Inconsistent or misleading scales, chartjunk, or starting axes away from zero unnecessarily.
Organizing Numerical Data
- Numerical data can be organized into ordered arrays and various distribution formats:
- Ordered Array: data in rank order from smallest to largest; shows range and potential outliers.
- Frequency Distribution: data are grouped into class intervals; key decisions include the number of classes, class width, and class boundaries to avoid overlap.
- The number of classes typically ranges from 5 to 15.
- Important formulas:
- Class width: w = \frac{\text{range}}{k} where range = max − min and k is the desired number of classes.
- Range: \text{range} = \max(x) - \min(x).
- Relative frequency distribution: RFi = \frac{fi}{N} where f_i is the frequency of class i and N is the total number of observations.
- Cumulative distribution: CFi = \sum{j=1}^{i} f_j
- Cumulative percentage (ogive context): \text{Cum\%}i = \frac{CFi}{N} \times 100\%
- Example: Insulation temperature data across 20 winter days (illustrative data set of 20 values):
- Raw data (sorted): 10, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
- Range = 58 − 10 = 48; choose 5 classes; width w ≈ \left\lceil\frac{48}{5}\right\rceil = 10
- Class intervals (examples): 10\le x < 20, 20\le x < 30, 30\le x < 40, 40\le x < 50, 50\le x < 60
- Class midpoints: 15, 25, 35, 45, 55
- Frequency distribution example (20 data):
- Class midpoints and frequencies:
- 10 but less than 20: 3
- 20 but less than 30: 6
- 30 but less than 40: 5
- 40 but less than 50: 4
- 50 but less than 60: 2
- Total = 20
- Relative and cumulative frequency example:
- Relative frequency: for each class, e.g., 3/20 = 0.15, 6/20 = 0.30, 5/20 = 0.25, 4/20 = 0.20, 2/20 = 0.10
- Cumulative frequency: 3, 9, 14, 18, 20
- Cumulative percentage: 15%, 45%, 70%, 90%, 100%
- Practical notes:
- A frequency distribution condenses raw data and supports quick visual interpretation of where data are concentrated.
- The choice of class boundaries can affect the appearance of the distribution, especially with small data sets; larger data sets reduce this effect.
- When comparing two or more groups with different sample sizes, use relative frequency or percentages for fair comparison.
Visualizing Numerical Data
- Graphical displays for numerical data include:
- Ordered array, Stem-and-Leaf display, Histogram, Polygon (frequency polygon), Ogive (cumulative percentage polygon).
- Frequency distributions and cumulative distributions are often shown in histograms and ogives.
- Stem-and-Leaf Display:
- A simple way to visualize distribution by separating leading digits (stems) from trailing digits (leaves).
- Example structure (notional):
- Stem 1: Leaves 67788899
- Stem 2: Leaves 0012257
- Stem 3: Leaves 28
- Stem 4: Leaves 2
- Example data in stem-and-leaf form can be used to show age distributions for day vs night students.
- Histogram:
- Vertical bar chart of a frequency distribution with no gaps between bars.
- Class boundaries (or midpoints) on horizontal axis; vertical axis shows frequency, relative frequency, or percentage.
- Frequency distribution visuals:
- Example of a 5-class histogram with frequencies and relative frequencies provided, and a total of 20 observations.
- Ogive (cumulative percentage polygon):
- Plot cumulative percentages against class boundaries or midpoints; useful for comparing groups when multiple distributions are present.
- Frequency polygon and percentage polygon:
- Connect class midpoints with line segments to form a polygon; useful for comparing distributions across groups.
Visualizing Two Numerical Variables
- Scatter plots:
- Used for paired observations from two numerical variables; one variable on the x-axis and the other on the y-axis.
- Purpose: examine possible relationships or correlations between the two variables.
- Example: Volume per day vs Cost per day plotted as paired observations.
- Time-series plots:
- Used to study patterns in a numeric variable over time.
- Axes: time on the horizontal axis; numeric value on the vertical axis.
- Example: Yearly data for the number of franchises from 2011 to 2019.
Multidimensional Data and Drill-Down
- Multidimensional contingency tables (three or more categorical variables):
- Tallies each combination of the variables to discover patterns that simpler tables cannot reveal.
- Practical guideline: keep tables to no more than three or four variables.
- In practice, extend contingency tables to more rows/columns or replace frequencies with numeric summaries when needed.
- Pivot tables in Excel enable interactive displays of multidimensional contingency data.
- Drill-down:
- Clicking a summary cell reveals the underlying row-level data that comprise that summary.
- Simple form of data discovery.
- Displays to visualize a mix of many variables:
- Colored scatter plots, bubble charts (size encodes a third variable), pivot charts, treemaps, sparklines.
- Excel PivotChart and Tableau visualizations can visualize specific categories from PivotTables and PivotCharts.
Filtering, Querying, and Interactive Exploration
- Filtering and querying data help prepare tabular or visual summaries:
- Data Filtering: selects rows based on criteria for specific variable values.
- Data Querying: similar but may not select all columns from matching rows.
- Excel features for filtering/querying include: slicers, filter options, and pivot table interactions.
- Slicers (Excel): a panel of clickable buttons that filter data displayed in a PivotTable; each button represents a unique value of a variable.
- Practical use: asking and answering questions like which attributes correspond to the lowest expense ratio or which expense ratios are associated with large market-cap value funds with a star rating of five.
Pitfalls, Ethical and Practical Considerations in Visualization
- Common pitfalls when organizing/visualizing data:
- Underestimating perceptual limits of audiences.
- Creating summaries that obscure data or mislead (false impressions).
- Chartjunk and overly complex visuals.
- An example of obscuring data: data overload hides key patterns in the underlying data.
- False impressions can arise from:
- Selective summarization (showing only parts of the data).
- Improperly scaled axes (e.g., axes not starting at zero or using broken axes).
- Pie charts that are hard to interpret or misrepresent proportions.
- Example illustrating selective summarization: two different year-to-year summaries can tell different stories for the same company data; improving involves presenting complementary views or full data.
- Graphical errors to watch for:
- No relative basis or inconsistent baselining across charts.
- Compressing or exploding axes, misaligned scales, or starting points that mislead.
- Chartjunk: extraneous images or decorations that do not convey information.
- Best practices for constructing visualizations:
- Use the simplest visualization that communicates the message.
- Include clear titles and label all axes; provide axis scales where relevant.
- Start vertical axis at zero with a constant scale if comparing values.
- Avoid 3D, exploded views, and chartjunk; use consistent coloring when comparing charts.
- Prefer standard chart types; avoid obscure ones (e.g., radar, 3D cone/pyramid) for general interpretation.
Chapter Summary
- This chapter covers:
- How to organize and visualize categorical variables.
- How to organize and visualize numerical variables.
- How to summarize a mix of variables (including multidimensional tabulations).
- How to avoid common errors and misrepresentations in data visualization.
- Key takeaway: choose appropriate visualization methods for the data type and context, be mindful of how the presentation could influence interpretation, and use checks (e.g., multiple views) to ensure accurate representation.