GH

Organizing and Visualizing Variables - Study Notes

Organizing Categorical Data

  • Categorical data are organized primarily using tables and visual displays to summarize frequencies and patterns.
  • The DCOVA workflow (Define, Collect, Organize, Visualize, Analyze) is used throughout to guide the process of organizing and visualizing data.
  • Key concepts:
    • Summary Table: tallies the frequencies or percentages of items in categories to show differences between categories.
    • Contingency Table (two categorical variables): cross-tabulates responses to study patterns between two categorical variables; rows correspond to one variable and columns to the other.
  • Example: Summary table of devices used to watch movies/TV shows
    • Television Set: 49% | Tablet: 9% | Smartphone: 10% | Laptop/Desktop: 32%
    • Source: Sharma, Wall Street Journal, 2016.
  • Example: Contingency table for invoices by size and presence of errors (two categorical variables: Size and Errors)
    • Table layout (frequencies):
    • Small Amount: No Errors 170 | Errors 20 | Total 190
    • Medium Amount: No Errors 100 | Errors 40 | Total 140
    • Large Amount: No Errors 65 | Errors 5 | Total 70
    • Totals: No Errors 335 | Errors 65 | Total 400
  • Percentages (three ways to view the contingency data):
    • Based on overall total (denominator = 400):
    • Small Amount: No Errors = rac{170}{400}=0.425=42.50\%, Errors = rac{20}{400}=0.05=5.00\%, Total = rac{190}{400}=0.475=47.50\%
    • Medium Amount: No Errors = rac{100}{400}=0.25=25.00\%, Errors = rac{40}{400}=0.10=10.00\%, Total = rac{140}{400}=0.35=35.00\%
    • Large Amount: No Errors = rac{65}{400}=0.1625=16.25\%, Errors = rac{5}{400}=0.0125=1.25\%, Total = rac{70}{400}=0.175=17.50\%
    • Based on row totals (denominator = row total):
    • For Small Amount (190 total): No Errors = rac{170}{190}=0.8947=89.47\%, Errors = rac{20}{190}=0.1053=10.53\%
    • For Medium Amount (140 total): No Errors = rac{100}{140}=0.7143=71.43\%, Errors = rac{40}{140}=0.2857=28.57\%
    • For Large Amount (70 total): No Errors = rac{65}{70}=0.9286=92.86\%, Errors = rac{5}{70}=0.0714=7.14\%
    • Based on column totals (denominator = column total No Errors / Errors):
    • No Errors column total = 335; No Errors by size: Small = rac{170}{335}=0.5075=50.75\%, Medium = rac{100}{335}=0.2985=29.85\%, Large = rac{65}{335}=0.1940=19.40\%
    • Errors column total = 65; Errors by size: Small = rac{20}{65}=0.3077=30.77\%, Medium = rac{40}{65}=0.6154=61.54\%, Large = rac{5}{65}=0.0769=7.69\%
  • Practical implications:
    • Percent of a row, a column, or the overall total can tell different stories; choose the interpretation that best supports the decision context.
  • Summary points:
    • Tables enable precise organization and quick tabulation of categorical data.
    • Contingency tables reveal relationships between two categorical variables and support multiple percentage viewpoints (overall, row, column).

Visualizing Categorical Data

  • Graphical displays help assess patterns quickly:
    • For one categorical variable: Bar Chart, Pie/Doughnut Chart, Pareto Chart, Summary Table.
    • For two categorical variables: Side-by-Side Bar Chart (a contingency view).
    • Other displays: Donut (Doughnut) Chart, Pareto Chart shows the most important categories and their cumulative share.
  • Pareto chart characteristics:
    • Vertical bar chart with categories in descending order of frequency.
    • Includes a cumulative polygon to emphasize the "vital few" vs the "trivial many".
  • Side-by-Side Bar Chart uses a contingency table to compare No Errors vs Errors across invoice sizes (Small, Medium, Large).
  • Doughnut chart is a circular chart like a pie chart but with a central hole; used to display categorical data from a contingency table.
  • Practical notes:
    • Bar charts and pie/doughnut charts should have clear labels, consistent scales, and emphasis on actual percentages or frequencies.
    • Pareto charts help prioritize issues by highlighting the most significant categories first.
  • Example details (ATM/two-variable view):
    • In separate visuals, show the breakdown of Invoices by Size and Errors.
    • When combined, clear statements about which category combinations dominate (e.g., medium-size invoices have higher error rates).
  • Common pitfalls to avoid in categorical visuals:
    • Using too many categories, which obscures patterns.
    • Inconsistent or misleading scales, chartjunk, or starting axes away from zero unnecessarily.

Organizing Numerical Data

  • Numerical data can be organized into ordered arrays and various distribution formats:
    • Ordered Array: data in rank order from smallest to largest; shows range and potential outliers.
    • Frequency Distribution: data are grouped into class intervals; key decisions include the number of classes, class width, and class boundaries to avoid overlap.
    • The number of classes typically ranges from 5 to 15.
  • Important formulas:
    • Class width: w = \frac{\text{range}}{k} where range = max − min and k is the desired number of classes.
    • Range: \text{range} = \max(x) - \min(x).
    • Relative frequency distribution: RFi = \frac{fi}{N} where f_i is the frequency of class i and N is the total number of observations.
    • Cumulative distribution: CFi = \sum{j=1}^{i} f_j
    • Cumulative percentage (ogive context): \text{Cum\%}i = \frac{CFi}{N} \times 100\%
  • Example: Insulation temperature data across 20 winter days (illustrative data set of 20 values):
    • Raw data (sorted): 10, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
    • Range = 58 − 10 = 48; choose 5 classes; width w ≈ \left\lceil\frac{48}{5}\right\rceil = 10
    • Class intervals (examples): 10\le x < 20, 20\le x < 30, 30\le x < 40, 40\le x < 50, 50\le x < 60
    • Class midpoints: 15, 25, 35, 45, 55
  • Frequency distribution example (20 data):
    • Class midpoints and frequencies:
    • 10 but less than 20: 3
    • 20 but less than 30: 6
    • 30 but less than 40: 5
    • 40 but less than 50: 4
    • 50 but less than 60: 2
    • Total = 20
  • Relative and cumulative frequency example:
    • Relative frequency: for each class, e.g., 3/20 = 0.15, 6/20 = 0.30, 5/20 = 0.25, 4/20 = 0.20, 2/20 = 0.10
    • Cumulative frequency: 3, 9, 14, 18, 20
    • Cumulative percentage: 15%, 45%, 70%, 90%, 100%
  • Practical notes:
    • A frequency distribution condenses raw data and supports quick visual interpretation of where data are concentrated.
    • The choice of class boundaries can affect the appearance of the distribution, especially with small data sets; larger data sets reduce this effect.
    • When comparing two or more groups with different sample sizes, use relative frequency or percentages for fair comparison.

Visualizing Numerical Data

  • Graphical displays for numerical data include:
    • Ordered array, Stem-and-Leaf display, Histogram, Polygon (frequency polygon), Ogive (cumulative percentage polygon).
    • Frequency distributions and cumulative distributions are often shown in histograms and ogives.
  • Stem-and-Leaf Display:
    • A simple way to visualize distribution by separating leading digits (stems) from trailing digits (leaves).
    • Example structure (notional):
    • Stem 1: Leaves 67788899
    • Stem 2: Leaves 0012257
    • Stem 3: Leaves 28
    • Stem 4: Leaves 2
  • Example data in stem-and-leaf form can be used to show age distributions for day vs night students.
  • Histogram:
    • Vertical bar chart of a frequency distribution with no gaps between bars.
    • Class boundaries (or midpoints) on horizontal axis; vertical axis shows frequency, relative frequency, or percentage.
  • Frequency distribution visuals:
    • Example of a 5-class histogram with frequencies and relative frequencies provided, and a total of 20 observations.
  • Ogive (cumulative percentage polygon):
    • Plot cumulative percentages against class boundaries or midpoints; useful for comparing groups when multiple distributions are present.
  • Frequency polygon and percentage polygon:
    • Connect class midpoints with line segments to form a polygon; useful for comparing distributions across groups.

Visualizing Two Numerical Variables

  • Scatter plots:
    • Used for paired observations from two numerical variables; one variable on the x-axis and the other on the y-axis.
    • Purpose: examine possible relationships or correlations between the two variables.
    • Example: Volume per day vs Cost per day plotted as paired observations.
  • Time-series plots:
    • Used to study patterns in a numeric variable over time.
    • Axes: time on the horizontal axis; numeric value on the vertical axis.
    • Example: Yearly data for the number of franchises from 2011 to 2019.

Multidimensional Data and Drill-Down

  • Multidimensional contingency tables (three or more categorical variables):
    • Tallies each combination of the variables to discover patterns that simpler tables cannot reveal.
    • Practical guideline: keep tables to no more than three or four variables.
    • In practice, extend contingency tables to more rows/columns or replace frequencies with numeric summaries when needed.
  • Pivot tables in Excel enable interactive displays of multidimensional contingency data.
  • Drill-down:
    • Clicking a summary cell reveals the underlying row-level data that comprise that summary.
    • Simple form of data discovery.
  • Displays to visualize a mix of many variables:
    • Colored scatter plots, bubble charts (size encodes a third variable), pivot charts, treemaps, sparklines.
  • Excel PivotChart and Tableau visualizations can visualize specific categories from PivotTables and PivotCharts.

Filtering, Querying, and Interactive Exploration

  • Filtering and querying data help prepare tabular or visual summaries:
    • Data Filtering: selects rows based on criteria for specific variable values.
    • Data Querying: similar but may not select all columns from matching rows.
  • Excel features for filtering/querying include: slicers, filter options, and pivot table interactions.
  • Slicers (Excel): a panel of clickable buttons that filter data displayed in a PivotTable; each button represents a unique value of a variable.
  • Practical use: asking and answering questions like which attributes correspond to the lowest expense ratio or which expense ratios are associated with large market-cap value funds with a star rating of five.

Pitfalls, Ethical and Practical Considerations in Visualization

  • Common pitfalls when organizing/visualizing data:
    • Underestimating perceptual limits of audiences.
    • Creating summaries that obscure data or mislead (false impressions).
    • Chartjunk and overly complex visuals.
  • An example of obscuring data: data overload hides key patterns in the underlying data.
  • False impressions can arise from:
    • Selective summarization (showing only parts of the data).
    • Improperly scaled axes (e.g., axes not starting at zero or using broken axes).
    • Pie charts that are hard to interpret or misrepresent proportions.
  • Example illustrating selective summarization: two different year-to-year summaries can tell different stories for the same company data; improving involves presenting complementary views or full data.
  • Graphical errors to watch for:
    • No relative basis or inconsistent baselining across charts.
    • Compressing or exploding axes, misaligned scales, or starting points that mislead.
    • Chartjunk: extraneous images or decorations that do not convey information.
  • Best practices for constructing visualizations:
    • Use the simplest visualization that communicates the message.
    • Include clear titles and label all axes; provide axis scales where relevant.
    • Start vertical axis at zero with a constant scale if comparing values.
    • Avoid 3D, exploded views, and chartjunk; use consistent coloring when comparing charts.
    • Prefer standard chart types; avoid obscure ones (e.g., radar, 3D cone/pyramid) for general interpretation.

Chapter Summary

  • This chapter covers:
    • How to organize and visualize categorical variables.
    • How to organize and visualize numerical variables.
    • How to summarize a mix of variables (including multidimensional tabulations).
    • How to avoid common errors and misrepresentations in data visualization.
  • Key takeaway: choose appropriate visualization methods for the data type and context, be mindful of how the presentation could influence interpretation, and use checks (e.g., multiple views) to ensure accurate representation.