Data Transformation: Pivoting Data (Long to Wide, Wide to Long)

Data Transformation: Pivoting Data

Tidy Data Principles
  • Definition: A tidy data set is structured where:

    • Every variable is its own column.

    • Every observation is its own row.

    • Every cell contains a single piece of information (a single observation).

  • Why Tidy Data?:

    • Makes data more accessible and useful for analysis, especially with tools like Tidyverse and ggplot2.

    • Untidy data examples:

      • Year as Columns: A data set with countries in the first column and subsequent columns representing years (e.g., 2000, 2001, 2002), with values for each country-year combination. This is not tidy because year is a variable but is spread across multiple columns.

      • Multiple Observations per Cell: A data set where one cell contains multiple pieces of information, such as a list of hashtags in a single tweet observation. If a tweet is the unit of observation, and a cell contains multiple hashtags, it's untidy.

  • Unit of Observation: It is crucial to identify the core entity being analyzed.

    • Often, humans erroneously default to nouns (e.g., "countries") when the actual unit of observation is a measure over time (e.g., "observations of a country" or "country-year").

    • For example, measuring the United States, Uganda, or Thailand over time means the unit of observation is the specific country at a specific point in time.

Pivoting from Long to Wide Data (pivot_wider())
  • Purpose: To transform data from a 'long' format (where variables are stacked in rows) to a 'wide' format (where variables become new columns).

  • Scenario: Useful for certain visualizations, tables, or when specific variables need to be compared side-by-side as distinct columns.

  • pivot_wider() Arguments:

    • id_cols: The variable(s) that should remain as ID columns (identifiers) in the new wide format.

    • names_from: The column whose values will become the new column names.

    • values_from: The column whose values will populate the cells in the new columns.

    • names_prefix: A text string to add before the new column names for clarity.

  • Example: Billboard Hot 100100 Chart Data

    • Data Set: Top 100100 weekly Billboard chart songs from 19621962 to 20172017, with over 300,000300,000 observations and 1010 variables.

    • Initial Analysis Steps:

      • Loading tidyverse.

      • Sorting data (e.g., arrange(week_ID)).

      • Summarizing a variable: Finding the song on the chart for the longest duration using group_by(song, performer), summarise(weeks_on_chart = n()), arrange(desc(weeks_on_chart)), and slice_head(n = 10).

        • Result: "Radioactive" by Imagine Dragons was on the chart for 8787 weeks.

    • New Function: unite():

      • Purpose: Creates a new variable by combining existing text variables.

      • Syntax: unite(new_var_name, var1, var2, ..., sep = "-").

      • Example: hot_100 %>% unite(entry, performer, song, sep = "-") combines performer and song into a new entry variable, separated by a hyphen. This can create a single identifier for the unit of observation (song + performer).

    • Filtering for Top Songs: filter(week_position <= 3) to select only the top 33 songs each week.

    • Applying pivot_wider() to Billboard Data:

      • Objective: To have columns for song_entry_1, song_entry_2, song_entry_3 (representing the top 33 songs each week) and the week as the id_cols.

      • Code: pivot_wider(id_cols = week_ID, names_from = week_position, values_from = entry, names_prefix = "song_entry_").

      • Outcome: The data is reshaped so each row represents a week, and columns song_entry_1, song_entry_2, song_entry_3 contain the performer-song string for the top 33 positions, respectively.

    • Troubleshooting Tip: Execute each step of data manipulation individually and check the output to identify errors systematically.

    • Fun Example: Finding the Top Song on a Birthday:

      • Conceptual Steps:

        1. Convert the week_ID column to date objects.

        2. Create a date object for the birthday.

        3. Find all unique week_ID dates.

        4. Calculate the absolute difference in days between the birthday and each unique week date: abs(weeksbirthday)abs(weeks - birthday).

        5. Determine the minimum of these absolute differences.

        6. Use the which() command to find the index (position) of the week that has this minimum difference.

        7. Filter the original data set by this closest week_ID.

        8. Arrange by week_position and select song, performer, week_position.

        • Example Result: For November 27,199327, 1993, Meat Loaf's "I'd Do Anything for Love, but I Won't Do That" was the top song.

Exploratory Data Analysis with ggplot2 (Wide Data Example)
  • Visualizing "Radioactive" Performance:

    • Converting week_ID to a date format using mutate(date = as.Date(week_ID)).

    • ggplot() setup:

      • aes(x = date, y = week_position).

      • geom_line(color = "green", linewidth = 1) to show trend.

      • geom_point(shape = 'R', size = 3) for specific points.

      • scale_y_reverse() since lower rank numbers (e.g., 11) are 'better'.

      • scale_x_date(date_breaks = "2 months", date_labels = "%b %Y") for x-axis formatting.

      • theme_bw() and informative labs().

    • Adding Annotations:

      • To highlight an event (e.g., SNL appearance for Imagine Dragons in February 20142014).

      • geom_vline(xintercept = as.Date("2014-02-01"), color = "blue").

      • annotate("text", x = as.Date("2014-02-01"), y = 50, label = "SNL Appearance").

      • Implication: Being on SNL can cause a short-term spike in chart position and offers research questions about the average impact of such appearances.

Pivoting from Wide to Long Data (pivot_longer())
  • Purpose: To transform data from a 'wide' format (where related variables are in separate columns) to a 'long' format (where those variables are stacked into a single column, with another column identifying their original category).

  • Scenario: Essential for making data tidy when multiple columns represent different measures of the same variable (e.g., "Civil Liberties Score" and "Political Rights Score" are both measures of "Freedom"). Also crucial for ggplot2 which requires tidy data.

  • pivot_longer() Arguments:

    • cols: The columns to pivot from wide to long (e.g., CL, PR).

    • names_to: The new column name that will hold the original column names (e.g., "category").

    • values_to: The new column name that will hold the values from the pivoted columns (e.g., "score").

  • Example: Freedom House Data

    • Data Set: Freedom House measures of political liberties (PR) and civil rights (CL) worldwide, along with country, year, status (Not Free, Partially Free, Free), and region.

    • Initial Data Structure: country, year, CL (civil liberties score), PR (political rights score).

    • Problem: CL and PR are two separate columns, but conceptually, they are both "scores" of "freedom." For ggplot2 to easily compare them, they need to be in a single score column, with a category column to differentiate between CL and PR.

    • Applying pivot_longer():

      • Code: pivot_longer(cols = c(CL, PR), names_to = "category", values_to = "score").

      • Outcome: The CL and PR columns are collapsed. A new category column contains "CL" or "PR", and a new score column contains the corresponding numerical values. Now, each row represents a country-year-category observation.

    • Visualizing Iraq's Freedom Scores (Long Data Example):

      • ggplot() setup:

        • aes(x = year, y = score, color = category, linetype = category).

        • geom_line() or geom_step().

        • scale_y_reverse(breaks = 1:7) (or similar, adjusting for scale where 11 is most free and 77 is least free).

        • Informative labs().

        • geom_step() vs. geom_line(): geom_step() creates a stair-step plot, useful for showing distinct jumps between categories, while geom_line() implies a more continuous transition.

      • Observation: Iraq's freedom scores were very low in the late 90s/early 2000s, with a recent improvement, but still on the lower end.

    • Comparing Countries (e.g., Norway vs. USA):

      • Filter the long data for specific countries (e.g., Norway_USA_long <- filter(freedom_long, country %in% c("Norway", "United States"))).

      • Use facet_wrap(~country) in ggplot2 to compare trends side-by-side.

      • Using both color and linetype in aes() helps differentiate overlapping trends (e.g., CL vs. PR for each country).

    • Aggregating and Visualizing Worldwide Trends:

      • Group by year, then summarise() mean(CL) and mean(PR).

      • Pivot these aggregated average scores long.

      • Plot global average CL and PR over time using geom_step().

      • scale_y_continuous(limits = c(3, 4)) can be used to zoom into a specific range of the y-axis to observe subtle trends more clearly.

        • Observation: More variability observed in civil liberties (CLCL) compared to political rights (PRPR) worldwide. An uptick in political rights was noted during late 90s/early 2000s democratization, but less pronounced than civil liberties.

    • Analyzing Regional Trends:

      • Group by year and region.

      • Summarize average CL and PR scores for each region-year.

      • Pivot these regional averages long.

      • Plot using ggplot2 with color = region and linetype = category.

        • Observation: Europe, Americas, and Oceania show stable, high freedom levels. Africa and Asia exhibit a "wave of democratization" in the late 90s/early 2000s (expanding liberties/rights), followed by an ebbing in the mid-2000s.

  • Power of ggplot2 with Tidy Data:

    • Can visualize multiple variables simultaneously (e.g., score, time, region, right type) using different aesthetic components (x, y, color, linetype, facet_wrap).

    • Complex visualizations can tell clear and informative stories.

General Advice and Best Practices
  • Practice Incrementally: Build code step-by-step, checking outputs at each stage to ease troubleshooting.

  • Substantive Understanding First: Before writing code, articulate in words what you are trying to achieve with the data. "What do I have? Where do I want to get to?" Translate the substantive goal into R syntax.

  • Pivoting Practice: pivot_longer() will be used much more frequently in day-to-day work than pivot_wider(). Practice both, and try pivoting data long then back to wide, and vice-versa, to understand the reversibility and structure changes.

  • ggplot2 Aesthetics: Explore alpha (transparency), fill (inner color of bars/columns, distinct from color for perimeter). There's very little you cannot accomplish with ggplot2 aesthetics.

  • Learning Curve for Pivoting: Mastering pivoting takes time and practice. Expect initial confusion, but with repeated application, it becomes second nature.

  • Reading Different File Types:

    • .csv files are native to readr (part of tidyverse).

    • readxl package for Excel files (.xlsx or .xls), e.g., readxl::read_xlsx("filepath.xlsx", sheet = "Sheet1").

    • haven package for statistical software files (e.g., SPSS, SAS, Stata), e.g., haven::read_spss("filepath.sav").

    • jsonlite package for JSON files, e.g., jsonlite::fromJSON("filepath.json").

    • Specialized libraries exist for virtually any file format (e.g., map for shapefiles).

  • GitHub: Useful for project management and replication, though personal workflows may vary. Python tools often integrate well with GitHub.