Data Transformation: Pivoting Data (Long to Wide, Wide to Long)
Data Transformation: Pivoting Data
Tidy Data Principles
Definition: A tidy data set is structured where:
Every variable is its own column.
Every observation is its own row.
Every cell contains a single piece of information (a single observation).
Why Tidy Data?:
Makes data more accessible and useful for analysis, especially with tools like
Tidyverseandggplot2.Untidy data examples:
Year as Columns: A data set with countries in the first column and subsequent columns representing years (e.g., 2000, 2001, 2002), with values for each country-year combination. This is not tidy because
yearis a variable but is spread across multiple columns.Multiple Observations per Cell: A data set where one cell contains multiple pieces of information, such as a list of hashtags in a single tweet observation. If a tweet is the unit of observation, and a cell contains multiple hashtags, it's untidy.
Unit of Observation: It is crucial to identify the core entity being analyzed.
Often, humans erroneously default to nouns (e.g., "countries") when the actual unit of observation is a measure over time (e.g., "observations of a country" or "country-year").
For example, measuring the United States, Uganda, or Thailand over time means the unit of observation is the specific country at a specific point in time.
Pivoting from Long to Wide Data (pivot_wider())
Purpose: To transform data from a 'long' format (where variables are stacked in rows) to a 'wide' format (where variables become new columns).
Scenario: Useful for certain visualizations, tables, or when specific variables need to be compared side-by-side as distinct columns.
pivot_wider()Arguments:id_cols: The variable(s) that should remain as ID columns (identifiers) in the new wide format.names_from: The column whose values will become the new column names.values_from: The column whose values will populate the cells in the new columns.names_prefix: A text string to add before the new column names for clarity.
Example: Billboard Hot Chart Data
Data Set: Top weekly Billboard chart songs from to , with over observations and variables.
Initial Analysis Steps:
Loading
tidyverse.Sorting data (e.g.,
arrange(week_ID)).Summarizing a variable: Finding the song on the chart for the longest duration using
group_by(song, performer),summarise(weeks_on_chart = n()),arrange(desc(weeks_on_chart)), andslice_head(n = 10).Result: "Radioactive" by Imagine Dragons was on the chart for weeks.
New Function:
unite():Purpose: Creates a new variable by combining existing text variables.
Syntax:
unite(new_var_name, var1, var2, ..., sep = "-").Example:
hot_100 %>% unite(entry, performer, song, sep = "-")combinesperformerandsonginto a newentryvariable, separated by a hyphen. This can create a single identifier for the unit of observation (song + performer).
Filtering for Top Songs:
filter(week_position <= 3)to select only the top songs each week.Applying
pivot_wider()to Billboard Data:Objective: To have columns for
song_entry_1,song_entry_2,song_entry_3(representing the top songs each week) and theweekas theid_cols.Code:
pivot_wider(id_cols = week_ID, names_from = week_position, values_from = entry, names_prefix = "song_entry_").Outcome: The data is reshaped so each row represents a
week, and columnssong_entry_1,song_entry_2,song_entry_3contain the performer-song string for the top positions, respectively.
Troubleshooting Tip: Execute each step of data manipulation individually and check the output to identify errors systematically.
Fun Example: Finding the Top Song on a Birthday:
Conceptual Steps:
Convert the
week_IDcolumn to date objects.Create a date object for the birthday.
Find all unique
week_IDdates.Calculate the absolute difference in days between the birthday and each unique week date: .
Determine the minimum of these absolute differences.
Use the
which()command to find the index (position) of the week that has this minimum difference.Filter the original data set by this closest
week_ID.Arrange by
week_positionand selectsong,performer,week_position.
Example Result: For November , Meat Loaf's "I'd Do Anything for Love, but I Won't Do That" was the top song.
Exploratory Data Analysis with ggplot2 (Wide Data Example)
Visualizing "Radioactive" Performance:
Converting
week_IDto a date format usingmutate(date = as.Date(week_ID)).ggplot()setup:aes(x = date, y = week_position).geom_line(color = "green", linewidth = 1)to show trend.geom_point(shape = 'R', size = 3)for specific points.scale_y_reverse()since lower rank numbers (e.g., ) are 'better'.scale_x_date(date_breaks = "2 months", date_labels = "%b %Y")for x-axis formatting.theme_bw()and informativelabs().
Adding Annotations:
To highlight an event (e.g., SNL appearance for Imagine Dragons in February ).
geom_vline(xintercept = as.Date("2014-02-01"), color = "blue").annotate("text", x = as.Date("2014-02-01"), y = 50, label = "SNL Appearance").Implication: Being on SNL can cause a short-term spike in chart position and offers research questions about the average impact of such appearances.
Pivoting from Wide to Long Data (pivot_longer())
Purpose: To transform data from a 'wide' format (where related variables are in separate columns) to a 'long' format (where those variables are stacked into a single column, with another column identifying their original category).
Scenario: Essential for making data tidy when multiple columns represent different measures of the same variable (e.g., "Civil Liberties Score" and "Political Rights Score" are both measures of "Freedom"). Also crucial for
ggplot2which requires tidy data.pivot_longer()Arguments:cols: The columns to pivot from wide to long (e.g.,CL,PR).names_to: The new column name that will hold the original column names (e.g., "category").values_to: The new column name that will hold the values from the pivoted columns (e.g., "score").
Example: Freedom House Data
Data Set: Freedom House measures of political liberties (
PR) and civil rights (CL) worldwide, along withcountry,year,status(Not Free, Partially Free, Free), andregion.Initial Data Structure:
country,year,CL(civil liberties score),PR(political rights score).Problem:
CLandPRare two separate columns, but conceptually, they are both "scores" of "freedom." Forggplot2to easily compare them, they need to be in a singlescorecolumn, with acategorycolumn to differentiate betweenCLandPR.Applying
pivot_longer():Code:
pivot_longer(cols = c(CL, PR), names_to = "category", values_to = "score").Outcome: The
CLandPRcolumns are collapsed. A newcategorycolumn contains "CL" or "PR", and a newscorecolumn contains the corresponding numerical values. Now, each row represents acountry-year-categoryobservation.
Visualizing Iraq's Freedom Scores (Long Data Example):
ggplot()setup:aes(x = year, y = score, color = category, linetype = category).geom_line()orgeom_step().scale_y_reverse(breaks = 1:7)(or similar, adjusting for scale where is most free and is least free).Informative
labs().geom_step()vs.geom_line():geom_step()creates a stair-step plot, useful for showing distinct jumps between categories, whilegeom_line()implies a more continuous transition.
Observation: Iraq's freedom scores were very low in the late 90s/early 2000s, with a recent improvement, but still on the lower end.
Comparing Countries (e.g., Norway vs. USA):
Filter the long data for specific countries (e.g.,
Norway_USA_long <- filter(freedom_long, country %in% c("Norway", "United States"))).Use
facet_wrap(~country)inggplot2to compare trends side-by-side.Using both
colorandlinetypeinaes()helps differentiate overlapping trends (e.g.,CLvs.PRfor each country).
Aggregating and Visualizing Worldwide Trends:
Group by
year, thensummarise()mean(CL)andmean(PR).Pivot these aggregated average scores long.
Plot global average
CLandPRover time usinggeom_step().scale_y_continuous(limits = c(3, 4))can be used to zoom into a specific range of the y-axis to observe subtle trends more clearly.Observation: More variability observed in civil liberties () compared to political rights () worldwide. An uptick in political rights was noted during late 90s/early 2000s democratization, but less pronounced than civil liberties.
Analyzing Regional Trends:
Group by
yearandregion.Summarize average
CLandPRscores for each region-year.Pivot these regional averages long.
Plot using
ggplot2withcolor = regionandlinetype = category.Observation: Europe, Americas, and Oceania show stable, high freedom levels. Africa and Asia exhibit a "wave of democratization" in the late 90s/early 2000s (expanding liberties/rights), followed by an ebbing in the mid-2000s.
Power of
ggplot2with Tidy Data:Can visualize multiple variables simultaneously (e.g., score, time, region, right type) using different aesthetic components (
x,y,color,linetype,facet_wrap).Complex visualizations can tell clear and informative stories.
General Advice and Best Practices
Practice Incrementally: Build code step-by-step, checking outputs at each stage to ease troubleshooting.
Substantive Understanding First: Before writing code, articulate in words what you are trying to achieve with the data. "What do I have? Where do I want to get to?" Translate the substantive goal into R syntax.
Pivoting Practice:
pivot_longer()will be used much more frequently in day-to-day work thanpivot_wider(). Practice both, and try pivoting data long then back to wide, and vice-versa, to understand the reversibility and structure changes.ggplot2Aesthetics: Explorealpha(transparency),fill(inner color of bars/columns, distinct fromcolorfor perimeter). There's very little you cannot accomplish withggplot2aesthetics.Learning Curve for Pivoting: Mastering pivoting takes time and practice. Expect initial confusion, but with repeated application, it becomes second nature.
Reading Different File Types:
.csvfiles are native toreadr(part oftidyverse).readxlpackage for Excel files (.xlsxor.xls), e.g.,readxl::read_xlsx("filepath.xlsx", sheet = "Sheet1").havenpackage for statistical software files (e.g., SPSS, SAS, Stata), e.g.,haven::read_spss("filepath.sav").jsonlitepackage for JSON files, e.g.,jsonlite::fromJSON("filepath.json").Specialized libraries exist for virtually any file format (e.g.,
mapfor shapefiles).
GitHub: Useful for project management and replication, though personal workflows may vary. Python tools often integrate well with GitHub.