ECO376/511 Business and Economic Forecasting using Time Series Analysis - Week 1 Notes
Time Series Analysis in R
- Definition: Time series analysis is a statistical technique for analyzing data points recorded at regular intervals over time.
- Key Aspect: Time is a critical component, differentiating it from general data mining and machine learning.
- Considerations: Time series methods account for internal structures like autocorrelation, trends, and seasonal variations.
Why Use R for Time Series Analysis?
- Power: R is a widely used language for data analysis and statistical computing, developed in the early 90s.
- Evolution: It has evolved from a basic text editor to interactive environments like R Studio and Jupyter Notebooks.
- Graphics: R boasts unmatched, powerful graphics capabilities.
Advantages of Using R for Time Series Analysis
- Open Source: R is open-source, flexible, and customizable, allowing users to write, use, and modify code.
- Statistical Capabilities: It offers advanced statistical capabilities and models.
- Community and Resources: R has a large community, providing ample learning resources and support.
- Integration: It integrates well with languages like C++, Java, and Python.
- Object-Oriented: Everything in R is treated as an object.
- Parallel Computing: R supports parallel computing, utilizing multiple processors for tasks.
How to Get R and R Studio
- Installation: Install R first, followed by RStudio.
- R Installation:
- Download the R installer from CRAN (The Comprehensive R Archive Network).
- Ensure it is version 4.2.0 or later.
- Link: https://cran.r-project.org/
- RStudio Installation:
- Download RStudio from https://rstudio.com/products/rstudio/download/ after installing R.
Library in R
- Definition: A library in R is a directory where packages are stored.
- Package: A package includes R functions, data, and compiled code.
- Usage: Libraries extend the capabilities of R programs, making data analysis easier.
- Example: Using the
ggplot2 library for data visualization:- Install the package:
install.packages("ggplot2") - Load the package:
library(ggplot2)
- Required Libraries for Class:
dplyrggplot2plotlyzooxtsstatsgtrendsRquantmodlubridategapminder
R Data Structure
- Supported Structures: R supports vectors, matrices, lists, factors, and data frames.
- Data Types: These structures can hold numeric, character, logical, and complex data types.
Vectors
- Definition: Simplest data structure in R with homogeneous elements.
- Creation: Created using the
c() functionv0 <- rep(0,5)v1 <- 1:5v1 <- seq(1,5,by=1)v1 <- c(1L,2L,3L,4L,5L)v2 <- c(1, 2.1, 3.5, 4, 50)v3 <- c("a", "b", "c","d","e")v4 <- c(TRUE, FALSE, TRUE, TRUE,FALSE)
Matrices
- Definition: Two-dimensional arrays storing data of a single type.
- Creation: Created using the
matrix() functionm1 <- matrix(1:9, nrow=3, ncol=3)
Lists
- Definition: Can contain elements of different types, including vectors, matrices, and other lists.
list1 <- list(v1,v2,v3,v4)
Data Frames
- Definition: Used to store tabular data with different columns holding different data types.
- Creation and Usage:
df <- data.frame(Name = c("Alice", "Bob", "Charlie"), Age = c(25, 32, 37), Married = c(TRUE, FALSE, TRUE))df<- read.csv("/var/www/html/jlee141/econdata/eco520/chicago_cca.csv")dat1 <- data.frame(v1,v2,v3,v4)dat1$cat1 <- ifelse(dat1$v2 > 20,1,0)dat1$cat2 <- ifelse(dat1$v4 == TRUE,1,0)dat2 <- dat1[1:3,]dat3 <- dat1[,1:3]dat4 <- dat1[which(v4==TRUE),]dat5 <- subset(dat1,v4==TRUE)dat6 <- subset(dat1,v4==TRUE, select=c(v1,v2,squared,cat1:cat2))
Time Series Data Types
- Definition: Sequences of values collected over time intervals.
- Handling in R: Specific data structures for efficient handling.
ts function: Creates time-series objects from numeric vectors, specifying start time, end time, and frequency.- Frequency examples: 1 (annual), 12 (monthly), 4 (quarterly), 52 (weekly).
- Examples:
- Annual:
my_ts <- ts(data = c(1:24), start = 2001, end=2024, frequency = 1) - Quarterly:
my_ts <- ts(data = c(1:24), start = 2001, frequency = 4) - Monthly:
my_ts <- ts(data = c(1:24), start = 2001, frequency = 12) - Weekly:
my_ts <- ts(data = c(1:24), start = 2001, frequency = 52) - Entered:
data <- c(10,11,5,10,15,12,30,18,25,5,20,15) then my_ts <- ts(data, start=c(2001,1), end=c(2003,4), frequency=4)
- Printing Time Series Data: Using the
ts function with specified frequency and start time.ts(1:10, frequency = 4, start = c(1959, 2))ts(1:10, frequency = 7, start = c(12, 2)), calendar = TRUE)ts(1:48, frequency = 24, start = c(110, 6)), calendar = TRUE)
zoo: For ordered indexed series.- Requires:
library(zoo) - Example:
dates <- as.Date("2020-01-01") + 0:9
zoo_ts <- zoo(1:24, dates)
xts: Extends zoo for financial time series.- Requires:
library(xts) - Example:
xts_ts <- xts(1:24, order.by=dates)
Time Series Data Conversion
- Convert to different data types data frame, ts, and xts, requires
library(lubridate)ID <- 1:12NAME <- LETTERS[1:12]NUM <- rnorm(12)date <- seq(as.Date("2020-01-01"),length=12, by="months")year <- year(date)month <- month(date)
- Convert to data frame
df <- data.frame(ID,date,year,month,NAME,NUM)
- Data frame to ts format
df_ts <- ts(df,start=c(2020,1), frequency = 12)
- ts to xts format
ts_xts <- as.xts(df_ts)plot(ts_xts)plot(ts_xts$NUM)
- data frame to xts
df_xts <- xts(df,order.by = date)
- xts to ts
xts_ts <- ts(df_xts,start=c(2020,1),frequency = 12)
- xts to dataframe
xts_df <- as.data.frame(df_xts)xts_df$date <- as.Date(xts_df$date)xts_df$year <- year(xts_df$date)xts_df$NUM <- as.numeric(xts_df$NUM)
Time Series Data Frequency Changes
- XTS data
- Requires:
library(quantmod)
- Daily Data Load
getSymbols("AAPL", src = "yahoo", from = "2020-01-01", to = Sys.Date())plot(Cl(AAPL)) # Closing prices
- weekly Data Conversion (indexAt can be "firstof", "endof", or "mean")
- Last day of the week
AAPL_weekly <- to.weekly(AAPL, indexAt = "endof", OHLC = FALSE)plot(Cl(AAPL_weekly)) # Closing prices
- Monthly Data Conversion
- First day of the month
AAPL_monthly <- to.monthly(AAPL, indexAt = "firstof", OHLC = FALSE)plot(Cl(AAPL_monthly)) # Closing prices
- Quarterly Data Conversion
AAPL_quarterly <- to.quarterly(AAPL, indexAt = "firstof", OHLC = FALSE)plot(Cl(AAPL_quarterly)) # Closing prices
- Quarterly Data Conversion
AAPL_yearly <- to.yearly(AAPL, indexAt = "firstof", OHLC = FALSE)plot(Cl(AAPL_yearly)) # Closing prices
Lag, Forward, Differencing, and Percentage Change Operations
- Lag Operation:
- Shifts data points to the next periods, delaying them.
- Useful for comparing current values against past values.
vector <- rnorm(12)lag_vector <- lag(vector, n = 1)
- Forward Operation:
- Shifts data points to previous periods, forecasting them forward.
lead_vector <- lead(vector, n = 1) (Using dplyr package)
- Differencing:
- Makes a non-stationary time series stationary.
- Subtracts the previous observation from the current observation.
diff_vector <- vector - lag(vector, n = 1)
- Percentage Change:
- Measures the relative change between two numbers as a percentage of the first number.
- perc_change <- ((vector - lag(vector)) / lag(vector)) * 100
Sources of Time Series Data
- Google Trends
- FRED
- Yahoo Finance
- M Forecasting competitions
- Web APIs (e.g., Quandl)
- Time series data repositories (UCI Machine Learning Repository; UEA and UCR Time Series Classification Repository)
- Data sets in R packages (e.g., tscompdata)
- Sensor data from smart cities, medical devices
- Signals for radio, music, medical devices, speech, radars
- Countries’ organizations (Eurostat, OECD, NOAA)
Time Series Data Example: Goolge Trends
- Google Trends data offers insights into search interest trends over time.
- Requires:
gtrendsR package in R provides an interface to Google Trends.
- Install:
install.packages("gtrendsR")
- Load
- Fetch trends data for specific keywords.
- Single Keyword Example
trends <- gtrends(c("Data Science"))plot(trends)
- Multiple Keywords Example
trends <- gtrends(c("Data Science", "Machine Learning"))plot(trends)
- Geographical and Time Specifications
trends <- gtrends(c("Data Science"), geo = "US", time = "2016-01-01 2021-12-31")plot(trends)
Fetching Data from FRED
- You can use the
getSymbols function.- Example: unemployment Rate requires
library(quantmod)getSymbols("UNRATE", src = "FRED")getSymbols("UNRATE", src = "FRED",return.class=’ts’)plot(UNRATE)
- Load Multiple Series with plots
getSymbols(c("CHXRNSA","NYXRNSA","LVXRNSA"), src = "FRED")
- Combine Multiple Series in a xts data
Home_Index <- cbind(CHXRNSA,NYXRNSA,LVXRNSA)
- Subset, Lag, Monthly, and Annual Changes for xts data
Home_Covid <- window(Home_Index, start="2019-01-01", end="2024-03-01")Lag_Home <- lag(Home_Covid)Home_Rate_m <- diff(Home_Covid, lag=1, differences = 1)plot(Home_Rate_m)Home_Rate_y <- diff(Home_Covid, lag=12, differences = 1)plot(Home_Rate_y)
Fetching Data from Yahoo Finance
- Use quantmod to fetch historical daily market data.
- Fetching Apple Inc. Data requires
library(quantmod)getSymbols("AAPL", src = "yahoo", from = "2020-01-01", to = "2024-12-31")apple_data <- AAPLchartSeries(apple_data, type = "line", theme = chartTheme("white"))chartSeries(apple_data, type = "line", TA=NULL, theme = chartTheme("white"))
- Conversion to different frequency
VoWeek <- apply.weekly(Vo(AAPL),sum) # sum from Monday to FridayVoMonth <- apply.monthly(Vo(AAPL),sum) # sum to monthVoQuarter <- apply.quarterly(Vo(AAPL),sum) # sum to quarterVoYear <- apply.yearly(Vo(AAPL),sum) # sum to year
Time Series Data Cleaning
- Like all data, time series data must be inspected and cleaned
- Is there a time stamp?
- Is the time stamp in order?
- Are the observations recorded monthly, annually, daily, hourly, weekly..?
- Are the data recorded at equally spaced intervals?
- Are there breaks in the time stamp?
- Are there missing values?
- Are there extraneous numerical codes that are not really data but code representing something else.
- Are the measurements intertwined with summaries of the data, or some other data processing that is not really an observation?
- Are the numbers formatted with commas in them?
Time Series with Missing Values
- Missing values are a lot easier to deal with.
- It’s best if you can find out why he values are missing in the first place, but if you can’t, there are various statistical methods available for imputing them.
- With missing values in time series datasets, you usually have the data column fully populated, and the value field is set to NA.
- Example requires:
library(lubridate)
R
df <- data.frame(
date = seq(ymd("20230101"), ymd("20231231"), by = "months"),
value = c(145, 212, NA, 265, 299, 345, NA, NA, 278, 256, 202, 176)
)
library(zoo)- Mean value imputation- the missing values will be replaced with a simple average
mean_value <- mean(df$value, na.rm = TRUE)df$v_mean <- ifelse(is.na(df$value), mean_value, df$value)
- Forward fill - the missing value at the point T is filled with a non-missing value at T-1
df$v_ffill <- na.locf(df$value, na.rm = FALSE)
- Backward fill - the missing value at the point T is filled with a non-missing value at T+1
df$v_bfill <- na.locf(df$value, fromLast = TRUE, na.rm = FALSE)
- Linear interpolation - the missing value at the point T is filled with an average of non-missing values at T-1 and T+1
df$v_interpolated <- na.approx(df$value)
Time Series Visualization
Time Series Graphs
- plot, plot.ts
- Creating plots in R is a fundamental aspect of data analysis, allowing for visual exploration of data sets and the communication of data insights.
- The base plotting system in R is simple and direct, making it easy to create basic graphs using ”plot” or ”plot.ts” command:
- requires
library(quantmod) - Simple Plots
getSymbols("UNRATE", src = "FRED",return.class=’ts’)plot(UNRATE)
- Use plot.ts to create a time series plot with a custom color
plot.ts(UNRATE, main = "Monthly Unemployment Rate", ylab = "Unemployment Rate", col = "blue")
- Add text at a specific location with a custom color
text(x = c(2012,2020), y = c(6,10), labels = c("Bottom", "Covid19"), col = c("red", "red"))
- Multiple graphs on xts by combining multiple xts data
Home_Index <- cbind(CHXRNSA,NYXRNSA,LVXRNSA)plot(Home_Index)plot(Home_Index$CHXRNSA)plot(CHXRNSA,col="red", main="Case-Shiller Home Price Index \n Chicago, New York, Las Vegas")lines(NYXRNSA,col="blue")lines(LVXRNSA,col="green")
Time Series Graphs: chartSeries
- Example: Graphs for Apple Inc. Data requires
library(quantmod)- Subset
chartSeries(AAPL, subset=’2020-05::2024-01’, theme=chartTheme(’white’))chartSeries(AAPL,type="line", subset=’2023’, theme=chartTheme(’white’))
- Barchart
chartSeries(AAPL, type="bar", subset=’2024-01’, theme=chartTheme(’white’))#
- Candle Sticks
chartSeries(AAPL, type="candlesticks", subset=’2007-01’, theme=chartTheme(’white’))
- Add Line; Simple Moving Average (SMA), Exponential MA (EMA)
chartSeries(AAPL, subset=’2019-05::2023-01’, theme=chartTheme(’white’))addSMA(n=5,on=1,col = "red")addMACD(fast=12,slow=26,signal=9,type="EMA")addRSI(n=30,maType="EMA")
- Multiple Series in a Graph
chartSeries(Cl(AAPL), subset=’2007-05::2023-01’, theme=chartTheme(’white’))addTA(Cl(IBM), on=1,col="blue",lty="dashed")
Dyanmic and Interactive Graphs
- Time Series Dyanmic Graph: dygraph requires library
- Standard dynamic graph#
- The function dygraph() display time series data interactively. Move your mouse on the diagram
- Shading
graph<- dygraph(Cl(AAPL), main = "AAPL")dyShading(graph, from="2020-03-20", to="2022-12-11", color="#FFE6E6")
- Event line
graph <- dygraph(OHLC(AAPL), main = "AAPL")graph <- dyEvent(graph,"2007-6-29", "iphone", labelLoc = "bottom")graph <- dyEvent(graph,"2010-5-6", "Flash Crash", labelLoc = "bottom")graph <- dyEvent(graph,"2014-6-6", "Split", labelLoc = "bottom")dyEvent(graph,"2011-10-5", "Jobs", labelLoc = "bottom")
- Candle Chart
AAPL_C <- tail(AAPL, n=30)graph <- dygraph(OHLC(AAPL_C))dyCandlestick(graph)
Dyanmic and Interactive Graphs: Life Expectation vs. GDP Per Capita by Year and Country
- Time Series Interactive Graph Example:
- requires library
plotly - requires library
dplyr - requires library
gapminder
R
data <- gapminder %>%
select(country, year, lifeExp, gdpPercap, pop) %>%
mutate(scaled_pop = pop/5000000) # Using square root to scale population
- Interactive graph
R
figure <- plot_ly(
data,
x = ~gdpPercap,
y = ~lifeExp,
text = ~country,
mode = ’markers’,
color = ~country,
frame = ~year,
ids = ~country,
marker = list(
size = ~scaled_pop,
opacity = 0.7,
line = list(color = ’rgba(0, 0, 0, 0.5)’, width = 0.5)
),
type = ’scatter’,
hoverinfo = ’text+x+y’
) %>%
layout(
title = ’Life Expectancy vs. GDP Per Capita by Year and Country’,
xaxis = list(type = ’log’, title = ’GDP per Capita’),
yaxis = list(title = ’Life Expectancy’),
margin = list(l = 60, r = 50, b = 65, t = 90)
)
figure
Graph using ggplot2
- Introduction to ggplot2
- ggplot2 is a powerful R package for creating graphics. It implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can create advanced plots in a consistent manner. This presentation will focus on its application to time series data.
- Simple Line Graph
- A simple line plot is the most common graph for time series data. We plot time on the x-axis and the variable of interest on the y-axis. The following code will produce a line plot with a minimalistic style.
- Make sure to load ggplot2 library
- Sample Data
R
ts_data <- data.frame(
Date = seq(as.Date("2000/1/1"), by = "month", length.out = 100),
Value = cumsum(runif(100, min = -10, max = 10))
)
- ggplot example
R
ggplot(ts_data, aes(x = Date, y = Value)) +
geom_line() +
theme_minimal()
Graph using ggplot2: Time Series with Multiple Groups
- Sometimes we need to compare multiple time series on the same plot. This can be done by mapping a categorical variable to color or linetype.
R
ts_data$Group <- rep(c("A", "B"), each = 50)
ggplot(ts_data, aes(x = Date, y = Value, color = Group)) +
geom_line() +
theme_minimal()
- This code adds an additional categorical variable ‘Group‘ to differentiate lines.
Graph using ggplot2: Adding Points to Time Series
- To highlight individual data points, you can add points to your line plot. This is useful to mark outliers or specific events.
R
ggplot(ts_data, aes(x = Date, y = Value)) +
geom_line() +
geom_point(aes(color = Value > 0)) +
theme_minimal()
- Points are colored differently if the ‘Value‘ is greater than zero.
Graph using ggplot2: Time Series with Facets
- Faceting creates a matrix of panels by one or more grouping variables. It allows us to compare several time series graphs side by side.
R
ggplot(ts_data, aes(x = Date, y = Value)) +
geom_line() +
facet_wrap(~ Group) + # Facets by group
theme_minimal()
- Each panel represents a different group’s time series.
Graph using ggplot2: Customizing Time Series Plots
- ggplot2 is highly customizable. You can adjust almost every element of a plot to suit your needs.
R
ggplot(ts_data, aes(x = Date, y = Value, color = Group)) +
geom_line() +
labs(title = "Customized Time Series Plot", x = "Time", y = "Value") +
scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
scale_color_manual(values = c("blue", "red")) +
theme_minimal()
- The labels, scales, and colors are customized.
Graph using ggplot2: Spagettie Time Series Plots ggplot2
- is highly customizable. You can adjust almost every element of a plot to suit your needs.
- Spagetti Plot
- Convert to data frame
getSymbols(c("CHXRNSA"), src = "FRED")df <- data.frame(Date = index(CHXRNSA), Price = coredata(CHXRNSA))
- Extract year and month
df$Year <- as.numeric(format(df$Date, "%Y"))df$Month <- as.numeric(format(df$Date, "%m"))
- Create the spaghetti plot
R
df %>%
filter(Year > 2018) %>%
ggplot(aes(x = Month, y = CHXRNSA, group = Year, color = as.factor(Year))) +
geom_line() +
geom_point() + # Add this line to include dots
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(title = "Spaghetti Plots of Monthly Chicago Home Prices by Year", x = "Month", y = "Price Index") +
scale_color_discrete(name = "Year")