ECO376/511 Business and Economic Forecasting using Time Series Analysis - Week 1 Notes

Time Series Analysis in R

  • Definition: Time series analysis is a statistical technique for analyzing data points recorded at regular intervals over time.
  • Key Aspect: Time is a critical component, differentiating it from general data mining and machine learning.
  • Considerations: Time series methods account for internal structures like autocorrelation, trends, and seasonal variations.

Why Use R for Time Series Analysis?

  • Power: R is a widely used language for data analysis and statistical computing, developed in the early 90s.
  • Evolution: It has evolved from a basic text editor to interactive environments like R Studio and Jupyter Notebooks.
  • Graphics: R boasts unmatched, powerful graphics capabilities.

Advantages of Using R for Time Series Analysis

  • Open Source: R is open-source, flexible, and customizable, allowing users to write, use, and modify code.
  • Statistical Capabilities: It offers advanced statistical capabilities and models.
  • Community and Resources: R has a large community, providing ample learning resources and support.
  • Integration: It integrates well with languages like C++, Java, and Python.
  • Object-Oriented: Everything in R is treated as an object.
  • Parallel Computing: R supports parallel computing, utilizing multiple processors for tasks.

How to Get R and R Studio

  • Installation: Install R first, followed by RStudio.
  • R Installation:
    • Download the R installer from CRAN (The Comprehensive R Archive Network).
    • Ensure it is version 4.2.0 or later.
    • Link: https://cran.r-project.org/
  • RStudio Installation:
    • Download RStudio from https://rstudio.com/products/rstudio/download/ after installing R.

Library in R

  • Definition: A library in R is a directory where packages are stored.
  • Package: A package includes R functions, data, and compiled code.
  • Usage: Libraries extend the capabilities of R programs, making data analysis easier.
  • Example: Using the ggplot2 library for data visualization:
    • Install the package: install.packages("ggplot2")
    • Load the package: library(ggplot2)
  • Required Libraries for Class:
    • dplyr
    • ggplot2
    • plotly
    • zoo
    • xts
    • stats
    • gtrendsR
    • quantmod
    • lubridate
    • gapminder

R Data Structure

  • Supported Structures: R supports vectors, matrices, lists, factors, and data frames.
  • Data Types: These structures can hold numeric, character, logical, and complex data types.

Vectors

  • Definition: Simplest data structure in R with homogeneous elements.
  • Creation: Created using the c() function
    • v0 <- rep(0,5)
    • v1 <- 1:5
    • v1 <- seq(1,5,by=1)
    • v1 <- c(1L,2L,3L,4L,5L)
    • v2 <- c(1, 2.1, 3.5, 4, 50)
    • v3 <- c("a", "b", "c","d","e")
    • v4 <- c(TRUE, FALSE, TRUE, TRUE,FALSE)

Matrices

  • Definition: Two-dimensional arrays storing data of a single type.
  • Creation: Created using the matrix() function
    • m1 <- matrix(1:9, nrow=3, ncol=3)

Lists

  • Definition: Can contain elements of different types, including vectors, matrices, and other lists.
    • list1 <- list(v1,v2,v3,v4)

Data Frames

  • Definition: Used to store tabular data with different columns holding different data types.
  • Creation and Usage:
    • df <- data.frame(Name = c("Alice", "Bob", "Charlie"), Age = c(25, 32, 37), Married = c(TRUE, FALSE, TRUE))
    • df<- read.csv("/var/www/html/jlee141/econdata/eco520/chicago_cca.csv")
    • dat1 <- data.frame(v1,v2,v3,v4)
    • dat1$cat1 <- ifelse(dat1$v2 > 20,1,0)
    • dat1$cat2 <- ifelse(dat1$v4 == TRUE,1,0)
    • dat2 <- dat1[1:3,]
    • dat3 <- dat1[,1:3]
    • dat4 <- dat1[which(v4==TRUE),]
    • dat5 <- subset(dat1,v4==TRUE)
    • dat6 <- subset(dat1,v4==TRUE, select=c(v1,v2,squared,cat1:cat2))

Time Series Data Types

  • Definition: Sequences of values collected over time intervals.
  • Handling in R: Specific data structures for efficient handling.
    • ts function: Creates time-series objects from numeric vectors, specifying start time, end time, and frequency.
      • Frequency examples: 1 (annual), 12 (monthly), 4 (quarterly), 52 (weekly).
      • Examples:
        • Annual: my_ts <- ts(data = c(1:24), start = 2001, end=2024, frequency = 1)
        • Quarterly: my_ts <- ts(data = c(1:24), start = 2001, frequency = 4)
        • Monthly: my_ts <- ts(data = c(1:24), start = 2001, frequency = 12)
        • Weekly: my_ts <- ts(data = c(1:24), start = 2001, frequency = 52)
        • Entered: data <- c(10,11,5,10,15,12,30,18,25,5,20,15) then my_ts <- ts(data, start=c(2001,1), end=c(2003,4), frequency=4)
  • Printing Time Series Data: Using the ts function with specified frequency and start time.
    • ts(1:10, frequency = 4, start = c(1959, 2))
    • ts(1:10, frequency = 7, start = c(12, 2)), calendar = TRUE)
    • ts(1:48, frequency = 24, start = c(110, 6)), calendar = TRUE)
  • zoo: For ordered indexed series.
    • Requires: library(zoo)
    • Example:
      dates <- as.Date("2020-01-01") + 0:9
      zoo_ts <- zoo(1:24, dates)
  • xts: Extends zoo for financial time series.
    • Requires: library(xts)
    • Example: xts_ts <- xts(1:24, order.by=dates)

Time Series Data Conversion

  • Convert to different data types data frame, ts, and xts, requires
    • library(lubridate)
    • ID <- 1:12
    • NAME <- LETTERS[1:12]
    • NUM <- rnorm(12)
    • date <- seq(as.Date("2020-01-01"),length=12, by="months")
    • year <- year(date)
    • month <- month(date)
  • Convert to data frame
    • df <- data.frame(ID,date,year,month,NAME,NUM)
  • Data frame to ts format
    • df_ts <- ts(df,start=c(2020,1), frequency = 12)
  • ts to xts format
    • ts_xts <- as.xts(df_ts)
    • plot(ts_xts)
    • plot(ts_xts$NUM)
  • data frame to xts
    • df_xts <- xts(df,order.by = date)
  • xts to ts
    • xts_ts <- ts(df_xts,start=c(2020,1),frequency = 12)
  • xts to dataframe
    • xts_df <- as.data.frame(df_xts)
    • xts_df$date <- as.Date(xts_df$date)
    • xts_df$year <- year(xts_df$date)
    • xts_df$NUM <- as.numeric(xts_df$NUM)

Time Series Data Frequency Changes

  • XTS data
    • Requires: library(quantmod)
  • Daily Data Load
    • getSymbols("AAPL", src = "yahoo", from = "2020-01-01", to = Sys.Date())
    • plot(Cl(AAPL)) # Closing prices
  • weekly Data Conversion (indexAt can be "firstof", "endof", or "mean")
    • Last day of the week
      • AAPL_weekly <- to.weekly(AAPL, indexAt = "endof", OHLC = FALSE)
      • plot(Cl(AAPL_weekly)) # Closing prices
  • Monthly Data Conversion
    • First day of the month
      • AAPL_monthly <- to.monthly(AAPL, indexAt = "firstof", OHLC = FALSE)
      • plot(Cl(AAPL_monthly)) # Closing prices
  • Quarterly Data Conversion
    • AAPL_quarterly <- to.quarterly(AAPL, indexAt = "firstof", OHLC = FALSE)
    • plot(Cl(AAPL_quarterly)) # Closing prices
  • Quarterly Data Conversion
    • AAPL_yearly <- to.yearly(AAPL, indexAt = "firstof", OHLC = FALSE)
    • plot(Cl(AAPL_yearly)) # Closing prices

Lag, Forward, Differencing, and Percentage Change Operations

  • Lag Operation:
    • Shifts data points to the next periods, delaying them.
    • Useful for comparing current values against past values.
      • vector <- rnorm(12)
      • lag_vector <- lag(vector, n = 1)
  • Forward Operation:
    • Shifts data points to previous periods, forecasting them forward.
      • lead_vector <- lead(vector, n = 1) (Using dplyr package)
  • Differencing:
    • Makes a non-stationary time series stationary.
    • Subtracts the previous observation from the current observation.
      • diff_vector <- vector - lag(vector, n = 1)
  • Percentage Change:
    • Measures the relative change between two numbers as a percentage of the first number.
      • perc_change <- ((vector - lag(vector)) / lag(vector)) * 100

Sources of Time Series Data

  • Google Trends
  • FRED
  • Yahoo Finance
  • M Forecasting competitions
  • Web APIs (e.g., Quandl)
  • Time series data repositories (UCI Machine Learning Repository; UEA and UCR Time Series Classification Repository)
  • Data sets in R packages (e.g., tscompdata)
  • Sensor data from smart cities, medical devices
  • Signals for radio, music, medical devices, speech, radars
  • Countries’ organizations (Eurostat, OECD, NOAA)

Time Series Data Example: Goolge Trends

  • Google Trends data offers insights into search interest trends over time.
    • Requires: gtrendsR package in R provides an interface to Google Trends.
  • Install:
    • install.packages("gtrendsR")
  • Load
    • library(gtrendsR)
  • Fetch trends data for specific keywords.
    • Single Keyword Example
      • trends <- gtrends(c("Data Science"))
      • plot(trends)
    • Multiple Keywords Example
      • trends <- gtrends(c("Data Science", "Machine Learning"))
      • plot(trends)
    • Geographical and Time Specifications
      • trends <- gtrends(c("Data Science"), geo = "US", time = "2016-01-01 2021-12-31")
      • plot(trends)

Fetching Data from FRED

  • You can use the getSymbols function.
    • Example: unemployment Rate requires library(quantmod)
      • getSymbols("UNRATE", src = "FRED")
      • getSymbols("UNRATE", src = "FRED",return.class=’ts’)
      • plot(UNRATE)
    • Load Multiple Series with plots
      • getSymbols(c("CHXRNSA","NYXRNSA","LVXRNSA"), src = "FRED")
    • Combine Multiple Series in a xts data
      • Home_Index <- cbind(CHXRNSA,NYXRNSA,LVXRNSA)
    • Subset, Lag, Monthly, and Annual Changes for xts data
      • Home_Covid <- window(Home_Index, start="2019-01-01", end="2024-03-01")
      • Lag_Home <- lag(Home_Covid)
      • Home_Rate_m <- diff(Home_Covid, lag=1, differences = 1)
      • plot(Home_Rate_m)
      • Home_Rate_y <- diff(Home_Covid, lag=12, differences = 1)
      • plot(Home_Rate_y)

Fetching Data from Yahoo Finance

  • Use quantmod to fetch historical daily market data.
    • Fetching Apple Inc. Data requires library(quantmod)
      • getSymbols("AAPL", src = "yahoo", from = "2020-01-01", to = "2024-12-31")
      • apple_data <- AAPL
      • chartSeries(apple_data, type = "line", theme = chartTheme("white"))
      • chartSeries(apple_data, type = "line", TA=NULL, theme = chartTheme("white"))
    • Conversion to different frequency
      • VoWeek <- apply.weekly(Vo(AAPL),sum) # sum from Monday to Friday
      • VoMonth <- apply.monthly(Vo(AAPL),sum) # sum to month
      • VoQuarter <- apply.quarterly(Vo(AAPL),sum) # sum to quarter
      • VoYear <- apply.yearly(Vo(AAPL),sum) # sum to year

Time Series Data Cleaning

  • Like all data, time series data must be inspected and cleaned
  • Is there a time stamp?
  • Is the time stamp in order?
  • Are the observations recorded monthly, annually, daily, hourly, weekly..?
  • Are the data recorded at equally spaced intervals?
  • Are there breaks in the time stamp?
  • Are there missing values?
  • Are there extraneous numerical codes that are not really data but code representing something else.
  • Are the measurements intertwined with summaries of the data, or some other data processing that is not really an observation?
  • Are the numbers formatted with commas in them?

Time Series with Missing Values

  • Missing values are a lot easier to deal with.
  • It’s best if you can find out why he values are missing in the first place, but if you can’t, there are various statistical methods available for imputing them.
  • With missing values in time series datasets, you usually have the data column fully populated, and the value field is set to NA.
    • Example requires:
      • library(lubridate)
        R df <- data.frame( date = seq(ymd("20230101"), ymd("20231231"), by = "months"), value = c(145, 212, NA, 265, 299, 345, NA, NA, 278, 256, 202, 176) )
      • library(zoo)
      • Mean value imputation- the missing values will be replaced with a simple average
        • mean_value <- mean(df$value, na.rm = TRUE)
        • df$v_mean <- ifelse(is.na(df$value), mean_value, df$value)
      • Forward fill - the missing value at the point T is filled with a non-missing value at T-1
        • df$v_ffill <- na.locf(df$value, na.rm = FALSE)
      • Backward fill - the missing value at the point T is filled with a non-missing value at T+1
        • df$v_bfill <- na.locf(df$value, fromLast = TRUE, na.rm = FALSE)
      • Linear interpolation - the missing value at the point T is filled with an average of non-missing values at T-1 and T+1
        • df$v_interpolated <- na.approx(df$value)

Time Series Visualization

Time Series Graphs

  • plot, plot.ts
  • Creating plots in R is a fundamental aspect of data analysis, allowing for visual exploration of data sets and the communication of data insights.
  • The base plotting system in R is simple and direct, making it easy to create basic graphs using ”plot” or ”plot.ts” command:
    • requires library(quantmod)
    • Simple Plots
      • getSymbols("UNRATE", src = "FRED",return.class=’ts’)
      • plot(UNRATE)
    • Use plot.ts to create a time series plot with a custom color
      • plot.ts(UNRATE, main = "Monthly Unemployment Rate", ylab = "Unemployment Rate", col = "blue")
    • Add text at a specific location with a custom color
      • text(x = c(2012,2020), y = c(6,10), labels = c("Bottom", "Covid19"), col = c("red", "red"))
    • Multiple graphs on xts by combining multiple xts data
      • Home_Index <- cbind(CHXRNSA,NYXRNSA,LVXRNSA)
      • plot(Home_Index)
      • plot(Home_Index$CHXRNSA)
      • plot(CHXRNSA,col="red", main="Case-Shiller Home Price Index \n Chicago, New York, Las Vegas")
      • lines(NYXRNSA,col="blue")
      • lines(LVXRNSA,col="green")

Time Series Graphs: chartSeries

  • Example: Graphs for Apple Inc. Data requires library(quantmod)
    • Subset
      • chartSeries(AAPL, subset=’2020-05::2024-01’, theme=chartTheme(’white’))
      • chartSeries(AAPL,type="line", subset=’2023’, theme=chartTheme(’white’))
    • Barchart
      • chartSeries(AAPL, type="bar", subset=’2024-01’, theme=chartTheme(’white’))#
    • Candle Sticks
      • chartSeries(AAPL, type="candlesticks", subset=’2007-01’, theme=chartTheme(’white’))
    • Add Line; Simple Moving Average (SMA), Exponential MA (EMA)
      • chartSeries(AAPL, subset=’2019-05::2023-01’, theme=chartTheme(’white’))
      • addSMA(n=5,on=1,col = "red")
      • addMACD(fast=12,slow=26,signal=9,type="EMA")
      • addRSI(n=30,maType="EMA")
    • Multiple Series in a Graph
      • chartSeries(Cl(AAPL), subset=’2007-05::2023-01’, theme=chartTheme(’white’))
      • addTA(Cl(IBM), on=1,col="blue",lty="dashed")

Dyanmic and Interactive Graphs

  • Time Series Dyanmic Graph: dygraph requires library
  • Standard dynamic graph#
    • The function dygraph() display time series data interactively. Move your mouse on the diagram
      • dygraph(OHLC(AAPL)
    • Shading
      • graph<- dygraph(Cl(AAPL), main = "AAPL")
      • dyShading(graph, from="2020-03-20", to="2022-12-11", color="#FFE6E6")
    • Event line
      • graph <- dygraph(OHLC(AAPL), main = "AAPL")
      • graph <- dyEvent(graph,"2007-6-29", "iphone", labelLoc = "bottom")
      • graph <- dyEvent(graph,"2010-5-6", "Flash Crash", labelLoc = "bottom")
      • graph <- dyEvent(graph,"2014-6-6", "Split", labelLoc = "bottom")
      • dyEvent(graph,"2011-10-5", "Jobs", labelLoc = "bottom")
    • Candle Chart
      • AAPL_C <- tail(AAPL, n=30)
      • graph <- dygraph(OHLC(AAPL_C))
      • dyCandlestick(graph)

Dyanmic and Interactive Graphs: Life Expectation vs. GDP Per Capita by Year and Country

  • Time Series Interactive Graph Example:
    • requires library plotly
    • requires library dplyr
    • requires library gapminder
      R data <- gapminder %>% select(country, year, lifeExp, gdpPercap, pop) %>% mutate(scaled_pop = pop/5000000) # Using square root to scale population
  • Interactive graph
    R figure <- plot_ly( data, x = ~gdpPercap, y = ~lifeExp, text = ~country, mode = ’markers’, color = ~country, frame = ~year, ids = ~country, marker = list( size = ~scaled_pop, opacity = 0.7, line = list(color = ’rgba(0, 0, 0, 0.5)’, width = 0.5) ), type = ’scatter’, hoverinfo = ’text+x+y’ ) %>% layout( title = ’Life Expectancy vs. GDP Per Capita by Year and Country’, xaxis = list(type = ’log’, title = ’GDP per Capita’), yaxis = list(title = ’Life Expectancy’), margin = list(l = 60, r = 50, b = 65, t = 90) )
  • figure

Graph using ggplot2

  • Introduction to ggplot2
    • ggplot2 is a powerful R package for creating graphics. It implements the grammar of graphics, a coherent system for describing and building graphs. With ggplot2, you can create advanced plots in a consistent manner. This presentation will focus on its application to time series data.
  • Simple Line Graph
    • A simple line plot is the most common graph for time series data. We plot time on the x-axis and the variable of interest on the y-axis. The following code will produce a line plot with a minimalistic style.
    • Make sure to load ggplot2 library
    • Sample Data
      R ts_data <- data.frame( Date = seq(as.Date("2000/1/1"), by = "month", length.out = 100), Value = cumsum(runif(100, min = -10, max = 10)) )
    • ggplot example
      R ggplot(ts_data, aes(x = Date, y = Value)) + geom_line() + theme_minimal()

Graph using ggplot2: Time Series with Multiple Groups

  • Sometimes we need to compare multiple time series on the same plot. This can be done by mapping a categorical variable to color or linetype.
    R ts_data$Group <- rep(c("A", "B"), each = 50) ggplot(ts_data, aes(x = Date, y = Value, color = Group)) + geom_line() + theme_minimal()
  • This code adds an additional categorical variable ‘Group‘ to differentiate lines.

Graph using ggplot2: Adding Points to Time Series

  • To highlight individual data points, you can add points to your line plot. This is useful to mark outliers or specific events.
    R ggplot(ts_data, aes(x = Date, y = Value)) + geom_line() + geom_point(aes(color = Value > 0)) + theme_minimal()
  • Points are colored differently if the ‘Value‘ is greater than zero.

Graph using ggplot2: Time Series with Facets

  • Faceting creates a matrix of panels by one or more grouping variables. It allows us to compare several time series graphs side by side.
    R ggplot(ts_data, aes(x = Date, y = Value)) + geom_line() + facet_wrap(~ Group) + # Facets by group theme_minimal()
  • Each panel represents a different group’s time series.

Graph using ggplot2: Customizing Time Series Plots

  • ggplot2 is highly customizable. You can adjust almost every element of a plot to suit your needs.
    R ggplot(ts_data, aes(x = Date, y = Value, color = Group)) + geom_line() + labs(title = "Customized Time Series Plot", x = "Time", y = "Value") + scale_x_date(date_breaks = "1 year", date_labels = "%Y") + scale_color_manual(values = c("blue", "red")) + theme_minimal()
  • The labels, scales, and colors are customized.

Graph using ggplot2: Spagettie Time Series Plots ggplot2

  • is highly customizable. You can adjust almost every element of a plot to suit your needs.
  • Spagetti Plot
    • Convert to data frame
      • getSymbols(c("CHXRNSA"), src = "FRED")
      • df <- data.frame(Date = index(CHXRNSA), Price = coredata(CHXRNSA))
    • Extract year and month
      • df$Year <- as.numeric(format(df$Date, "%Y"))
      • df$Month <- as.numeric(format(df$Date, "%m"))
    • Create the spaghetti plot
      R df %>% filter(Year > 2018) %>% ggplot(aes(x = Month, y = CHXRNSA, group = Year, color = as.factor(Year))) + geom_line() + geom_point() + # Add this line to include dots theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + labs(title = "Spaghetti Plots of Monthly Chicago Home Prices by Year", x = "Month", y = "Price Index") + scale_color_discrete(name = "Year")