#Imports
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.cross_decomposition import CCA
df = pd.read_excel('.\\Econ and CS Project Data.xlsx', sheet_name=0).T
#relative addressing. Keep the ipynb in the same file as the xlsx
header = df.iloc[0]
header.name = 'Feature'
df = df[1:]
df.columns = header
df
econ_keys = ['Real PCE for Goods', 'Real PCE for Services', 'Real GDP','Personal income', 'Disposable personal income','Annual Mean Unemployment Rate']
travel_keys = ['Air carrier domestic all services vehicle-miles','Highway vehicle-miles', 'Transite vehicle-miles', 'Rail train-miles','Air-travel arrivals in the USA', 'Air-travel departures from the USA']
Both the truncated correlation matrix and the pair plot below show comparisons of our datasets with economic data on the vertical axis and travel data on the horizontal. The correlation matrix shows the correlation between any two features at the position where their row and column intersect. An intution for that correlation can be gleamed by looking at the same on the pair plot. This knowledge of individual correlations can help us later when explaining the results of our canonical correlation analysis.
df.drop(['Units']).astype(float).corr().head(6)[travel_keys]
#Correlation matrix shows correlation of each feature row by column
sns.set(style="darkgrid")
sns.set_palette("Dark2")
sns.pairplot(df.drop(['Units']),height=3, x_vars = travel_keys, y_vars = econ_keys)
plt.show()
#Define input and output datasets
econ = df.drop(['Units']).astype(float)[econ_keys]
travel = df.drop(['Units']).astype(float)[travel_keys]
cca = CCA(n_components=5)
cca.fit(travel, econ)
U = pd.DataFrame(cca.x_rotations_)
U.rename(index={0:"Air v-mi",1:"Highway v-mi",2:"Transit v-mi",3:"Rail t-mi",4:"Air Arrivals",5:"Air Departures"}, inplace=True)
print("Travel Canonical Components:")
print(U)
V = pd.DataFrame(cca.y_rotations_)
V.rename(index={0:"RPCE Goods",1:"RPCE Services",2:"RGDP",3:"Income",4:"Disp. Income",5:"Unemployment"}, inplace=True)
print("\nEconomic Canonical Components:")
print(V)
print('\n')
travel_c, econ_c = cca.fit_transform(travel, econ)
print("First Canonical Correlation =", np.corrcoef(travel_c[:,0], econ_c[:,0])[0,1])
print("Second Canonical Correlation =", np.corrcoef(travel_c[:,1], econ_c[:,1])[0,1])
print("Third Canonical Correlation =", np.corrcoef(travel_c[:,2], econ_c[:,2])[0,1])
print("Fourth Canonical Correlation =", np.corrcoef(travel_c[:,3], econ_c[:,3])[0,1])
print("Fifth Canonical Correlation =", np.corrcoef(travel_c[:,2], econ_c[:,4])[0,1])
The goal of this canonical correlation analysis (CCA) was to determine how our economic measures and our travel data relate by projecting them as input and output data onto single dimensions with maximal correlation. The particular data components were chosen to gain insight on a wide array of macroeconomic health indicators and different types of travel. Our selection of data components was due to the lack of historical precedence of a pandemic as devastating as Covid-19 since 1975. With major institutions halted and data collection a low priority, available and reliable private and public travel data is limited. However, as travel data is a strong indicator of overall economic health and as we have, we decided to focus on the impacts of the travel ban on economic health through CCA analysis of our chosen data components.
While CCA is often used for dimensionality reduction as a means of enhancing the training of machine learning models, valuable insights can also be gained by examining the canonical correlations and their components. The first three travel canonical components are very strongly correlated with their respective economic canonical components, so we can say with relative confidence that the travel data is indeed correlated with the economic data, although we may be overfitting our model to achieve such a high R-squared. The relative personal consumption expenditure of goods and services and real GDP are highly correlated with changes in travel data, as expected of statistics that are understood to be economic health indicators. Then, looking at the makeup of individual canonical components in each canonical correlation, we can gain insights into relationships between input and output features in context. For example, in the first canonical correlation, the travel canonical component has a high weight on Highway v-mi compared to the weights representing the other travel features. Likewise, the first canonical correlation's economic canonical components are dominated by the weight of RPCE Services. Given these facts together, they could imply a strong relationship between these features, given their surrounding context. Indeed, if we look back at our pairplot, we see that those two features do correlate very strongly.
We had expected a strong correlation between macroeconomic factors and travel data for Real GDP, Personal Income, RPCE Goods, and RPCE Services as it would make sense for growth in the travel industry to grow in conjunction with these factors. The real GDP numbers show a steady trend of economic growth in the US while there has also been a steady increase and growth of the travel industry and tourism throughout the decades since 1975, with a few exceptions such as during 9/11. An increase in personal income and consumption of goods and services also correlates with increasing travel as more personal income provides more means for leisure activities such as travel. Businesses can also support more travel expenses as they are making higher revenue and profits. Higher consumption of goods and services also indicate a higher capacity for travel whether cross-country or internationally. We were surprised by the results of disposable personal income and unemployment. Data on unemployment has an extremely low R-value; this reflects a low explanatory connection between unemployment and travel data, although this does not necessarily mean our model had poor predictions. Even during periods of low disposable income and high unemployment, travel remained at a steady growth. This could indicate travel as a necessity even during periods when consumers have low disposable income or are experiencing unemployment. This makes logical sense for vehicle and train mileage as those are necessary to travel to essential places such as work and other businesses. The growth in US population since 1975 and the growth in the travel industry due to globalization and business expansion can also help account for this, because there is a higher need and established infrastructure to support public transportation despite high unemployment and low disposable income. Breakdown on data for travel due to leisure vs. travel due to business would provide further insight and a more interesting analysis.
Follow this link for phase 2 of this project: https://observablehq.com/@limyifan1/international-flight-map
This canonical correlation analysis is based on code by Professor Tom Fletcher, distributed to his Foundations of Data Analysis class at UVA. The aforementioned code can be found here: https://tomfletcher.github.io/FoDA/examples/CCA.ipynb