Please copy and paste this embed script to where you want to embed

FORECASTING AIRLINE DELAYS

On any given day, more than 87,000 flights take place in the United States alone. About one-third of these flights are commercial flights, operated by companies like United, American Airlines, and JetBlue. While about 80% of commercial flights take-off and land as scheduled, the other 20% suffer from delays due to various reasons. A certain number of delays are unavoidable, due to unexpected events, but some delays could hopefully be avoided if the factors causing delays were better understood and addressed.

In this problem, we'll use a dataset of 9,381 flights that occurred in June through August of 2014 between the three busiest US airports -- Atlanta (ATL), Los Angeles (LAX), and Chicago (ORD) -- to predict flight delays. The dataset AirlineDelay.csv includes the following 23 variables:

1. Flight = the origin-destination pair (LAX-ORD, ATL-LAX, etc.) 2. Carrier = the carrier operating the flight (American Airlines, Delta Air Lines, etc.) 3. Month = the month of the flight (June, July, or August) 4. DayOfWeek = the day of the week of the flight (Monday, Tuesday, etc.) 5. NumPrevFlights = the number of previous flights taken by this aircraft in the same day 6. PrevFlightGap = the amount of time between when this flight's aircraft is scheduled to arrive at the airport and when it's scheduled to depart for this flight 7. HistoricallyLate = the proportion of time this flight has been late historically 8. InsufficientHistory = whether or not we have enough data to determine the historical record of the flight (equal to 1 if we don't have at least 3 records, equal to 0 if we do) 9. OriginInVolume = the amount of incoming traffic volume at the origin airport, normalized by the typical volume during the flight's time and day of the week 10.OriginOutVolume = the amount of outgoing traffic volume at the origin airport, normalized by the typical volume during the flight's time and day of the week 11.DestInVolume = the amount of incoming traffic volume at the destination airport, normalized by the typical volume during the flight's time and day of the week 12.DestOutVolume = the amount of outgoing traffic volume at the destination airport, normalized by the typical volume during the flight's time and day of the week 13.OriginPrecip = the amount of rain at the origin over the course of the day, in tenths of millimeters 14.OriginAvgWind = average daily wind speed at the origin, in miles per hour

15.OriginWindGust = fastest wind speed during the day at the origin, in miles per hour 16.OriginFog = whether or not there was fog at some point during the day at the origin (1 if there was, 0 if there wasn't) 17.OriginThunder = whether or not there was thunder at some point during the day at the origin (1 if there was, 0 if there wasn't) 18.DestPrecip = the amount of rain at the destination over the course of the day, in tenths of millimeters 19.DestAvgWind = average daily wind speed at the destination, in miles per hour 20.DestWindGust = fastest wind speed during the day at the destination, in miles per hour 21.DestFog = whether or not there was fog at some point during the day at the destination (1 if there was, 0 if there wasn't) 22.DestThunder = whether or not there was thunder at some point during the day at the destination (1 if there was, 0 if there wasn't) 23.TotalDelay = the amount of time the aircraft was delayed, in minutes (this is our dependent variable)

PROBLEM 1 - LOADING THE DATA Load the dataset AirlineDelay.csv into R and call it "Airlines". Randomly split it into a training set (70% of the data) and testing set (30% of the data) by running the following lines in your R console:

set.seed(15071) spl = sample(nrow(Airlines), 0.7*nrow(Airlines)) AirlinesTrain = Airlines[spl,] AirlinesTest = Airlines[-spl,]

How many observations are in the training set AirlinesTrain? (2 marks) Answer: 6566 How many observations are in the testing set AirlinesTest? (2 marks) Answer: 2815 PROBLEM 2 - A LINEAR REGRESSION MODEL Build a linear regression model to predict "TotalDelay" using all of the other variables as independent variables. Use the training set to build the model. What is the model's R-squared? (Please report the "Multiple R-squared" value in the output.) (2 marks) Answer: 0.09475 PROBLEM 3 - CHECKING FOR SIGNIFICANCE In your linear regression model, which of the independent variables are significant at the p=0.05 level (at least one star)? For factor variables, consider the variable significant if at least one level is significant. Write all that apply. (2 marks)

PROBLEM 4 - CORRELATIONS What is the correlation between NumPrevFlights and PrevFlightGap in the training set? (2 marks) Answer: -0.652053189 What is the correlation between OriginAvgWind and OriginWindGust in the training set? (2 marks) Answer: 0.509953488 Hint: Identify/find out a function that will calculate correlation between variables

PROBLEM 5 - IMPORTANCE OF CORRELATIONS Why is it important to check for correlations between independent variables? Select one of the following: (2 marks) 1. Having highly correlated independent variables in a regression model can affect the interpretation of the coefficients.

2. Having highly correlated independent variables in a regression model can affect the quality of the resulting predictions. PROBLEM 6 - COEFFICIENTS In the model with all of the available independent variables, what is the coefficient for HistoricallyLate? (2 marks) Answer: 47.913638 PROBLEM 7 - UNDERSTANDING THE COEFFICIENTS The coefficient for NumPrevFlights is 1.56. What is the interpretation of this coefficient? Choose one of the following: (2 marks)

For an increase of 1 in the predicted total delay, the number of previous flights increases by approximately 1.56. For an increase of 1 in the number of previous flights, the prediction of the total delay increases by approximately 1.56. If the number of previous flights increases by 1, then the total delay will definitely increase by approximately 1.56; the number of previous flights should be minimized if airlines want to decrease the amount of delay.

PROBLEM 8 - UNDERSTANDING THE MODEL Let us try to understand our model. In the linear regression model, given two flights that are otherwise identical, what is the absolute difference in predicted total delay given that one flight is on Thursday and the other is on Sunday? (2 marks) Answer: 6.989857 In the linear regression model, given two flights that are otherwise identical, what is the absolute difference in predicted total delay given that one flight is on Saturday and the other is on Sunday? (2 marks) Answer: 0.911413 PROBLEM 9 - PREDICTIONS ON THE TEST SET Make predictions on the test set using your linear regression model. What is the Sum of Squared Errors (SSE) on the test set? Hint: Use “predict” function for prediction. You can then take your data in excel by write.csv function and compute SSE. You can very well do this R as well. However, the choice is left to you. (2 marks) Answer: 4744764

PROBLEM 11 - A CLASSIFICATION PROBLEM Let's turn this problem into a multi-class classification problem by creating a new dependent variable. Our new dependent variable will take three different values: "No Delay", "Minor Delay", and "Major Delay". Create this variable, called

"DelayClass", in your dataset Airlines by running the following line in your R console: IMPORTANT: BEFORE YOU DO THIS STEP SAVE YOUR ORIGINAL DATA FRAME. YOU MAY USE FOLLOWING CODE: AirlinesOriginal=Airlines Now you may proceedAirlines$DelayClass = factor(ifelse(Airlines$TotalDelay == 0, "No Delay", ifelse(Airlines$TotalDelay >= 30, "Major Delay", "Minor Delay"))) How many flights in the dataset Airlines had no delay? (2 marks) Answer: 4688 How many flights in the dataset Airlines had a minor delay? (2 marks) Answer: 3096 How many flights in the dataset Airlines had a major delay? (2 marks) Answer: 1597 Now, remove the original dependent variable "TotalDelay" from your dataset with the command: Airlines$TotalDelay = NULL PROBLEM 12 - A CART MODEL Build a CART model to predict "DelayClass" using all of the other variables as independent variables and the training set to build the model. Remember that to predict a multi-class dependent variable, you can use the rpart function in the same way as for a binary classification problem. Just use the default parameter settings. How many split are in the resulting tree? (2 marks) Answer: 2 The CART model you just built splits first on which variable? (2 marks) Answer: Historically Late PROBLEM 13- LOGISTIC REGRESSION Use following command in your R console to create dichotomous dependant variable: AirlinesOriginal$DelayClass = factor(ifelse(AirlinesOriginal$TotalDelay == 0, "No Delay", “Delay")) What is baseline accuracy? (2 marks) Answer: 50% What is the model accuracy? (2 marks) Answer: 65% PROBLEM 14- RANDFOREST

Run RandomForest model and identify most significant variable using varImpPlot. (5 Marks) Answer: Historically Late

View more...
On any given day, more than 87,000 flights take place in the United States alone. About one-third of these flights are commercial flights, operated by companies like United, American Airlines, and JetBlue. While about 80% of commercial flights take-off and land as scheduled, the other 20% suffer from delays due to various reasons. A certain number of delays are unavoidable, due to unexpected events, but some delays could hopefully be avoided if the factors causing delays were better understood and addressed.

In this problem, we'll use a dataset of 9,381 flights that occurred in June through August of 2014 between the three busiest US airports -- Atlanta (ATL), Los Angeles (LAX), and Chicago (ORD) -- to predict flight delays. The dataset AirlineDelay.csv includes the following 23 variables:

1. Flight = the origin-destination pair (LAX-ORD, ATL-LAX, etc.) 2. Carrier = the carrier operating the flight (American Airlines, Delta Air Lines, etc.) 3. Month = the month of the flight (June, July, or August) 4. DayOfWeek = the day of the week of the flight (Monday, Tuesday, etc.) 5. NumPrevFlights = the number of previous flights taken by this aircraft in the same day 6. PrevFlightGap = the amount of time between when this flight's aircraft is scheduled to arrive at the airport and when it's scheduled to depart for this flight 7. HistoricallyLate = the proportion of time this flight has been late historically 8. InsufficientHistory = whether or not we have enough data to determine the historical record of the flight (equal to 1 if we don't have at least 3 records, equal to 0 if we do) 9. OriginInVolume = the amount of incoming traffic volume at the origin airport, normalized by the typical volume during the flight's time and day of the week 10.OriginOutVolume = the amount of outgoing traffic volume at the origin airport, normalized by the typical volume during the flight's time and day of the week 11.DestInVolume = the amount of incoming traffic volume at the destination airport, normalized by the typical volume during the flight's time and day of the week 12.DestOutVolume = the amount of outgoing traffic volume at the destination airport, normalized by the typical volume during the flight's time and day of the week 13.OriginPrecip = the amount of rain at the origin over the course of the day, in tenths of millimeters 14.OriginAvgWind = average daily wind speed at the origin, in miles per hour

15.OriginWindGust = fastest wind speed during the day at the origin, in miles per hour 16.OriginFog = whether or not there was fog at some point during the day at the origin (1 if there was, 0 if there wasn't) 17.OriginThunder = whether or not there was thunder at some point during the day at the origin (1 if there was, 0 if there wasn't) 18.DestPrecip = the amount of rain at the destination over the course of the day, in tenths of millimeters 19.DestAvgWind = average daily wind speed at the destination, in miles per hour 20.DestWindGust = fastest wind speed during the day at the destination, in miles per hour 21.DestFog = whether or not there was fog at some point during the day at the destination (1 if there was, 0 if there wasn't) 22.DestThunder = whether or not there was thunder at some point during the day at the destination (1 if there was, 0 if there wasn't) 23.TotalDelay = the amount of time the aircraft was delayed, in minutes (this is our dependent variable)

PROBLEM 1 - LOADING THE DATA Load the dataset AirlineDelay.csv into R and call it "Airlines". Randomly split it into a training set (70% of the data) and testing set (30% of the data) by running the following lines in your R console:

set.seed(15071) spl = sample(nrow(Airlines), 0.7*nrow(Airlines)) AirlinesTrain = Airlines[spl,] AirlinesTest = Airlines[-spl,]

How many observations are in the training set AirlinesTrain? (2 marks) Answer: 6566 How many observations are in the testing set AirlinesTest? (2 marks) Answer: 2815 PROBLEM 2 - A LINEAR REGRESSION MODEL Build a linear regression model to predict "TotalDelay" using all of the other variables as independent variables. Use the training set to build the model. What is the model's R-squared? (Please report the "Multiple R-squared" value in the output.) (2 marks) Answer: 0.09475 PROBLEM 3 - CHECKING FOR SIGNIFICANCE In your linear regression model, which of the independent variables are significant at the p=0.05 level (at least one star)? For factor variables, consider the variable significant if at least one level is significant. Write all that apply. (2 marks)

PROBLEM 4 - CORRELATIONS What is the correlation between NumPrevFlights and PrevFlightGap in the training set? (2 marks) Answer: -0.652053189 What is the correlation between OriginAvgWind and OriginWindGust in the training set? (2 marks) Answer: 0.509953488 Hint: Identify/find out a function that will calculate correlation between variables

PROBLEM 5 - IMPORTANCE OF CORRELATIONS Why is it important to check for correlations between independent variables? Select one of the following: (2 marks) 1. Having highly correlated independent variables in a regression model can affect the interpretation of the coefficients.

2. Having highly correlated independent variables in a regression model can affect the quality of the resulting predictions. PROBLEM 6 - COEFFICIENTS In the model with all of the available independent variables, what is the coefficient for HistoricallyLate? (2 marks) Answer: 47.913638 PROBLEM 7 - UNDERSTANDING THE COEFFICIENTS The coefficient for NumPrevFlights is 1.56. What is the interpretation of this coefficient? Choose one of the following: (2 marks)

For an increase of 1 in the predicted total delay, the number of previous flights increases by approximately 1.56. For an increase of 1 in the number of previous flights, the prediction of the total delay increases by approximately 1.56. If the number of previous flights increases by 1, then the total delay will definitely increase by approximately 1.56; the number of previous flights should be minimized if airlines want to decrease the amount of delay.

PROBLEM 8 - UNDERSTANDING THE MODEL Let us try to understand our model. In the linear regression model, given two flights that are otherwise identical, what is the absolute difference in predicted total delay given that one flight is on Thursday and the other is on Sunday? (2 marks) Answer: 6.989857 In the linear regression model, given two flights that are otherwise identical, what is the absolute difference in predicted total delay given that one flight is on Saturday and the other is on Sunday? (2 marks) Answer: 0.911413 PROBLEM 9 - PREDICTIONS ON THE TEST SET Make predictions on the test set using your linear regression model. What is the Sum of Squared Errors (SSE) on the test set? Hint: Use “predict” function for prediction. You can then take your data in excel by write.csv function and compute SSE. You can very well do this R as well. However, the choice is left to you. (2 marks) Answer: 4744764

PROBLEM 11 - A CLASSIFICATION PROBLEM Let's turn this problem into a multi-class classification problem by creating a new dependent variable. Our new dependent variable will take three different values: "No Delay", "Minor Delay", and "Major Delay". Create this variable, called

"DelayClass", in your dataset Airlines by running the following line in your R console: IMPORTANT: BEFORE YOU DO THIS STEP SAVE YOUR ORIGINAL DATA FRAME. YOU MAY USE FOLLOWING CODE: AirlinesOriginal=Airlines Now you may proceedAirlines$DelayClass = factor(ifelse(Airlines$TotalDelay == 0, "No Delay", ifelse(Airlines$TotalDelay >= 30, "Major Delay", "Minor Delay"))) How many flights in the dataset Airlines had no delay? (2 marks) Answer: 4688 How many flights in the dataset Airlines had a minor delay? (2 marks) Answer: 3096 How many flights in the dataset Airlines had a major delay? (2 marks) Answer: 1597 Now, remove the original dependent variable "TotalDelay" from your dataset with the command: Airlines$TotalDelay = NULL PROBLEM 12 - A CART MODEL Build a CART model to predict "DelayClass" using all of the other variables as independent variables and the training set to build the model. Remember that to predict a multi-class dependent variable, you can use the rpart function in the same way as for a binary classification problem. Just use the default parameter settings. How many split are in the resulting tree? (2 marks) Answer: 2 The CART model you just built splits first on which variable? (2 marks) Answer: Historically Late PROBLEM 13- LOGISTIC REGRESSION Use following command in your R console to create dichotomous dependant variable: AirlinesOriginal$DelayClass = factor(ifelse(AirlinesOriginal$TotalDelay == 0, "No Delay", “Delay")) What is baseline accuracy? (2 marks) Answer: 50% What is the model accuracy? (2 marks) Answer: 65% PROBLEM 14- RANDFOREST

Run RandomForest model and identify most significant variable using varImpPlot. (5 Marks) Answer: Historically Late

Thank you for interesting in our services. We are a non-profit group that run this website to share documents. We need your help to maintenance this website.

To keep our site running, we need your help to cover our server cost (about $400/m), a small donation will help us a lot.