Ranae Dietzel and Andee Kaplan
xaringan
Maybe check it out.
You have presentations coming up.
Tuesday, December 13th, 9:45, this room.
ggplot(sim3, aes(x1, y)) +
geom_point(aes(colour = x2))
There are two possible models you could fit to this data:
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
When you add variables with +
, the model will estimate each effect independent of all the others. It’s possible to fit the so-called interaction by using *
. For example, y ~ x1 * x2
is translated to y = a_0 + a_1 * a1 + a_2 * a2 + a_12 * a1 * a2
. Note that whenever you use *
, both the interaction and the individual components are included in the model.
To visualise these models we need two new tricks:
data_grid()
both variables. It finds all the unique values of x1
and x2
and then generates all combinations.gather_predictions()
which adds each prediction as a row. The complement of gather_predictions()
is spread_predictions()
which adds each prediction to a new column.Together this gives us:
grid <- sim3 %>%
data_grid(x1, x2) %>%
gather_predictions(mod1, mod2)
grid
## # A tibble: 80 × 4
## model x1 x2 pred
## <chr> <int> <fctr> <dbl>
## 1 mod1 1 a 1.674928
## 2 mod1 1 b 4.562739
## 3 mod1 1 c 6.480664
## 4 mod1 1 d 4.034515
## 5 mod1 2 a 1.478190
## 6 mod1 2 b 4.366001
## 7 mod1 2 c 6.283926
## 8 mod1 2 d 3.837777
## 9 mod1 3 a 1.281453
## 10 mod1 3 b 4.169263
## # ... with 70 more rows
We can visualise the results for both models on one plot using facetting:
ggplot(sim3, aes(x1, y, colour = x2)) +
geom_point() +
geom_line(data = grid, aes(y = pred)) +
facet_wrap(~ model)
Which model is better for this data? We can take look at the residuals.
sim3 <- sim3 %>%
gather_residuals(mod1, mod2)
ggplot(sim3, aes(x1, resid, colour = x2)) +
geom_point() +
facet_grid(model ~ x2)
ggplot(diamonds, aes(cut, price)) + geom_boxplot()
ggplot(diamonds, aes(color, price)) + geom_boxplot()
ggplot(diamonds, aes(clarity, price)) + geom_boxplot()
ggplot(diamonds, aes(carat, price)) +
geom_point()
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))
mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)
grid <- diamonds2 %>%
data_grid(carat = seq_range(carat, 20)) %>%
mutate(lcarat = log2(carat)) %>%
add_predictions(mod_diamond, "lprice") %>%
mutate(price = 2 ^ lprice)
## # A tibble: 20 × 4
## carat lcarat lprice price
## <dbl> <dbl> <dbl> <dbl>
## 1 0.2000000 -2.32192809 8.289840 312.9613
## 2 0.3210526 -1.63911827 9.437897 693.5697
## 3 0.4421053 -1.17753819 10.213984 1187.7244
## 4 0.5631579 -0.82838862 10.801034 1784.1663
## 5 0.6842105 -0.54748780 11.273333 2475.2060
## 6 0.8052632 -0.31246777 11.668489 3255.1059
## 7 0.9263158 -0.11042399 12.008199 4119.3452
## 8 1.0473684 0.06676901 12.306126 5064.2274
## 9 1.1684211 0.22456026 12.571432 6086.6476
## 10 1.2894737 0.36678233 12.810560 7183.9430
## 11 1.4105263 0.49623358 13.028216 8353.7936
## 12 1.5315789 0.61501973 13.227939 9594.1507
## 13 1.6526316 0.72476514 13.412462 10903.1859
## 14 1.7736842 0.82674917 13.583935 12279.2526
## 15 1.8947368 0.92199749 13.744083 13720.8562
## 16 2.0157895 1.01134497 13.894309 15226.6312
## 17 2.1368421 1.09548031 14.035772 16795.3226
## 18 2.2578947 1.17497823 14.169437 18425.7712
## 19 2.3789474 1.25032335 14.296120 20116.9016
## 20 2.5000000 1.32192809 14.416515 21867.7116
ggplot(diamonds2, aes(carat, price)) +
geom_point() +
geom_line(data = grid, colour = "red", size = 1)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond, "lresid")
ggplot(diamonds2, aes(lcarat, lresid)) +
geom_point()
ggplot(diamonds2, aes(cut, lresid)) + geom_boxplot()
ggplot(diamonds2, aes(color, lresid)) + geom_boxplot()
ggplot(diamonds2, aes(clarity, lresid)) + geom_boxplot()
If we wanted to, we could continue to build up our model, moving the effects we’ve observed into the model to make them explicit. For example, we could include color
, cut
, and clarity
into the model so that we also make explicit the effect of these three categorical variables:
mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, data = diamonds2)
grid <- diamonds2 %>%
data_grid(cut, .model = mod_diamond2) %>%
add_predictions(mod_diamond2)
grid
## # A tibble: 5 × 5
## cut lcarat color clarity pred
## <ord> <dbl> <chr> <chr> <dbl>
## 1 Fair -0.5145732 G SI1 10.98985
## 2 Good -0.5145732 G SI1 11.10479
## 3 Very Good -0.5145732 G SI1 11.15824
## 4 Premium -0.5145732 G SI1 11.19055
## 5 Ideal -0.5145732 G SI1 11.22187
ggplot(grid, aes(cut, pred)) +
geom_point()
daily <- flights %>%
mutate(date = make_date(year, month, day)) %>%
group_by(date) %>%
summarise(n = n())
daily
## # A tibble: 365 × 2
## date n
## <date> <int>
## 1 2013-01-01 842
## 2 2013-01-02 943
## 3 2013-01-03 914
## 4 2013-01-04 915
## 5 2013-01-05 720
## 6 2013-01-06 832
## 7 2013-01-07 933
## 8 2013-01-08 899
## 9 2013-01-09 902
## 10 2013-01-10 932
## # ... with 355 more rows
ggplot(daily, aes(date, n)) +
geom_line()
Understanding the long-term trend is challenging because there’s a very strong day-of-week effect that dominates the subtler patterns. Let’s start by looking at the distribution of flight numbers by day-of-week:
One way to remove this strong pattern is to use a model. First, we fit the model, and display its predictions overlaid on the original data:
mod <- lm(n ~ wday, data = daily)
grid <- daily %>%
data_grid(wday) %>%
add_predictions(mod, "n")
ggplot(daily, aes(wday, n)) +
geom_boxplot() +
geom_point(data = grid, colour = "red", size = 4)
Next we compute and visualise the residuals:
daily <- daily %>%
add_residuals(mod)
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line()
Our model seems to fail starting in June: you can still see a strong regular pattern that our model hasn’t captured. Drawing a plot with one line for each day of the week makes the cause easier to see:
ggplot(daily, aes(date, resid, colour = wday)) +
geom_ref_line(h = 0) +
geom_line()
There are some days with far fewer flights than expected:
daily %>%
filter(resid < -100)
## # A tibble: 11 × 4
## date n wday resid
## <date> <int> <ord> <dbl>
## 1 2013-01-01 842 Tues -109.3585
## 2 2013-01-20 786 Sun -105.4808
## 3 2013-05-26 729 Sun -162.4808
## 4 2013-07-04 737 Thurs -228.7500
## 5 2013-07-05 822 Fri -145.4615
## 6 2013-09-01 718 Sun -173.4808
## 7 2013-11-28 634 Thurs -331.7500
## 8 2013-11-29 661 Fri -306.4615
## 9 2013-12-24 761 Tues -190.3585
## 10 2013-12-25 719 Wed -243.6923
## 11 2013-12-31 776 Tues -175.3585
There seems to be some smoother long term trend over the course of a year. We can highlight that trend with geom_smooth()
:
daily %>%
ggplot(aes(date, resid)) +
geom_ref_line(h = 0) +
geom_line(colour = "grey50") +
geom_smooth(se = FALSE, span = 0.20)
Let’s first tackle our failure to accurately predict the number of flights on Saturday. A good place to start is to go back to the raw numbers, focussing on Saturdays:
Lets create a “term” variable that roughly captures the three school terms, and check our work with a plot:
term <- function(date) {
cut(date,
breaks = ymd(20130101, 20130605, 20130825, 20140101),
labels = c("spring", "summer", "fall")
)
}
daily <- daily %>%
mutate(term = term(date))
daily %>%
filter(wday == "Sat") %>%
ggplot(aes(date, n, colour = term)) +
geom_point(alpha = 1/3) +
geom_line() +
scale_x_date(NULL, date_breaks = "1 month", date_labels = "%b")
It’s useful to see how this new variable affects the other days of the week:
daily %>%
ggplot(aes(wday, n, colour = term)) +
geom_boxplot()
It looks like there is significant variation across the terms, so fitting a separate day of week effect for each term is reasonable.
mod1 <- lm(n ~ wday, data = daily)
mod2 <- lm(n ~ wday * term, data = daily)
This improves our model, but not as much as we might hope:
daily %>%
gather_residuals(without_term = mod1, with_term = mod2) %>%
ggplot(aes(date, resid, colour = model)) +
geom_line(alpha = 0.75)
We can see the problem by overlaying the predictions from the model on to the raw data:
Our model is finding the mean effect, but we have a lot of big outliers, so mean tends to be far away from the typical value. We can alleviate this problem by using a model that is robust to the effect of outliers: MASS::rlm()
. This greatly reduces the impact of the outliers on our estimates, and gives a model that does a good job of removing the day of week pattern:
mod3 <- MASS::rlm(n ~ wday * term, data = daily)
daily %>%
add_residuals(mod3, "resid") %>%
ggplot(aes(date, resid)) +
geom_hline(yintercept = 0, size = 2, colour = "white") +
geom_line()