Discussion on fixed effects

Note

In this file, I study the behavior of the fixed effects, what variation is left after I introduce them, plot the residuals…

Note that I estimate everything with OLS, much faster than PPML, but the gist remains.

My specification ultimately incorporates

  1. AGEB (neighborhood) FE,
  2. year-by-month FE,
  3. day-of-week FE
  4. day-of-year FE,

and additionally controls for precipitation, windspeed and humidity.

The baseline model is :

reports_dv ~ tmean | prec_quintile + rh_quintile + wsp_quintile + ageb + year^month + day_of_week + day_of_year

1 Raw data - Relating Temperatures and DV cases

From the page that documents raw associations between temperature and DV, we know that higher temp is associated with more DV.

Simple averages at the monthly level

The unit of analysis in the paper is day per neighborhood. Before looking at data at this level, I first look at month level for the whole city – summing all cases of DV per month and averaging temperatures.

Warmer months record higher level of DV cases, as shown in the graphs below.

plot_monthly_averages <- function(data) {
  
  data_monthly <- data %>% 
    group_by(month = lubridate::month(date, label =T), year) %>% 
    summarise(sum_dv = sum(reports_dv),
              sum_calls = sum(calls),
              temp = mean(tmean),
              .groups = "drop")
  
  reports <- data_monthly %>% 
    filter(!is.na(sum_dv)) %>% 
    summarise(sum_dv = mean(sum_dv), .by = "month") %>%
    ggplot(aes(month, sum_dv)) +
    geom_line(stat = "identity", group = 1) +
    labs(y = "Reports for DV", x = element_blank(),
         title = "Reports for DV per month")
  
  calls <- data_monthly %>% 
    filter(!is.na(sum_calls)) %>% 
    summarise(sum_calls = mean(sum_calls), .by = "month") %>%
    ggplot(aes(month, sum_calls)) +
    geom_line(stat = "identity", group = 1) +
    labs(y = "Calls for DV", x= element_blank(),
         title = "Calls for DV per month")
  
  temp <- data_monthly %>% 
    summarise(mean_temp = mean(temp, na.rm = T), .by = "month") %>% 
    ggplot(aes(month, mean_temp)) +
    geom_bar(stat = "identity", fill = "purple") +
    labs(y = "Temperature", x = element_blank(), title = "Average temperature per month") +
    geom_text(aes(label = round(mean_temp,1)),
              vjust=2, colour = "white") 
  
  temp + reports + calls + plot_layout(ncol = 1) +
    plot_annotation(title = paste("Monthly averages for",
                                  deparse(substitute(data))))
  
}

plot_monthly_averages(data)
plot_monthly_averages(data_pre)
plot_monthly_averages(data_pandemic)
plot_monthly_averages(data_post)
plot_monthly_averages(data_pre_post)

The relationship is clearer in a scatterplot: warmer month record higher DV crime on average.

plot_monthly_scatter <- function(data) {
  
  data_monthly <- data %>% 
    group_by(month, year) %>% 
    summarise(sum_dv = sum(reports_dv),
              sum_calls = sum(calls),
              temp = mean(tmean),
              .groups = "drop")
  
  tmp <- data_monthly %>% 
    group_by(month) %>% 
    summarise(mean_temp = mean(temp, na.rm = T),
              mean_dv = mean(sum_dv, na.rm = T),
              mean_calls = mean(sum_calls, na.rm = T))
  
  correlation_reports <- cor.test(tmp$mean_temp, tmp$mean_dv, method = "pearson")
  
  scatter_reports <- tmp %>% 
    ggplot(aes(mean_temp, mean_dv)) +
    geom_point(stat = "identity") +
    geom_line(stat = "smooth", method = "lm", se = F, size = 1.5, alpha = .3, color = "blue") + 
    geom_text_repel(aes(label=month), colour = "black") +
    #axis ticks on x every 1
    scale_x_continuous(breaks = seq(0, 30, 1)) +
    labs(y = "Reports for DV", x = "Temperature")
  #       caption = paste("Pearson Correlation Coefficient: r =",
  #                       round(correlation_reports$estimate, 2)))
  
  correlation_calls <- cor.test(tmp$mean_temp, tmp$mean_calls, method = "pearson")
  
  scatter_calls <- tmp %>% 
    ggplot(aes(mean_temp, mean_calls)) +
    geom_point(stat = "identity") +
    geom_line(stat = "smooth", method = "lm", se = F, size = 1.5, alpha = .3, color = "blue") + 
    geom_text_repel(aes(label=month), colour = "black") +
    #axis ticks on x every 1
    scale_x_continuous(breaks = seq(0, 30, 1)) +
    labs(y = "Calls for DV", x = "Temperature")
  #caption = paste("Pearson Correlation Coefficient: r =",
  #               round(correlation_calls$estimate, 2)))
  
  
  scatter_reports + scatter_calls +
    plot_annotation(title = "Monthly averages for DV cases and temperature",
                    #subtitle = paste0("Data used is: ", deparse(substitute(data))))
    )
  
}

plot_monthly_scatter(data)
plot_monthly_scatter(data_pre)
plot_monthly_scatter(data_pandemic)
plot_monthly_scatter(data_post)
plot_monthly_scatter(data_pre_post)

I now plot the same graph, but separately for each year of the sample, together with a linear fit. Two facts are clear from the graphs:

  1. The number of DV cases increases over time (vertical shift from year to year)
  2. Temperature is much higher on average in 2019, as compared to other years (horizontal shift for 2019).
data %>% 
  group_by(month_nr = month(date), month, year) %>% 
  summarise(mean_temp = mean(tmean),
            sum_dv = sum(reports_dv)) %>% 
  mutate(year = as_factor(year)) %>% 
  ggplot(aes(mean_temp, sum_dv)) +
  geom_point(stat = "identity",
             aes(color = year)) +
  scale_x_continuous(breaks = seq(0, 30, 1)) +
  geom_line(stat = "smooth", aes(color = year),
            method = "lm", size = 1, alpha = .3) +
  labs(y = "Monthly DV reports", x = "Average temperature")

So, if I naively relate temperatures and DV cases, the positive relationship might be explained by the fact that 2019 was a warmer year, and also the one with the most DV cases. This is one of the reasons I will include year fixed effects in my analysis. The year FE will shift the intercept.

Simple averages at day x neighborhood level

Now, I get to the unit of analysis of the paper: day per neighborhood. I first to plot the average number of DV cases per day per neighborhood, for days that fall into a certain 1°C temperature bin.

This plot shows that, for example, that observations (neighborhood days) that fall in the 11-12°C bin record on average 0.02 reports for domestic violence. The plot shows that warmer day record warmer DV cases, but the extreme bins are very noisy.

data %>%
  filter(!is.na(tmean)) %>% 
  group_by(bins_tmean_one = cut_width(tmean, width = 1, center = .5, closed = "left")) %>%
  summarise(mean_dv = mean(reports_dv),
            count = n(),
            sd = sd(reports_dv),
            se = sd / sqrt(count)) %>%
  filter(mean_dv > 0) %>% #to compute the CI, I need at least 1 crime
  mutate(prop_in_pc = 100*count/sum(count)) %>% 
  mutate(lower_ci = mean_dv - qt(0.975, count-1) * se,  
         upper_ci = mean_dv + qt(0.975, count-1) * se) %>% 
  ggplot(aes(bins_tmean_one, mean_dv, color = prop_in_pc)) +
  geom_point(stat = "identity", size = 2.5) +
  geom_pointrange(aes(ymin = lower_ci, ymax = upper_ci)) +
  scale_x_discrete(guide = guide_axis(n.dodge = 1, angle = 45)) +
  scale_colour_viridis_c(option = "D", direction = -1,
                         limits = c(0, 1),
                         na.value = "black",
                         breaks = c(0, .5, 1),
                         labels = c("0%", ".5%", ">1%")) +
  labs(title = "Average number of DV reports per day per neighborhood, by temperature bin",
       x = "Temperature bin (°C)", y = "DV reports", color = "Support of temperature bin (% of obs)") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(colour = guide_colourbar(title.position="top",
                                  barwidth=13))

As is clear from the graph above, the support of each temperature bin is very unequal. The color of the points represents the proportion of day/neighborhood that fall in that temperature bin i.e the support of the distribution in my data. The darker the color, the higher the proportion of days that fall in that temperature bin. The black points all have a support higher than 1% (up to 18% of observations for the 18-19 degrees bin). However, less than 1% of observations fall in all the degrees bin below [11,12) or above [22,23), as indicated by lighter colors, resulting in large confidence intervals. Very few day/neighborhood observations fall in the most extreme bins, and those observations sum 0 DV reports (so, also no CI).

There are two ways to go around this. First, I plot the same graph, by lumping the most extreme bins together, so that each bin contains at least 1pc of the observations.

data %>%
  filter(!is.na(tmean)) %>% 
  group_by(bins_temp_one) %>%
  summarise(mean_dv = mean(reports_dv),
            count = n(),
            sd = sd(reports_dv),
            se = sd / sqrt(count)) %>%
  #filter(mean_dv > 0) %>% #to compute the CI, I need at least 1 crime
  mutate(prop_in_pc = 100*count/sum(count)) %>% 
  mutate(lower_ci = mean_dv - qt(0.975, count-1) * se,  
         upper_ci = mean_dv + qt(0.975, count-1) * se) %>% 
  ggplot(aes(bins_temp_one, mean_dv, color = prop_in_pc)) +
  geom_point(stat = "identity", size = 3) +
  geom_linerange(aes(ymin = lower_ci, ymax = upper_ci)) +
  scale_x_discrete(guide = guide_axis(n.dodge = 1, angle = 45)) +
  scale_colour_viridis_c(option = "G", direction = -1,
                         limits = c(0, 5),
                         na.value = "black",
                         breaks = c(0, 1, 2, 3, 4, 5),
                         labels = c("0%", "1%", "2%", "3%", "4%", ">5%")) +
  labs(title = "Avg. count of DV reports per day per neighborhood, by temp. bin",
       x = element_blank(), y = "DV reports", color = "Support of temperature bin (% of obs)") +
  theme_minimal() +
  theme(legend.position = "bottom") +
  guides(colour = guide_colourbar(title.position = "top",
                                  barwidth = 13))

Another way is to construct temperature bins that contain the same number of observations. Here, I construct 20 bins of equal size based on temperature. The dots represent the average temperature and the average number of DV reports per day per neighborhood for each bin. The dashed line shows the predicted relationship based on an OLS estimation for the underlying data.

data %>% 
  mutate(bins_temps_equal_size = cut_number(temp, 20)) %>% 
  # I can do the same with many more bins, for example 200:
  # mutate(bins_temps_equal_size = cut_number(temp, 20)) %>% 
  summarise(mean_dv = mean(reports_dv),
            mean_temp = mean(temp), .by = "bins_temps_equal_size") %>% 
  ggplot(aes(mean_temp, mean_dv)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, linetype = "dashed", color = "red3", size = .5) +
  scale_x_continuous(breaks = seq(7, 24, 1)) +
  labs(title = "Temperatures and daily DV reports at neighborhood level",
       subtitle = "20 bins of equal n (based on temperature)",
       x = "Average temperature",
       y = "DV reports")

2 Adding the different fixed effects incrementally

Neighborhood fixed effects

There is substantial spatial variation in domestic violence crime across neighborhoods of Mexico City, which may result from many factors. To account for this spatial variation in DV crime, I include neighborhood fixed effects. These fixed effects ensure that the model is estimated on deviations from neighborhood averages rather than on cross-sectional differences in temperature. Intuitively, the implication of this modelling choice is that the estimates represent a weighted average of within-neighborhood comparisons, i.e., the difference in domestic violence in a particular neighborhood of Mexico City, on a warmer day versus a colder day, for example.

reg_no_fe <-
  feols(reports_dv ~ temp + prec,
        data_pre, cluster = "col_id")

reg_col_fe <-
  feols(reports_dv ~ temp + prec | col_id,
        data_pre, cluster = "col_id", 
        combine.quick = FALSE)

I first run the simplest OLS regression of reports for DV on temperature at the neighborhood-day level, controlling only for precipitation. The relationship is positive, and a one degree increase on one day is associated with ….. more DV cases per neighborhood on average.

However, on aggregate, this is overestimating the effect of temperatures. If I include neighborhood fixed effects, the relationship is much weaker, with a coefficient estimate of …..

#extract the fe from the model
col_fe <- 
  fixef(reg_col_fe)$col_id %>%
  enframe(name = "col_id", value = "fixef") %>%
  mutate(col_id = as.numeric(col_id)) %>% 
  mutate(fixef_maxxed = case_when(fixef > .2 ~ .2,
                                  TRUE ~ fixef))

#join to the spatial dataset
colonias <-
  read_sf(here("output", "colonias.gpkg"), quiet = T) %>% 
  left_join(col_fe)
ggplot(colonias) +
  annotation_scale(location = "br", style = "ticks") +
  geom_sf(aes(fill = fixef_maxxed),color = gray(.7), alpha = .9) +
  #theme_void() +
  scale_fill_gradientn(colors = viridis(n = 10),
                       breaks = c(0, .1, .2),
                       labels = c("0", "0.1", ">0.2")) +
  theme(legend.position=c(.9,.875), legend.title = element_blank()) +
  labs(title = "Neighborhood fixed effect estimates",
       subtitle = "From a simple model: dv ~ temp | neighborhood")

Year Month fixed effects

My identification strategy also addresses the seasonality of both domestic violence and temperature. I include year-by-month fixed effects in the model. Intuitively, this choice of fixed effects implies that the model coefficients represent an average of the differences in domestic violence on hot days versus cold days within a particular neighborhood and month of a certain year.

I can plot the fitted values from these models over the temperature support of the data.

# simple estimation using no FE

reg_fe_year <-
  feols(reports_dv ~ temp | year,
        data_pre, cluster = "col_id")

reg_fe_year_month <-
  feols(reports_dv ~ temp | year^month,
        data_pre, cluster = "col_id",
        combine.quick = FALSE)
# as_tibble(expand.grid(temp = seq(7,23))) %>% 
#   mutate(pred = predict(reg_no_fe, newdata = .)) %>% 
#   ggplot(aes(x = temp, y = pred)) +
#   geom_line() +
#   labs(title = "Predicted number of daily DV reports from a simple model without FE")

rbind(
  as_tibble(expand.grid(temp = seq(7,23),
                        year = 2016:2019,
                        month = rep("All",12))) %>% 
    mutate(pred = predict(reg_fe_year, newdata = .)),
  
  as_tibble(expand.grid(temp = seq(7,23),
                        year = 2016:2019,
                        month = unique(data$month))) %>% 
    mutate(pred = predict(reg_fe_year_month, newdata = .))
) %>% 
  mutate(all = ifelse(month == "All", "Year FEs", "Year x Month FEs"),
         year = factor(year)) %>%
  ggplot(aes(x = temp, y = pred, group = paste(year,month), color = year)) +
  geom_line() +
  facet_wrap(~all) +
  labs(title = "Predicted DV reports from models with (year) and (year month) FEs") +
  theme(legend.position = "bottom")

These are the estimates for the year FEs, by which the line is shifted.

enframe(fixef(reg_fe_year)$year,
        name = "year",
        value = "FE") %>%
  pivot_wider(values_from = FE, names_from = year) %>% 
  gt() %>% 
  fmt_number(decimals = 3)

# enframe(fixef(reg_fe_year_month)$`year^month`, "year_month", "FE") %>% 
#   separate(year_month, sep = "_", into = c("year", "month")) %>% 
#   arrange(year) %>% 
#   pivot_wider(values_from = FE, names_from = year) %>% 
#   gt() %>% 
#   data_color(columns = 2:8,
#              direction = "row",
#              palette = "Purples",
#              order="desc")

I can also plot the residuals from the model with only neighborhood fixed-effects, and then with year-month fixed-effects. The apparent patterns suggest that we fail to account for year and monthly trend in the simple model. Ideally, residuals should be randomly distributed without patterns when plotted over time.

data_pre %>% 
  mutate(resid = resid(reg_col_fe), .after = "reports_dv") %>% 
  filter(#year == 2018,
         #month %in% c("Mar", "Apr")
         ) %>% 
  group_by(date, day_of_week, month) %>%
  summarise(resid = mean(resid)) %>%
  ggplot(aes(x = date, y = resid, color = month)) +
  geom_point() +
  labs(title = "Residuals from a model with only neighborhood FE",
       subtitle = "We clearly fail to account for year and monthly trend")

data_pre %>% 
  mutate(resid = resid(reg_fe_year_month), .after = "reports_dv") %>% 
  filter(#year == 2018,
         #month %in% c("Mar", "Apr")
         ) %>% 
  group_by(date, day_of_week, month) %>%
  summarise(resid = mean(resid)) %>%
  ggplot(aes(x = date, y = resid, color = month)) +
  geom_point() +
  labs(title = "Residuals from a model with neighborhood + year-month FE",
       subtitle = "We do account for year and monthly trend, still not random around 0")

Day of week

Below, I plot the residuals from a model with and without day of week fixed effects. This makes it very clear that we fail to account for the trend in crime over the week if we do not include those FE.

reg_fe_dow <- 
  feols(reports_dv ~ temp + prec | col_id + year^month + day_of_week,
        cluster = ~ col_id, data = data_pre)
data_pre %>% 
  mutate(resid = resid(reg_fe_year_month), .after = "reports_dv") %>% 
  group_by(date, day_of_week, month) %>%
  summarise(resid = mean(resid)) %>%
  ggplot(aes(x = date, y = resid, color = day_of_week %in% c("Sat", "Sun"))) +
  geom_point(size = 1) +
  geom_hline(yintercept = 0) +
  scale_color_manual(values = c("skyblue", "orange"), name = "Weekend:") +
  labs(title = "Residuals from a model with neighborhood + year-month FE",
       subtitle = "We do account for year and monthly trend, still not random around 0") +
  theme(legend.position = "bottom")

data_pre %>% 
  mutate(resid = resid(reg_fe_dow), .after = "reports_dv") %>% 
  group_by(date, day_of_week, month) %>%
  summarise(resid = mean(resid)) %>%
  ggplot(aes(x = date, y = resid, color = day_of_week %in% c("Sat", "Sun"))) +
  geom_point(size = 1) +
  geom_hline(yintercept = 0)+
  scale_color_manual(values = c("skyblue", "orange"), name = "Weekend:") +
  labs(title = "Residuals from a model with neighborhood + year-month FE + day-of-week + calendar days",
       subtitle = "We do account for many time trends, looks random around 0") +
  theme(legend.position = "bottom")

I estimate the model by further adding day of week fixed effect. I show the estimates for the day of week fixed effects, with Monday as the reference category. DV crime is higher on the weekend than during the week.

#restimate with day of week as control, to be able to set monday as ref category
reg_fe_dow <- 
  feols(reports_dv ~ temp + prec + i(day_of_week, ref= "Mon") | col_id + year^month,
        cluster = ~ col_id, data = data_pre)

dow_fe <- iplot(reg_fe_dow, only.params = T)$prms %>% 
  as_tibble() %>% 
  mutate(order = if_else(x == 1, 8, x)) %>% 
  arrange(order) 

ggplot(dow_fe, aes(x = reorder(estimate_names, order), y = estimate)) +
  geom_col() +
  labs(title = "Estimates for day of week fixed effect",
       subtitle = "Monday is the reference category") +
  geom_hline(yintercept = 0, color = "red") +
  labs(x = "Day of week", y = "Estimate") 

rm(dow_fe)

Day of year and holidays FE

I also want to include day-of-year fixed effects to absorb patterns of domestic violence that are related to calendar effects, such as pay-day effects or holidays, or potential misreporting of the DV data on the first of each month for example.

I include a day of year variable in my estimations. I show below a table with the estimates for the day of year fixed effects. First of each month appears very clearly: probably measurement error. Other dates appear: for example, September 16 – which is the highest estimates of all dates – corresponds to Independence Day.

reg_day_of_year <- data_pre %>%
  mutate(month_day = mday(date),
         month = month(date)) %>%
  mutate(day_of_year = paste(month, month_day)) %>% 
  feols(reports_dv ~ temp + prec | col_id + year^month + day_of_week + day_of_year,
        cluster = ~ col_id)

day_fe <- enframe(fixef(reg_day_of_year)$day_of_year, "day_of_year", "FE") %>%
  separate(day_of_year, sep = " ", into = c("month", "day")) %>%
  mutate(month = as.numeric(month),
         day = as.numeric(day)) %>% 
  arrange(month, day) %>% 
  left_join(data %>% distinct(month) %>% 
              rename(month_name = month) %>% 
              mutate(month = as.numeric(month_name)))

day_fe %>%
  pivot_wider(values_from = FE, names_from = day) %>%
  select(-month) %>% 
  gt(rowname_col = "month_name") %>%
  fmt_number(decimals = 3) %>% 
  data_color(columns = 2:32,
             direction = "row",
             palette = c("red", 'white', "green"))

rm(day_fe)

I can instead include a dummy for public holiday days. I do so below to compare. I could also include a dummy for school holidays (find the data). But I would fail to control for the first of the month, so I eventually decide to control for calendar days.

data %>% distinct(date, holiday) %>% filter(holiday & year(date) == 2016) %>% 
  gt()

Interact time and neighborhood FEs?

Fixed effects for year, calendar month, and day of the week preclude bias from day-specific heterogeneity and flexibly control for long-term trends and seasonality common to all neighborhoods. To exclude further possible biases, I can interact the time fixed effects for year, calendar month, and day of the week with the neighborhood dummies. The interactions allow for neighborhood-specific differences in patterns over months and over days of the week.

reg_fe_non_interacted <- 
  feols(reports_dv ~ temp + prec | col_id + year^month + day_of_week,
        cluster = ~ col_id, data = data_pre)

reg_fe_full_interacted <- 
  feols(reports_dv ~ temp + prec | col_id + year^month + day_of_week + year^month^col_id + day_of_week^col_id,
        cluster = ~ col_id, data = data_pre)

etable(reg_fe_non_interacted, reg_fe_full_interacted, powerBelow = -10)
rm(list = ls(pattern = "reg_"))

However, I am not overfitting the model? From the regression table, it is hard to see anything.

I can plot the residualized data from an OLS estimation of DV reports on temp and all those FE, at the neighborhood x day level. This suggests that the relationship between temperature and DV reports is still positive, even after controlling for all these factors. Can I interpret anything from those two plots? Anyways, I think this is too mant FE, and I should not include the full interaction.

#follow wooldridge p337
reports_dv_detrend <- 
  feols(reports_dv ~ 1 | col_id + year^month + day_of_week + day_of_year,
        cluster = ~ col_id, data = data_pre)

temp_detrend <- 
  feols(temp ~  1 | col_id + year^month + day_of_week + day_of_year,
        cluster = ~ col_id, data = data_pre)

prec_detrend <- 
  feols(prec ~  1 | col_id + year^month + day_of_week + day_of_year,
        cluster = ~ col_id, data = data_pre)

data_pre <- data_pre %>% 
  mutate(resid_temp = resid(temp_detrend),
         resid_dv = resid(reports_dv_detrend),
         resid_prec = resid(prec_detrend))

#by regressing the residuals, I obtain the same coefficients as with the original model
# etable(
#   feols(reports_dv ~ temp + prec | col_id + year^month + day_of_week + day_of_year,
#         cluster = ~ col_id, data = data),
#   feols(resid_dv ~ resid_temp + resid_prec,
#         cluster = ~ col_id, data = data)
# )


data_pre %>% 
  mutate(bins_temps_equal_size = cut_number(resid_temp, 20)) %>% 
  summarise(mean_resid_dv = mean(resid_dv),
            mean_resid_temp = mean(resid_temp), .by = "bins_temps_equal_size") %>% 
  ggplot(aes(mean_resid_temp, mean_resid_dv)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, linetype = "dashed", color = "red3", size = .5) +
  labs(title = "Residuals of temp. and daily DV reports at neighborhood level",
       subtitle = "20 bins of equal n (based on temperature)",
       caption = "From a model with no interactions between time and neighborhood \n i.e. col_id + year^month + day_of_week + day_of_year",
       x = "Temperature residuals",
       y = "DV reports residuals")
#follow wooldridge p337
reports_dv_detrend <- 
  feols(reports_dv ~ 1 | col_id + year^month + day_of_week + day_of_year + year^month^col_id + day_of_week^col_id,
        cluster = ~ col_id, data = data_pre)

temp_detrend <- 
  feols(temp ~  1 | col_id + year^month + day_of_week + day_of_year + year^month^col_id + day_of_week^col_id,
        cluster = ~ col_id, data = data_pre)

prec_detrend <- 
  feols(prec ~  1 | col_id + year^month + day_of_week + day_of_year + year^month^col_id + day_of_week^col_id,
        cluster = ~ col_id, data = data_pre)

data_pre <- data_pre %>% 
  mutate(resid_temp = resid(temp_detrend),
         resid_dv = resid(reports_dv_detrend),
         resid_prec = resid(prec_detrend))

#by regressing the residuals, I obtain the same coefficients as with the original model
# etable(
#   feols(reports_dv ~ temp + prec | col_id + year^month + day_of_week + day_of_year,
#         cluster = ~ col_id, data = data),
#   feols(resid_dv ~ resid_temp + resid_prec,
#         cluster = ~ col_id, data = data)
# )


data_pre %>% 
  mutate(bins_temps_equal_size = cut_number(resid_temp, 20)) %>% 
  summarise(mean_resid_dv = mean(resid_dv),
            mean_resid_temp = mean(resid_temp), .by = "bins_temps_equal_size") %>% 
  ggplot(aes(mean_resid_temp, mean_resid_dv)) +
  geom_point() +
  geom_smooth(method = "lm", se = F, linetype = "dashed", color = "red3", size = .5) +
  labs(title = "Residuals of temp. and daily DV reports at neighborhood level",
       subtitle = "20 bins of equal n (based on temperature)",
       caption = "From a model with full interactions between time and neighborhood \n i.e. col_id + year^month + day_of_week + day_of_year + year^month^col_id + day_of_week^col_id",
       x = "Temperature residuals",
       y = "DV reports residuals")
rm(list = ls(pattern = "detrend"))

3 OLS results

Finally, I show the results of the different regressions in Table 1, in which I add the FE incrementally. I use data from the period before the pandemic.

setFixest_estimation(reset = TRUE)
setFixest_etable(reset = TRUE)

my_fun_effsize = function(x) round(coef(x)["temp"]/mean(model.matrix(x, type = "lhs"))*100, 2) %>% paste0(., "%")

extralines_register("tempeffsize", my_fun_effsize, "__Effect size (temp)")
Table 1: Simple OLS results

OLS results without any FE in column (1) are overestimating the effect of temperature. When I control only the seasonality in (2), the estimate is even larger. However, as soon as I control for neighborhood in (3), the estimate for temperature gets to its stable level. Controls for seasonality and neighborhood (4) seem to be enough to stabilize the coefficient estimate on temperature. I add day of the week in (5) and day of the year in (6), but the coefficient on temperature does not change much. Finally, I control for a holidays dummy in (7) instead of the day of year FE, but the coefficient on temperature remains stable.

My preferred specification is that of column (6). The identifying variation in that specification using fixed effects can be characterized as within-neighborhood, within-year, within-month, and within-day of the week.

Adding all those FE maximizes the control for confounding variation and focuses on the most granular level of time-variant changes in temperature within neighborhoods. However, I could be left with too little variation for identification, and the estimation could be sensitive to measurement error in temperature or crime counts… Signal to noise ratio…

Specifying temperatures flexibly

I can estimate the same specifications as in Table 1, but with a flexible specification for temperature. I use bins of 1 degree, to be able to detect non-linearities in the effect of temperature on crime, but also, and more importantly at this point, to detect if the inclusion of all those fixed effects leads to a loss in precision (the more bins, the more precision I would need).

ols_flex <- 
  feols(reports_dv ~ i(bins_temp_one, ref = "[16") + prec | 
          sw0(year^month,
              col_id, 
              col_id + year^month,
              col_id + year^month + day_of_week,
              col_id + year^month + day_of_week + day_of_year),
        cluster = ~ col_id,
        data = data_pre)

ids <- coeftable(ols_flex) %>% select(id, fixef) %>% unique()

estimates <- iplot(ols_flex, only.params = T)$prms %>% 
  left_join(ids, by = "id") %>% as_tibble() %>% 
  mutate(fixef = if_else(fixef == "1", "No fixed effects", fixef),
         fixef = as_factor(fixef))

rm(ols_flex, ids)

I first plot the estimates for specifications with corresponding to columns 1 to 3. As in Table 1, it is clear that controlling for neighborhood is crucial and is enough to capture estimates reasonably close in magnitude to the final ones.

estimates %>% 
  filter(id %in% 1:3) %>%
  ggplot(aes(x = estimate_names, group = 1)) +
  geom_point(aes(y = estimate), size = 1) +
  geom_errorbar(aes(ymin = ci_low, ymax = ci_high),
                width = 0.5, color = "black") +
  geom_hline(yintercept = 0, alpha = .5) +
  labs(x = element_blank(), y = element_blank()) +
  scale_y_continuous(breaks = scales::breaks_width(.005)) +
  scale_x_discrete(guide = guide_axis(angle = 45)) + 
  facet_wrap(~fixef) 

Then, I plot col 4-6. The coefficients are surprisingly stable. The inclusion of neighborhood and seasonality seems to be enough to capture the effect of temperature on crime. Adding day of the week and day of the year does not change the estimates much.

estimates %>% 
  filter(id %in% 4:6) %>%
    ggplot(aes(x = estimate_names, group = 1)) +
  geom_point(aes(y = estimate), size = 1) +
    geom_errorbar(aes(ymin = ci_low, ymax = ci_high),
                  width = 0.5, color = "black") +
    geom_hline(yintercept = 0, alpha = .5) +
    labs(x = element_blank(), y = element_blank()) +
    scale_y_continuous(breaks = scales::breaks_width(.005)) +
    scale_x_discrete(guide = guide_axis(angle = 45)) + 
  facet_wrap(~fixef)

rm(estimates)

Now that we have a good overview of the fixed effects, let us move to the modelling. As of now, I estimate simple OLS, but this is clearly not optimal given the nature of the data.