Presenting results from an arbitrary number of models

9 minute read

The combination of tidyr::nest() and purrr:map() can be used to easily fit the same model to different subsets of a single dataframe. There are many tutorials available to help guide you through this process. There are substantially fewer (none I’ve been able to find) that show you how to use these two functions to fit the same model to different features from your dataframe.

While the former involves splitting your data into different subsets by row, the latter involves cycling through different columns. I recently confronted a problem where I had to run many models, including just one predictor at a time from large pool of candidate predictors, while also including a standard set of control variables in each.1 Given the (apparent) absence of tutorials on fitting the same model to different features from a dataframe using these functions, I decided to write up the solution I reached in the hope it might be helpful to someone else.2 Start by loading the following packages:

library(tidyverse)
library(broom)
library(modelsummary)
library(kableExtra)
library(nationalparkcolors)

We’ll start with a recap of the subsetting approach, then build on it to cycle through features instead of subsets of the data. This code is similar to the official tidyverse tutorial above, but pipes the output directly to a ggplot() call to visualize the results.

mtcars %>% 
  nest(data = -cyl) %>% # split data by cylinders
  mutate(mod = map(data, ~lm(mpg ~ disp + wt + am + gear, data = .x)),
         out = map(mod, ~tidy(.x, conf.int = T))) %>% # tidy model to get coefs
  unnest(out) %>% # unnest to access coefs
  mutate(sig = sign(conf.low) == sign(conf.high), # p <= .05
         cyl = as.factor(cyl)) %>% # factor for nicer plotting
  filter(term == 'disp') %>% 
  ggplot(aes(x = cyl, y = estimate, ymin = conf.low, ymax = conf.high,
             color = sig)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
  scale_color_manual(name = 'Statistical significance',
                     labels = str_to_title,
                     values = park_palette('Saguaro')) +
  labs(x = 'Cylinders', y = "Coefficient estimate") + 
  theme_bw() +
  theme(legend.position = 'bottom')

Multiple predictors

The first thing we have to do is create a custom fuction because we now need to be able to specify different predictors in different runs of the model. The code below is very similar to the code above, except that we’re defining the formula in lm() via the formula() function, which parses a character object that we’ve assembled via str_c(). The net effect of this is to fit a model where the pred argmument to func_var() is the first predictor. This lets us use an external function to supply different values to pred. Then we use broom::tidy() to create a tidy dataframe of point estimates and measures of uncertainty from the model and store them in a variable called out. Finally, mutate(pred = pred) creates a variable named pred in the output dataframe that records what the predictor used to fit the model was. We could retrieve this from the mod list-column, but this is approach is simpler both to extract the predictor programtically and to visually inspect the data. We use then purr::map_dfr() to generate a dataframe where each row corresponds to a model with with a different predictor.

func_var <- function(pred, dataset) {
  
  dataset %>% 
    nest(data = everything()) %>% 
    mutate(mod = map(data, ~lm(formula(str_c('mpg ~ ' , pred, # substitute pred
                                             ' + wt + am + gear')),
                               data = .x)),
           out = map(mod, ~tidy(.x, conf.int = T))) %>% 
    mutate(pred = pred) %>% 
    return()
  
}

## predictors of interest
preds <- c('disp', 'hp', 'drat')

## fit models with different predictors
mods_var <- map_dfr(preds, function(x) func_var(x, mtcars))

## inspect
mods_var
## # A tibble: 3 × 4
##   data               mod    out              pred 
##   <list>             <list> <list>           <chr>
## 1 <tibble [32 × 11]> <lm>   <tibble [5 × 7]> disp 
## 2 <tibble [32 × 11]> <lm>   <tibble [5 × 7]> hp   
## 3 <tibble [32 × 11]> <lm>   <tibble [5 × 7]> drat

Plots

You can see our original dataframe that we condensed down into data with nest(), the model object in mod, the tidied model output in out, and finally the predictor used to fit the model in pred. Using unnest(), we can unnest the out object and get a dataframe we can use to plot the main coefficient estimate from each of our three models.

mods_var %>% 
  unnest(out) %>% 
  mutate(sig = sign(conf.low) == sign(conf.high)) %>% 
  filter(term %in% preds) %>% 
  ggplot(aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
             color = sig)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
  scale_color_manual(name = 'Statistical significance',
                     labels = str_to_title,
                     values = park_palette('Saguaro')) +
  labs(x = 'Predictor', y = "Coefficient estimate") + 
  theme_bw() +
  theme(legend.position = 'bottom')

Tables

Things get slightly more complicated when we want to represent our results textually instead of visually. We can use the excellent modelsummary::modelsummary() function to create our table, but we need to supply a list of model objects, rather than the unnested dataframe we created above to plot the results. We can use the split() function to turn our dataframe into a list, and by using split(seq(nrow(.))), we’ll create one list item for each row in our dataframe.

Since each list item will be a one row dataframe, we can use lapply() to cycle through the list. The mod object in each one row dataframe is itself a list-column, so we need to index it with [[1]] to properly access the model object itself.3 The last step is a call to unname(), which will drop the automatically generated list item names of 1, 2, and 3, allowing modelsummary() to use the default names for each model column in the output.

tab_coef_map = c('disp' = 'Displacement', # format coefficient labels
                 'hp' = 'Horsepower',
                 'drat' = 'Drive ratio',
                 'wt' = 'Weight (1000 lbs)',
                 'am' = 'Manual',
                 'gear' = 'Gears',
                 '(Intercept)' = '(Intercept)')

mods_var %>% 
  split(seq(nrow(.))) %>% # list where each object is a one row dataframe
  lapply(function(x) x$mod[[1]]) %>% # extract model from data dataframe
  unname() %>% # remove names for default names in table
  modelsummary(coef_map = tab_coef_map, stars = c('*' = .05))

Bonus

Now, let’s combine both approaches. We’re going to be splitting our dataframe into three sub-datasets by number of cylinders while also fitting the same model three times with 'disp', 'hp', and 'drat' as predictors. The only changes to func_var() are to omit cyl from the nesting, and to recode it as a factor to treat it as discrete axis labels.

func_var_obs <- function(pred, dataset) {
  
  dataset %>% 
    nest(data = -cyl) %>% 
    mutate(mod = map(data, ~lm(formula(str_c('mpg ~ ' , pred,
                                             ' + wt + am + gear')),
                               data = .x)),
           out = map(mod, ~tidy(.x, conf.int = T)),
           cyl = as.factor(cyl),
           pred = pred) %>% 
    select(-data) %>% 
    return()
  
}

preds <- c('disp', 'hp', 'drat')

mods_var_obs <- map_dfr(preds, function(x) func_var_obs(x, mtcars))

Plotting involves a call to facet_wrap(), but is otherwise similar.

mods_var_obs %>% 
  unnest(out) %>% 
  mutate(sig = sign(conf.low) == sign(conf.high)) %>% 
  filter(term %in% preds) %>% 
  ggplot(aes(x = cyl, y = estimate, ymin = conf.low, ymax = conf.high,
             color = sig)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
  facet_wrap(~pred) +
  scale_color_manual(name = 'Statistical significance',
                     labels = str_to_title,
                     values = park_palette('Saguaro')) +
  labs(x = 'Predictor', y = "Coefficient estimate") + 
  theme_bw() +
  theme(legend.position = 'bottom')

Creating tables is more complex. Here we have to cycle through each predictor with a call to map(), filter the output to only contain results from models using that predictor, then split the dataframe by cylinders instead of into separate rows. Note the use of unname(preds_name[x]) to retrieve full english predictor names to create more useful table titles. We’ll also be using tab_coef_map from above to get more informative row labels in our tables. Running the code below generates the following tables:

## named vector for full english predictor names
preds_name <- c('displacement', 'horsepower', 'drive ratio')
names(preds_name) <- preds

map(preds, function(x) mods_var_obs %>% 
      filter(pred == x) %>% # subset to models using predictor x
      select(mod, cyl) %>% # drop tidied model
      split(.$cyl) %>% # split by number of cylinders in engine
      lapply(function(y) y$mod[[1]]) %>% # only one item in each list
      modelsummary(title = str_c('Predictor: ',
                                 unname(preds_name[x]), # formatted name
                   coef_map = tab_coef_map,
                   stars = c('*' = .05),
                   escape = F) %>% 
      add_header_above(c(' ' = 1, 'Cylinders' = 3))) %>% 
  walk(print) # invisibly return input to avoid [[1]] in output



We’ve got one table for each predictor we considered, and each one is split into three models for cars with four, six, and eight cylinder engines. This is a bit overkill for this example, but it’s all you have to do to scale this framework up to hundreds of potential predictors is put more items in preds.

  1. Yes, I know this is a perfect situation to use LASSO. Sometimes people (reviewers) want certain models run, and you just have to run them. 

  2. There’s a very real chance that someone else is me in six months. 

  3. Things get a lot more complicated if your split() call produces a list of dataframes that aren’t one row each, so make sure that’s what you’re getting before you proceed.