Rob Williams

Presenting results from an arbitrary number of models

2023-03-04T00:00:00-06:00

The combination of tidyr::nest() and purrr:map() can be used to easily fit the same model to different subsets of a single dataframe. There are many tutorials available to help guide you through this process. There are substantially fewer (none I’ve been able to find) that show you how to use these two functions to fit the same model to different features from your dataframe.

While the former involves splitting your data into different subsets by row, the latter involves cycling through different columns. I recently confronted a problem where I had to run many models, including just one predictor at a time from large pool of candidate predictors, while also including a standard set of control variables in each.¹ Given the (apparent) absence of tutorials on fitting the same model to different features from a dataframe using these functions, I decided to write up the solution I reached in the hope it might be helpful to someone else.² Start by loading the following packages:

library(tidyverse)
library(broom)
library(modelsummary)
library(kableExtra)
library(nationalparkcolors)

We’ll start with a recap of the subsetting approach, then build on it to cycle through features instead of subsets of the data. This code is similar to the official tidyverse tutorial above, but pipes the output directly to a ggplot() call to visualize the results.

mtcars %>% 
  nest(data = -cyl) %>% # split data by cylinders
  mutate(mod = map(data, ~lm(mpg ~ disp + wt + am + gear, data = .x)),
         out = map(mod, ~tidy(.x, conf.int = T))) %>% # tidy model to get coefs
  unnest(out) %>% # unnest to access coefs
  mutate(sig = sign(conf.low) == sign(conf.high), # p <= .05
         cyl = as.factor(cyl)) %>% # factor for nicer plotting
  filter(term == 'disp') %>% 
  ggplot(aes(x = cyl, y = estimate, ymin = conf.low, ymax = conf.high,
             color = sig)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
  scale_color_manual(name = 'Statistical significance',
                     labels = str_to_title,
                     values = park_palette('Saguaro')) +
  labs(x = 'Cylinders', y = "Coefficient estimate") + 
  theme_bw() +
  theme(legend.position = 'bottom')

Multiple predictors

The first thing we have to do is create a custom fuction because we now need to be able to specify different predictors in different runs of the model. The code below is very similar to the code above, except that we’re defining the formula in lm() via the formula() function, which parses a character object that we’ve assembled via str_c(). The net effect of this is to fit a model where the pred argmument to func_var() is the first predictor. This lets us use an external function to supply different values to pred. Then we use broom::tidy() to create a tidy dataframe of point estimates and measures of uncertainty from the model and store them in a variable called out. Finally, mutate(pred = pred) creates a variable named pred in the output dataframe that records what the predictor used to fit the model was. We could retrieve this from the mod list-column, but this is approach is simpler both to extract the predictor programtically and to visually inspect the data. We use then purr::map_dfr() to generate a dataframe where each row corresponds to a model with with a different predictor.

func_var <- function(pred, dataset) {
  
  dataset %>% 
    nest(data = everything()) %>% 
    mutate(mod = map(data, ~lm(formula(str_c('mpg ~ ' , pred, # substitute pred
                                             ' + wt + am + gear')),
                               data = .x)),
           out = map(mod, ~tidy(.x, conf.int = T))) %>% 
    mutate(pred = pred) %>% 
    return()
  
}

## predictors of interest
preds <- c('disp', 'hp', 'drat')

## fit models with different predictors
mods_var <- map_dfr(preds, function(x) func_var(x, mtcars))

## inspect
mods_var

## # A tibble: 3 × 4
##   data               mod    out              pred 
##                            
## 1      disp 
## 2      hp   
## 3      drat

Plots

You can see our original dataframe that we condensed down into data with nest(), the model object in mod, the tidied model output in out, and finally the predictor used to fit the model in pred. Using unnest(), we can unnest the out object and get a dataframe we can use to plot the main coefficient estimate from each of our three models.

mods_var %>% 
  unnest(out) %>% 
  mutate(sig = sign(conf.low) == sign(conf.high)) %>% 
  filter(term %in% preds) %>% 
  ggplot(aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high,
             color = sig)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
  scale_color_manual(name = 'Statistical significance',
                     labels = str_to_title,
                     values = park_palette('Saguaro')) +
  labs(x = 'Predictor', y = "Coefficient estimate") + 
  theme_bw() +
  theme(legend.position = 'bottom')

Tables

Things get slightly more complicated when we want to represent our results textually instead of visually. We can use the excellent modelsummary::modelsummary() function to create our table, but we need to supply a list of model objects, rather than the unnested dataframe we created above to plot the results. We can use the split() function to turn our dataframe into a list, and by using split(seq(nrow(.))), we’ll create one list item for each row in our dataframe.

Since each list item will be a one row dataframe, we can use lapply() to cycle through the list. The mod object in each one row dataframe is itself a list-column, so we need to index it with [[1]] to properly access the model object itself.³ The last step is a call to unname(), which will drop the automatically generated list item names of 1, 2, and 3, allowing modelsummary() to use the default names for each model column in the output.

tab_coef_map = c('disp' = 'Displacement', # format coefficient labels
                 'hp' = 'Horsepower',
                 'drat' = 'Drive ratio',
                 'wt' = 'Weight (1000 lbs)',
                 'am' = 'Manual',
                 'gear' = 'Gears',
                 '(Intercept)' = '(Intercept)')

mods_var %>% 
  split(seq(nrow(.))) %>% # list where each object is a one row dataframe
  lapply(function(x) x$mod[[1]]) %>% # extract model from data dataframe
  unname() %>% # remove names for default names in table
  modelsummary(coef_map = tab_coef_map, stars = c('*' = .05))

Bonus

Now, let’s combine both approaches. We’re going to be splitting our dataframe into three sub-datasets by number of cylinders while also fitting the same model three times with 'disp', 'hp', and 'drat' as predictors. The only changes to func_var() are to omit cyl from the nesting, and to recode it as a factor to treat it as discrete axis labels.

func_var_obs <- function(pred, dataset) {
  
  dataset %>% 
    nest(data = -cyl) %>% 
    mutate(mod = map(data, ~lm(formula(str_c('mpg ~ ' , pred,
                                             ' + wt + am + gear')),
                               data = .x)),
           out = map(mod, ~tidy(.x, conf.int = T)),
           cyl = as.factor(cyl),
           pred = pred) %>% 
    select(-data) %>% 
    return()
  
}

preds <- c('disp', 'hp', 'drat')

mods_var_obs <- map_dfr(preds, function(x) func_var_obs(x, mtcars))

Plotting involves a call to facet_wrap(), but is otherwise similar.

mods_var_obs %>% 
  unnest(out) %>% 
  mutate(sig = sign(conf.low) == sign(conf.high)) %>% 
  filter(term %in% preds) %>% 
  ggplot(aes(x = cyl, y = estimate, ymin = conf.low, ymax = conf.high,
             color = sig)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, lty = 2, color = 'grey60') +
  facet_wrap(~pred) +
  scale_color_manual(name = 'Statistical significance',
                     labels = str_to_title,
                     values = park_palette('Saguaro')) +
  labs(x = 'Predictor', y = "Coefficient estimate") + 
  theme_bw() +
  theme(legend.position = 'bottom')

Creating tables is more complex. Here we have to cycle through each predictor with a call to map(), filter the output to only contain results from models using that predictor, then split the dataframe by cylinders instead of into separate rows. Note the use of unname(preds_name[x]) to retrieve full english predictor names to create more useful table titles. We’ll also be using tab_coef_map from above to get more informative row labels in our tables. Running the code below generates the following tables:

## named vector for full english predictor names
preds_name <- c('displacement', 'horsepower', 'drive ratio')
names(preds_name) <- preds

map(preds, function(x) mods_var_obs %>% 
      filter(pred == x) %>% # subset to models using predictor x
      select(mod, cyl) %>% # drop tidied model
      split(.$cyl) %>% # split by number of cylinders in engine
      lapply(function(y) y$mod[[1]]) %>% # only one item in each list
      modelsummary(title = str_c('Predictor: ',
                                 unname(preds_name[x]), # formatted name
                   coef_map = tab_coef_map,
                   stars = c('*' = .05),
                   escape = F) %>% 
      add_header_above(c(' ' = 1, 'Cylinders' = 3))) %>% 
  walk(print) # invisibly return input to avoid [[1]] in output

We’ve got one table for each predictor we considered, and each one is split into three models for cars with four, six, and eight cylinder engines. This is a bit overkill for this example, but it’s all you have to do to scale this framework up to hundreds of potential predictors is put more items in preds.

Yes, I know this is a perfect situation to use LASSO. Sometimes people (reviewers) want certain models run, and you just have to run them. ↩
There’s a very real chance that someone else is me in six months. ↩
Things get a lot more complicated if your split() call produces a list of dataframes that aren’t one row each, so make sure that’s what you’re getting before you proceed. ↩

There is as Yet Insufficient Data for a Meaningful Answer

2022-07-05T00:00:00-05:00

Since taking a job as a data scientist three months ago, I’ve spoken with multiple political science PhD students who are interested in potentially making the same transition. This post synthesizes what I’ve said in those conversations with what I’ve learned in my first three months on the job, and I hope it will be helpful to anyone in the same position I was six months ago. As I mentioned in my previous post, I’m drawing inferences from an n of one, so take anything I say with a hefty grain of salt.¹ While I’m structuring this post largely as pieces of advice, keep in mind that these were things that worked for me, and may not generalize.²

Differences from the academic job market

Some important differences between the academic and nonacademic job markets that are useful to consider at the start:

Timelines are faster than faculty searches, but they are far less consistent. One process took almost three months, while another took less than three weeks.
Not a single employer asked for letters of recommendation. One contacted references.
Who you talk to varies greatly. For some positions my first contact was an HR phone screen, for others it was a 30 minute initial interview with the hiring manager.
Performance tasks, otherwise known as coding assignments (or, more accurately, unpaid work), are common. These are just a fact of life for data science jobs. They varied from straightforward problem sets to research design memos, but not every job I interviewed for required them.
There will probably be a technical interview. As these were not software engineering jobs, most of the ones I encountered tried to assess whether you know the basics of analyzing data in your language of choice and to get some insight into your problem solving approaches.
Job talks are much less common, but not unheard of. Only two of the positions I interviewed for required a technical presentation, and unlike in academia, there is absolutely zero stigma against presenting coauthored work.
Based on some very informal reckoning, automated HR rejection emails seem to be about as common for nonacademic jobs as academic ones.³ When they do come, these emails are much faster than in academia: days or weeks instead of months.
Get ready for a new world of terminology and titles. In the same way that the assistant $\rightarrow$ associate $\rightarrow$ full professor progression baffles many outside of academia, I felt very lost upon encountering ads for senior, principal, and lead data scientists, and especially so when I applied to one for a data science technical adviser.
Similarly, get ready to navigate the variety of different jobs that can fall under the umbrella of data scientist. Does a job list SQL, Tableau, and Excel as the most important technical skills? That’s probably more of a data analyst position. TensorFlow, Dask, and C++? That’s likely more of machine learning engineer job. If you’re anything like me, you want to aim for the middle ground between these two.

The nonacademic résumé

Probably the biggest transition when starting to apply for data science jobs was the shift from an academic CV to a nonacademic résumé. A CV lists functionally every major accomplishment you’ve achieved in your time in the field, while a résumé is highly targeted for a specific position. When applying to academic jobs, I wrote a (semi) customized cover letter for every job, and then included the relevant version of my CV (conflict, methods, or teaching). Each of these CVs contained the same information, just in a different order. In contrast, I significantly edited the skills section of most résumés I sent out based on the job listing. The WashU career center has a fantastic handout on differences between the two documents and how to adapt a CV into a résumé that I drew on heavily in this process.

In my opinion, the conventional wisdom that a résumé can only ever be one page is an overcorrection from the never-ending academic CV. The résumé I used to apply for jobs was two pages: the first included work experience, education, and a list of technical skills, while the second was project-oriented, and covered two publications, a couple of blog posts, a Shiny dashboard, teaching materials for the grad stats lab I taught. You definitely want to include links here, not just to the final product, but also the code behind it where relevant (replication materials for publications, git repos for smaller projects). This is an excellent opportunity to showcase work that uses data science skills to show something interesting, but wouldn’t be considered novel enough for publication in an academic journal. Here are some other points that may be helpful when writing a résumé:

No one is likely to care that you wrote an undergraduate thesis or received a masters in passing (I did both, neither are on my résumé). An important exception to the latter point applies if you will be leaving your program without finishing your PhD; definitely list an in-passing masters in this case. Similarly, if you received a masters in a separate (more technical) program during your PhD, e.g., statistics or data science, be sure to list it as well.
Social sciences can be a bit out of left field for data science hiring managers, so my résumé did include a “Concentrations: quantitative methodology and international relations” sub-bullet under my PhD in my education section.
Paid research assistant jobs you had in grad school absolutely count as work experience and should be listed separately from your research and teaching if relevant to the types of jobs you’re applying for. I listed my jobs ensuring the reproducibility of quantitative results for academic journals and supporting users of university high performance computing resources as there’s a very short line between both of those job descriptions and many common data science tasks.
If a job ad lists a skill and you have that skill, put it on your résumé, even if it’s not one of your strongest skills. Your résumé will almost certainly be fed through an applicant tracking system, and the more matches the system finds, the higher the chance your résumé will end up in front of human eyes.
I would take this a step further and do this in your cover letter as well. Does a job ad list a “solid understanding of relevant theories in machine learning, statistics, and probability theory” in the requirements? Then you’d better be prepared to talk about how you apply machine learning, statistics, and probability in your work. Does this feel a little like undergraduates trying to avoid plagiarism detection software by changing a few words here and there? Yes, but it’s how hiring happens these days.

Things to do

Below is a list of non-résumé-related things I did to prepare for and during my nonacademic job search that I found helpful:

As someone who (hubristically) deleted theirs the second year of grad school, it pains me to say that the most important thing you can do here is get yourself a LinkedIn. Get it looking as professional as your academic website. The first thing is to set the headline directly below your name to the type of job you’re looking for. Want to be a research manager? List yourself as one and then talk about all the research assistants you coordinated. You’ll have to do some reframing and shortening, but you can largely transfer over content from your academic website. I added publications and blog posts to the publications and projects section at the bottom of my profile, and I also added them as media items under my postdoc and PhD experiences where appropriate. Add a link or two with high quality preview images to the featured section at the top of your page.
If you’re applying for jobs now and you’ve taught a quantitative methods course at any point, get ready to talk about this. Every single interview asked me about a time where I had to explain a technical concept or project to a nontechnical audience, and teaching quantitative methods is nothing but that, multiple times a week, for an entire semester. Teaching statistics and programming is hard, so you’ll also have lots of anecdotes ready when the interviewer asks a followup question about a time where you had to change your approach midway through a project. If you haven’t taught quantitative methods yet and you’re not already applying for jobs, do so if at all possible.
Use your resources. I was fortunate enough to do my postdoc at an institution with an excellent career center that had multiple staff members with experience helping PhD students and postdocs get nonacademic jobs. However, even if your career center is less prepared to help you get a nonacademic job, lots of career centers have publicly available online resources that can be very helpful.
Use your networks. I talked with many people who work in data science and do not have degrees in computer science or statistics. This included two people from my undergraduate institution (one PhD in psychology, one in physics), multiple political science PhDs I met through Twitter and LinkedIn, and people who did data science masters and nonacademic data science bootcamps. Their experience and advice were invaluable for me in my job search process.
Research salaries in the field you’re applying to. You can get a broad sense of this through sites like Glassdoor, but ask the people I mentioned above about their starting salaries as well. They likely came from a similar background to you, and this information can be very useful when negotiating salary. You don’t want to undersell yourself when an interviewer asks you your salary range.

Software skills

Social science PhD programs are good at teaching research design, formal modeling, and statistical methodology. They spend far less time on what I’ll call more supporting technical skills. Here are some suggestions in this domain based on my observations so far:

Don’t try to learn everything there is to know about a cloud computing architecture. There are too many, and every company’s implementation is subtly different. At my job, we use AWS, GCP, and Azure for various tasks, so learning one inside and out won’t give you a huge advantage when applying. If you can generate SSH keys and copy them to a remote host, you’re most of the way there.
Learn some SQL, but don’t worry about learning how to administer a database. If you can write queries that join multiple tables together and summarize by multiple groups, you’re probably good. If you know the standard libraries for connecting to and querying a SQL database in Python and/or R, that’s great. Again, depending on the individual database solution your job uses, you may have to use a very specific package to access it from your data science language. Mode has a free tutorial with an interactive interface that lets you write and run SQL queries in your browser that I found very helpful.
Get some experience with shell scripting. I was first exposed to shell scripts because you had to write one to submit jobs on our university cluster in grad school. Data science often involves many moving parts, and being able to use some shell scripting to glue them all together can be incredibly useful. Software Carpentry has a pretty solid introductory lesson.
I use git daily. While I rarely used git to manage collaboration with coauthors in academia, I used it to version control all of my solo-authored projects, and that provided a solid-enough background for my current level of usage.
Automation is another important skill in the data science toolbox. Sometimes you’ll have fancy GUI-based tools to set things up to run automatically, but other times it’s faster and simpler to use a cron jobs. I taught myself the basics of cron to keep the stats in my post visualizing police militarization via the transfer of surplus armored vehicles to police departments automatically updated.

So far this post has mainly been oriented around a list of discrete things you can do to (potentially) improve your odds of securing a data science job as someone with a social science PhD. This last section reflects a perspective I developed throughout my job search process as I participated in more and more interviews, and I hope, will serve as a source of motivation for anyone pursuing a similar career transition.

The vast majority of quantitative social science PhDs (myself very much included) are never going to be machine learning engineers who run neural networks all day long. Instead, we’re going to be working with those engineers, running our own analyses (which might include some deep learning models, but plenty of other types of models as well), and also working with with less-technical stakeholders.

Based on conversations with other data scientists and my experiences as a data scientist thus far, a large part of a data scientist’s job is communicating the value of the work you and your more-technical team members have done to people with less technical training. Even if they have a strong background in statistics or research design more generally, they’re still likely to be less familiar with your specific area of expertise. Communicating effectively in this situation requires distilling large amounts of information, drawing conclusions based on data, and then summarizing what you did, why you did it, and what you learned from doing it. To me, that sounds exactly like what social science PhD programs train their students to do.⁴

Three months is also far too short a time to reach a definitive conclusion on this topic. ↩
Talking to other data scientists with similar backgrounds, which I discuss below, was useful because it gave me information and context that I was able to draw on when negotiating salary. However, an extensive body of research finds that women are penalized for negotiating where men are rewarded for it. This is just one reminder of the fact that something that I found helpful may be less useful for you. ↩
This was a pleasant surprise for me, as I still have vivid memories of sending résumé after résumé out into the void as a fresh poli sci BA in 2012 and almost never hearing back. ↩
To make this even more concrete: being able to communicate effectively with software engineers means that they help make your models more efficient with less work from you; being able to communicate with stakeholders means that you are more likely to get recognition for the work you did. ↩

So it goes

2022-03-31T00:00:00-05:00

When I was applying to graduate school and asking for letters of recommendation from my undergrad professors, one of them told me to give academia three years, and that if I hadn’t found a permanent position by then, to find another career. It’s been three years, and next week I start a new job as a data scientist. I read a fair bit of quit lit in my first couple years of grad school and always told myself that if I went that same route, I would never pen any of my own…

Two things have changed since then. One: an already precarious academic job market that never recovered from the global financial crisis has imploded even further. Two: opportunities for people with the set of skills you pick up in a quantitative social science PhD program have exploded. Quit lit is often deeply personal and centered around the path one took to deciding to leave academia; see this piece for links to several prominent examples.¹ This is not that kind of quit lit, because that’s not where my communication skills are strongest. Instead, I’m writing this post to illustrate the contrast between my academic and nonacademic job search processes in the hopes that it may be a useful data point for current grad students, postdocs, adjuncts, and maybe even some early-career faculty.² When reading this post, bear in mind that I am presenting data from an n of one, and my experiences may not generalize outside of quantitative social science, or even very far within it.

I had an enormous amount of support in this process from both my institutions and my networks; in no way could I have gotten a data science job as easily on my own. I talk more about the help I received in this post.

Let’s get straight to the numbers. Out of 142 jobs I applied to, I received two job offers. That’s a 98.6% rejection rate.³ Visualizing this (with apologies to Andrew Heiss) looks like so.

Five jobs expressed interest me beyond my initial application, which translates to a 3.5% response rate. The ‘Nothing’ category encompasses both jobs that sent me an automated HR rejection email (often several months after their chosen candidate had accepted the offer) as well as ones that never got back to me. Many searches for faculty positions will conduct Zoom/Skype/Teams interviews with their long short list of candidates before inviting the short list to an on-campus visit, colloquially termed a flyout, but some may skip straight to the on-campus visit. Some postdoc positions conduct virtual interviews, while others simply make an offer to their preferred candidate. I used a rough ranking of potential outcomes as Offer > Flyout ≥ Interview > Rejection in constructing this plot, with each dot representing the final outcome for that application.

I applied to a wide range of permanent (tenure-track and teaching-track) faculty positions, as well as a number of temporary (postdoc, visiting assistant professor, lecturer) positions. Splitting my applications along this dimension shows that I had noticeably more success in my applications for temporary positions (10.3% response rate) than permanent ones (1.8% response rate).

Since my non-nothing outcomes are so few, I can easily list them in more detail:

Two postdoc offers
A postdoc I interviewed for and declined in favor of another postdoc offer
A teaching-track flyout I declined in favor of a data science offer
A tenure-track interview I declined in favor of the same offer

If we break down the jobs I applied for by academic subfield, some unsurprising patterns emerge. Data science jobs include those listed as computational social science, jobs listed for a substantive subfield and methods are coded under the substantive area, and international relations, conflict, peace studies, security studies, and international political economy are all represented in the IR category.

The majority of jobs I applied to (92) were advertised as international relations. While much of my research sits at the intersection of international relations and comparative politics, very few of the jobs I applied to do. I didn’t track how frequent these jobs are, so it could just be a case of few jobs to apply to. Data science (24) handily outnumbers the more traditional subfield of methods (14), reflecting increasing interest in the former by the discipline.

The map below geographically visualizes the jobs I applied to. Each circle represents one institution, with the size of the circle denoting how many positions I applied for. I applied to five positions at UCSD, the most of any institution.

I focused primarily on the Eastern US and California. I applied to jobs in 31 states and the District of Columbia, meaning there were 19 states I did not apply to any jobs in. Looking at my applications over time helps tell the story of my academic job search process.

The 2018-19 academic job market season was my final year of grad school. I wanted to be done, so I applied to a wide variety of jobs. The postdoc I received an offer from was actually the last position I applied for in this cycle. I was a little more selective in the 2019-20 job market season because I had an excellent postdoc, with a high chance of a second year of funding. I started a new postdoc in 2020 and knew that I had a second year of funding guaranteed. COVID-19 absolutely devastated the job market that cycle as well. With a second year of funding secure and precious few institutions hiring, I decided to spend my time focusing on improving my CV and applied to a total of four jobs that cycle, all tenure-track. The market somewhat recovered in the 2021-22 cycle, but there were still far fewer jobs than in my first two cycles. I applied to 19 jobs this cycle, all of them permanent. There were some great postdocs this cycle, but three years as a postdoc had been enough for me.

Two jobs did show interest in me this last cycle, but it was too little, too late. I had an offer for a data science job when I received an on-campus interview for one job, and had already accepted the offer when I received a Zoom interview for another. Given the typical pace of an academic search, it was possible that even if I were successful in getting an offer for either of these positions, it wouldn’t be for another month or two. My postdoc ended in June, and an offer in hand doing interesting research was an easy sell compared to that uncertainty.

Across all 142 applications, I ended up submitting 399 letters of recommendation to search committees. I was very fortunate that UNC has a department administrator handle letters for grad students as a sort of discount (read: free) Interfolio Dossier service. They generously provide this service to graduates of the department until they secure their first permanent job, even after they have left. I spent so long on the academic job market that I had no less than three different people help me with this process. I am incredibly grateful for their efforts and want to highlight the support they gave me.

I haven’t done as good of a job tracking my applications to nonacademic jobs because the process is much less structured and standardized. Some applications require a cover letter, so I can count up all the cover letters saved in my job search folder: 25. You can also apply for many jobs with just a résumé. Let’s say I applied to 10 of those, which makes 35 applications total.

I started the interview process with seven of these employers. Acknowledging some uncertainty in the denominator, that’s a 20% response rate, more than six times higher than my academic response rate of 3.5%. I completed the interview process with four of these employers, receiving one rejection and three offers (I withdrew from the other three interview processes after accepting one of those offers). A 75% interview success rate is pretty incredible compared to my experience on the academic job market. That’s an overall success rate of 8.6%, which is also more than six times higher than my overall success rate for academic jobs.

Or is it a 50% success rate? I actually interviewed for two different positions with two of these employers, so you could also slice the data less favorably and say I received offers for three out of six positions I interviewed for. That’s still an overall success rate of 8.1%, which is pretty damn good in my eyes. I also want to highlight some of the experiences I had on the nonacademic job market that I can’t imagine ever happening on the academic one:

Recruiters reached out to me to ask me to apply to positions
One employer alerted me to another position they were hiring for and connected me with the hiring manager for it
Another informed me that I was actually overqualified for the position I applied for and considered me for a more senior position instead
I had my first job offer almost exactly three months after I started my nonacademic job search in earnest
I received three job offers in three days that week

Others have written extensively about why you shouldn’t view a nonacademic job as a backup option or a failure, but sometimes it’s just nice to know that people want to pay you. If you’re striking out on the academic job market, there are plenty of other options out there. So it goes.

People have criticized the term quit lit for focusing on the individual and ignoring the systemic forces that contribute to many people’s decision to leave academia. I am very persuaded by this argument, but no one has yet coined a similarly catchy and succinct alternative. ↩
I’m using the term nonacademic instead of industry, which is usually presented as the alternative to academic jobs for people with a PhD, because I applied for jobs in both the private and public sectors. ↩
I considered Kilgore Trout’s intended epitaph from Breakfast of Champions as a title for this post, but decided it was both too obscure and too bleak: he tried. ↩

Regular expressions for replication

2021-07-01T00:00:00-05:00

As part of the publication process for my recent article on how states preempt separatist conflict, I needed to submit replication materials to the journal. I took my graduate quantitative methods sequence with the late Tom Carsey, so I’ve long been a proponent replicability efforts in social science. I also had an hourly job in grad school replicating quantitative results for multiple political science journals, so I’m very familiar with best practices for replication. Unfortunately, in the four years since I wrote the first line of code for this project, somewhere in between defending my dissertation and starting a new job (ok, fine, almost immediately after writing that first line of code), I got a little lazy.

Sometimes it’s faster (easier) to just write code that works for you, on your system, without any consideration for some poor researcher who may try to replicate your results in the future.¹ This tendency was especially bad for this project because at various points in time I was writing code to run on my personal laptop and two different high performance computing clusters. This is a recipe for code that doesn’t travel well and will almost certainly fail to replicate.

There were a lot of changes I made to my code to ensure my results replicate, but the most tedious (and time consuming, by far) was cleaning up my file paths. Due to the computationally intensive GIS work and Bayesian statistics involved in the project, I ran lots of code on a cluster, and then pulled the results back to my laptop to summarize and create figures. This unsurprisingly resulted in a huge mess when looking at the project as a whole, rather than any individual script. Luckily, R and Rstudio made things (relatively) painless to fix.

File paths

Anytime you load a dataset into R, you need to specify the path to that file. The same’s true when you save R output to a file. This article started as a chapter of my dissertation, so all of the code originally lived in the Dissertation folder on my laptop. However, as I started adapting it to an article length manuscript, I created a new Conflict Preemption folder in my Projects folder. By the time the article was accepted, I had two main folders I needed to combine:

/Users/Rob/Dropbox/UNC/Dissertation/Onset
/Users/Rob/Dropbox/WashU/Projects/Conflict Preemption

Both of these folders live in my Dropbox, but that’s about where the similarities end. I wrote most of the code for running models while still at UNC, so when I added new scripts to run models to respond to reviewer comments, I still stuck them in the UNC folder. That also means that all of the output of these models ended up in the UNC folder when it got transferred from the cluster. However, when I needed to do something simpler like create a time series plot of the number of separatist groups in existence, I wrote that code in the WashU folder. I also had a script in the WashU folder to load all of the results and generate plots from them. Because this script and the data it needed to load where in completely different directories, this is what I had to do to load the data to create one of the main figures:

load('/Users/Rob/Dropbox/UNC/Dissertation/Onset/Figure Data/marg_eff_pop_df_cy.RData')

Not particularly likely to work on anyone else’s computer. To fix this, I needed to move all of the data to the Conflict Preemption folder, which was easy, and then rewrite all of the code the referenced file locations, which was less easy.

Here

As a first step, I needed to chop off /Users/Rob/Dropbox/UNC/Dissertation/Onset/ from the start of every file path. All the files for the article, including both the R scripts and the various data files, now live in /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption, but all of the file paths in the scripts still start with /Users/Rob/Dropbox/UNC/Dissertation/Onset, because that’s where all the files were before. You can do this just using the standard find and replace functionality built into RStudio. However, there’s no guarantee that someone in the future will correctly set R’s working directory before running the code. I used the here R package to ensure that R can always find everything it needs for my code. All you have to do is wrap file paths in the here() function in the package, and they’ll be automatically completed with the full file path, letting R find your files.²

You need to use the relative path to each file, so for a file with an absolute path of /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption/Figure Data/marg_eff_pop_df_cy.RData, the relative path (relative to the project folder of /Users/Rob/Dropbox/WashU/Projects/Conflict Preemption) would be Figure Data/marg_eff_pop_df_cy.RData. The final bit of R code looks like this:

load(here('Figure Data/marg_eff_pop_df_cy.RData'))

The addition of that here() in between load() and the file path means that things are no longer as simple as finding and replacing the start of the file path.

Regular expressions

Luckily, I was able to take advantage of RStudio’s built in support for regular expressions to save myself from having to manually change each line of code that either loaded or saved a file. Regular expressions are a powerful way to search through and manipulate text. You can activate them in RStudio’s find and replace dialog by checking the Regex box:

Once you’ve done that, certain characters in your search will no longer be interpreted literally. The most important difference is probably ., which is a stand-in for any character.³ This is similar to how * is a wildcard in the Unix shell, e.g., you can use ls *.R to list all R script files in a folder. The main regular expression feature I used is the capturing group, which allows you to identify and extract a subset of a line of text. You designate a capturing group by surrounding the desired text with parentheses. To fix all of the code loading RData files from the Figure Data folder, my regular expression looked like this:

'/Users/Rob/Dropbox/UNC/Dissertation/Onset/(Figure Data/.*\.RData)'

It starts with /Users/Rob/Dropbox/UNC/Dissertation/Onset/, which is the part I want to get rid of. Next, (Figure Data/.*\.RData)' tells the regular expression to look for any character (.) repeated an unlimited number of times (*) followed by .RData. Because . is a special character in regular expressions, we have to escape it with a backslash (\). This will match any file name ending in .RData in the Figure Data folder. If we left out the leading /Users/Rob/Dropbox/UNC/Dissertation/Onset/, we end up with the capturing group we want, but since /Users/Rob/Dropbox/UNC/Dissertation/Onset/ wouldn’t be part of the search string, it wouldn’t end up getting replaced. This is the same reason we need to include the opening and closing quotation marks; if we didn’t, we’d end up with a here() command inside quotation marks, which R would just treat as a string and not a command.

At this point I had the core of the line that I wanted to keep, but now I needed to extract it and place it inside of a call to here(). You accomplish this goal using a backreference to the capturing group. To reference the first capturing group, you use either \1 or $1 depending on which version of regular expressions you are using. This is often very difficult to figure out, and is one of the most annoying things about regular expressions. You’ll often just have to experiment and find out which one to use through trial and error. Luckily RStudio accepts either version!

To replace the absolute path with a relative one wrapped in a here() call, this is what I typed into the Replace field in the find and replace dialog:

here('$1')

and it resulted in this:

here('Figure Data/marg_eff_pop_df_cy.RData')

Thanks to the power of capture groups, you can just hit the replace all button and instantly transform every file path into a much more portable and replication-friendly one.

A little bit faster now

If you’re feeling really confident that you moved every file correctly, you can replace all file paths with the following regular expression:

'/Users/Rob/Dropbox/UNC/Dissertation/Onset/(.*\..*)'

This will get any files with file extensions (the \. followed by .* to ensure there’s at least one character after a literal period), as well as any preceding subdirectories (the initial .*) and stick them all into the resulting here() call. As an example, this will successfully turn this: fileConn <- file(here::here(‘Tables/pd_pop_cy.tex’))

groups <- readRDS('/Users/Rob/Dropbox/Dissertation/Onset/Input Data/groups_nightlights.RDS')

into this:

groups <- readRDS(here::here('Input Data/groups_nightlights.RDS'))

I’m using ‘replication’ here to mean that the code used to generate quantitative results from a dataset should produce those same results when run by another researcher, not in the sense that means that independent researchers following the published protocol can collect the data themselves and arrive at the same conclusion. I use the term ‘reproducible’ to describe this property. Annoyingly, different fields use opposing definitions of these two terms. ↩
Specifically, here() will key into the .Rproj file included in my replication materials and use that to properly locate everything else. ↩
Except for newlines, carriage returns, and other end of line special characters. ↩

Faceted maps in R

2021-05-19T00:00:00-05:00

I recently needed to create a choropleth of a few different countries for a project on targeting of UN peacekeepers by non-state armed actors I’m working on. A choropleth is a type of thematic map where data are aggregated up from smaller areas (or discrete points) to larger ones and then visualized using different colors to represent different numeric values.

See this simple example, which displays the area of each county in North Carolina, from the sf package documentation.¹ First, we need to load sf and then get the built-in nc dataset:

library(sf)
nc <- st_read(system.file('shape/nc.shp', package = 'sf'))
plot(nc[1])

Since I needed to generate choropleths for multiple countries, I decided to use ggplot2’s powerful faceting functionality. Unfortunately, as I discuss below, ggplot2 and sf don’t work together perfectly in ways that become more apparent (and problematic) the more complex your plots get. I moved away from faceting, and just glued together a bunch of separate plots, but then I had to figure out how to end up with a shared legend for five separate plots. Read on to see how I solved both of these issues.

The data

I already loaded sf to make the plot of North Carolina above, so now let’s load the remaining packages we’ll use:

library(tidyverse) # data manipulation and plotting
library(tmap)      # spatial plots
library(cowplot)   # combine plots
library(RWmisc)    # clean plot theme

I’m working with cleaned and subsetted versions of ACLED and GADM, which I’ve uploaded to my website as PKO.Rdata if you want to download them and run this code yourself. The acled object contains a list of attacks on peacekeepers in active Chapter VII UN peacekeeping missions in Subsaharan Africa, while the adm object contains all of the second order administrative districts (ADM2) in the five countries with active missions.

## load data
load(url('https://jayrobwilliams.com/data/PKO.Rdata'))

## inspect
head(acled)
head(adm)

## Simple feature collection with 6 features and 30 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -3.6102 ymin: 0.4966 xmax: 29.4654 ymax: 19.4695
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 31
##   data_id   iso event_id_cnty event_id_no_cnty event_date  year time_precision
##                                           
## 1 6713346   140 CEN47283                 47283 2019-12-27  2019              1
## 2 6689432   180 DRC16211                 16211 2019-12-08  2019              1
## 3 7578005   180 DRC16182                 16182 2019-12-04  2019              1
## 4 7191069   466 MLI3253                   3253 2019-10-21  2019              1
## 5 6759702   466 MLI3225                   3225 2019-10-06  2019              1
## 6 6023339   466 MLI3224                   3224 2019-10-06  2019              1
## # … with 24 more variables: event_type , sub_event_type ,
## #   actor1 , assoc_actor_1 , inter1 , actor2 ,
## #   assoc_actor_2 , inter2 , interaction , region ,
## #   country , admin1 , admin2 , admin3 , location ,
## #   geo_precision , source , source_scale , notes ,
## #   fatalities , timestamp , iso3 , month ,
## #   geometry 

## Simple feature collection with 6 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 18.54607 ymin: 4.221635 xmax: 22.395 ymax: 9.774724
## Geodetic CRS:  WGS 84
## # A tibble: 6 x 20
##   GID_0 NAME_0   GID_1  NAME_1 NL_NAME_1 GID_2 NAME_2 VARNAME_2 NL_NAME_2 TYPE_2
##                               
## 1 CAF   Central… CAF.1… Bamin…       CAF.… Bamin…             Sous-…
## 2 CAF   Central… CAF.1… Bamin…       CAF.… Ndélé              Sous-…
## 3 CAF   Central… CAF.2… Bangui       CAF.… Bangui             Sous-…
## 4 CAF   Central… CAF.3… Basse…       CAF.… Alind…             Sous-…
## 5 CAF   Central… CAF.3… Basse…       CAF.… Kembé              Sous-…
## 6 CAF   Central… CAF.3… Basse…       CAF.… Minga…             Sous-…
## # … with 10 more variables: ENGTYPE_2 , CC_2 , HASC_2 ,
## #   ID_0 , ISO , ID_1 , ID_2 , CCN_2 , CCA_2 ,
## #   geometry 

First attempt: `ggplot2`

The first step we need to do is associate each individual attack with the ADM2 it occurred in. We can do this with the st_join() function. This function executes a left join by default, so by using adm for the x argument and acled for the y argument, we end up with one row for every ADM2 with no attacks in it, and n rows for each ADM2 with attacks in it, where n equals the number of attacks in that ADM2. We can then use group_by() and summarize() to create a count of attacks for each ADM2 by summing the number of non-NA observations of event_id_cnty, the main ID field in ACLED. Finally, I log this count variable (using log1p() to account for the ADM2s without any attacks because ln(0) is undefined) to make the resulting plot more informative due to outliers in Northern Mali and the Eastern DRC. Putting it all together:

st_join(adm, acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  ggplot(aes(fill = attacks)) +
  geom_sf(lwd = NA) +                 # no borders
  scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
  theme_rw() +                        # clean plot
  theme(axis.text = element_blank(),  # no lat/long values
        axis.ticks = element_blank()) # no lat/long ticks

That’s a lot of wasted white space, and it can make certain countries harder to see. Let’s split it out using facet_wrap(). We simply add a facet_wrap() call to our ggplot2 code, and tell it to split by our country name variable, NAME_0:

adm %>% 
  st_join(acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  ggplot(aes(fill = attacks)) +
  geom_sf(lwd = NA) +
  scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
  facet_wrap(~ NAME_0) +
  theme_rw() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())

We’ve got facets, but everything is still clearly on the same scale. let’s set scales = 'free' in our call to facet_wrap() to try and fix that.

st_join(adm, acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  ggplot(aes(fill = attacks)) +
  geom_sf(lwd = NA) +
  scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
  facet_wrap(~ NAME_0, scales = 'free') +
  theme_rw() +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank())

## Error: coord_sf doesn't support free scales

And we get an error. It turns out the the ggplot2 codebase assumes that it can maniulate axes independently of one another. This is very much not the case with geographic data where a meter vertically needs to equal a meter horizontally, so coord_sf() locks the axes in much the same manner as coord_fixed().² To try and get around the limitations from ggplot2’s non-spatial origins, I turned to a package written from the ground up for plotting spatial data.

Second attempt: `tmap`

My googling led me to this Stack Overflow answer extolling the virtue of the tmap package.³ tmap is a package for drawing thematic maps from sf objects using a syntax very similar to ggplot2. We can reuse the same data wrangling code and as before pipe it into our plotting function, which this time is tm_shape(). We then add a call to tm_polygons() to get our colored features and tm_facet() to split them apart. Note that unlike ggplot2, we need to quote the names of variables in tmap functions:

st_join(adm, acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  tm_shape() +
  tm_polygons('attacks', title = 'PKO targeting\nevents (logged)') +
  tm_facets('NAME_0')

Much better so far! However, notice that tmap defaults to assuming that our attacks variable is discrete. We’ll need to tell it that it’s continuous. And what if we moved that legend down to the bottom right to get rid of the wasted space currently there?

adm %>% 
  st_join(acled) %>% 
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  tm_shape() +
  tm_polygons(col = 'attacks',
              title = 'PKO targeting\nevents (logged)',
              style = 'cont') +                  # continuous variable
  tm_facets('NAME_0') +
  tm_layout(legend.outside.position =  "bottom", # legend outside below
            legend.position = c(.8, 1.1))        # manually position legend

This is…fine. You’ll notice that there’s a lot of white space at the bottom of the plot, which I still haven’t figured out how to eliminate, and I personally prefer the color palette options available in ggplot2. Finally, there’s not much control over the legend compared to what you get with ggplot2, so let’s head back there and try to come at this problem from a different direction.

Third attempt: `cowplot`

While we’re still using ggplot2 to make individual plots, we need some way to combine them into a final plot. We can rely on the plot_grid() function in the cowplot library for that.⁴ We need to create five subplots, which we could do manually, but let’s do it programmatically because at some point you may need to do this for 27 different countries. The best way to store our five subplots is in a list, because lists in R can contain any type of R objects as their elements.⁵ I’m going to use the map() function from the purrr package to accomplish this, but you could also use lapply(). map() takes a list as its first argument, .x and a function as its second, .f. To see how map works, look at the following example:

map(1:3, sample)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1 2
## 
## [[3]]
## [1] 2 3 1

map() returns a list of length 3 because our input .x was a vector of length three, and it applies the function .f to each element of .x. I’m going to use an anonymous function to filter adm to only contain ADM2s from one country at a time, then create our subplots separately like we did together above:

pko_countries <- c('Central African Republic', 'Democratic Republic of the Congo',
                   'Mali', 'South Sudan', 'Sudan')

## create maps in separate plots, force common scale between them
maps <- map(.x = pko_countries, 
            .f = function(x) adm %>% 
              filter(NAME_0 == x) %>% 
              st_join(acled) %>% 
              group_by(NAME_0, NAME_1, NAME_2) %>% 
              summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
              ggplot(aes(fill = attacks)) +
              geom_sf(lwd = NA) +
              scale_fill_continuous(name = 'PKO targeting\nevents (logged)') +
              theme_rw() +
              theme(axis.text = element_blank(),
                    axis.ticks = element_blank()))

We can either supply each individual subplot to plot_grid() separately, or we can use the plotlist argument to pass a list of plots; good thing we saved them in a list:

## use COWplot to combine and add single legend
plot_grid(plotlist = maps, labels = LETTERS[1:5], label_size = 10, nrow = 2)

I tried using the name of each country as the subplot label, but because label positioning is relative to the width of labels it was impossible to get them all nicely left-aligned. As a result, I had to settle on using letters to label the subplots and then identifying them in the figure caption in text. As you’ll see later, there’s no perfect way of accomplishing this and you’ll have to make a trade-off somewhere.

Setting aside that compromise, there’s still one issue with this plot that we can fix. We’re measuring the same thing (attacks on UN peacekeeping personnel) in all five choropleths, so there’s no need for five separate scales.

Shared legend

The cowplot documentation demonstrates how to use the get_legend() function to extract the legend from one of the subplots and then add it as another element to plot_grid(), placing it in the bottom right like we sort of managed to do with tmap. However, we need to add theme(legend.position = 'none') to the ggplot call for each subplot, otherwise we’ll just end up with six legends. That means we need to apply to each element of our list of maps, which means it’s another job that map() is perfect for! We’ll use map() to take each subplot in maps and remove the legend from it, then use get_legend() to add a legend in the bottom right.

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
                           .f = function(x) x + theme(legend.position = 'none'))),
          get_legend(maps[[1]]),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

This doesn’t look right! We told plot_grid() to start with our maps, so why is the legend the first thing in the plot? If you look closely at the documentation for plot_grid(), you’ll see that the ... argument comes before the plotlist argument in the function definition. Even when we specify plotlist first, the function will add plotlist after ....⁶ To fix this, all we need to do is concatenate the results of get_legend() with the results of our call to map(). Note that we need to first transform the former to a list with list(), otherwise each element of it will be concatenated separately rather than as a grob object:

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps[[1]]))),
          labels = LETTERS[1:5],
          label_size = 10,
          nrow = 2)

So far so good. But if we try using a different map in our call to get_legend(), things get weird:

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps[[4]]))),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

Each subplot has its own unique legend that’s automatically generated from the values of attacks it contains. This is even worse than it might seem at first glance, because it means that the various subplots are in no way comparable to one another!

Accurate shared legend

To avoid misrepresenting the data, we need to ensure that each subplot has the same legend. The easiest way to do this is to manually set the legend for each subplot in our call to scale_fill_continuous(). Even though we’re manually setting the bounds of the legend, that doesn’t mean we have to hard code them. We can use a simpler version of our code to join attacks to ADM2s and then calculate the highest number of attacks across all countries in the data. Then we take advantage of the fact that scale_fill_continuous() can pass additional parameters to continuous_scale() via the ... argument. The continuous_scale() function is a low-level function used throughout ggplot2 to construct continuous scales, and it has a limits argument that sets the bounds of the scale. All we have to do is pass the minimum and maximum (logged) numbers of attacks in the data and we’re in business:

st_join(adm, acled) %>% 
  st_drop_geometry() %>%   # we don't need a map at the end; drop geometry to speed up
  group_by(NAME_0, NAME_1, NAME_2) %>% 
  summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
  pull(attacks) %>%        # extract attacks variable
  range() -> attacks_range # get min and max

## create maps in separate plots, force common scale between them
maps_shared <- map(.x = pko_countries, 
                   .f = function(x) adm %>% 
                     filter(NAME_0 == x) %>% 
                     st_join(acled) %>% 
                     group_by(NAME_0, NAME_1, NAME_2) %>% 
                     summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
                     ggplot(aes(fill = attacks)) +
                     geom_sf(lwd = NA) +
                     scale_fill_continuous(limits = attacks_range,
                                           name = 'PKO targeting\nevents (logged)') +
                     theme_rw() +
                     theme(axis.text = element_blank(),
                           axis.ticks = element_blank()))

Now all that’s left is to use plot_grid() to put it all together:

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps_shared,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps_shared[[1]]))),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

And unlike before, the legend is identical regardless of which subplot we use with get_legend():

## use COWplot to combine and add single legend
plot_grid(plotlist = c(map(.x = maps_shared,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps_shared[[4]]))),
          labels = LETTERS[1:5], label_size = 10, nrow = 2)

This approach is still useful even if you’re not working with spatial data. plot_grid() is powerful because it lets you make asymmetric arrangements like this example from the cowplot documentation:

p1 <- ggplot(mtcars, aes(disp, mpg)) + 
  geom_point()
p2 <- ggplot(mtcars, aes(qsec, mpg)) +
  geom_point()

plot_grid(p1, p2, labels = c('A', 'B'), rel_widths = c(1, 2))

If the units you’re faceting by contain substantially different observations, you might end up in a situation where the automatically generated legends are different from one another. Manually creating the scale of the legend and ensuring it’s the same for all plots would solve this problem here, too.

Bonus: still to solve

Don’t let anyone convince you they know everything. I still haven’t managed to get my ideal (conditional on regular faceting with facet_wrap() being out of the question) solution to this working. I tried to create five subplots and just add a facet label to each, with each one being a facet of one panel. Straightforward enough, right?

maps_facet <- map(.x = pko_countries, 
                  .f = function(x) adm %>% 
                    filter(NAME_0 == x) %>% 
                    st_join(acled) %>% 
                    group_by(NAME_0, NAME_1, NAME_2) %>% 
                    summarize(attacks = log1p(sum(!is.na(event_id_cnty)))) %>% 
                    ggplot(aes(fill = attacks)) +
                    geom_sf(lwd = NA) +
                    scale_fill_continuous(limits = attacks_range,
                                          name = 'PKO targeting\nevents (logged)') +
                    facet_wrap(~NAME_0) +
                    theme_rw() +
                    theme(axis.text = element_blank(),
                          axis.ticks = element_blank()))

plot_grid(plotlist = c(map(.x = maps_facet,
                           .f = function(x) x + theme(legend.position = 'none')),
                       list(get_legend(maps_facet[[1]]))),
          nrow = 2)

Not so much, and no amount of tinkering with the align and axis arguments to plot_grid() has yielded any improvement. The specific paper this plot is for doesn’t have any other plots with facets, so I’m content to go with my inelegant solution of lettered labels and a key to them in the figure caption. If that weren’t the case, I might still be fiddling with this and getting deeper and deeper into the source code for plot_grid().

If you’re wondering why the largest county area is in the ballpark of 0.25, it’s because the data are in square degrees, an old non-SI unit of measurement that’s defined in terms of how much the field of view from a given point is obstructed by an object. GIS is so easy these days, folks. ↩
The more I learn about how ggplot2 and sf work under the hood, the more amazed I am that geom_sf() Just Works in 80% of cases, let alone works at all. ↩
The answer also listed the geom_spatial() function from the ggspatial package as an alternative option, but I couldn’t get it to work. The answer is three and a half years old, which means it’s very possible something changed in either sf or ggspatial that broke this solution. So it goes. ↩
It’s much more powerful and easily customizable than gridExtra::grid.arrange(). ↩
They can also contain heterogeneous elements which will come in handy later. ↩
If you check out the actual source code of plot_grid(), line 9 shows you that the function is indeed putting ... ahead of plotlist: plots <- c(list(...), plotlist). ↩

Finding Backcountry Campsites with CalTopo, OpenStreetMap, and R

2021-01-04T00:00:00-06:00

Like many people, I’ve been spending more time outdoors during this pandemic. While this means daily walks in my neighborhood, it also means getting out into the wilderness and sleeping in a tent when I can. Although outdoor recreation is one of the safer ways to entertain yourself these days, it’s not without its own concerns. The difficulty of safely getting to trailheads means that while I’m backpacking more than usual, it’s still not as often as I’d like.

That means I’m spending a decent chunk of time thinking about and planning future trips. At some point in the process of doing this, I realized that I could use the GIS skills from my day job to help make planning future trips more efficient. In this post I walk through how you can use GIS tools in R to help with some of the route planning for a multiday backpacking trip. Specifically, how you can use open source spatial data on geography and transportation infrastructure to identify potential campsites along a hiking trail.

This was largely an exercise in seeing how I could apply GIS skills I’ve learned in the study of political violence to small-scale GPS navigation. I haven’t had the opportunity to hit the trail and test out any of the assumptions I use in this process yet, so you should view this post as more of a (loose) method than concrete suggestions. For a short and simple point-to-point hike with only one route, there’s really no need to engage in this level of GIS analysis. I’ve kept things simple to make them easier to follow, but this approach could actually be useful and save some time when planning a longer trip with many potential routes.

Backcountry camping

At some point in the future, I want to hike the Uwharrie Trail in Uwharrie National Forest in central North Carolina, near where I went to grad school. As I think about this (probably far off) trip, I’ve been using CalTopo to plan my route.

If you spend any amount of time in the outdoors, you should know about CalTopo. CalTopo is a website that lets you plan routes (hiking, skiing, rafting, etc.) on top of super high resolution topographic maps. You can then turn your smartphone into a full-featured GPS and use it to follow those routes (CalTopo offes a mobile app, as does Gaia GPS, both for about $20 a year). While the Uwharrie Trail is a pretty straightforward hike, I’ve been using this as an excuse to try and apply my GIS skills in a new context.

CalTopo is great, but it’s very point and click. I like doing things programmatically when I can, so that means it’s time to grab some of the open source data that CalTopo uses so we can play around with it in R. The base map in CalTopo is called MapBuilder Topo, and uses OpenStreetMap data as its starting point, so let’s start there.

Disclaimer

This guide is intended to show how to identify potential backcountry campsites on public land where dispersed camping is permitted. If you are backpacking in an area with designated, maintained backcountry campsites, you should use them. Dispersed camping is typically permitted in less-traveled areas where the impact of campers is better minimized by diffusing it rather than concentrating it into a handful of designated sites.¹

Always check regulations for any land you plan to camp on to see if there are specific requirements for site selection or areas where camping is prohibited. Picking an actual campsite requires identifying areas where your saftey will be maximized and the longterm impact of your stay will be minimized. See this guide for the basics and this series for a slightly more hardcore set of principles to follow. And remember, never go into the wilderness without telling someone where you’re going and when you should be back.

Getting the data

OpenStreetMap (OSM) is an open source map of the entire globe; think of it as a hybrid of Google Maps and Wikipedia. OSM is designed so that anyone can easily add to or edit it. Setting aside the normative value of this perspective, this is helpful for us because it means that OSM is transparent. We can use the excellent osmdata R package to query OSM via the [Overpass API], and we can use OSM itself via the OSM website to learn the various parameters we’ll use to query OSM.

Trails

The getting started vignette covers much of the basics of using osmdata. The key functions are osmdata::opq(), which builds a query to the Overpass API, and osmdata::add_osm_feature(), which requests specific features. OSM classifies features using key-value pairs, and we can use the OSM website to figure out just which pairs we need. Navigate to an area of interest, right-click on the feature of interest, and then select “query features.”

Next, select the desired feature in the dialog on the left of the screen. In this case, select the “Relation” rather than the “Path” because the path will only include one segment of the trail while the relation will include its entire length.

We can see here that the Uwharrie Trail relation has type=hiking, so that’s the key-value pair wew’ll have to specify in our query.

Make sure to use the bbox argument to osmdata::opq(), otherwise you’ll request every hiking trail in the world! You can manually specify the four edges of a bounding box to search in, or you can use the osmdata::getbb() function to get it automatically using the Nominatim geocoder.

library(tidyverse)
library(sf)
library(osmdata)

## get hiking routes in Uwharrie National Forest
unf_trails <- opq(bbox = getbb('uwharrie national forest usa')) %>% 
  add_osm_feature(key = 'route', value = 'hiking') %>% 
  osmdata_sf()

Notice that we use the osmdata::osmdata_sf() function to convert the resulting object for use with the sf R package. Let’s inspect the resulting object of class osmdata_sf.

## inspect
unf_trails

## Object of class 'osmdata' with:
##                  $bbox : 35.3951403,-80.0236608,35.4351403,-79.9836608
##         $overpass_call : The call submitted to the overpass API
##                  $meta : metadata including timestamp and version numbers
##            $osm_points : 'sf' Simple Features Collection with 3341 points
##             $osm_lines : 'sf' Simple Features Collection with 26 linestrings
##          $osm_polygons : 'sf' Simple Features Collection with 0 polygons
##        $osm_multilines : 'sf' Simple Features Collection with 1 multilinestrings
##     $osm_multipolygons : NULL

We can see that the unf_trails object includes points, lines, polygons, multilines, and multipolygons. We want to use the lines since that will include any short trail segments that aren’t part of a larger trail. We can easily plot the trails using this object.

## plot
plot(unf_trails$osm_lines$geometry, col = 'coral4')

Don’t get lost

Let’s do some quick sanity checks. First, Wikipedia tells us the trail should be about 20 miles. We can use the sf::st_length() function to measure the length of each trail segment, and the sf::st_union() function to combine all segments. We’ll get our answer in meters, which as a metric-deprived American, won’t be all that helpful to me. To get around this, we can use the `units::st_units() function to convert from meters to miles.

## measure total trail length
st_union(unf_trails$osm_lines$geometry) %>% # combine all segments.
  st_length() %>% # measure length
  units::set_units(mi) # convert to miles

## 28.26457 [mi]

While that’s initially concerning, a closer reading of the Wikipedia article for the trail reveals that it was originally 40 miles long, so OSM likely includes some of the Northern section of the trail beyond what’s officially recognized today.

We should also plot the bounding box that osmdata::getbb() ends up generating to ensure we’re not missing any part of the trail. We can do this with the OpenStreetMap [R package](https://cran.r-project.org/package=OpenStreetMap. Here we unfortunately need to manually specify the bounding box as a series of two vectors with the latitude and longitude coordinate of the upper-left and lower-right of the box. OpenStreetMap::openmap() uses (latitude, longitude) pairs, not (longitude, latitude) pairs as is more common in GIS, i.e., (y, x) not (x, y), so be sure to include them in that order.[^lat-long]{markdown} OpenStreetMap::openproj() also requires a projection argument, so I use sf::st_crs(4326)$proj4string to generate one automatically, ensuring I don’t introduce a type somewhere by accident.

[^lat-long]:{markdown} I spent 20 minutes not understanding why I couldn’t get this to work before I finally read the documenation. Don’t be like me, folks.

library(OpenStreetMap)

## get bounding box
unf_bb <- getbb('uwharrie national forest usa')

## get OSM tiles
unf_tile <- openmap(c(unf_bb[2,1],  # lat
                      unf_bb[1,1]), # long
                    c(unf_bb[2,2],  # lat
                      unf_bb[1,2]), # long
                    type = 'osm', mergeTiles = T)

## project map tiles and plot (OSM comes in Mercator...)
plot(openproj(unf_tile), projection = st_crs(4326)$proj4string)

## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

Uh oh. We can see that we’re only getting a small portion of the total trail and that it trails (heh) off the map on three sides. That’s not great, so let’s fix it. We can start by looking up Uwharrie National Forest itself on the OSM website. This gives us the boundaries of the official forest land in orange.

We can see from the dialog on the left that the forest’s OSM ID is 2918413, so we can use the osmdata::opq_osm_id() function to get the polygons for the forest’s boundaries. Let’s grab the forest boundaries and plot them, along with the bounding box they imply and the bounding box that resulted from osmdata::getbb() (in red) for comparison.

## get Uwharrie National Forest Boundaries
unf <- opq_osm_id(type = 'relation', id = 2918413) %>% 
  osmdata_sf()

## plot Uwharrie National Forest polygons
plot(unf$osm_multipolygons$geometry, col = 'lightgreen', border = NA, bty = 'n')

## construct line for original bounding box
plot(st_multilinestring(list(matrix(c(unf_bb[1, 1], unf_bb[2, 1],
                                      unf_bb[1, 1], unf_bb[2, 2],
                                      unf_bb[1, 2], unf_bb[2, 2],
                                      unf_bb[1, 2], unf_bb[2, 1],
                                      unf_bb[1, 1], unf_bb[2, 1]),
                                      ncol = 2, byrow = T))),
     add = T, col = 'red')

## plot bounding box for Uwharrie National Forest polygons
plot(st_as_sfc(st_bbox(unf$osm_multipolygons)), add = T)

## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

Wow, we were missing a lot before. Let’s use the bounding box for the entire forest as our new bounding box. First, we plot OSM using this new bounding box. st_bbox() yields a vector of four numbers, rather than the matrix that osmdata::getbb() produces, so we need to work around this and specify the top-left and bottom-right corners of our new, bigger bounding box.

## get OSM tile for Uwharrie National Forest polygons
unf_full_tile <- openmap(c(st_bbox(unf$osm_multipolygons)[4],  # lat
                           st_bbox(unf$osm_multipolygons)[1]), # long
                         c(st_bbox(unf$osm_multipolygons)[2],  # lat
                           st_bbox(unf$osm_multipolygons)[3]), # long
                         type = 'osm', mergeTiles = T)

## project and plot OSM tile
plot(openproj(unf_full_tile), projection = st_crs(4326)$proj4string)

## plot trails
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

That’s much better! We’re getting a lot of area beyond the trail, but it’s easy to filter that out later so it’s better to grab too much than too little.

The whole trail

Now we can go back and grab all hiking trails in Uwharrie National Forest using our new bounding box. osmdata::opq() expects a bounding box in a certain format, so let’s inspect it to see what we’re working with and what we need to reshape the output of sf::st_bbox(unf$osm_multipolygons) into:

## bbox format osmdata::opq() expects
unf_bb

##         min       max
## x -80.02366 -79.98366
## y  35.39514  35.43514

## rearrange sf::st_bbox() output
matrix(st_bbox(unf$osm_multipolygons), ncol = 2,
       dimnames = list(c('x', 'y'), c('min', 'max')))

##         min       max
## x -80.17085 -79.73170
## y  35.21987  35.63684

Note that I’m specifying row and column names when creating the new bounding box. Without them, osmdata::opq() will fail! We can now plug this new bounding box object into osmdata::opq() and get all hiking routes in the forest.

## get hiking trails in all of Uwharrie National Forest
unf_trails_full <- opq(bbox = matrix(st_bbox(unf$osm_multipolygons), ncol = 2,
                                     dimnames = list(c('x', 'y'), c('min', 'max')))) %>% 
  add_osm_feature(key = 'route', value = 'hiking') %>% 
  osmdata_sf()

## plot
plot(unf_trails_full$osm_lines$geometry, col = 'coral4')

Now we’re getting a bunch of trails across the Pee Dee River in Morrow Mountain State Park. Again it’s easy to drop these extra trails later, so for the moment, more complete is better than less complete. These data come from OpenStreetMap, so they also include lots of usuable data. Let’s take a look at the fields included in our lines:

## inspect
glimpse(unf_trails_full$osm_lines)

## Rows: 106
## Columns: 37
## $ osm_id             "32024414", "216945232", "216945234", "216945241", …
## $ name               "Uwharrie Trail", "Mountain Loop Trail", "Mountain …
## $ alt_name           "Uwharrie National Recreation Trail", NA, NA, NA, N…
## $ bicycle            "no", "no", "no", "no", "no", "no", "no", "no", "no…
## $ bridge             NA, NA, "yes", "yes", NA, NA, "boardwalk", NA, NA, …
## $ construction       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ dog                NA, "leashed", "leashed", "leashed", "leashed", NA,…
## $ foot               "designated", "designated", "designated", "designat…
## $ footway            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ highway            "path", "path", "path", "path", "path", "path", "pa…
## $ horse              NA, "no", "no", "no", "no", "no", NA, NA, NA, NA, "…
## $ lanes              NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ layer              NA, NA, "1", "1", NA, NA, "1", NA, NA, NA, NA, NA, …
## $ motor_vehicle      NA, "no", "no", "no", "no", "no", NA, NA, NA, NA, "…
## $ name_1             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ oneway             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ rcn_ref            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ sac_scale          NA, "mountain_hiking", "mountain_hiking", "mountain…
## $ service            NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "parkin…
## $ smoothness         NA, "bad", "good", "good", "bad", NA, NA, NA, NA, N…
## $ source             NA, NA, NA, NA, NA, "GPS_2009", "GPS_2009", "GPS_20…
## $ surface            "dirt", "ground", "wood", "wood", "ground", "ground…
## $ symbol             NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "wh…
## $ tiger.cfcc         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.county       NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_base    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_base_1  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.name_type    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.reviewed     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_left     NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_left_1   NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_right    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tiger.zip_right_1  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ tracktype          NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ trail_visibility   NA, "excellent", "excellent", "excellent", "excelle…
## $ wheelchair         NA, "no", "no", "no", "no", NA, NA, NA, NA, NA, "no…
## $ geometry           LINESTRING (-80.0435 35.310..., LINESTRI…

We can use the “name” field to subset the data. If you were considering some parallel or spur trails, you could use sf::st_filter() in combination with `sf::st_is_within_distance() to instead just grab trails near your primary trail.

## extract OSM lines and filter
ut <- unf_trails_full$osm_lines %>% filter(name == 'Uwharrie Trail')

Now we’ve gotten the Uwharrie Trail twice. Once using a smaller bounding box and once using a larger one. We can plot them both and see if there were any segments the intial query missed

## plot
plot(ut$geometry, col = 'red')
plot(unf_trails$osm_lines$geometry, add = T, col = 'coral4')

Luckily the initial query still picked up every segment, but that won’t always be the case if you start with an inaccurate initial bounding box. If the entire Uwharrie Trail wasn’t collected into a relation, we might have missed large chunks of it on either end. Now we can use the bounding box for the Uwharrie Trail to capture any other features we care about nearby.

Water

The first other feature we need is water. On any multi-day trip, being able to refill your water is essential. The OSM wiki page on waterways shows us that they values we need to grab relevant water sources are river and stream. Although not well-documented, you can supply multiple value arguments to osmdata::opq() using c(). This will let us quickly and easily grab both rivers and streams in the area.²

## create bbox for just the Uwharrie Trail; no need for all water in the whole National Forest
ut_bb <- matrix(st_bbox(ut), ncol = 2, dimnames = list(c('x', 'y'), c('min', 'max')))

## get rivers and streams and extract OSM lines
ut_water <- opq(bbox = ut_bb) %>% 
  add_osm_feature(key = 'waterway', value = c('river', 'stream')) %>% 
  osmdata_sf()

Our next step will be to drop any water sources more than a kilometer from the trail. This will simplify our analysis later and also minimizes our environmental impact. To conduct GIS operations in meters, we need to project our data from latitude and longitude-based WGS84 to a meter-based coordiante reference system (CRS). The CRS database epsg.io shows that NAD83/North Carolina(EPSG:32119) is the projection for data in North Carolina, so we use sf::st_transform() along with sf::st_crs() to project our trail and water source objects. This lets us calculate distances in feet/meters rather than decimal degrees. We’ll use this to limit the water features to those that fall within 1km of the trail. This way we’re not limiting ourselves to only water features that directly intersect the trail, but we’re also not retaining a bunch of features that are farther off-trail than I like to hike for water.

## project trail
ut <- st_transform(ut, st_crs(32119))

## project water sources
ut_water <- ut_water$osm_lines %>% 
  st_transform(st_crs(32119)) %>% 
  st_filter(ut, .predicate = st_is_within_distance, dist = 1000)

## plot
plot(ut_water$geometry, col = 'lightblue')
plot(ut$geometry, add = T, col = 'coral4')

Roads

If we want to be near water, we want to be far from roads. OpenStreetMap has lots of different categories of roads, so we’ll want to capture all the major ones, as well as service roads and “tracks”, which is how OpenStreetMap refers to forest roads.³ OSM identifies roads with the key “highway,” and inspecting the OSM wiki page on roads shows us the various values we’ll need to grab all relevant roads.

## get roads, project, and limit to w/in 1000 m of trail
ut_roads <- opq(bbox = ut_bb) %>% 
  add_osm_feature(key = 'highway',
                  value = c('primary', 'secondary', 'tertiary', 'residential',
                            'unclassified', 'track', 'service')) %>% 
  osmdata_sf() %>% 
  magrittr::extract2('osm_lines') %>% 
  st_transform(st_crs(32119)) %>% 
  st_filter(ut, .predicate = st_is_within_distance, dist = 1000)

## plot
plot(ut_roads$geometry, col = 'black')
plot(ut$geometry, add = T, col = 'coral4')

Note the use of magrittr::extract2() to extract the osm_lines object from the osmdata_sf object returned by osmdata::osmdata_sf(). This is how you can access a list element in a pipeline, and is equivalent to $osm_lines.

Campsites

To locate potential campsites we need to identify our priorities and use them to define a set of rules for selecting potential sites. For this exercise, I’m using the following:

I’d like to be within 750 feet of a water source. Some (more hardcore) backpackers prefer to be farther away from water sources to minimize the chance of encountering animals. Since Uwharrie National Forest isn’t an area with heightened bear activity, I’m willing to trade the chance of a raccoon sniffing around my bear canister for a shorter walk to refill my water.
The US Forest Service requires that you camp at least 200 feet away from any water source. This is good practice everywhere, but it’s required in National Forests, so we want to make sure any potential campsites are at least 200 feet from any water features.
The Uwharrie Trail is a fairly heavily-trafficked trail, so I’d like to avoid going more than 1/4 mile off-trail to find a campsite. This will minimize the disturbance to the surrounding area.⁴ All of the semi-official campsites on the Uwharrie Trail are a good ways off the trail itself, so staying near the trail will contain my impact on a large scale, but minimize it locally.
If you’re not in a designated campsite, you should be at least 200 feet away from any trail. Again, this seeks to minimize your impact on the area by spreading out campsites over time.
If I’m making the effort to carry my shelter, sleep system, and food on my back, you better believe I don’t want to be hearing any cars at night. To try and minimize the chances of this happening, I want to be least 1,000 feet from any roads. The lower section of the trail skirts particularly close to a residential neighborhood, so this is an important consideration.
I’m going to drop any potential campsites smaller than 0.1 km^2. Choosing where to actually pitch your tent within a potential site area requires many considerations like drainage, wind exposure, and avoiding dead trees overhead. This means that we want to have ample space in which to find the ideal tent spot, so dropping small potential sites reduces the possibility of arriving at a spot and finding that there’s no good place for your tent.

With all of those factors in mind, we can now define our potential campsites and then narrow them down. I start by buffering the rivers and streams by 1,000 feet with sf::st_buffer(), which gives us every area within 1,000 feet of a water source. Then I move down my list of conditions, buffering the relevant feature and using sf::st_intersect() when I want to ensure I stay within a given distance of that feature and sf::st_difference() when I want to stay a given distance away from that feature.

Since NAD83 uses meters as the unit of measurement, we need to convert these distances in feet into meters. Again, the units package makes this easy with the units::set_units() function.

## buffer water 750 ft
campsites <- st_union(ut_water) %>% 
  st_buffer(dist = units::set_units(750, ft))

## buffer water 200 ft and subtract
campsites <- st_union(ut_water) %>% 
  st_buffer(dist = units::set_units(200, ft)) %>% 
  st_difference(x = campsites)

## buffer trail 1/4 mile and intersect
campsites <- st_union(ut) %>% 
  st_buffer(dist = units::set_units(.25, mi)) %>% 
  st_intersection(x = campsites)

## buffer trail 200 ft and subtract
campsites <- st_union(ut) %>% 
  st_buffer(dist = units::set_units(200, ft)) %>% 
  st_difference(x = campsites)

## buffer roads to 1000 ft and subtract
campsites <- st_union(ut_roads) %>% 
  st_buffer(dist = units::set_units(1000, ft)) %>% 
  st_difference(x = campsites)

## cast multipolygon to polygons and convert to sf
campsites <- campsites %>% 
  st_cast('POLYGON') %>% 
  st_sf() %>% 
  mutate(id = 1:n()) %>% # create ID variable
  filter(st_area(.) > units::set_units(.1, km^2)) # filter to > .1 sq km

The animation below shows each step in the process in order:

Elevation

So far we haven’t really done anything that you couldn’t do on CalTopo, albeit in a less programmatic way. Let’s change that by bringing in some elevation data. Elevation is important when hiking because it determines how many climbs your lungs will have to endure and how many descents your knees will. CalTopo has great built-in tools for generating elevation profiles and more detailed terrain statistics that can tell you what to expect along a given route. However, you can only calculate them for lines or polygons you’ve manually drawn.

While we could import the potential campsite polygons we’ve just generated into CalTopo and then calculate the terrain statistics, this has two major drawbacks. First, you have to point and click through generating the report for each polygon because there’s no way to batch process. Second, and more importantly, this would use a lot of processing power and computing time on CalTopo’s servers. If, unlike me, you have a paid subscription, you might feel less bad about this, but I’m trying not to take advantage of such an awesome service that CalTopo currently provides for free.

We can use R’s capabilities to handle raster data to solve both of these problems! The elevatr package lets you easily download elevation data in the form of a digital elevation model. These models combine multiple measurements from satellites to produce a single image of the earth where the brightness of each pixel represents the elevation of a given area. elevatr allows you to easily access elevation data compiled from a number of different data sources. The main function is elevatr::get_elev_raster(), which takes an sf object as its first argument and z, z zoom level of 1:14. We can also specify the clip = 'bbox' argument to crop the resulting raster to just the bounding box of our potential campsites, and not the entire tile they fall in.

library(raster)
library(elevatr)

## get elevation raster and clip to bbox
elev <- get_elev_raster(campsites, z = 13, clip = 'bbox')

## plot to inspect
plot(elev, col = grey(1:100/100))
plot(ut$geometry, add = T, col = 'coral4')

Since we can see that the highest point in the area is only about 300 feet above sea level, we don’t need to worry about absolute elevation when picking potential sites. Instead, we want to know how level these areas are; no one wants to wake up smushed against the downhill wall of their tent. We can use the raster::terrain() function to calculate the slope in each pixel.

## calculate slope
camp_slope <- terrain(elev, opt = 'slope', unit = 'degrees')

## plot slope
plot(camp_slope)
plot(ut$geometry, add = T, col = 'coral4')

All that’s left to do is aggregate slope measures to each polygon, and then calculate some sort of summary statistic to tell us how steep each potential site is overall. I’m going to use the median of each area’s slope rather than its average to avoid giving undue influence to outliers (if a .5 km² area is largely flat with a cliff at one edge, then it’s likely still a good candidate for a campsite). Let’s filter out all areas with a median slope of more than 10°.

## calculate median slope for each polygon and filter
campsites <- campsites %>% 
  mutate(med_slope = (raster::extract(camp_slope, ., fun = median, na.rm = T))) %>%
  filter(med_slope < 10) 

With that done, we can now plot our potential campsite locations and all the features used to define them:

## plot campsites and all features
plot(ut$geometry, type = 'n')
plot(campsites$geometry, add = T, col = 'lightgreen', border = NA)
plot(ut_water$geometry, add = T, col = 'lightblue')
plot(ut_roads$geometry, add = T, col = 'black')
plot(ut$geometry, add = T, col = 'coral4')

This is a pretty picture, but it’s not very useful. To make it so that we can actually navigate to any of these spots, we need to get them onto a topographic map.

Plan it

To make our map usable, all we have to do is export the potential campsite polygons from R so that we can import them into CalTopo. CalTopo supports a number of file formats for importing, but the one we want to use is GeoJSON. We can use the geojsonio package to easily convert our polygons from sf objects to GeoJSON format and then save them to disk to import into CalTopo.⁵

There are two (potentially) tricky things we need to do. First, make sure we reproject our NAD83 data back to decimal degree-based WGS84 so that CalTopo can properly reference them. Second, we want to take advantage of R’s capabilities to efficiently wrangle data and create a name field for our polygons so they’ll be easy to identify and reference once they’re in CalTopo. To do this, we need to create a “title” field in our sf object before we convert it to GeoJSON.⁶

library(geojsonio)

## create site number field; transmute b/c all fields other than label are lost on import
campsites %>% 
  transmute(title = str_c('Potential Site ', row_number())) %>% 
  st_transform(st_crs(4326)) %>% # project to WGS84
  geojson_json() %>% 
  geojson_write(file = 'campsites.json')

## export Uwharrie Trail to save the trouble of tracing it
ut %>% 
  st_transform(st_crs(4326)) %>% # project to WGS84
  geojson_json() %>% 
  geojson_write(file = 'trail.json')

At this point all that’s left to do is click the “Import” button in CalTopo and select your newly created .json file. You can check out the potential campsites live on CalTopo below:

Some closing thoughts viewing the potential sites in context on CalTopo:

Potential Site 2 looks promising. It’s slightly downhill from the trail, and has some relatively flat ground. However, the water source is an intermittent stream (denoted by the three dots in the blue line), so depending on time of year there may not actually be easy access to water here.
Potential Site 4 is located near both a perennial and an intermittent stream, so the odds of finding a usable water source are higher. Across the trail to the West you can see an area that meets all of our site selection criteria except gentle slopes due to the steep rise to the 795 foot peak nearby.
Potential Site 7 demonstrates the limitations of this approach because there are two forest roads near it on the Forest Service map that aren’t included in OpenStreetMap. Google Maps shows that there’s a private RV campground here, so best to avoid it. Doubly so because it’s largely outside of Uwharrie National Forest’s boundaries (the green lines). This is why it’s important to check more than just the terrain before you go!

See here for a discussion of different types of campsites and contexts in which they are usually found. ↩
If we didn’t do this, we’d have to use c() to combine multiple osmdata_sf objects and then extract the osm_lines object from the combined osmdata_sf object. ↩
The US Forest Service maintains GIS data on forest roads on National Forest land, but the API to access them is…less than user friendly so I’m ignoring them for this illustration. ↩
In very sparsely-traveled areas, it can be better to seek out campsites far from the trail to avoid camping in areas where others have recently stayed. This can help prevent the emergence of ‘social’ campsites that are not officially recognized or maintained but are frequently used. It will also reduce the chance that you’ll encounter any local wildlife that have learned that such spots can be a source of easy meals. ↩
Want to find potential campsites for a trail that’s not in OpenStreetMap? CalTopo supports exports as well as imports, so you can trace the route in CalTopo, export it, then load it in R with sf::st_read() and then carry out the steps above! ↩
CalTopo refers to an object’s name field as its “Label” in the interface, but this isn’t what it’s called under the hood. I had to export a line I create and inspect the resulting .json file to find out that it’s referred to as a “title” instead. ↩

R Markdown, Jekyll, and Footnotes

2020-10-26T00:00:00-05:00

Update: 05/19/2021 John MacFarlane helpfully pointed out that this is all incredibly unnecessary because pandoc makes it easy to add support for footnotes to GitHub-Flavored Markdown. The documentation notes that you can add extensions to output formats they don’t normally support. Since standard markdown natively supports footnotes when used as an output format, I didn’t even think to look into manually enabling them for GitHub-Flavored Markdown.

If you’re running pandoc from the command line all you need to do is add -t gfm+footnotes to your pandoc command. If you’re working with .Rmd files like me, all you need to do is add +footnotes to the end of of the variant: gfm line in your YAML header. As a side benefit, you can drop the --wrap=preserve flag and end up with .md files that aren’t hundreds of columns wide. I’m leaving the original post up below in case anyone who has an even weirder use case than me might find it helpful, or if any of my students ever stumble across this page and don’t believe that I’m still constantly learning, too.

I use jekyll to create my website. Jekyll converts Markdown files into the HTML that your browser renders into the pages you see. As others and I have written before, it’s pretty easy to use R Markdown to generate pages with R code and output all together. One thing has consistently eluded me, however: footnotes.

Every time I try to include footnotes in my .Rmd file, they end up mangled and not actually footnotes in the final HTML page. My solution thus far has been to just avoid footnotes and lean heavily on parenthetical asides when I’m using R Markdown to generate a page. My recent post on using SQL style filtering to preprocess large spatial datasets before loading them into memory needed a whopping six footnotes, so I finally had to sit down and figure it out.

What’s happening

The ‘standard’ method for adding footnotes in R Markdown is actually a bit of a cheat compared to the method in the official Markdown specification. R Markdown lets you use a LaTeX-esque syntax for defining footnotes:

Here is some body text.^[This footnote will appear at the bottom of the page.]

However, Jekyll uses the official Markdown specification for footnotes, so this won’t work. Instead, we need to define them with the official syntax:

Here is some body text.[^1]

[^1]: This footnote will appear at the bottom of the page.

However, when R Markdown converts your file from standard Markdown to GitHub-Flavored Markdown, something strange happens and the output in your .md file will look like this:

Here is some body text.\[1\]

1. This footnote will appear at the bottom of the page.

When Jekyll converts the Markdown file to HTML, you end up with a sad lonely unclickable [1] where your footnote should go. The content of the footnote does appear at the bottom of the page, but it lacks the footnote formatting so it just looks like regular text and there’s no link to click and return to the footnote’s place in the text.

Why it’s happening

Understanding what’s happening here (and thus how to fix it) requires a slightly detailed explanation of what exactly happens when you hit that Knit button in RStudio. First, the knitr package runs all of the code in your .Rmd file and creates a .md file. Next, pandoc takes the .md file and converts it to whatever output format you want.¹

Image courtesy of RStudio

Pandoc is the source of our problems here. The square braces that set off a footnote are metacharacters in Markdown, since they’re used to construct links (among other things, like citations with pandoc-citeproc). When Pandoc sees them in the process of converting from standard Markdown to GitHub-Flavored Markdown, it (logically) decides that they’re important content and preserves them by escaping them with a backslash so they’re preserved in the GitHub-Flavored Markdown. Unfortunately for us, we want our square brackets to be treated as special characters and not turned into text. This is a known issue with Pandoc (see this issue on GitHub) so it will eventually get fixed, but in the meantime I’ve come up with a workaround.

How to fix it

Pandoc allows you to tag both code chunks and inline code with a special raw attribute which will ensure they’re passed on to the output format unmodified. To do this, just enclose any text with backticks (`) and then put {=markdown} immediately after the closing backtick. This will ensure that Pandoc doesn’t alter the ‘code’ in the backticks at all. It’s debatable whether the [^1] used to define a footnote is really code, but for our purposes treating it like code will ensure that our footnotes work in the final output:

Here is some body text.`[^1]`{=markdown}

`[^1]:`{=markdown} This footnote will appear at the bottom of the page.

There’s one more tweak we have to make to get this to work. If any of your footnotes are longer than 72 characters,² then Pandoc will split them up and divide them into multiple lines in the output .md file. Since footnotes need to be all on the same line, this will break them and you’ll have a bunch of sentence fragments at the end of your page right above the equally fragmented footnotes. To fix this, we need to use the --wrap argument to Pandoc in our YAML header. Below is the YAML header for the .Rmd file that generates the .md file that Jekyll uses to generate the HTML your browser uses to render this page.

---
title: Footnotes in `.Rmd` files
output:
  md_document:
    variant: gfm
    preserve_yaml: TRUE
    pandoc_args: 
      - "--wrap=preserve"
knit: (function(inputFile, encoding) {
  rmarkdown::render(inputFile, encoding = encoding, output_dir = "../_posts") })
date: 2020-10-26
permalink: /posts/2020/10/jeykll-footnotes
excerpt_separator: 
toc: true
tags:
  - jekyll
  - rmarkdown
---

By specifying --wrap=preserve, we tell Pandoc to respect the line breaks present in the .Rmd file when generating the .md file.³ Accordingly, our footnotes will be intact and functional in the final web page.

Proof

And now, to prove to you that this post really did start out as a .Rmd file, here’s some R code and a plot. Everyone’s seen mtcars a million times, and it turns out that iris was originally published in the Annals of Eugenics, so I went digging for a new built in dataset.⁴ I landed on the Loblolly pines dataset, which records the height of 14 different loblolly pine trees.⁵

library(ggplot2)
ggplot(Loblolly, aes(x = age, y = height, group = Seed)) +
  geom_line(alpha = .5) +
  labs(x = 'Age (years)', y = 'Height (feet)') +
  theme_bw()

It looks like all of the trees in the sample followed a pretty similar growth trajectory! Finally, to really really prove this page started out as a .Rmd file, here’s the sessionInfo():

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.2
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.11      knitr_1.30           magrittr_1.5        
##  [4] tidyselect_1.1.0     munsell_0.5.0        colorspace_1.4-1    
##  [7] here_0.1             R6_2.4.1             rlang_0.4.8         
## [10] dplyr_1.0.2          stringr_1.4.0        tools_4.0.2         
## [13] grid_4.0.2           gtable_0.3.0         xfun_0.18           
## [16] withr_2.3.0          htmltools_0.5.0.9001 ellipsis_0.3.1      
## [19] yaml_2.2.1           rprojroot_1.3-2      digest_0.6.25       
## [22] tibble_3.0.4         lifecycle_0.2.0      crayon_1.3.4        
## [25] purrr_0.3.4          vctrs_0.3.4          glue_1.4.2          
## [28] evaluate_0.14        rmarkdown_2.3        stringi_1.5.3       
## [31] compiler_4.0.2       pillar_1.4.6         generics_0.0.2      
## [34] scales_1.1.1         backports_1.1.10     pkgconfig_2.0.3

Pandoc is incredibly powerful, but it’s also incredibly opaque and difficult to learn. You can create incredibly fancy PDF and HTML documents in R Markdown without ever having to know anything about Pandoc. ↩
The default output width defined by the --columns argument to Pandoc. ↩
You can also use --wrap=none, which will put every paragraph in a single gigantic line of text. ↩
If you’re willing to install additional packages, Allison Horst’s palmerpenguins package is fantastic and fills much the same educational niche as iris. See here for even more alternatives. ↩
Fun fact, loblolly pine seeds were carried aboard Apollo 14 and subsequently planted throughout the US. ↩

Working with Large Spatial Data in R

2020-09-25T00:00:00-05:00

In my research I frequently work with large datasets. Sometimes that means datasets that cover the entire globe, and other times it means working with lots of micro-level event data. Usually, my computer is powerful enough to load and manipulate all of the data in R without issue. When my computer’s fallen short of the task at hand, my solution has often been to throw it at a high performance computing cluster. However, I finally ran into a situation where the data proved too large even for that approach.

As a result, I finally had to teach myself how to break large spatial datasets into more manageable chunks. In the process a learned a little SQL and a lot about the underlying software libraries that power the r-spatial ecosystem of R packages. In this post, I walk through the workflow I developed for this task and explain the logic behind each step.

On disk

The general idea is to work with data ‘on disk’ instead of ‘in memory’. Normally, when you load a dataset into R, your computer reads it from whatever storage media it uses (hard drive or solid state drive) into memory (RAM). Memory is considerably faster to read from and write to than storage, which is what lets you complete simple operations in R in the blink of an eye. Most consumer computers have much more storage than RAM (my 2015 MacBook Pro has 256 GB of storage and 8 GB of memory) so it’s very possible to end up with a dataset larger than your computer’s memory. In fact, it doesn’t have to be anywhere near the size of your computer’s memory to bump into this limit because every other application you have running uses up memory as well.

To deal with this issue, you can extract just the parts of a dataset you need to work with at any given time; this subset will be loaded into memory, and the rest remain on disk and invisible to R¹. There are a couple of R packages that exist for dealing with this issue, such as bigmemory for basic R data types like numerics or disk.frame for dplyr-compatible operations, but neither supports spatial data.

I’m going to use the cshapes to illustrate and explain this workflow². You can download and extract it from within R:

## download cshapes dataset
download.file('http://downloads.weidmann.ws/cshapes/Shapefiles/cshapes_0.6.zip',
              'cshapes.zip')

## extract cshapes dataset
unzip('cshapes.zip')

## check that dataset extracted correctly
list.files(path = '.', pattern = 'cshapes')

## [1] "cshapes_shapefile_documentation.txt" "cshapes.dbf"                        
## [3] "cshapes.prj"                         "cshapes.shp"                        
## [5] "cshapes.shx"                         "cshapes.zip"

Then use the sf package to load the data and check them out:

## load packages
library(tidyverse)
library(sf)

## read in cshapes
cshapes <- st_read('cshapes.shp')

The cshapes dataset is specifically designed to be easy to load and manipulate on a conventional laptop computer. To do this, it sacrifices a significant degree of detail in the polygons that represent each individual state. For many analyses, this is fine and won’t affect the results. However, sometimes you need to measure the length of borders between states, and the coastline paradox dictates that you use the most high resolution spatial data possible. In that case, the data might be too large for your computer to hold in memory. If that’s the case, then it’s time to start thinking about leaving the data on disk and only loading what you really need at any given point.

SQL

Luckily the sf package supports SQL queries to filter the data on disk and only read in a subset of the total data. SQL is a language for interacting with relational databases, and is incredibly fast compared to loading data into R and then filtering it. SQL has many variants, referred to as dialects, and the sf package uses one called OGR SQL dialect to interact with spatial datasets. The basic structure of a SQL call is SELECT col FROM "table" WHERE cond.

SELECT tells the database what columns (fields in SQL parlance) we want
FROM tells the database what table (databases can have many tables) to select those columns from
WHERE tells that database we only want rows where some condition is true

If you use the tidyverse a lot, this may seem familiar to you because it’s pretty similar to dplyr syntax, except dplyr already knows which data frame you want to work with. If we want to only load one polygon at a time into R, then we need to know the field (or combination of fields) that uniquely identifies a polygon. To demonstrate, let’s load just the polygon for Morocco that begins in 1976 when it annexed the Northern part of Western Sahara. Let’s cheat by looking at the data I’ve loaded into R:

## filter to Morocco beginning in 1976
cshapes %>% filter(CNTRY_NAME == 'Morocco', GWSYEAR == 1976)

## Simple feature collection with 1 feature and 24 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -15.22687 ymin: 23.11465 xmax: -1.011809 ymax: 35.91916
## geographic CRS: WGS 84
##   CNTRY_NAME     AREA CAPNAME CAPLONG CAPLAT FEATUREID COWCODE COWSYEAR COWSMONTH COWSDAY COWEYEAR
## 1    Morocco 576351.8   Rabat   -6.83  34.02       220     600     1976         4       1     1979
##   COWEMONTH COWEDAY GWCODE GWSYEAR GWSMONTH GWSDAY GWEYEAR GWEMONTH GWEDAY ISONAME ISO1NUM ISO1AL2
## 1         8       4    600    1976        4      1    1979        8      4 Morocco     504      MA
##   ISO1AL3                       geometry
## 1     MAR MULTIPOLYGON (((-4.420418 3...

The cshapes dataset records when states change territorial boundaries or capital locations, so the combination of a state name or identifier and a start or end date uniquely identifies all rows in the data. Since, this polygon begins on April 1, 1976 and the Gleditsch and Ward code for Morocco is 600, plugging it all into the query argument to st_read() gets us:

## read in morocco polygon
morocco <- st_read('cshapes.shp',
                   query = 'SELECT * FROM "cshapes" WHERE GWCODE = 600 AND GWSYEAR = 1976 AND GWSMONTH = 4 AND GWSDAY = 1')

## verify country name
morocco$CNTRY_NAME

## [1] "Morocco"

Awesome! We were able to read in just one polygon from the cshapes dataset. Note that * means all columns. As I mentioned above, this is cheating, since we had to read the whole dataset into R with a standard st_read() call to learn the names and values of the variables we then filtered on.

Sneaking a peek

When this isn’t an option, we can sneak a peak at the data by loading just the first observation into R. This requires significantly less memory than loading an entire dataset, and can give us the information we need to filter the full dataset and read in one observation at a time. Most SQL implementations don’t have row numbers, so it’s hard to just grab the first row of the data for this purpose. However, the OGR SQL dialect documentation notes that it implements a special field called FID that is a feature ID, i.e., a row number. We can take advantage of FID to select the first polygon from the data using the query argument to st_read() again:

## read in first row of the data
cshapes_row <- st_read('cshapes.shp', query = 'SELECT * FROM "cshapes" WHERE FID = 1')

## inspect
cshapes_row

## Simple feature collection with 1 feature and 24 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -58.0714 ymin: 1.836245 xmax: -53.98612 ymax: 6.001809
## geographic CRS: WGS 84
##   CNTRY_NAME     AREA    CAPNAME CAPLONG   CAPLAT FEATUREID COWCODE COWSYEAR COWSMONTH COWSDAY
## 1   Suriname 145952.3 Paramaribo   -55.2 5.833333         1     115     1975        11      25
##   COWEYEAR COWEMONTH COWEDAY GWCODE GWSYEAR GWSMONTH GWSDAY GWEYEAR GWEMONTH GWEDAY  ISONAME
## 1     2016         6      30    115    1975       11     25    2016        6     30 Suriname
##   ISO1NUM ISO1AL2 ISO1AL3                 _ogr_geometry_
## 1     740      SR     SUR POLYGON ((-55.12796 5.82217...

Even if we knew that the data had an ID column and start and end dates, we wouldn’t know the precise formatting (capitalization, underscores or dashes) of column names, or whether start and end dates are stored as one column or sets of three like they are here.

Making a list

We still need more information if we want to iterate through the polygons in the data and load them one at a time. We know what columns uniquely identify the rows, but what don’t know all the values they take on. Without that, we we’re stuck. What (usually) makes spatial data big is not the tabular data themselves, but the spatial features they’re attached to. This is particularly the case with polygons, which can be incredibly large in size for complex features. So, the goal here is to get the data we care about (ID column and start date) and ditch everything else, loading only the bare minimum into memory.

To do this, we’ll use the ogr2ogr() function in the gdalUtils package³. ogr2ogr() converts between different spatial data formats. It also offers two features that we’re going to use to cut down the data to the bare minimum. The select argument is a SQL selection, so we’re going to create a comma separated list of our key columns. The nlt argument specifies what type of geometry to create in the output. Conveniently it accepts NONE as a value, which will yield a plain table of data with none of the memory-hogging geometries:

## load package
library(gdalUtils)

## convert to nonspatial geometry
ogr2ogr(src_datasource_name = 'cshapes.shp', dst_datasource_name = 'cshapes_no_geom',
        select = 'GWCODE,GWSYEAR,GWSMONTH,GWSDAY', nlt = 'NONE')

This will create a shapefile in the new directory cshapes_no_geom called cshapes. The usual .shp and .shx components of a shapefile are missing, but the .dbf part is there, and that’s the one we care about. Load it up with st_read() and we’ll have what we need:

## load non-geometry table
cshapes_id <- st_read('cshapes_no_geom/cshapes.dbf')

## inspect
head(cshapes_id)

##   GWCODE GWSYEAR GWSMONTH GWSDAY
## 1    110    1966        5     26
## 2    115    1975       11     25
## 3     52    1962        8     31
## 4    101    1946        1      1
## 5    990    1962        1      1
## 6    972    1970        6      4

Now you can load polygons one at a time and perform whatever geometric operations you need to. To illustrate, I’ll load the first four polygons in the dataset, calculate their area, and then plot them.

## set up four panel plot
par(mfrow = c(1, 4), mar = c(6.1, 4.1, 4.1, 4.1))

## read in each polygon and plot 
for (i in 1:4) {
  
  ## build SQL query
  query_str <- str_c('SELECT * FROM "cshapes" WHERE GWCODE = ', cshapes_id$GWCODE[i],
                     ' AND GWSYEAR = ', cshapes_id$GWSYEAR[i],
                     ' AND GWSMONTH = ', cshapes_id$GWSMONTH[i],
                     ' AND GWSDAY = ', cshapes_id$GWSDAY[i])
  
  ## read in data
  pol <- st_read('cshapes.shp', query = query_str)
  
  ## plot data
  pol %>%
    st_geometry() %>% 
    plot(main = pol$CNTRY_NAME,
         sub = str_c(round(units::set_units(st_area(pol), 'km^2'), digits = 0),
                      ' km^2'))
  
}

Won’t you be my neighbor?

Sometimes (oftentimes in spatial analysis) we need not just a polygon, but also its neighbors. That means loading just one polygon is insufficient. If your data are already in R, this is easy with the st_filter() function, but it’s much trickier if you’re trying to filter data before loading them into R⁴. Luckily, st_read() as you covered! The wkt_filter accepts a well-known text string that can be used to filter the data before loading them into R⁵. Well-known text is a standard string representation of geometry, and is actually how the sf package prints geometry in R:

st_point(c(1, 2))

We want to use the wkt_filter argument to only load polygons that intersect with our Morocco polygon into R. To do that, we need to convert our polygon to a well-known text string with the st_as_text() function, then pass it to st_read(). However, st_as_text() only accepts sfc and sfg objects, not sf objects:

## create well known text object to filter cshapes on disk
morocco_wkt <- st_as_text(morocco)

## Error in UseMethod("st_as_text"): no applicable method for 'st_as_text' applied to an object of class "c('sf', 'data.frame')"

To get around this, we need to drop the data on morocco and extract just the geometry of the polygon with st_geometry():

## create well known text object to filter cshapes on disk
morocco_wkt <- morocco %>% 
  st_geometry() %>% # convert to sfc
  st_as_text() # convert to well known text

## plot morocco and neighbors
st_read('cshapes.shp', wkt_filter = morocco_wkt) %>%
  st_geometry() %>%
  plot(main = morocco$CNTRY_NAME)

## add morocco polygon on top
morocco %>% 
  st_geometry() %>%
  plot(add = T, col = rgb(0, 1, 0, .5))

Notice that there are multiple polygon boundaries within the green area of our green Morocco polygon. That’s because there are 4 Morocco polygons in the data starting in 1956, 1958, 1976, and 1979. Be sure to filter the dataset, either as part of the SQL query or in a dplyr::filter() so that you only get polygons that existed contemporaneously with your polygon of interest.

Wrapping up

So far, we’ve covered:

How to extract the first polygon for a spatial dataset and learn the names of identifier columns
How to strip the geometry from a spatial dataset and extract just a table of these columns
How to use these columns to iterate through the polygons in the dataset and import them one at a time, or along with their neighbors

You can technically skip the first two steps and just move the .shp and .shx files out of the directory before loading the .dbf file with st_read(), but that kind of feels like cheating to me⁶ and it only works with shapefiles. If you have another type of spatial dataset, read on.

This time for real

In my research, I often need to work with spatial data that’s measured at or aggregated up to different administrative divisions (ADMs). GADM helpfully provides a global dataset of ADMs. Although you can download ADMs for specific countries, I work with data in enough different countries that I finally decided to just download the entire dataset. While the cshapes example above just illustrated how to implement a pipeline for working with spatial data on disk, you may actually need to use one with these data depending on your machine’s hardware.

This master dataset comes as a GeoPackage. Most importantly for us, that means we can’t just delete a few component files to load the non-spatial table from the dataset; we have to convert it from a spatial dataset to a non-spatial one with ogr2ogr(). The GeoPackage contains ADMs from level 0 (countries) all the way down to level 5. Each level is stored as a separate layer in the .gpkg, and we can get a list of available layers with the st_layers() function:

## get layers
st_layers('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg')

## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields
## 1     level0 Multi Polygon      256      2
## 2     level1 Multi Polygon     3610     10
## 3     level2 Multi Polygon    45962     13
## 4     level3 Multi Polygon   147427     16
## 5     level4 Multi Polygon   138053     14
## 6     level5 Multi Polygon    51427     15

We want to work with the third-order administrative divisions (cities, towns, and other municipalities in the US context), so we need the level3 layer. Where we just used the name of the dataset in our SQL call before, this time we’ll use level3. Now we just follow the same workflow as with the cshapes dataset above:

## get first observation
level3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
                  query = 'SELECT * FROM "level3" WHERE FID = 1', layer = 'level3')

## inspect
level3

## Simple feature collection with 1 feature and 16 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13.08792 ymin: -8.010127 xmax: 13.59943 ymax: -7.708598
## geographic CRS: WGS 84
##   GID_0 NAME_0   GID_1 NAME_1 NL_NAME_1     GID_2 NAME_2 NL_NAME_2       GID_3 NAME_3 VARNAME_3
## 1   AGO Angola AGO.1_1  Bengo       AGO.1.1_1 Ambriz       AGO.1.1.1_1 Ambriz      
##   NL_NAME_3  TYPE_3 ENGTYPE_3 CC_3 HASC_3                           geom
## 1       Commune   Commune     MULTIPOLYGON (((13.12764 -7...

This time we have a single column that uniquely identifies observations, GID_3, so we only have to extract one column from the dataset. We use the ogr2ogr() function as before, but we have to specify the layer = 'level3' argument since the GeoPackage has more than one layer and we want to work with a specific one. Since GID_3 is our identifier column, that’s what we select from the dataset:

## convert to nonspatial geometry
ogr2ogr(src_datasource_name = '/Users/Rob/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
        dst_datasource_name = 'gadm34_levels_no_geom',
        layer = 'level3',
        select = 'GID_3',
        nlt = 'NONE')

## load non-geometry table
gadm_ids <- st_read('gadm34_levels_no_geom/level3.dbf')

## inspect
head(gadm_ids)

##         GID_3
## 1 AGO.1.1.1_1
## 2 AGO.1.1.2_1
## 3 AGO.1.1.3_1
## 4 AGO.1.2.1_1
## 5 AGO.1.2.2_1
## 6 AGO.1.2.3_1

And we can again read the polygons into R one at a time and perform whatever spatial operations we need. Since our identifying column is a string this time, we need to enclose it quotes in our SQL call. SQL is very picky about quotation mark types, so while we needed to surround our layer name with double quotes, we need to surround our identifier variable with single quotes. I’m already using single quotes to define the character string for the SQL call, so I need to escape the single quotes around the identifier. You can do this with a single backslash (\). Thus, you can include single quotes in a single-quoted string like this: 'this is a string \'this is another part of a string\''. Other than that wrinkle, things are pretty much the same as with cshapes:

## for reproducibility
set.seed(27599)

## set up four panel plot
par(mfrow = c(1, 4), mar = c(2.1, 4.1, 4.1, 4.1))

## read in each polygon and plot
for (i in sample(1:nrow(gadm_ids), 4, replace = F)) { # mix it up
  
  ## build SQL query
  query_str <- str_c('SELECT * FROM "level3" WHERE GID_3 = \'',
                     gadm_ids$GID_3[i], '\'')
  
  ## read in polygon for ADM3 i
  adm3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
                  query = query_str, layer = 'level3')
  
  ## plot polygon and label with full name
  print(plot(adm3$geom,
             main = adm3 %>%
               select(starts_with('NAME_')) %>% # get all name variables
               st_drop_geometry() %>% # drop geometry
               rev() %>% # reverse order of names to 3, 2, 1, 0
               str_c(collapse = ', '), # collapse w/ commas
             cex.main = .6))
  
}

Spatially filtering the GADM dataset is just as easy as with cshapes. To illustrate, I’m going to pull out a random polygon and use it to filter the data. However, these are third-order administrative divisions, and so it’s possible that even capturing all adjacent polygons won’t cover a very large area. To deal with this concern, we can buffer the polygon with the st_buffer() function before we convert it to well-known text:

## import single polygon
adm3 <- st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
                  query = str_c('SELECT * FROM "level3" WHERE FID = 63130'))

## create well known text object to filter GADM on disk
adm3_wkt <- adm3 %>% 
  st_geometry() %>% # convert to sfc
  st_buffer(.025) %>% # buffer .05 decimal degrees
  st_as_text() # convert to well known text

## plot Dakkoun and neighbors w/in .05 decimal degrees
st_read('~/Dropbox/Datasets/GADM/gadm34_levels_gpkg/gadm34_levels.gpkg',
        layer = 'level3', wkt_filter = adm3_wkt) %>%
  st_geometry() %>%
  plot(main = adm3 %>%
               select(starts_with('NAME_')) %>%
               st_drop_geometry() %>%
               rev() %>%
               str_c(collapse = ', '))

## plot Dakkoun and highlight
adm3 %>%
  st_geometry() %>%
  plot(add = T, col = 'green')

## plot buffered polygon used to filter GADM on disk
adm3 %>% 
  st_geometry() %>% 
  st_buffer(.025) %>% 
  st_cast('LINESTRING') %>%
  plot(add = T, col = 'blue')

The green polygon above is Dakkoun, the 63,130th polygon in the the dataset. The blue line is the extent of the .025 decimal degree buffer applied to it to before filtering the dataset. This workflow can speed things up when working with these data, considering there are 884,562 third-order administrative division polygons in the dataset.

Making data manageable

The query and wkt_filter arguments to st_read() can help you work with large spatial datasets that are either too big to load into memory, or too slow to work with once loaded. While this is less of a concern with low resolution datasets created by social scientists, it can be incredibly useful if you ever have to work with super high resolution data created by remote sensing technologies or actual cartographers and geographers.

This is the appraoch that the raster package uses. R only stores information on the extent and resolution of a raster in memory; the actual values in each cell of a raster are only loaded into memory when accessed by R using a function like extract(). ↩
Although I’m using cshapes as an example throughout this post so you can easily follow along and run the code yourself, it’s a small enough dataset that no modern machine should have trouble loading it. I also use this approach for a much larger dataset where you’d actually benefit from this approach at the end of this post. ↩
This function is just a wrapper around the GDAL utility ogr2ogr. You could also do this with ogr2ogr directly in the shell, but it’s much uglier: ogr2ogr -f "ESRI SHAPEFILE" cshapes_no_geom.shp cshapes.shp cshapes -nlt NONE -select GWCODE,GWSYEAR,GWSMONTH,GWSDAY. ↩
st_filter() accepts various spatial predicates beyond the default of st_intersects(). This filtering on disk gives much less fine-grained control. If you need more precision, you can load more nearby polygons by buffering the polygon before filtering the input like here and then using st_filter() with your spatial predicate of choice. ↩
I spent over an hour trying to figure out how to tell the query parameter to use PostGIS or SpatiaLite dialects instead of the OGR SQL dialect so I could execute a spatial filter before finding the wkt_filter argument to st_read(). Always read the documentation carefully. ↩
Having to move or delete files also risks losing them; the ogr2ogr() approach is safer in this regard. ↩

Jekyll and HTML Widgets

2020-09-19T00:00:00-05:00

I’m currently compiling a list of university-affiliated programs designed to help prepare students for graduate study in political science and assist them in the process of applying to graduate school (a labyrinthine and opaque process in many regards). Since travel costs can be a deciding factor for some students when deciding whether to apply to these programs, I thought it would be nice to also put them on a map.

While just plotting them on a map is easy, since it will be on a web page, I figured why not also embed links to each program in the map as well. In theory this is easy thanks to R packages like leaflet, which leverages the (unsurprisingly named) leaflet JavaScript library for interactive webmaps. However, because I use Jekyll instead of Hugo for my site, I can’t just use the blogdown R package and have everything magically work.

Steven Miller’s tutorial on integrating R Markdown and Jekyll is the starting point my own use of R Markdown and Jekyll, so check that out first for a quick primer on how to use R Markdown to render .Rmd files into the .md files that Jekyll uses to render your website. This approach works fantastically well for static images, and requires just a little tweaking to make interactive widgets like leaflet maps work.

Leaflet

We’ll use three packages to create our map. The tidyverse is pretty well-documented at this point, but I use it to write efficient and readable code. tidygeocoder is a geocoder that can use a variety of geocoding services and works well with data frames and tibbles. Finally, leaflet is what we’ll use to create our actual map widget.

library(tidyverse)
library(tidygeocoder)
library(leaflet)

First, we need to load our data. This is a CSV file of program information that I’ve compiled myself.

## read in data
predoc <- read_csv('predoc.csv')

## inspect the data
predoc

## # A tibble: 9 x 4
##   Institution          Name                     Location      URL                                   
##                                                                                 
## 1 University of South… POIR Predoctoral Summer… Los Angeles,… https://dornsife.usc.edu/poir/predoct…
## 2 Duke University      Ralph Bunche Summer Ins… Durham, NC, … https://www.apsanet.org/rbsi          
## 3 UC San Diego         START                    La Jolla, CA… https://grad.ucsd.edu/diversity/progr…
## 4 MIT                  MSRP                     Cambridge, M… https://oge.mit.edu/graddiversity/msr…
## 5 UC Irvine            SURF                     Irvine, CA, … https://grad.uci.edu/about-us/diversi…
## 6 University of Washi… NSF REU: Spatial Models… Tacoma, WA, … https://www.tacoma.uw.edu/smed/nsf-re…
## 7 University of North… NSF REU: Civil Conflict… Denton, TX, … https://untconflictmgmtreu.wordpress.…
## 8 Princeton University Emerging Scholars in Po… Princeton, N… https://politics.princeton.edu/gradua…
## 9 Harvard University   PS-Prep                  Cambridge, M… https://projects.iq.harvard.edu/ps-pr…

First, we need to get latitude and longitude coordinates from our place names to plot them on a map. We’ll use the geocode() function, where the first argument is a data frame containing a column with the location information we want to use. The second argument is address, which tells the geocoder to use the information stored in the Address column of our data frame, and then method = 'osm' dispatches it to the Open Street Map geocoder, Nominatim.

Next, we’ll use mutate() to create a new variable to hold the popup text a user will see when they click on a point. I want to provide the university name, the program’s name, and then a link to the program’s information page. I use the str_c() function to combine the Institution and Name columns, and then I use another call to str_c() to format the URL. This second call looks like str_c('Program Info'), where URL is the name of the URL field. It combines the standard start of an HTML anchor tag (target="_PARENT" in the anchor tag. This is necessary to make any links a user clicks open normally, instead of within the frame used to embed it into the page (more on that later).

Once we’ve prepped our popup text, we just pass the data frame to leaflet(), add a background map (I’ve used a styled map, but you can also get the default map with addTiles()), and then the markers themselves. The one tricky part of addMarkers() is that it expects its arguments as one-sided formulas, not just variable names like tidyverse functions. geocode() has created lat and long columns, so pass those through as well as our label column, and we’re good to go.

Map it

Putting all the above code together in a pipeline looks like this:

## prep and plot
predoc %>% 
  geocode(address = Location, method = 'osm') %>% ## gecode locations
  mutate(lab = str_c(Institution, Name,
                     str_c(', URL, '" target="_PARENT">Program Info'),
                     sep = '
')) %>% # paste fields into popup text
  leaflet() %>% # create leaflet map widget
  addProviderTiles(providers$CartoDB.Positron) %>% # add muted palette basemap
  addMarkers(lng = ~ long, lat = ~ lat, popup = ~ lab) # add markers with popup text

Unfortunately this code produces an error that stops R Markdown dead in its tracks; like, the-error = T-knitr-chunk-option-won’t-even-save-you dead in its tracks. What gives? R Markdown is supposed to be able to render interactive widgets no problem. The issue is that R Markdown can render those widgets for HTML output, but since we’re creating a GitHub Flavored Markdown document that Jekyll then turns into HTML, R Markdown chokes. It can’t embed an HTML widget into a plain text markdown document. Luckily there is a way around this, but it involves an extra step and dealing with some file paths.

R Markdown, HTML widgets, and Jekyll

To make things work, we have to manually save the HTML from our widget, and then embed it into our resulting markdown document. Then, when Jekyll renders the markdown to HTML, it will be visible in the final HTML files that comprise your website. This involves telling R where to save the HTML, then referencing it using raw HTML code in our markdown document. We’re going to do this with the htmlwidgets R package.

## load htmlwidgets to save map widget
library(htmlwidgets)

## prep and plot
predoc %>% 
  geocode(address = Location, method = 'osm') %>% ## gecode locations
  mutate(lab = str_c(Institution, Name,
                     str_c(', URL, '" target="_PARENT">Program Info'),
                     sep = '
')) %>% # paste fields into popup text
  leaflet() %>% # create leaflet map widget
  addProviderTiles(providers$CartoDB.Positron) %>% # add muted palette basemap
  addMarkers(lng = ~ long, lat = ~ lat, popup = ~ lab) %>% # add markers with popup text
  saveWidget(here::here('/files/html/posts', 'predoc_map.html')) # save map widget

The code is identical to that above, with the addition of the file line that saves the map widget as an HTML file called predoc_map.html in /files/html/posts using the saveWidget() function. You’ll notice I use the here() function from the here R package to supply the file argument to saveWidget(). here is great because it very intelligently finds the top level of whatever project you’re working on and then constructs file paths from there. It has a number of ways to determine where a project ‘starts’, but for us it works because our website is a git repo and contains a .git directory.

Frame it

All that’s left to do is embed the map widget in the page using an iframe. iframes allow you to embed an HTML page inside of another HTML page. Since saveWidget() saved our map widget as an HTML file that’s nothing but our map, we can then embed it into our page using an iframe. Jekyll allows raw HTML in markdown files which it ignores and passes through untouched into the final HTML files it produces. Here’s the code I used for the map in this post.

 src="/files/html/posts/predoc_map.html" height="600px" width="100%" style="border:none;">

The main argument is src="...", which tells the iframe what content it will contain. Notice that this is the same file path I just specified above in saveWidget(). As long as that directory exists in your website repo, everything will work smoothly. There are three important arguments in addition to the content of the iframe itself:

height is how tall you want the iframe to be; here I’ve specified it in pixels, but you can also use inches, centimeters, or percentages as you’ll see below
width is how wide you want the iframe to be; I’ve used a percentage here because the AcademicPages template is responsive and will resize itself on smaller screens
style is where I tell the iframe not to include a border so it blends seamlessly with the rest of the page

The finished product

Here’s what the final map looks like. If you didn’t know the extra effort it took, it would blend seamlessly into the page. Theoretically this should work for any HTML widget, like those produced by the plotly R package. If you haven’t checked plotly out, you really should. It can turn ggplot2 plots into interactive widgets with a single line of code!

Extracting UN Peacekeeping Data from PDF Files

2020-08-28T00:00:00-05:00

Some coauthors and I recently published a piece in the Monkey Cage on the recent military coup in Mali and the overthrow of president Ibrahim Boubacar Keïta. We examine what the ouster of Keïta means for the future of MINUSMA, the United Nations peacekeeping mission in Mali. One of my contributions that didn’t make the final cut was this plot of casualties to date among UN peacekeepers in the so-called big 5 peacekeeping missions .

These missions are distinguished from other current UN peacekeeping missions by high levels of violence (both overall and against UN personnel) and expansive mandates that go beyond ‘traditional’ goals of stabilizing post-conflict peace. The conflict management aims of these operations necessarily expose peacekeepers to high levels of risk. If we want to try understand what the future of MINUSMA might look like dealing with a new government in Mali, it’s important to place MINUSMA in context among the remainder of the big 5 missions. To help do so, I turned to the source for data on peacekeeping missions, the UN.

Nonstandard formats

When we wrote the piece, the Peacekeeping open data portal page on fatalities only had a link to this PDF report instead of the usual CSV file (the CSV file is back, so you don’t technically have to go through all of these steps to recreate this figure). Here’s what the first page of that PDF looks like:

Since we were working on a short deadline, I needed to get these data out of that PDF. The most direct option is to just copy and paste the data into an Excel sheet. However, these data run to 148 pages, so all that copying and pasting would be tiring and risks introducing errors when your attention eventually slips and you forget to include page 127.

Getting the data

Enter the tabulizer R package. This package is just a (much) friendlier wrapper to the Tabula Java library, which is designed to extract tables from PDF documents. To do so, just plug in the file name of the local PDF you want or URL for a remote one:

library(tabulizer)

## data PDF URL
dat <- 'https://peacekeeping.un.org/sites/default/files/fatalities_june_2020.pdf'

## get tables from PDF
pko_fatalities <- extract_tables(dat, method = 'stream')

The extract_tables() function has two different methods for extracting data: lattice for more structured, spreadsheet like PDFs and stream for messier files. While the PDF looks pretty structured to me, method = 'lattice' returned a series of one variable per line gibberish, so I specify method = 'stream' to speed up the process by not forcing tabulizer to determine which algorithm to use on each page.

Note that you may end up getting several warnings, such as the ones I received:

## WARNING: An illegal reflective access operation has occurred
## WARNING: Illegal reflective access by RJavaTools to method java.util.ArrayList$Itr.hasNext()
## WARNING: Please consider reporting this to the maintainers of RJavaTools
## WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
## WARNING: All illegal access operations will be denied in a future release

Everything still worked out fine for me, but you may run into problems in the future based on the warning about future releases.

Cleaning the data

We end up with a list that is 148 elements long, one per page. Each element is a matrix, reflecting the structured nature of the data. Normally, we could just combine this list of matrices into a single object with do.call(rbind, pko_fatalities):

do.call(rbind, pko_fatalities)

## Error in (function (..., deparse.level = 1) : number of columns of matrices must match (see arg 2)

But if we do this, we get an error! Let’s take a look and see what’s going wrong. We can use lapply() in combination with dim() to do so:

head(lapply(pko_fatalities, dim))

## [[1]]
## [1] 54  9
## 
## [[2]]
## [1] 54  7
## 
## [[3]]
## [1] 54  7
## 
## [[4]]
## [1] 54  7
## 
## [[5]]
## [1] 54  7
## 
## [[6]]
## [1] 54  7

The first matrix has an extra two columns, causing our attempt to rbind() them all together to fail.

head(pko_fatalities[[1]])

##      [,1]                   [,2]                            [,3] [,4]              
## [1,] "Casualty_ID"          "Incident_Date Mission_Acronym" ""   "Type_of_Casualty"
## [2,] "BINUH‐2019‐12‐00001"  "30/11/2019 BINUH"              ""   "Fatality"        
## [3,] "BONUCA‐2004‐06‐04251" "01/06/2004 BONUCA"             ""   "Fatality"        
## [4,] "IPTF‐1997‐01‐02515"   "31/01/1997 IPTF"               ""   "Fatality"        
## [5,] "IPTF‐1997‐09‐02720"   "17/09/1997 IPTF"               ""   "Fatality"        
## [6,] "IPTF‐1997‐09‐02721"   "17/09/1997 IPTF"               ""   "Fatality"        
##      [,5]                       [,6]                [,7] [,8]                     
## [1,] "Casualty_Nationality"     "M49_Code ISOCode3" ""   "Casualty_Personnel_Type"
## [2,] "Haiti"                    "332 HTI"           ""   "Other"                  
## [3,] "Benin"                    "204 BEN"           ""   "Military"               
## [4,] "Germany"                  "276 DEU"           ""   "Police"                 
## [5,] "United States of America" "840 USA"           ""   "Police"                 
## [6,] "United States of America" "840 USA"           ""   "Police"                 
##      [,9]              
## [1,] "Type_Of_Incident"
## [2,] "Malicious Act"   
## [3,] "Illness"         
## [4,] "Accident"        
## [5,] "Accident"        
## [6,] "Accident"

head(pko_fatalities[[2]])

##      [,1]                    [,2]                 [,3]       [,4]       [,5]     
## [1,] "MINUSCA‐2015‐10‐09459" "06/10/2015 MINUSCA" "Fatality" "Burundi"  "108 BDI"
## [2,] "MINUSCA‐2015‐10‐09468" "13/10/2015 MINUSCA" "Fatality" "Burundi"  "108 BDI"
## [3,] "MINUSCA‐2015‐11‐09509" "10/11/2015 MINUSCA" "Fatality" "Cameroon" "120 CMR"
## [4,] "MINUSCA‐2015‐11‐09510" "22/11/2015 MINUSCA" "Fatality" "Rwanda"   "646 RWA"
## [5,] "MINUSCA‐2015‐11‐09511" "30/11/2015 MINUSCA" "Fatality" "Cameroon" "120 CMR"
## [6,] "MINUSCA‐2015‐12‐09542" "06/12/2015 MINUSCA" "Fatality" "Congo"    "178 COG"
##      [,6]                     [,7]              
## [1,] "Military"               "Malicious Act"   
## [2,] "Military"               "Accident"        
## [3,] "Military"               "Malicious Act"   
## [4,] "Military"               "To Be Determined"
## [5,] "International Civilian" "Illness"         
## [6,] "Military"               "Illness"

We can see that the first page has two blank columns, accounting for the 9 columns compared to the 7 columns for all other pages. Closer inspection of the header on the first page and the columns on both the first and second pages reveals that there actually should be 9 columns in the data.

The Incident_Date and Mission_Acronym columns are combined into one, as are the M49_Code and ISOCode3 columns. We’ll fix the data in those two columns in a bit, but first we have to get rid of the empty columns in the first page before we can merge the data from all the pages. We could just tell R to drop those columns manually with pko_fatalities[[1]][, -c(3, 7)], but this isn’t a very scalable solution if we have lots of columns with this issue.

To do this programmatically, we need a way to identify empty columns. If this was a list of data frames, we could use colnames() to identify the empty columns. However, extract_tables() has given us a matrix with the column names in the first row. Instead, we’ll just get the first row of the matrix. Since we’re accessing a matrix that is the first element in a list, we want to use pko_fatalities[[1]][1,] to index pko_fatalities. Next, we’ll use the grepl() function to identify the empty columns. We want to search for the regular expression ^$, which means the start of a line immediately followed by the end of a line, i.e., an empty string. Finally, we negate it with a ! to return only non-empty column names:

## drop two false empty columns on first page
pko_fatalities[[1]] <- pko_fatalities[[1]][, !grepl('^$', pko_fatalities[[1]][1,])]

With that out of the way, we can now combine all the pages into one giant matrix. After that, I convert the matrix into a data frame, set the first row as the column names, and then drop the first row.

## rbind pages
pko_fatalities <- do.call(rbind, pko_fatalities)

## set first row as column names and drop
pko_fatalities <- data.frame(pko_fatalities)
colnames(pko_fatalities) <- (pko_fatalities[1, ])
pko_fatalities <- pko_fatalities[-1, ]

Now that we’re working with a data frame, we can finally tackle those two sets of mashed up columns. To do this, we’ll use the separate() function in the dplyr package, which I load via the tidyverse package. Separate is magically straightforward. It takes a column name (which I have to enclose in backticks thanks to the space), a character vector of names for the resulting columns, and a regular expression to split on. I use \\s, which matches any whitespace characters. I also filter out any duplicate header rows that may have crept in (there’s one on page 74, at the very least).

library(tidyverse)

## separate columns tabulizer incorrectly merged
pko_fatalities <- pko_fatalities %>% 
  filter(Casualty_ID != 'Casualty_ID') %>% # drop any repeated header(s)
  separate(`Incident_Date Mission_Acronym`, c('Incident_Date', 'Mission_Acronym'),
           sep = '\\s', convert = T, extra = 'merge')  %>% 
  separate(`M49_Code ISOCode3`, c('M49_Code', 'ISOCode3'),
           sep = '\\s', convert = T) %>% 
  mutate(Incident_Date = dmy(Incident_Date)) # convert date to date object

You’ll notice I also supply two other arguments here: convert and extra. The former will automatically convert the data type of resulting columns, which is useful because it converts Incident_Date into a Date object, and M49_Code into an int object. The latter tells separate() what to do if it detects more matches of the splitting expression than you’ve supplied column names. There are 18 observations where the mission acronym is list as “UN Secretariat”. That means that separate() will detect a second whitespace character in these 18 rows. If you don’t explicitly set extra, you’ll get a warning telling you what happened with those extra characters. By setting extra = 'merge', you’re telling separate() to effectively ignore any space after the first one and keep everything to the right of the first space as part of the output. Thus, our "UN Secretariat" observations are preserved instead of being chopped off to just "UN".

Creating the plot

Now that we’ve got the data imported and cleaned up, we can recreate the plot from the Monkey Cage piece. However, first we need to bring in some outside information and calculate some simple statistics.

Preparing the data

Before we can plot the data, we need to bring in some mission-level information, namely what country each mission operates in. We can get this easily from the Peacekeeping open data portal master dataset. Once I load the data into R I select just the mission acronym and country of operation. I then edit the strings for CAR and DRC to add newlines between words with \n to make them fit better into the plot.

## get active PKO data and clean up country names
read_csv('https://data.humdata.org/dataset/819dce10-ac8a-4960-8756-856a9f72d820/resource/7f738eb4-6f77-4b5c-905a-ed6d45cc5515/download/coredata_activepkomissions.csv') %>% 
  select(Mission_Acronym, Country = ACLED_Country) %>% 
  mutate(Country = case_when(Country == 'Central African Republic' ~
                               'Central\nAfrican\nRepublic',
                             Country == 'Democratic Republic of Congo' ~
                               'Democratic\nRepublic\nof the Congo',
                             TRUE ~ Country)) -> pko_data

We’re looking to see how dangerous peacekeeping missions are for peacekeepers, so we want to only look at fatalities that are the result of deliberate acts. The data contain 6 different types of incident, so let’s check them out:

table(pko_fatalities$Type_Of_Incident)

## 
##         Accident          Illness    Malicious Act   Self‐Inflicted To Be Determined 
##             2712             2582             2096              268              244 
##          Unknown 
##               50

Malicious acts are the third highest type of incident, so it’s important for us to subset the data to ensure we’re counting the types of attacks we’re interested in. Since we’re looking at fatalities in the big 5 missions, we also need to subset the data to just these missions. We’re going to use the summarize() function in conjunction with group_by() to calculate several summary statistics for each mission. We’ll also use the time_length() and interval() functions from the lubridate package, so load that as well.

library(lubridate)

## list of PKOs to include
pkos <- c('MINUSMA', 'UNAMID', 'MINUSCA', 'MONUSCO', 'UNMISS')

## aggregate mission level data
pko_fatalities %>% 
  filter(Type_Of_Incident == 'Malicious Act',
         Mission_Acronym %in% pkos) %>% 
  group_by(Mission_Acronym) %>% 
  summarize(casualties = n(),
            casualties_mil = sum(Casualty_Personnel_Type == 'Military'),
            casualties_pol = sum(Casualty_Personnel_Type == 'Police'),
            casualties_obs = sum(Casualty_Personnel_Type == 'Military Observer'),
            casualties_civ = sum(Casualty_Personnel_Type == 'International Civilian'),
            casualties_oth = sum(Casualty_Personnel_Type == 'Other'),
            casualties_loc = sum(Casualty_Personnel_Type == 'Local'),
            duration = time_length(interval(min(Incident_Date),
                                            max(Incident_Date)),
                                   unit = 'year')) %>% 
  mutate(MINUSMA = case_when(Mission_Acronym == 'MINUSMA' ~ 'MINUSMA',
                             TRUE                         ~ '')) %>% 
  left_join(pko_data, by = 'Mission_Acronym') %>% 
  mutate(Country = factor(Country,
                          levels = Country[order(casualties,
                                                 decreasing = T)])) -> data_agg

casualties = n() counts the total number of fatalities in each mission because each row is one fatality
casualties_mil = sum(Casualty_Personnel_Type == 'Military') counts how many of those casualties were UN troops
the other casualties_... lines do the same for different categories of UN personnel
the code to the right of duration calculates how long each mission has lasted by:
- finding the first and last date of a fatality in each mission
- creating an interval object from those dates
- calculating the length of that period in years
create an indicator variable noting whether or not an observation belongs to MINUSMA

Finally, we merge on the country information contained in pko_data and convert Country to a factor with levels that are decreasing in fatalities. This last step is necessary to have a nice ordered plot.

Plot it

With that taken care of, we can create the plot using ggplot. I’m using the label argument to place mission acronyms inside the bars with geom_text(), and a second call to geom_text() with the casualties variable to place fatality numbers above the bars. The nudge_y argument in each call to geom_text() ensures that they’re vertically spaced out, making them readable instead of overlapping.

ggplot(data_agg, aes(x = Country, y = casualties, label = Mission_Acronym)) +
  geom_bar(stat = 'identity', fill = '#5b92e5') +
  geom_text(color = 'white', nudge_y = -10) +
  geom_text(aes(x = Country, y = casualties, label = casualties),
            data = data_agg, inherit.aes = F,
            nudge_y = 10) +
  labs(x = '', y = 'UN Fatalities',
       title = 'UN fatalities in big 5 peacekeeping operations') +
  theme_bw()

Plot it (again)

We can also create some other plots to visualize how dangerous each mission is to peacekeeping personnel. While total fatalities are an important piece of information, the rate of fatalities can tell use more about the intensity of the danger in a given conflict.

data_agg %>% 
  ggplot(aes(x = duration, y = casualties, label = MINUSMA)) +
  geom_point(size = 2.5, color = '#5b92e5') +
  geom_text(nudge_x = 1) +
  expand_limits(x = 0, y = 0) +
  labs(x = 'Mission duration (years)', y = 'Fatalities (total)',
       title = 'UN fatalities in big 5 peacekeeping operations') +
  theme_bw()

We can see from this plot that not only does MINUSMA have the most peacekeeper fatalities out of any mission, it reached that point in a comparatively short amount of time. To really drive this point home, we can draw on the fantastic gganimate package. We’re going to animate cumulative fatality totals over time, so we need a yearly version of our mission-level data frame from above. The code below is pretty similar except we’re grouping by both Mission_Acronym and a variable called Year what we’re generating with the year() function in lubridate (it extracts the year from a Date object).

pko_fatalities %>% 
  filter(Type_Of_Incident == 'Malicious Act',
         Mission_Acronym %in% pkos) %>% 
  group_by(Mission_Acronym, Year = year(Incident_Date)) %>% 
  summarize(casualties = n(),
            casualties_mil = sum(Casualty_Personnel_Type == 'Military'),
            casualties_pol = sum(Casualty_Personnel_Type == 'Police'),
            casualties_obs = sum(Casualty_Personnel_Type == 'Military Observer'),
            casualties_civ = sum(Casualty_Personnel_Type == 'International Civilian'),
            casualties_oth = sum(Casualty_Personnel_Type == 'Other'),
            casualties_loc = sum(Casualty_Personnel_Type == 'Local')) %>% 
  mutate(MINUSMA = case_when(Mission_Acronym == 'MINUSMA' ~ 'MINUSMA',
                             TRUE                         ~ ''),
         Mission_Year = Year - min(Year) + 1) %>% 
  left_join(pko_data, by = 'Mission_Acronym') %>% 
  mutate(Country = factor(Country, levels = levels(data_agg$Country))) -> data_yr

Once we’ve done that, we need to make a couple tweaks to our data to ensure that our plot animates correctly. I use the new across() function (which is likely going to eventually replace mutate_at, mutate_if, and similar functions) to select all columns that start with “casualties”. Then, I supply the cumsum() function to the .fns argument, and use the .names argument to append “_cml” to the end of each resulting variable’s name. This argument uses glue syntax, which allows you to embed R code in strings by enclosing it in curly braces. The complete() function uses the full_seq() function to fill in any missing years in each mission, i.e., a year in the middle of a mission without any fatalities due to malicious acts. Finally, the fill() function fills in any rows we just added that are missing fatality data due to an absence of fatalities that year.

Now we’re ready to animate our plot! We construct the ggplot object like before, but this time we add the transition_manual() function to the end of the plot specification. This function tells gganimate what the ‘steps’ in our animation are. Since we’ve got individual years, we’re using the manual version of transition_ instead of the many fancier versions included in the package.

If you check out the documentation for transition_manual(), you’ll notice that there are a handful of special label variables you can use when constructing your plot. These will update as the plot cycles through its frames, allowing you to convey information about the flow of time. I’ve used the current_frame variable, again with glue syntax, to make the title of the plot display the current mission year as the frames advance.

library(gganimate)

data_yr %>% 
  arrange(Mission_Year) %>% 
  mutate(across(starts_with('casualties'), .fns = cumsum, .names = '{col}_cml')) %>%
  complete(Mission_Year = full_seq(Mission_Year, 1)) %>%
  fill(Year:casualties_loc_cml, .direction = 'down') %>%
  filter(Mission_Year <= 6) %>% # youngest mission is UNMISS
  ggplot(aes(x = Country, y = casualties_cml, label = casualties_cml)) +
  geom_bar(stat = 'identity', fill = '#5b92e5') +
  geom_text(nudge_y = 10) +
  labs(x = '', y = 'UN Fatalities',
       title = 'UN fatalities in big 5 peacekeeping operations: mission year {current_frame}') +
  theme_bw() +
  transition_manual(Mission_Year)

While the scatter plot above illustrates that UN personnel working for MINUSMA have suffered the most violence in the shortest time out of any big 5 mission, this animation make it abundantly clear, especially since MONUSCO and UNMISS both experience years without a single UN fatality from a deliberate attack. Visualizations like these are a great way to showcase your work, especially if you’re dealing with dynamic data. While you still can’t easily include them in a journal article, they’re fantastic tools for conference presentations or

Rob Williams

Presenting results from an arbitrary number of models

Multiple predictors

Plots

Tables

Bonus

There is as Yet Insufficient Data for a Meaningful Answer

Differences from the academic job market

The nonacademic résumé

Things to do

Software skills

The social science PhD comparative advantage

So it goes

Regular expressions for replication

File paths

Here

Regular expressions

A little bit faster now

Faceted maps in R

The data

First attempt: ggplot2

Second attempt: tmap

Third attempt: cowplot

Shared legend

Accurate shared legend

Bonus: still to solve

Finding Backcountry Campsites with CalTopo, OpenStreetMap, and R

Backcountry camping

Disclaimer

Getting the data

Trails

Don’t get lost

The whole trail

Water

Roads

Campsites

Elevation

Plan it

R Markdown, Jekyll, and Footnotes

What’s happening

Why it’s happening

How to fix it

Proof

Working with Large Spatial Data in R

On disk

SQL

Sneaking a peek

Making a list

Won’t you be my neighbor?

Wrapping up

This time for real

Making data manageable

Jekyll and HTML Widgets

Leaflet

Map it

R Markdown, HTML widgets, and Jekyll

Frame it

The finished product

Extracting UN Peacekeeping Data from PDF Files

Nonstandard formats

Getting the data

Cleaning the data

Creating the plot

Preparing the data

Plot it

Plot it (again)

First attempt: `ggplot2`

Second attempt: `tmap`

Third attempt: `cowplot`