Counting Words in a Snap

January 05, 2020 3 minute read

14 pt periods. 1.05” margins. 2.1 spaced lines. Times Newer Roman. I’ve seen them all, and I’m tired of trying to catch them. So, I’ve stopped assigning papers in terms of page length and switched to word counts. Unfortunately, counting words is more time-intensive than counting pages. To Save myself a little time and a lot of boredom I wrote a short shell script to give me the word count for every PDF file in a directory. My method is easiest to implement for PDFs, so I require students to submit their assignments as PDFs. As an added bonus, our learning managment system lets me set assignments to only accept files with a .pdf extension so I don’t have to deal with students submitting Word docs instead.

Setup

While the ps2ascii utility that we’ll use in a minute comes standard with Mac OS and most Linux distributions, we want the most recent version to make sure our script can handle any special characters it encounters. It comes packaged with Ghostscript, so install the latest version with Homebrew:

brew install ghostscript

Once you’ve done that, you’ve got everything you need.

The code

Next, we need to write a bit of code. Create a text file and put the following in it:

#!/bin/bash
find . -type f -iname "*.pdf" -print0 | while IFS= read -r -d $'\0' LINE
do
    ps2ascii "$LINE" | wc -w  > "${LINE/%.pdf/-wc.txt}"
done

Save it with whatever name you want (I used PDFcount), and give it either a .sh file extension or no extension.

The script first finds all .pdf files in the current directory and then prints them out with the -print0 flag to separate file names with null terminators instead of newlines.¹ We then pipe this list of file names to read which splits them at each instance of the internal field separator IFS which is set to the null terminator with -d $'\0' (-r tells read not to split lines on backslashes) and stores the result in LINE.

The loop then executes for each instance of LINE (each filename). It first converts the .pdf file to plain text using ps2ascii and then pipes the resulting text to wc with the -w flag to count the total number of words. Finally, it writes the output of wc - (the number of words in the document) to a file with >. It uses Bash parameter substitution to create a file with the same names as the input file in $LINE but with -wc.txt instead of .pdf on the end of it.

Note the % in the parameter substitution. The syntax for parameter substitution is ${variable/pattern/replacement}. By default, Bash substitutes the first occurrence of the pattern and then stops. If one of my students had submitted a file called FinalPaper.pdf.pdf then the script would output a file called FinalPaper-wc.txt.pdf, which wouldn’t be a plain text file. The % at the start of the pattern means that it will only match at the suffix of the variable, meaning the actual file extension.`

Running it

To run this script, all you need to do is make it executable and add it to your PATH, which I walk through in detail in a previous post. Then, just open a terminal window, navigate to the directory with the files you want word counts for, and type:

PDFcount

The script will create one text file for each .pdf file it finds in the directory while leaving the originals untouched.

This lets our script handle filenames with spaces, or really any weird characters, as null terminators are the only characters not allowed in file names on Unix-likes. ↩

Share on

Mastodon Twitter Facebook LinkedIn

Rob Williams

Counting Words in a Snap

Setup

The code

Running it

Share on

You May Also Enjoy

Adding Content to an Academic Website

Checking Progress with Bash

Combining PDF Documents the Smarter Way