1 Introduction

As a beginning graduate student in the social sciences, what sort of software should you use to do your work?¹ More importantly, what principles should guide your choices? I offer some general considerations and specific answers. The short version is: you should use tools that give you more control over the process of data analysis and writing. I recommend you write prose and code using a good text editor; analyze quantitative data with R and RStudio, or use Stata; minimize error by storing your work in a simple format (plain text is best), and make a habit of documenting what you’ve done. For data analysis, consider using a format like RMarkdown and tools like Knitr to make your work more easily reproducible for your future self. Use Pandoc to turn your plain-text documents into PDF, HTML, or Word files to share with others. Keep your projects in a version control system. Back everything up regularly. Make your computer work for you by automating as many of these steps as you can.

To help you get started, I provide a drop-in set of useful defaults to get started with Emacs (a powerful, free text-editor). I share some templates and style files that can get you quickly from plain text to various output formats. But I emphasize that this is one viable choice amongst many. I discuss several alternatives because no humane person should recommend Emacs without presenting some other options as well.

To begin, I discuss why you should care about having better control over your work materials. Rather than dive straight in to a list of tools or a recapitulation of their manuals, I want to encourage you to begin thinking about this issue in a way that will lead you to some solution that works well for you. Perhaps that will mean using the tools described here, but perhaps not.

1.1 Motivation

You can do productive, maintainable and reproducible work with all kinds of different software set-ups. This is the main reason I don’t go around encouraging everyone to convert to the applications I use. (My rule is that I don’t try to persuade anyone to switch if I can’t commit to offering them technical support during and after their move.) So this discussion is not geared toward convincing you there is One True Way to organize things. I do think, however, that if you’re in the early phase of your career as a graduate student in, say, Sociology, or Economics, or Political Science, you should give some thought to how you’re going to organize and manage your work.² This is so for two reasons. First, the transition to graduate school is a good time to make changes. Early on, there’s less inertia and cost associated with switching things around than there will be later. Second, in the social sciences, text and data management skills are usually not taught to students explicitly. This means that you may end up adopting the practices of your advisor or mentor, continue to use what you are taught in your methods classes, or just copy whatever your peers are doing. Following these paths may lead you to an arrangement that you will be happy with. But maybe not. It’s worth looking at the options.

Two remarks at the outset. First, because this discussion is aimed at beginning students, some readers may find much with which they are already familiar. Second, although in what follows I advocate you take a look at several applications in particular, it’s not really about the gadgets or utilities. The Zen of Organization is not to be found in Fancy Software. Nor shall the true path of Getting Things Done be revealed to you through the purchase of a nice Moleskine Notebook. Instead, it lies within—unfortunately.

1.2 Two Revolutions in Computing

When talking to undergraduates or graduate students on this topic, and when teaching classes that use these tools, I increasingly run into the problem that it’s hard to get started without backing up a bit first in order to talk about how the computer they are using works. I think the reason for this is the rise of the flat-screen, touch-based model of computing, most obviously on phones and then very secondarily on things like Apple’s iPad or Microsoft’s Surface tablet. Now, most people who need to write long documents (like papers or dissertations) or work in an involved way with data do not use a tablet as their primary device. But it does seem clear that some kind of touch-screen interaction is the future of computing for most people. Indeed, once you consider phones properly you realize it’s the present of computing for most people.

While it is not strictly impossible, it remains very difficult to do your academic, social-science work on a device of this sort. This is likely to be the case for some time. The tools we have are not designed for them. I think there is an underappreciated tension here. Two ongoing computing revolutions are tending to pull in opposite directions. On one side, the mobile, cloud-centered, touch-screen, phone-or-tablet model has brought powerful computing to more people than ever before. This revolution is the one everyone is talking about, because it is happening on a huge scale and is where all the money is. It puts single-purpose applications in the foreground. It hides the workings of the operating system from the user, and it goes out of its way to simplify or completely hide the structure of the file system where items are stored and moved around.

On the other side, open-source tools for plain-text coding, data analysis, and writing are also better and more accessible than they have ever been. This has happened on a smaller scale than the first revolution, of course. But still, these tools really have revolutionized the availability and practice of data analysis and scientific computing generally. They continue to do so, too, as people work to make them better at everything from slurping up data on the web to presenting results there. These tools mostly work by joining separate, specialized widgets into a reproducible workflow. They are “bitty” or granular because the process of data analysis is that way, too. These tools do much less to hide the operating system layer—instead they often directly mesh with it—and they often presuppose a working knowledge of the file system underpinning the organization of the things the researcher is using, from data files to code to figures and final papers.

The tension is that, increasingly, people who enter the world of social science excited to work with data will also tend to have little or no prior experience with text-based, command-line, file-system-dependent tools. In many cases, they will not have much experience making effective use of a multi-tasking windowing environment, either, in the sense of knowing how to make applications work together in the service of a single goal.³ To be clear, this is not something to blame users for. Neither is it some misguided nostalgia on my part for the command line. Rather, it is an aspect of how computer use is changing at a very large scale. The coding and data analysis tools we have are powerful and for the most part meant to allow research products to be opened up and inspected. But the way they work runs against the grain of everyday, end-use computing, which hides implementation details and focuses on single-purpose tasks. The net result for the social sciences in the short to medium term is that we will have a suite of powerful and very useful tools, developed in the open, supported by helpful communities, and mostly available for free. But it will get harder to teach people how to use them when they are just starting out, and perhaps even to convince people to try them in the first place.

1.3 What’s the Problem?

The problem is that doing scholarly work is intrinsically a mess. There’s the annoying business of getting ideas and writing them down, of course, but also everything before, during, and around it: data analysis and all that comes with it, and the tedious but unavoidable machinery of scholarly papers—especially citations and references. There is a lot of keep track of, a lot to get right, and a lot to draw together at the time of writing. Academic papers are by no means the only form of writing subject to constraints of this sort. Consider this sensible discussion by Dr. Drang, a (pseudonymous) consulting engineer:

[T]he type of writing I typically do … is loaded with facts. I am constantly referring to photographs, drawings, experimental test results, calculations, reports written by others, textbooks, journal articles, and so on. These are not distractions; they are essential to the writing process.

And it’s not just reference material. Quite often I need to make my own graphs and drawings to include in a report. Because the text and the graphics are all part of a coherent whole, I need to go back and forth between the two; the words inform the pictures and the pictures inform the words. This is not the Platonic ideal of a clean writing environment—a cup of coffee on an empty desk in a white room—that you see in videos for distraction-free editors.

Some of the popularity of these editors is part of the backlash against multitasking, but people are confusing themselves with their computers. When I’m writing a report, that is my single task, and I bring to bear whatever tools are necessary to complete it. That my computer is multitasking by running many programs simultaneously isn’t a source of confusion or distraction, it’s the natural and efficient way for me to get my one task done.

A lot of academic writing is just like this. It can be tricky to manage. It’s even worse when you have collaborators and other contributors. So, what to do?

1.4 The Office Model and the Engineering Model

Let me make a crude distinction. There are “Office Type” solutions to this problem, and there are “Engineering Type” solutions. Don’t get hung up on the distinction or the labels. Office solutions tend towards a cluster of tools where something like Microsoft Word is at the center of your work. A Word file or set of files is the most “real” thing in your project. Changes to your work are tracked inside that file or files. Citation and reference managers plug into those files. The outputs of data analyses—tables, figures—get cut and pasted in as well, or are kept alongside them. The master document may be passed around from person to person to be edited and updated.⁴ The final output is exported from it, perhaps to PDF or to HTML, but maybe most often the final output just is the .docx file, cleaned up and with the track changes feature turned off.

In the Engineering model, meanwhile, plain text files are at the center of your work. The most “real” thing in your project will either be those files or, more likely, the version control repository that stores the project. Changes are tracked outside of files, again using a version control system. Data analysis is managed in code that produces outputs in (ideally) a known and reproducible manner. Citation and reference management will likely also be done in plain text, as with a BibTeX .bib file. Final outputs are assembled from the plain text and turned to .tex, .html, or .pdf using some kind of typesetting or conversion tool. Very often, because of some unavoidable facts about the world, the final output of this kind of solution is also a .docx file.

This distinction is meant to capture a tendency in organization, not a rigid divide—still less a sort of personality. Obviously it is possible organize things on the Office model. (Indeed, it’s the dominant way.) Applications like Scrivener, meanwhile, combine elements of the two approaches. Scrivener embraces the “bittyness” of large writing projects in an effective way, and can spit out clean copy in a variety of formats. Scrivener is built for people writing lengthy fiction (or qualitative non-fiction) rather than anything with quantitative data analysis, so I have never used it extensively. Microsoft Word, meanwhile, still rules large swathes of the Humanities and the Social Sciences, and the production process of many publishers. So even if you prefer plain text for other reasons—especially in connection with project management and data analysis—the routine need or obligation to provide a Word document to someone is one of the main reasons to want to be able to easily convert things. HTML is a great lingua franca.

I mostly focus on the Engineering model. But many people use the Office model, and you may end up working with (or for) some of them. In those cases, it is generally easier for you to use their software than vice versa, if only because you are likely have a copy of Word on your computer already. In these circumstances you might also collaborate using Google Docs or some other service that allows for simultaneously editing the master copy of a document. This may not be ideal, but it is better than not collaborating. There is little to be gained from plain-text dogmatism in a .docx world.

☰ Menu

The Plain Person’s Guide to Plain Text Social Science

Kieran Healy

1 Introduction

1.1 Motivation

1.2 Two Revolutions in Computing

1.3 What’s the Problem?

1.4 The Office Model and the Engineering Model