Installation/setup/whatever is always harder and much more poorly documented than mere usage — Jenny’s Law
Running your data analysis in R, writing documents in Markdown or
RMarkdown, doing both via Emacs or some other text editor, processing
them with pandoc
, tracking things with Git and using (behind the
scenes) various Unix tools and LaTeX … this all sounds rather
complicated. It has four main advantages. First, these formats, tools,
and applications are all free. You can try them out without much in
the way of monetary expense. (Your time may be a different matter, but
although you don’t believe me, you have more of that now than you will
later.) Second, they are all open-source projects and are all
available for Mac OS X, Linux and Windows. Portability is important,
as is the long-term viability of the platform you choose to work with.
If you change your computing system, your work can move with you
easily. Third, they allow you to do your work in a portable,
documented, and reproducible way. And fourth, the applications are
closely integrated. Everything (including version control) can work
inside Emacs. All of them can work directly with or take advantage of
the others.
None of these tools is perfect. They can do very useful and important things for you, but they are not magic. There are bad habits associated with working in plain text, just as there are bad habits associated with writing everything in word. These tools are powerful, but they can be tedious to learn. However, you don’t have to start out using all of them at once, or learn everything about them right away—the only thing you really, really need to start doing immediately is keeping good backups. There are a number of ways to try these tools out in whole or in part. You could try writing something in Markdown first, using any text editor. You could begin using R with RStudio. Revision control is more beneficial when implemented at the beginning of projects, and best of all when committing changes to a project becomes a habit of work. But it can be added at any time.
You are not condemned to use these applications forever, either.
RMarkdown and (especially) Markdown documents can be converted into
many other formats. Your text files are editable in any other text
editor and on any other computer. Statistical code is by nature much
less portable, but the openness of R means that it is not likely to
become obsolete or inaccessible any time soon. In everyday use, you
may find that documents start life as plain-text, Markdown-formatted
notes jotted down on your phone, or computer; then they become longer
.md
files that grow references and figures and so on; and eventually
migrate to Word or Google Docs or something similar if you acquire a
collaborator or the time comes to submit a manuscript to a journal.
A disadvantage of these particular applications is that you might be in a minority with respect to other people in your field. This is less and less true in the case of R, and more recently with tools like Git as well. Writing papers in RMarkdown or Markdown is less common. Most people use Microsoft Word to write papers, and if you’re collaborating with people (people you can’t boss around, I mean) this can be an issue. It is usually easier to use applications like Word than convert people to a plain-text workflow. If you do, at least try and implement some of the principles discussed here when it comes to tracking changes to documents and managing the code and output of your data analysis.
There are many other applications you might put at the center of your workflow, depending on need, personal preference, willingness to pay some money, or desire to work on a specific platform. For text editing, especially, there is a plethora of choices. On the Mac, quality editors include BBEdit (beloved of many web developers, but with relatively little support for R beyond syntax highlighting), and TextMate 2 (shown in Figure 7.2. On Linux, the standard alternative to Emacs is vi or Vim, but there are many others. For Windows there is Textpad, WinEdt, UltraEdit, and NotePad++ amongst many others. Most of these applications have strong support for LaTeX and some also have good support for statistics programming.
Sublime Text 3 is a cross-platform text editor under active development, and with an increasingly large user base. Sublime Text is fast, fluid, and contains a powerful plugin system based on the Python programming language. Uniquely amongst alternatives to Emacs and ESS, Sublime Text includes a well-developed REPL that allows R to be run easily inside the editor.9 Sublime Text costs $70.
For a different approach to working with R—and as repeatedly noted
throughout this guide—you should consider
RStudio. A screenshot is shown in
Figure 7.3. Although it appears quite late in this discussion, it
might well be your first choice. I use it when teaching. It is not a
text editor but rather an “IDE”, an integrated development
environment. Your code and figures, together with an R console,
documentation, and other output are all displayed in different panes
and tabs of RStudio’s application window. Data and script files are
managed via various windows and menus. RStudio is available for Mac OS
X, Windows, and Linux. It intergrates nicely with R’s help files. It
understands knitr
and Git. As discussed above, it has full support
for Rmarkdown and generates HTML, PDF, and other formats for you very
easily. It is the easiest way by far to get into using R, and provides
a straightforward way to manage many of the tools already discussed
here.
For statistical analysis in the social sciences, the main alternative
to R is Stata. Stata is not free, but like R
it is versatile, powerful, extensible and available for all the main
computing platforms. It has a large body of user-contributed software.
In recent versions its graphics capabilities have improved a great
deal, as has its editor. ESS can highlight Stata .do
files in the
same way as it can do for R. Other editors can also be made to work
with Stata. More recently, Python has been coming
into frequent use in the social sciences. Python is a general-purpose
computing language that is relatively straightforward to learn. It is
often used for manipulating, cutting, and cleaning data prior to
analysis in applications like R or Stata. But it is also increasingly
a scientific computing platform in its own right.
SciPy is a useful place to begin learning
about Python’s capabilities on this front. Like R and RMarkdown, it
has good support for literate programming through tools like
iPython Notebooks. Naturally,
Emacs has good support for working with Python.
Amongst social scientists, revision control is perhaps the least widely-used of the tools I have discussed. But I am convinced that it is the most important one over the long term. While tools like Git take a little getting used to both conceptually and in practice, the services they provide are extremely useful. It is already quite easy to use version control in conjunction with most of the text editors discussed above. There are also full-featured clients like Tower that allow you to administer git without having to use the command line. Taking a longer view, version control is likely to become more widely available through intermediary services or even as part of the basic functionality of operating systems.
It would be nice if all you needed to do your work was a box of software tricks and shortcuts. But of course it’s a bit more complicated than that. In order to get to the point where you can write a paper, you need to be organized enough to have read the right literature, maybe collected some data, and most importantly asked an interesting question in the first place. No amount of software is going to solve those problems for you. Too much concern with the details of your setup hinders your work. Indeed—and I speak from experience here—this concern is itself a kind self-imposed distraction that placates work-related anxiety in the short term while storing up more of it for later.10 On the hardware side, there’s the absurd productivity counterpart to the hedonic treadmill, where for some reason it’s hard to get through the to-do list even though the cafe you’re at contains more computing power than was available to the Pentagon in 1965. On the software side, the besetting vice of software is the tendency to waste a lot of your time installing, updating, and generally obsessing about it.11 Even more generally, efficient workflow habits are themselves just a means to the end of completing the projects you are really interested in, of making the things you want to make, of finding the answers to the questions that brought you to graduate school. The process of idea generation and project management can be run well, too, and perhaps even the business of choosing what the projects should be in the first place. But this is not the place—and I am not the person—to be giving advice about that.
All of which is just to reiterate two things. First, I am not advocating these tools on the grounds that they will make you more “productive”. Rather, they may help you stay in control of—and able to reproduce—your own prior work. That is an important difference. If you care about getting the right answer in your data analysis, or at least being able to repeatedly get the same probably wrong answer, then tools that enhance this sort of control should appeal to you. Second, even with that caveat it is still the principles of workflow management that are important. The software is just a means to an end. One of the smartest, most productive people I’ve ever known spent half of his career writing on a typewriter and the other half on an ancient IBM Displaywriter. His backup solution for having hopelessly outdated hardware was to keep a spare Displaywriter in a nearby closet, in case the first one broke. It never did.