Sunday, April 08, 2012

In Defense of R

This post has been planned for several weeks now. After I saw this presentation I was in the mood to finish it. Although I question the choice of John Cook given the other speakers at the conference (to my knowledge, he's not involved with the core R development), I'm not responding to anything he said, as everything here was outlined weeks ago. I had never heard of John Cook before this presentation. His description of R as "a strange, deeply flawed language" does capture the views of the critics.

Some of the criticisms thrown at R include: the syntax is inconsistent; the language is poorly designed - by statisticians, heaven forbid, rather than computer scientists; too much stuff was added to the language through time rather than being part of an initial grand design; and it is slow, slow, slow.

The last criticism is correct. Loops are slow relative to Fortran or C++, for sure, but that reflects the interactive nature of the language. The same is true for Python, Ruby, and a whole lot of other languages. A JIT compiler is on the way, I've read. There already is a JIT compiler, but I'm talking about one that gives 10x or 20x or better speedups for some tasks relative to the current version of R. Since I don't know a lot about it I'll let you use Google to find information if you're interested.

The other criticisms I've listed are not accurate descriptions of what the critics dislike. What they're really saying is "R is flawed because I tried to do X and the behavior was not what I expected" and "R is flawed because it does a lot of things I don't need and don't understand."

Not all criticisms can be put into one of those two categories. R's approach to OOP (S3 or S4) leaves something to be desired. The language offers no support for tail calls that you get in a language like Clojure or Scala. That's a perfectly reasonable statement of areas in which R is lacking.

The source of most criticisms of the design of R is that the language does so much. Nothing else that I've tried comes close. There are a ton of packages available. You can see the R source code. You can get detailed information on the internals of R. The language supports functional, object oriented, and FORTRAN 66 programming styles. It's easy to interface with C++ and Fortran code when you need speed. The metaprogramming capabilities are astonishing.

And not to be lost is that this is all functionality of a statistical programming language. Statistical analysis is not something bolted on top of a language designed by computer scientists, or web developers, or someone who decided to improve the state of imperative programming. The core of the language is functions for analyzing data. You can load a dataset, do some manipulations, and estimate a probit model without loading any external packages. There are an enormous number of tradeoffs involved with writing a language that does everything I've described here (and I've only scratched the surface). I rarely see the critics acknowledging that - yet it is precisely the many things that R does that makes it so much better than the alternatives.

Lists are a good example of something that seems "weird" to the C or Matlab programmer. They are very important in R. Someone who has programmed in Lisp already knows all about R lists, but most of the critics do not have that background, so they think lists are a flaw rather than a feature. Making matters worse is that the term "list" is used in different ways, so even if they've seen a list, it probably wasn't important to the language and wasn't used the same way.
In R you can do:

x <- list(x=rnorm(100),z="Blackbeard the Pirate")

Matlab programmers think R is an inferior program because they're used to jamming everything into a matrix or array form. They see someone using a list and can't understand why you'd use such a complicated, verbose solution. The Matlab solution is easier. You create the vector x and then write a comment so that you can keep track in your head that x corresponds to Blackbeard the Pirate. It's much less typing.

The Matlab programmer then sees data frame objects everywhere when using R. In Matlab, you store data in a matrix, so to get your data out you take a slice from a matrix. In R, a data frame is a list with elements that have equal length. Not understanding what R lists are, but not willing to spend time learning what a data frame is, the Matlab programmer gives up and says "R has a lot of useful functionality, but it's strange." And they're right. What they don't realize is that "strange" is a reflection of their limited knowledge, not a reflection of the poor design of the R language. It's not so easy to keep track of the index of the column holding each variable when you've got 600 variables.

That's a simplistic example, but one I've seen many times. Different == wrong. Going a little deeper, a lot of newbies coming from other languages don't realize that vectors hold more than just numbers and characters. A vector can hold NA values for missing data. A vector can have one element (a scalar). It can have zero elements. You can work with NULL objects in vectors. For instance, you can do this

c(1,NULL,2)

and you'll get back a vector with two elements. Or you can do this

c(1,NA,2)

and you'll get back a vector with three elements. I realize that they confuse newbies, but given just how helpful it can be when you need it (which is all the time for me) you view it as a feature of the language, not a flaw in the language's design.

A third example is

cbind(1:6,1:3)

The output is

      [,1] [,2]
[1,]    1    1
[2,]    2    2
[3,]    3    3
[4,]    4    1
[5,]    5    2
[6,]    6    3

Sometimes recycling of elements gives a warning, sometimes it doesn't. If you try

cbind(1:6,1:4)

You get a warning message:

Warning message:
In cbind(1:6, 1:4) :
  number of rows of result is not a multiple of vector length (arg 2)

Recycling means R won't identify some of your bugs. When you need recycling, though, it is very nice to have.



These are three examples of language design decisions, not language design flaws. Someone coming from another language may not anticipate that R works the way it does. That argument doesn't mean much to me. Everyone would laugh at me if I said the design of Java is deeply flawed because I've been using R for years and the OOP approach in R doesn't work in Java. I think it's no less silly to criticize the design of R because it doesn't do things the way you expect. Maybe there needs to be a better way to communicate to newbies what the language is doing. Maybe R should have a newbie mode with lots of warnings. Maybe it should come with better documentation. But to criticize the language design on that basis is absurd.

The reason R is heavily used is not just because of the existing packages. It's also not because statisticians don't know anything about software development. It's because R is a programming language for statistical analysis. You don't get that with SAS, SPSS, or Stata, what you get is a way to call prewritten functions and something that sort of resembles a programming language. You feel shortchanged when you use them. I've put much effort into fitting my problem into the limitations of those programs. In a few cases I even had to change what I was doing due to their limitations. Python, Java, C++, Ruby, or whatever happens to be the latest fashion are general programming languages. You can write yourself a library if you want, but that's far from what R offers.

Julia is the current flavor of the month. I looked at it, and when I can actually do something useful, maybe I'll give it a try. It's easy for a language to appear clean and elegant if it doesn't do anything useful. Use it for a year as a replacement for everything you do in R, then get back to me about how clean and elegant it is. How does it handle missing data? How does package management work? How's the documentation system? Clojure is my favorite language, but I tried to use it as a replacement for R, and I didn't get anywhere. And not just because of the existing collection of R packages. I was doing so much with Rserve that I finally realized I was better off to go back to R.

R should get some credit for being the language that offers the right set of tradeoffs for many of us. The Bjarne Stroustrup quote, “There are only two kinds of languages: the ones people complain about and the ones nobody uses” comes to mind when reading criticisms of R.

4 comments:

  1. """ leaves something to be desired. The language offers no support for tail calls that you get in a language like Clojure or Scala"""

    tail calls or tail call optimization, almost any language offer tail calls and actually clojure and scala no offer tail call optimization, because the jvm doesn't support this...so I can't understand this argument, by the way, I don't use R but I use clojure and this post appeared in planetclojure (I don't know why)

    ReplyDelete
  2. I'd hope that Julia takes off and lives up to its promise of becoming an efficient computational language. There are already a few players in the field (Octave, R, Numpy). It is going to be hard to displace MATLAB though.

    ReplyDelete
  3. @anonymous

    I don't visit Planet Clojure. You'd have to ask that site why they linked to my post. I blog about whatever is on my mind. Sometimes Clojure is on my mind, but most of the time it's something else.

    "Support for tail calls" refers to loop/recur and trampoline in Clojure. Scala offers something similar though I'm not that familiar with the specifics. With R you're on your own. You have to convert recursive functions to while loops manually.

    ReplyDelete
  4. @Nick

    In my opinion, Julia has at least two big advantages against Matlab. The first is speed. The second is support for clusters. They're in business to make money and it shows.

    ReplyDelete