What’s there to like about R?

Update 10/11/2011: There’s a good discussion on Reddit
Update 10/12/2011: Note manipulate package and highlight data.table package

The R statistical computing platform is a rising star that’s been gaining popularity and attention, but it gets no respect in the hood. It’s telling that a popular guide to R is called The R Inferno, and that advocacy pieces are titled “Why R Doesn’t Suck.” Even the creator of R had this to say about the language in a damning article suggesting starting over with R:

I [Ross Ihaka] have been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too.

So why do people still use R? Would we lose anything if we just migrated to (say) Python, which many consider to be a major contender/alternative to R?

In this post I’m going to highlight a few things that are nice about R—not just in the platform itself, but in the whole ecosystem. These are things that you won’t necessarily find in alternate universes like Python’s.

The Packages

Obligatorily, R packages are the dominant reason people are anchored to R: ggplot2, caret, data.table (one of R’s best-kept secrets), zoo, the mountains of modeling packages, all the data munging/plumbing infrastructure. Packages for all manner of problems have been written, frequently exclusively for the R universe. The standard library is also solid.

R Studio

R Studio is just a whole bucket of awesome. IPython is OK for interactive exploration…as long as you’re just banging out new lines and editing/re-running those lines. As soon as you have to tweak a function somewhere or introduce a block of code, things get hairier. After editing your source file, you can choose to either re-run your whole script (first commenting/uncommenting various source lines), or if you want to preserve your data you can reload your module and all dependent imports, or you can sometimes use %edit but it’s limited and glitchy, or you can fudge around with joblib and the like. Because of Python’s modules and scoping, you can’t get away with something as simple as editing and re-evaluating some lines in Emacs.

But R Studio makes interactive coding easy-peasy. You simply tweak and (re-)run exactly the line(s) you want. This style of IDE is one of the things I missed the most from MATLAB. Even for tasks perfectly suitable for REPLs, it’s nice to be editing a concrete file, since this interactive style is how many of my scripts get fully written out.

It’s tricky to design an IDE for dynamic languages; they often end up pretty limited when it comes to things like completion suggestions. But I like the way R Studio handles dynamism: you get things like completions based on the current execution environment, and in general you get to inspect/play around with the environment, not unlike in a debugger. This is perfect for that interactive style of development.

Then there’s the fact that R Studio is available as a web app, which I love. Besides being able to resume my session from anywhere with a browser, I can work with graphics without hassle. I don’t want to be tunneling X over ssh or dumping PNGs or whathaveyou to see data visualizations. Add on top of all this the manipulate package that comes with R Studio, which gives you basic interactive plotting that builds on top of any plotting system including ggplot2. (I was excited to see that recent work on IPython has introduced a web interface too, but ggplot2 is light-years ahead of matplotlib.)

The Language

Wait! What’s this doing here? Isn’t the language R’s biggest liability?

Well, the core language actually turns out to be mostly pretty simple if you ignore the scoping magic. The syntax resembles Javascript, but simpler and more uniform.

There’s no special syntax for defining named functions. There’s no required return keyword. There’s no distinguishing among statements, blocks, or expressions—everything’s an expression. Operators, including = and [], are functions.

Like in Matlab, values are all immutable, pass-by-value, copy-on-write, etc. R also has open-world polymorphism that doesn’t introduce new syntax; it’s at the same time more flexible and more TOOWTDI than Python and other similar OOP languages. The function argument semantics are also more powerful/useful than those found in many other languages.

If it weren’t for the scoping rules and its system of data structures, R would be a simpler language than plenty of other similar scripting languages such as Python and Javascript. Yet despite the simplicity, and thanks to the fact that everything is vectorized, the language is expressive enough to accomplish an impressive number of things in one line.

The Data Structures

Here’s another one that’s equally asset and liability. Since we’re just focusing on the ups: the data frame and the factor stand out in particular. R’s data types are by themselves straightforward to implement (though tedious to implement well and optimize and deal with missing values and joins and pivoting and whatnot), but the fact is that they served as the foundation for a lot of R code. Such an established foundation simply does not exist yet in other environments like Python. Projects like Wes McKinney’s pandas add these crucial data structures, are making steady progress (including the recent addition of factors), and will probably be “standardized on” in the not-too-distant future, but until then there’s still a lot of work to do, multiple such projects in competition, and relatively little built on top of them.

Serializability

Nearly everything in R is serializable. This goes beyond Python pickle, which has plenty of limitations. Code, even closures, can be treated as first-class values—you can serialize it and send it around, something that is rarely seen outside of the Lisp family. Your execution environment, the session, is something you’ll regularly save and restore all the time—a huge boon for interactive development and exploration. Sure, a restored file/socket won’t be of any use, but everything else just works.

CRAN

Yes, other languages have analogues, like PyPI and npm. But R is the only place where I’ve never once had to go outside this system in all my time using R software.

I run into this problem all the time in other ecosystems. Just earlier I had to repackage Google Protocol Buffers for Python to actually have a working setup.py. But I’ve run into this problem in a whole ton of projects, a sample of which includes Pyevolve, PyStemmer, unicodecsv, progressbar, re2. And it’s not just Python. I’ve run into problems with gems, CPAN, Cabal, all sorts of other places. And it’s not just broken package installers. Sometimes it’s problems/limitations with the package manager (just earlier I had to separately install numpy before installing scipy). Sometimes it’s the occasional packages that don’t even publish to these repositories at all, forcing you to step outside the system (particularly poor coverage in the Java/Maven world, but Maven arrived late).

I don’t know if there’s something about the way R package authoring/publishing works that makes distribution particularly robust and straightforward, or what. But shit just works.

Embedded R and Rserve

Not much to say here other than the fact that there’s some good interop in the form of RPy2 for Python and REngine for Java. Using complementary tools lets you work around R shortcomings and opens up many more opportunities to use R. Although you can certainly choose to write everything in R, there’s a healthy widespread awareness of R’s weaknesses (and strengths) that so far seems to be doing a good job of drawing boundaries. This attitude and this degree of accessibility are just another thing I like about R.


So, there’s quite a bit to learn from R. All that said, it’s important to understand the opening quote. There are many fundamental problems with R, stemming not just from the platform’s intrinsic properties but simply from the fact that it exists at all. Some of my happier dreams involve burning R to the ground.

But that’s a future post.

Follow me on Twitter for stuff far more interesting than what I blog.

  • Brendan Miller

    What is %edit in python? I googled that, but google isn’t very friendly to the % character.

  • http://yz.mit.edu/ Yang Zhang

    It’s actually a special command in ipython, not python. You can find some documentation at http://ipython.org/ipython-doc/rel-0.10.2/html/interactive/tutorial.html#source-code-handling-tips.

  • jcborras

    Actually, there is a return() primitive in R. It’s not a language reserved word though but it provides you with a facility for non-local returns.

  • jcborras

    CRAN is a blessing and a curse. Yes it has codes for any single type of statistical analysis there exists out there no matter how exotic. But at the same time its overall user experience is far from consistent which when developing systems with it can be annoying. Some time is required too before packages mature.

    One of the good things of R is that it’s became a de facto standard for describing scientific computations or at least in certain fields. The Mapple/Mathematica/MacSyma systems are a different league, and MATLAB has a very strong position in DSP circles, but overall circulation or usage of R codes is surprisingly high in the academia.

    The people hanging out in the official R mailing lists make a fantastic community too….

  • http://yz.mit.edu/ Yang Zhang

    Indeed, all good points. If I were to start ranting on the problems in CRAN I’d end up with a full other post, and inconsistency & incompatibility are near the top of that list.

  • Pingback: Quora

  • Aatruji

    I would add http://stackoverflow.com/questions/tagged/r as the things to like about R. Nice post

  • http://www.vacationrentalnet.com/ Dolcen Vacation

    R is the necessary evil, I guess.

  • Yin Zhu

    I think the power of R, the language, comes from Scheme. Scheme is very small and elegant, so this makes meta programming, serialization, etc simple in R. R and Python are both dynamic, but I won’t expect Python will have something like ggplot2, I mean the DSL interface.

  • tanyaM

    I’ve read data.table described as ‘game-changer’, ‘best kept secret’…and I like what I’ve managed to figure out to do with it so far. I’m not an advanced useR yet nor a programmer so I’m finding the official documentation a little cryptic and I haven’t found many tutorials yet (not that I’m complaining—I am always truly grateful for the generosity of the gurus). But (and I hope I am not touching any sore spots) is it noticeably absent from the Rstudio user community and some other major forums? Are there any vested interests (commercial applications or intellectual competitiveness) at play? I hope the new R data wrangling tools (e.g. plyr, reshape2, data.table, (Hmisc, sqldf ?)) will continue to develop and play nice.

  • http://www.incion.com/ web design company los angeles

    All of the information given is really helpful and useful.