Analytics · R

R vs. Python?

David Feldman Data Analyst at Scribd

June 7th, 2015

Both R and Python are popular languages used to perform data analysis tasks. From what I understand, Python is a great general-purpose language, and R's functionality is developed specifically with  statisticians in mind. I've heard people argue both sides, but I wonder which is better for daily use?

Forget about “build it and they will come.” Products without scalable distribution channels will fail to gain traction. Instead of hiring an expensive marketing team, take this course to learn SEO, content marketing, retargeting, viral loops, email marketing, and sales funnel optimization.

Benjamin Olding Co-founder, Board Member at Jana

June 7th, 2015

I did a phd in statistics.  Everyone used R.  I didn't know R (I was not a stats undergrad), and it seemed magical: everyone was using it to solve everything.  So, I invested time learning it.

I was pretty disappointed.  It really seemed like the result of a small community only knowing a single scripting language.  You can do pretty much anything with pretty much any language.  Why would you want to though?  This isn't a case of best tool - it's just the only script tool for that community (or was at the time - I think it's changing, mercifully).

If you already know R and can accomplish a task with a R and you don't know python, I can't see a reason for you to not just use R to solve your problem.

If you already know python, then check out pandas and numpy/scipy.  When I was in grad school, these tools didn't exist, and as a result, I would have told you then that it made more sense to use the packages already in R than code the specialized routines you needed in another language.  Even so, R is just awful at manipulating data; I'd usually manipulate the data into the form I wanted outside R, then use read.table to read it in and pass it through the least amount of R code I needed to get the analysis done.  I was hardly alone: in fact, many of my fellow grad students just wrote everything in C++ for their dissertation, using R just as a way to easily bang out graphs when needed.  

Now that these python-based tools and libraries exist, however, I see no reason for a python programmer to not turn to them first, regardless of what you may hear about R.

If you do not know either R or python, please just learn python with pandas; this is the future.  There is nothing inherent to the R language that makes it superior - it just has a lot of packages already written for it.  However, that advantage decreases every day as more people contribute to pandas and numpy.  I love stats - but the ideas behind statistical analysis aren't "owned" by a programming language.  Python didn't really exist when S was created (the precursor to R).  S+ and then R had real advantages over other script-based languages for a long time.  It's just no longer the case.
Python can realistically be used for 20 other things, unlike R, and the reality of analysis is usually that more than 50% of the work is getting the data into a usable form.  R just fails at this.  As a result, I used a lot of awk and sed; but python will get things done too.  I only turned to awk and sed because R was so terrible at manipulating real-world raw data.  R does a fine job at analysis once you have things in table form, but it doesn't do a better job at it than python if the routine exists in both languages (and, unless you're doing something pretty obscure at this point, it likely does).

I really don't see a trade-off on this one.  Unless you already know R for some reason, I believe the answer to your question is python, full stop.

Dan Oblinger Founder at AnalyticsFire

June 7th, 2015

I second Benjamin's opinion.  scripting in a general purpose language which has libraries like pandas in it, is nearly always a better experience than working is a special built langauge that after the fact was extended to be a general purpose language.

Just one example to illustrate the point.  In R, certain operations on a DataFrame object will result in other lower dimensional objects, and sometimes not.  I think the rules originated when the operators were specialized statistical steps.  Since then R is extended to handle all the things general purpose languages do, but not in a simplest, cleanest way.  In Python the entire structure was created clean, then the Panda DataFrame was added, but it does not 'pollute' operations (like textual manipulation of data in a file).

Hasan, noted that Python graphing is primitives compared to R.  I do agree on this point.
I generally write up a small python function that dumps the R statements into a file in /tmp and then invoke R on that function.  (Once this is done, that graphing tool is available directly within python.)

Hasan also noted other statistical functions that R has that python does not.  Certainly true, but if you listed the algs in scipy and scikit-learn I am positive there would be many not found in R.

My only disclaimer I am not a hard core stats guy.  I am doing ML, and lots of data preprocessing.
So I cannot assess the completeness of the Python environment from the perspective of a stats guy.

Ana Echeverri Visual Analytics, Predictive Analytics, Enterprise Software

June 7th, 2015

I would say it depends on what you are trying to do. I use both R and python+scikit-learn. If I am just doing statistical modeling or data mining I prefer to use R. If however I need the analysis to be part of a web app I prefer to use Python. But the bottom line is I can probably achieve the same results from the analysis perspective using either one. Ana

R is a statistical tool

Python is a programming language

This is a huge and fundamental difference between the two which makes any further comparison redundant. I would say that daily use of Python would be by a developer and for R would be by statistician, plain and simple.

Hasan Diwan contract Data Scientist to several startups

June 7th, 2015

As with most such questions, it depends. Python was designed by a computer scientist; R by statisticians. The personalities of the designers of each shine through in their use.

Shobhit Verma

June 7th, 2015

I got degrees in Statistics as well as Computer Science. I love and use R for exploration and once I have played with the data and figured out what model would generalize best, I use python to create a production version algorithm that scales.
If you do not want to learn python you may be able to go very far using Revolution Analytics support. However, I just prefer rewriting in python as it allows me to be more in control of the various optimizations at scale.

Bojan Tunguz Chief Data Scientist at Tunguz Consulting LLC

June 7th, 2015

Another consideration might be performance. In my experience Python is much faster than R, which can be a serious issue for large data sets. 


June 7th, 2015

It depends on what you mean by "daily use".. Here are a couple of scenarios:

1. If you are building a generalized web platform that has more user engagement use-cases outside of data and statistical dashboards, then Python is going to be more resourceful as it has full stack web frameworks that can assist with web development and provides a productive/superior eco-system than R for web dev.

2. If your daily chores and product require a lot of data analysis and predictive modeling based on large sets of data, I'm biased that R has a better usage and easier to attain your goals.

Hasan Diwan contract Data Scientist to several startups

June 7th, 2015

Dr Olding, The gamlss package for R has no equivalent in python. And the plotting tools are primitive. There's no python equivalent for RGM[1]. -- H

Peter Johnston Businesses are composed of pixels, bytes & atoms. All 3 change constantly. I make that change +ve.

June 8th, 2015

One point made above is that Python's plotting tools are primitive. There are simple cost-effective add-ons which solve this - try, for example.

But it raises the question - what is the use to which the data will be put?
Many of us have an old mindset that the use of the tool is to create graphs from which we can see an insight or trend. That's part of the OWM business and scientific methodology - that Old White Men are the ultimate decision making tool.

In the modern world, we don't create single endpoints, we create systems. Systems which not only take the data and graph it but look at what's missing in the data itself - how it can be improved, where the logistical problems are and how the statistical quality can be improved. These systems are designed to learn and self-improve.

They are also designed to skip the OWM phase altogether and take direct action from the data to improve the system. This cuts out the "this data supports my project, so I'll promote it, this data doesn't so I'll suppress it" which underpins most scientific research and many business decisions.

Here you have to think back to the provenance. R is the beloved tool of the academic community - it has OWM baked into its methodologies, uses and outputs. Python is much more about simply building a program which is agnostic about the inputs and outputs. R has a "something to read" output, Python has a "something to do" output.

So - are you crunching data for OWM to visualise and pontificate over? Or to make something happen?