Analytics · R

R vs. Python?

David Feldman Data Analyst at Scribd

June 7th, 2015

Both R and Python are popular languages used to perform data analysis tasks. From what I understand, Python is a great general-purpose language, and R's functionality is developed specifically with  statisticians in mind. I've heard people argue both sides, but I wonder which is better for daily use?

Growth-hacking isn’t about quick wins and shortcuts, although they exist. In this course, we’ll cover the six-step growth hacking framework, how to measure user retention for your business, how to increase engagement and retention, and a bunch of case studies.

Benjamin Olding Co-founder, Board Member at Jana

June 7th, 2015

I did a phd in statistics.  Everyone used R.  I didn't know R (I was not a stats undergrad), and it seemed magical: everyone was using it to solve everything.  So, I invested time learning it.

I was pretty disappointed.  It really seemed like the result of a small community only knowing a single scripting language.  You can do pretty much anything with pretty much any language.  Why would you want to though?  This isn't a case of best tool - it's just the only script tool for that community (or was at the time - I think it's changing, mercifully).

If you already know R and can accomplish a task with a R and you don't know python, I can't see a reason for you to not just use R to solve your problem.

If you already know python, then check out pandas and numpy/scipy.  When I was in grad school, these tools didn't exist, and as a result, I would have told you then that it made more sense to use the packages already in R than code the specialized routines you needed in another language.  Even so, R is just awful at manipulating data; I'd usually manipulate the data into the form I wanted outside R, then use read.table to read it in and pass it through the least amount of R code I needed to get the analysis done.  I was hardly alone: in fact, many of my fellow grad students just wrote everything in C++ for their dissertation, using R just as a way to easily bang out graphs when needed.  

Now that these python-based tools and libraries exist, however, I see no reason for a python programmer to not turn to them first, regardless of what you may hear about R.

If you do not know either R or python, please just learn python with pandas; this is the future.  There is nothing inherent to the R language that makes it superior - it just has a lot of packages already written for it.  However, that advantage decreases every day as more people contribute to pandas and numpy.  I love stats - but the ideas behind statistical analysis aren't "owned" by a programming language.  Python didn't really exist when S was created (the precursor to R).  S+ and then R had real advantages over other script-based languages for a long time.  It's just no longer the case.
 
Python can realistically be used for 20 other things, unlike R, and the reality of analysis is usually that more than 50% of the work is getting the data into a usable form.  R just fails at this.  As a result, I used a lot of awk and sed; but python will get things done too.  I only turned to awk and sed because R was so terrible at manipulating real-world raw data.  R does a fine job at analysis once you have things in table form, but it doesn't do a better job at it than python if the routine exists in both languages (and, unless you're doing something pretty obscure at this point, it likely does).

I really don't see a trade-off on this one.  Unless you already know R for some reason, I believe the answer to your question is python, full stop.

Dan Oblinger Founder at AnalyticsFire

June 7th, 2015

I second Benjamin's opinion.  scripting in a general purpose language which has libraries like pandas in it, is nearly always a better experience than working is a special built langauge that after the fact was extended to be a general purpose language.

Just one example to illustrate the point.  In R, certain operations on a DataFrame object will result in other lower dimensional objects, and sometimes not.  I think the rules originated when the operators were specialized statistical steps.  Since then R is extended to handle all the things general purpose languages do, but not in a simplest, cleanest way.  In Python the entire structure was created clean, then the Panda DataFrame was added, but it does not 'pollute' operations (like textual manipulation of data in a file).

Hasan, noted that Python graphing is primitives compared to R.  I do agree on this point.
I generally write up a small python function that dumps the R statements into a file in /tmp and then invoke R on that function.  (Once this is done, that graphing tool is available directly within python.)

Hasan also noted other statistical functions that R has that python does not.  Certainly true, but if you listed the algs in scipy and scikit-learn I am positive there would be many not found in R.

My only disclaimer I am not a hard core stats guy.  I am doing ML, and lots of data preprocessing.
So I cannot assess the completeness of the Python environment from the perspective of a stats guy.
--dan

Ana Echeverri Visual Analytics, Predictive Analytics, Enterprise Software

June 7th, 2015

I would say it depends on what you are trying to do. I use both R and python+scikit-learn. If I am just doing statistical modeling or data mining I prefer to use R. If however I need the analysis to be part of a web app I prefer to use Python. But the bottom line is I can probably achieve the same results from the analysis perspective using either one. Ana

Hasan Diwan contract Data Scientist to several startups

June 7th, 2015

As with most such questions, it depends. Python was designed by a computer scientist; R by statisticians. The personalities of the designers of each shine through in their use.

Shobhit Verma

June 7th, 2015

I got degrees in Statistics as well as Computer Science. I love and use R for exploration and once I have played with the data and figured out what model would generalize best, I use python to create a production version algorithm that scales.
If you do not want to learn python you may be able to go very far using Revolution Analytics support. However, I just prefer rewriting in python as it allows me to be more in control of the various optimizations at scale.

Bojan Tunguz Chief Data Scientist at Tunguz Consulting LLC

June 7th, 2015

Another consideration might be performance. In my experience Python is much faster than R, which can be a serious issue for large data sets. 

Anonymous

June 7th, 2015

It depends on what you mean by "daily use".. Here are a couple of scenarios:

1. If you are building a generalized web platform that has more user engagement use-cases outside of data and statistical dashboards, then Python is going to be more resourceful as it has full stack web frameworks that can assist with web development and provides a productive/superior eco-system than R for web dev.

2. If your daily chores and product require a lot of data analysis and predictive modeling based on large sets of data, I'm biased that R has a better usage and easier to attain your goals.






Peter Johnston Businesses are composed of pixels, bytes & atoms. All 3 change constantly. I make that change +ve.

June 8th, 2015

One point made above is that Python's plotting tools are primitive. There are simple cost-effective add-ons which solve this - try Plot.ly, for example.

But it raises the question - what is the use to which the data will be put?
Many of us have an old mindset that the use of the tool is to create graphs from which we can see an insight or trend. That's part of the OWM business and scientific methodology - that Old White Men are the ultimate decision making tool.

In the modern world, we don't create single endpoints, we create systems. Systems which not only take the data and graph it but look at what's missing in the data itself - how it can be improved, where the logistical problems are and how the statistical quality can be improved. These systems are designed to learn and self-improve.

They are also designed to skip the OWM phase altogether and take direct action from the data to improve the system. This cuts out the "this data supports my project, so I'll promote it, this data doesn't so I'll suppress it" which underpins most scientific research and many business decisions.

Here you have to think back to the provenance. R is the beloved tool of the academic community - it has OWM baked into its methodologies, uses and outputs. Python is much more about simply building a program which is agnostic about the inputs and outputs. R has a "something to read" output, Python has a "something to do" output.

So - are you crunching data for OWM to visualise and pontificate over? Or to make something happen?

Jared Hardy Founding Director at Data Roads Foundation

June 8th, 2015

The biggest problem with R, in computer science terms, is that it is a domain specific language (DSL) by design (for statistics only) instead of a general purpose language like Python (which can do anything computers can do). You can almost always build domain specific functionality within any general purpose language by implementing it all in libraries, which has already been done for you with many statistics libraries in Python, so you don't have to completely change syntax for every application or interface you write. As a past full time developer, I can tell you that syntax mode-switch is a huge time sink, which is why most programmers and product teams prefer to stick to one language at a time. DSL's are most often incapable of performing any tasks outside their designated domain, which limits them from usability with other related tasks (like web interaction). General purpose languages like Python also tend to have "bridge libraries" with other languages like Rpy2, so you don't really ever have to use any DSL (read: R) for anything but the few tasks it's good at.rpy.sourceforge.net

In summation: there's no guarantee that any domain specific language like R is always going to be the best tool for the job it was designed for. In contrast, general purpose languages like Python are not limited by domain specific syntax or assumptions, and they can be constantly upgraded in capabilities and performance via new libraries without any significant change in syntax. This makes Python the obvious long-term choice between the two.

To compare Python to other general purpose languages instead, then the question becomes more about available (and near future plausible) development environment and interpreter/compiler infrastructure, which all requires widespread core language developer support. These are other areas where Python is a clear winner.

Micah Stevens Software/Hardware Engineer

June 7th, 2015

I'd suggest Python for general purpose stuff. It has a much larger ecosystem, more libraries, and can be used fruitfully in almost any manner. You might find that most problems are already solved, it's more a matter of integration than programming for 75% of your tasks.

Don't underestimate the importance of community support, and having a large community. Python is likely several orders of magnitude greater in this respect. This means you have more people to ask for help, you have more and better tools to work with the language, and it's better understood. 

I've also seen people use Python in modeling and simulations, so I know it's capable in that realm if it's a requirement, although I wouldn't be surprised if it's not as good as R for what R was designed for.