Python · Scalability

Can a Python - Elastic stack handle big data (2 billion + records)?

Jofin Joseph Lead data diagnostics at Vibe

December 2nd, 2015

We are estimating our product to be aggregating around 2 billion data points in the next year. Each data point will be a combination of 15 to 20 information items (text) about entities. We use Python/Django for the application and APIs and Elasticsearch as the data store. 

Will this stack scale to 2 billion records and processing? If no, what should I be considering. Would appreciate your suggestions. 

Joe Emison Chief Information Officer at Xceligent

December 3rd, 2015

You shouldn't listen to any advice here so far, because you haven't given us enough details about what you're doing. You have described what you want to store, but you haven't talked about:
  • how you need to search/retrieve data (are you searching with filters? and/or map? or direct retrieval of data? are you retrieving aggregate data or records directly? are you needing to connect your data to other data sources automatically/via API?)
  • how fast you need to retrieve data (sub-second? monthly reports?)
  • your key business risks (are you low on money? need to launch quickly? worried about adoption?)
Again, don't take any architecture advice from anyone until you've actually laid out the full range of what your needs are and what your risks are.  Also, engineers are absolutely awful at overengineering solutions for startups when the biggest risk is that no one will use the thing at all.  Much better to have actual customers who are running into technical limitations that you have to refactor because you were able to launch to them quickly vs. you never having any customers because engineers decided to build the thing to last for 10 years and you ran out of money before you could get to product-market fit.

Federico Marani Technical Architect

December 3rd, 2015

There is nothing inherently limiting in the tools you choose, but at that size the backend architecture becomes really important. 
I wouldn't use Elasticsearch as a primary datastore, it's not its purpose. Use something like Postgresql (or investigate into Cassandra).
You can use Elasticsearch as a cache for the "aggregated view" of your data (after having joined together all 15-20 datapoints) but if you are not doing full-text search over it, you may as well keep using Postgresql.
A configuration that may work is Postgresql with 2 sets of tables, one with the original data and one with an optimized structure for the most common operations your product does.
I have seen this sort of thing working well for processing 20 million records every day, but we spent a fair amount of time optimizing it and the database rows were quite independent.

Kias Hanifa Chief Technology Officer at Fonicom Limited, Malta

December 2nd, 2015

My suggestion would be to use a primary Data store with Mysql partition cluster or Mongodb. Elasticsearch in a distributed or cluster model for search and analytics and Python/Django for the application. 


Radamantis Torres

December 2nd, 2015

Hello,
This depends on how are you architecting your backend, Python/Django can handle that amount of data, but, you will need a load balancer and different servers processing whatever needs to be processed, about ElasticSearch, also depends on the "size" of the data, you mentioned 2 billion records, but, are we talking about terabytes? petabytes?
I'll suggest to use ElasticSearch as a caching mechanism and another powerful  database behind it, like Postgresql, if you're talking about serious petabytes, you want to check Hadoop

Marc Milgrom Business Manager at Bloomberg, LP

December 3rd, 2015

To reiterate the above, Elastic is NOT a data store. It's an index and search engine that runs on top of another database in the stack. 
I would personally recommend PostgrSQL as your persistent data store, unless the data records are very large text documents with no structure, in which case Mongo or Hadoop are better suited. 

Federico Marani Technical Architect

December 4th, 2015

To me, searching and retrieving are very different use cases: retrieving is when you are loading exactly one entity by a usually numeric ID, searching is when you load a selection of entities that match PARTS of a search string. I am sure Elasticsearch is optimized for both, but retrieval is more "predictable" in terms of performance (and probably quicker). You have not mentioned aggregation in the last post, but that again would be a completely different use case.

You may be able to take some of the load off in the serch/retrieval operations by putting a cache like Varnish in front of it, that is something I have seen done with Solr, but really it is about making sure Elasticsearch uses caches correctly, both internal and external.

I would be more concerned by that update frequency. Elasticsearch will be constantly busy updating its indexes. Not to mention how expensive it is going to be to maintain 6TB of storage on a cloud server...

Please do not take anything I said as "the right advice" because so much is dependent on the product and the business, and the expertise you have in house. If it's an MVP what you are building, you may discover that most of this is out of context.

Wedge Martin CTO at Vivo Technology Inc

December 4th, 2015

I'd save yourself the money on paying someone to run your search cluster for you.  It takes all of about 30 minutes to set up a functioning elastic search cluster and that much more to get some decent monitoring going.  We ran all of the commercial side of Ebay off of Lucene ( the search engine under the hood of ElasticSearch ) for a long time and it held up very well. Far more than 10 updates per second and wasn't that big a cluster, all things considered.  For the writes, my team did some large social mobile networks that handled ~800 queries ( on geospatial indexes )  and updates per second, with the back end being a 3x3 MongoDB cluster of ~200$/mo EC2 nodes.  You don't necessarily need to go NoSQL though if you write your sharding logic in code so that you can parallelize your I/O to several back end master/slave sets. 

Jofin Joseph Lead data diagnostics at Vibe

December 4th, 2015

Thanks a lot for the suggestions and sorry for not being comprehensive enough in my question. Our data records are all text with an average size of 3 KB per record. Hence we are estimating the maximum size to be: around 6000 GB

Our use case involves searching and retrieving individual records from this dataset which can go upto a frequency of upto 50 per second. There will also be update operations on the record happening at a rate of upto 10 per second.

We are contemplating all the options in front of us to reach a decision. Would appreciate more thoughts.

Slavomir Jasinski Technical Director at Real Estate Industry

December 3rd, 2015

Kind of interesting talk related with using Python and moving towards golang:

https://www.youtube.com/watch?v=JOx9enktnUM

So if you are looking for super efficiency - consider something "better" then Python. 

Wedge Martin CTO at Vivo Technology Inc

December 3rd, 2015

Lots of good answers here so I'm late to the party.  My short answer is, you can scale just about any platform to handle X traffic load, but the real difference will be in how much development time you have to put into sharding your data, and how much the platform will cost to run.    Django and MySQL can get you there reasonably well, and ElasticSearch is a great solution for handling search queries and being a bit of a buffer from your database.  My personal preference would be MongoDB, as I've got a lot of experience with it, but you want someone in-house that really knows the platform well. Some other good decisions to make will be around how the app is architected;  the back end should be all API endpoints feeding the front end clients ( mobile or otherwise ) and no views generated server side.  Layer cache on top of everything, and make sure that you overwrite entries in cache when data is updated instead of letting it expire as to avoid stampeding.