Ruby · Content

What are some good free web scrappers & techniques? If you've used one please share your story.

Marlina Kinnersley

August 3rd, 2013

Looking to learn more about web scraping and hear from those who've used them. If you've used one (or several) please share your purpose, scrapper link or name and results! (RoR-oriented ones would be most helpful!) Cheers!
Google AdWords doesn’t have to be a source of stress. In this course, we’ll teach you how to set up campaigns, utilize extensions, craft effective content, and track ROI and conversions without all of the anguish, leaving you with more time to focus on growing your customer base.

Saïd PhD Tech Entrepreneur, Search Expert, Digtal Marketer

January 12th, 2017

I used Nutch. It's very powerful open source solution: http://nutch.apache.org/


You can find tons of Python solutions. I'm not going to advertise any of them here. Several of them are pretty good.


You can also find working scrapers on GitHub. Sometimes, you can even scrapers adapted to specific websites.


Cheers,

<a href="https://www.nextal.com">Nextal ATS</a>

Michael Brill Technology startup exec focused on AI-driven products

August 3rd, 2013

Hi Marlina.

I've been scraping up a storm lately - learning a fair bit about copyright law and proxy meshes in the process. My general assessment is that if you have development skills, then you're probably better off using your favorite language rather than using a purpose-built tool. I looked at outwit and mozenda... and while they would both have done what I needed, it's yet another technology to learn and support.

No experience with RoR scraping libraries, but I'm sure if you google around that 1 or 2 will float to the top. Looked at Scrapy for Python and that looked quite good. Since we're working in Node.js, I just used two modules (request and cheerio) and it's ridiculous how simple it is. Since many sites have anti-scraping technology installed, you may end up having to use a VPN that cycles through IPs. I just used Hotspot Shield ($30/year) and that's worked on everything so far.

Make sure you know about IP issues and litigation risks. 

...Michael

Harshit Rastogi

August 3rd, 2013

i have used beautifulsoup in python , thats good to begin . But i have realized that i prefer using xpath to get the data since it doesn't have learning curve.

I used ruby and nokogiri ..

Scott Fairbairn

August 3rd, 2013

Hi Marlina,

We've used BeautifulSoup, which is a Python library, successfully in the past.  As Michael mentioned, when it comes to scraping data (hopefully legally), you run into all kinds of corner cases that off the shelf products might not handle gracefully.

Writing the code directly seems to work best, at least for us.

-Scott

Toddy Mladenov CTO and Co-Founder at Agitare Technologies Inc.

August 5th, 2013

I used http://scrapy.org/ (Python) for downloading Yahoo!Finance information and it required only 40 lines of code. Very easy to learn and use.

David Hunter Machine Learning Research, University of Oxford

August 3rd, 2013

Been looking at this very recently and I agree with Michael that it's better to write your own scraper if you can.

I'm a python addict and have so far found splinter (http://splinter.cobrateam.info/) to be perfect for pretty much any application, including scraping 'ajaxy' data rendered and hidden with javascript.

David

Anonymous

August 3rd, 2013

I am interested in this topic because we may go down the web scraping path in the future. It is on my radar but I am by no means an expert. In the past I worked in a company that had a different group than mine doing web scraping, I know they worked very hard dealing with badly formed HTMl, javascript issues, etc. on different web sites.

So unless you have significant development resources and you need to scrape just a handful of sites, you may be better off to see if there are any companies/open source projects that have already learnt the lessons and fought the battles of web scraping. That way you can concentrate on your core competencies. 

By the way a company that adopted this strategy fairly successfully is mint.com. They used Yodlee to do the web scraping for them. This was circa 2006 both companies seemed to have moved on since then.

You might want to look at this discussion http://www.quora.com/Web-Scraping/What-are-some-good-free-web-scrapers-scraping-techniques

Finally, in my opinion, the hard job in web scraping is to handle the idiosyncrasies of various web sites so if there is a great non-RoR tool it may not be that hard to integrate it with a RoR back end to handle the results of scraping.

Best of luck,
Manu  

Jesal Gadhia

August 3rd, 2013

Take a look at these two Ruby web scraping libraries: 


I've played around with both in the past with decent results. I've also used : https://github.com/sparklemotion/nokogiri - Which is more bare metal but gives you more flexibility. (Pismo & Mechanize uses Nokogiri on the back-end.)

Jerome Dangu CTO & Co-Founder at ClarityAd

August 4th, 2013

PhantomJS is a headless browser that is especially useful for sites that rely heavily on javascript.
You can access the DOM as opposed to the HTML source.

Jonathan Vanasco

August 3rd, 2013

Ruby on Rails is a web-app development framework -- basically the exact thing you don't want to scrape with.

Generally speaking, you want the scrapers to either be their own daemon/service or being dispatched tasks out of a messaging queue.  Requesting web pages is a blocking operation, and scraping often creates additional tasks ( ie, you derive another page to scrape ) , so those are things to consider as well.  Scraping is often best when implemented in an "event driven" and asynchronous framework.

If you can start from scratch, I'd probably do everything in Erlang ( or Node.js ).

If you want to stay in Ruby, you should look at Redis+Resque ( https://github.com/resque/resque ).  

In Python, you can do some decent scraping with Redis+Celery for task management.  You can also do everything in Twisted Python.  I've done both with great results.  I already am biased with Python , but Python has the BeautifulSoup library for parsing and navigating HTML documents -- and that makes pulling data out of the scraped pages way way way easier.

If you can avoid scraping, I'd suggest doing it.  There are companies like Embedly ( http://embed.ly/ ) that offer an API that gives most of the data you'd get from scraping , with a lot less work.