Ruby · Content

What are some good free web scrappers & techniques? If you've used one please share your story.

Marlina Kinnersley

August 3rd, 2013

Looking to learn more about web scraping and hear from those who've used them. If you've used one (or several) please share your purpose, scrapper link or name and results! (RoR-oriented ones would be most helpful!) Cheers!
A great idea is 1% of the work. Execution is the other 99%. In this course, we’ll teach you how to conduct market analysis, create an MVP and pivot (if needed), launch your business, survey customers, iterate your product/service based on feedback, and gain traction quickly.

Saïd PhD Tech Entrepreneur, Search Expert, Digtal Marketer

January 12th, 2017

I used Nutch. It's very powerful open source solution: http://nutch.apache.org/


You can find tons of Python solutions. I'm not going to advertise any of them here. Several of them are pretty good.


You can also find working scrapers on GitHub. Sometimes, you can even scrapers adapted to specific websites.


Cheers,

<a href="https://www.nextal.com">Nextal ATS</a>

Michael Brill Technology startup exec focused on AI-driven products

August 3rd, 2013

Hi Marlina.

I've been scraping up a storm lately - learning a fair bit about copyright law and proxy meshes in the process. My general assessment is that if you have development skills, then you're probably better off using your favorite language rather than using a purpose-built tool. I looked at outwit and mozenda... and while they would both have done what I needed, it's yet another technology to learn and support.

No experience with RoR scraping libraries, but I'm sure if you google around that 1 or 2 will float to the top. Looked at Scrapy for Python and that looked quite good. Since we're working in Node.js, I just used two modules (request and cheerio) and it's ridiculous how simple it is. Since many sites have anti-scraping technology installed, you may end up having to use a VPN that cycles through IPs. I just used Hotspot Shield ($30/year) and that's worked on everything so far.

Make sure you know about IP issues and litigation risks. 

...Michael

Anonymous

March 1st, 2017

We've had good success leveraging Beautiful Soup (written in python) in this regard. The linux "wget" command also has the ability to spider, scrape, and copy a web site.

Harshit Rastogi

August 3rd, 2013

i have used beautifulsoup in python , thats good to begin . But i have realized that i prefer using xpath to get the data since it doesn't have learning curve.

I used ruby and nokogiri ..

Scott Fairbairn

August 3rd, 2013

Hi Marlina,

We've used BeautifulSoup, which is a Python library, successfully in the past.  As Michael mentioned, when it comes to scraping data (hopefully legally), you run into all kinds of corner cases that off the shelf products might not handle gracefully.

Writing the code directly seems to work best, at least for us.

-Scott

Toddy Mladenov CTO and Co-Founder at Agitare Technologies Inc.

August 5th, 2013

I used http://scrapy.org/ (Python) for downloading Yahoo!Finance information and it required only 40 lines of code. Very easy to learn and use.

David Hunter Machine Learning Research, University of Oxford

August 3rd, 2013

Been looking at this very recently and I agree with Michael that it's better to write your own scraper if you can.

I'm a python addict and have so far found splinter (http://splinter.cobrateam.info/) to be perfect for pretty much any application, including scraping 'ajaxy' data rendered and hidden with javascript.

David

Anonymous

August 3rd, 2013

I am interested in this topic because we may go down the web scraping path in the future. It is on my radar but I am by no means an expert. In the past I worked in a company that had a different group than mine doing web scraping, I know they worked very hard dealing with badly formed HTMl, javascript issues, etc. on different web sites.

So unless you have significant development resources and you need to scrape just a handful of sites, you may be better off to see if there are any companies/open source projects that have already learnt the lessons and fought the battles of web scraping. That way you can concentrate on your core competencies. 

By the way a company that adopted this strategy fairly successfully is mint.com. They used Yodlee to do the web scraping for them. This was circa 2006 both companies seemed to have moved on since then.

You might want to look at this discussion http://www.quora.com/Web-Scraping/What-are-some-good-free-web-scrapers-scraping-techniques

Finally, in my opinion, the hard job in web scraping is to handle the idiosyncrasies of various web sites so if there is a great non-RoR tool it may not be that hard to integrate it with a RoR back end to handle the results of scraping.

Best of luck,
Manu  

Jesal Gadhia

August 3rd, 2013

Take a look at these two Ruby web scraping libraries: 


I've played around with both in the past with decent results. I've also used : https://github.com/sparklemotion/nokogiri - Which is more bare metal but gives you more flexibility. (Pismo & Mechanize uses Nokogiri on the back-end.)

Jerome Dangu CTO & Co-Founder at ClarityAd

August 4th, 2013

PhantomJS is a headless browser that is especially useful for sites that rely heavily on javascript.
You can access the DOM as opposed to the HTML source.