Product Development · Programming

What are the most effective ways to scrape content from websites?

Richard Pridham Investor, President & CEO at Retina Labs

October 31st, 2015

If you want to aggregate content from websites that do not offer API integration, what are the most effective methods? Let's say you want to pull consumer product review comments from Best Buy or Amazon. Can this be done? I recall from years past just how messy screen scraping was. Is this still the only option? Has the technology to do this improved in recent years? What if the content you want to aggregate requires a search in order to be found? For instance, let's say you want reviews pertaining to a specific product. If you were on Amazon, you'd need to search for the product first. How does screen scraping get around this? Can you run into legal issues and pushback if you attempt to collect content without consent?

In situations where a web site harbors product review comments and does not offer an API, what's the best approach to get them to share their data? I'm assuming that if they could monetize this content somehow they'd be receptive to the idea.

Andrew Lockley

October 31st, 2015

Import.io

Anton Yakovlev Founder of four successful businesses on two continents who can help you do the same

October 31st, 2015

Richard, from my knowledge, the NLP (natural language processing) tools have improved greatly. We were creating a tool that was scraping the app description from an Android app store, and understanded which permissions were needed for the app to run on an Android smartphone. And it worked with 80-90% efficacy.

Following the link Andrew provided I quickly ran import.io over the e-commerse site my company runs in Russia. It correctly parsed 4 of 20 items on a page, which is 25% efficacy. Therefore, in case you need a good scraper, and there is no API on the web site, I believe, you should create a good contemporary parser based on NLP, and get your data.

As for the legal issues. For sure it's legal to download the content from any public site on the web, and analyse it. At least Google does this all the time. Which is not legal in many cases is to publish the downloaded content without owners' consent. Therefore, the purpose of creating the scraper should be well determined. And, yes, amazon has an API that can give you all the information you need from them.

Hope that helps
   

Jackson Powell UI/UX Designer & Front End Developer

October 31st, 2015

Richard,

I'm getting 90%+ efficiency scraping with Kimono. I love these guys. Their tool is browser based and super user-friendly. Also very powerful in setting up your scrape.

There is a learning curve. They do respond to email questions. There is a free plan. 

I'm scraping millions of data points in days. @Kimono gives you lots of options to hook the scraped data into your database.

I haven't tried import.io so I can't comment on the differences. Though I don't need to since Kimono solves my problems.

Near Privman Googler, Startup Advisor

October 31st, 2015

The technology has greatly improved from the time it was commonly referred to as "screen scraping". Currently more commonly referred to as web scraping, or data scraping, is mostly done using automated headless browsers, which are able to return cookies, execute JavaScript, etc., making it much easier from a technical point of view. If the site you are scraping doesn't defend itself against scraping (but simply hasn't bothered to expose the data in API form), you should be able use a cloud scraping service like those offered by AWS and Google, for example.
The problem usually lies in the fact that you are probably doing something very much against the website owner's interests, and probably violating their terms of use, exposing yourself to lawsuits, etc.
In the examples you provide (Best Buy, Amazon), the content is very much central to their value proposition to customers, and poses a significant competitive advantage (people prefer to shop on Amazon because they trust the reviews there, for some reason). They would have to be convinced that they will gain directly from your use of the content, and that you will safe guard this content against their competitors as diligently as they do themselves (of which it would be tough for a startup to convince them).
If you do not have their permission, you will probably find that they spend vast resources to foil scraping attempts, e.g. by blocking or serving fake responses to requests they are able to identify as scraping attempts.

Anonymous

January 6th, 2017

Check out the website at http://www.3idatascraping.com – the company provides best quality scraping website content services at affordable prices. Get data and images from any website within your budget.

Rob Mitchell Senior Java Software Engineer at Direct Commerce

October 31st, 2015

Richard, I've worked a bit with some of what it sounds like you're trying to do and I can tell you from an engineering perspective, it is not trivial. Unless a company's website intends you to scrape or otherwise get their content, they will do many things to protect it including hiding it behind mostly dynamic HTML. 

What @NearPrivman talks about is quite true. 

If I were you, I would readjust the value proposition of what you're trying to do. Possible reach out and establish business/partner relations to make what you want happen a reality. 

Moh'd Jebrini CTO at Mashvisor

October 31st, 2015

Hi, Richard 

Taking all what have been said in considerations, there is still some engineering tools that could help you achieve your target. there is no 1 way solution to that, but usually you need to integrate a few technologies with each others to make sure you hit the target. 

Import.io service is quite nice, if you like it! otherwise .. 
I suggest you check out this tool with your engineers (https://github.com/scrapinghub/portia) + (http://scrapy.org)




Richard Pridham Investor, President & CEO at Retina Labs

October 31st, 2015

So web scraping has improved but it's still messy and challenging from an engineering point of view. The idea I have revolves around product review aggregation and analysis from various sites that contain such info, social media, blogs, etc... Some may have APIs, others won't. The intended audience for the collected and analyzed data would be product manufacturers.

Chris Pointon Internet Entrepreneur and Technologist

October 31st, 2015

Check out the tools at http://scrapinghub.com - as others have said, it's a complex business retrieving structured data from modern dynamic websites, especially if they don't want you to. Many have terms of use specifically barring scraping or data aggregation.

Richard Pridham Investor, President & CEO at Retina Labs

October 31st, 2015

I'm not a programmer so I'm not sure what this means:

http://docs.aws.amazon.com/AWSECommerceService/latest/DG/EX_RetrievingCustomerReviews.html

Does this mean that Amazon allows product review retrieval?