Programming · Scraping

Web Scraping - Introduction to methodology

Andrew Lampert Intent to improve and enhance my performance.

Last updated on February 22nd, 2017

I am looking for resources of how I can learn to develop my own web scraping processes. I have little technical training, so I am looking to begin with what are the necessary background to develop a web scraping bot; programming method, software and hardware requirements, etc? What technical training would be beneficial? Thank you for your help.


Updates:

  • I am familiar with SQL, but I am willing to study Java, HTML or another language required for Best Practices.
  • If this is an outdated process, what methods have replaced it?



James Abel Software/Hardware Engineer

Last updated on February 22nd, 2017

While I agree with the other post that web scraping is a little old school, if you absolutely have to do web scraping I'd suggest viewing this to get a feel for what's involved: https://youtu.be/3xQTJi2tqgk

YMalik Founder of ScanBuffer (website malware monitoring)

Last updated on February 24th, 2017

It depends on what you wish to achieve from Web Scrapping, e.g. are you trying to discover website meta data i.e. WHOIS info or you want to generate contents from other sites by parsing the webpages. The first one is easily doable but for webpage parsing you should pick a language that provides good tools to parse HTML elements and gently manage inappropriate HTML tags.


If you wish to discover website statistics then you may better buy it online from sites like Majestic or builtwith etc . You may also find it interesting to check Google Search API to find relevant sites.


In our product ScanBuffer, we do webpage parsing to find malicious contents that demands parsing all HTML elements including JavaScripts that requires syntax parsing as well. I hope you are not trying to do something that complex.


Best of luck!!

Steve Karmeinsky CoFounder City Meets Tech / Lean Capital Ltd / Placeholder Ltd

February 22nd, 2017

You could use http://import.io

Randall Carr Staff Engineer at Sony Interactive Entertainment

February 22nd, 2017

Andrew, why do you want to do this? Scraping is a semi-old way of extracting information from websites and other web-accessible servers. While this was a reliable way to multiply other people's data, it's rather frowned upon these days. This is not to mention the fact that it's a never-ending, changing environment. You'll always be updating/changing/etc. this bot to keep up with updates. Additionally, it can be illegal to extract this information without permission. Given all that, it's very simple to create bots to wander thru websites, using links to pages (if they exist). However, having said that, many websites are not really that scrape-able since they're generated on the fly from server frameworks, scripting and other dynamic methods.

Philippe Lachaise Co-founder at WHOZIC.COM

Last updated on February 22nd, 2017

Can you code. If so what programming languages do you know ?