Artificial intelligence · Machine learning

Has anyone worked on using machine learning for pulling data from different document types ? How long did that project take ?

Herbert Roy George Enterprise and Technology expert with enormous digital experience

May 29th, 2019

Bojan Mijušković Technical, versatile Product/UX freak

May 30th, 2019

Have a look at Amazon, they recently released Amazon Textract, a service for extracting data from scanned documents by completely wrapping the "ML" part away from you. Provided that the confidentiality allows, use a service for text extraction and document processing.

As the above comment said: you shouldn't be building this.

Max Technology, cyber-security & privacy compliance, risk management. SaaS/AppSec, FinTech.

June 6th, 2019

I believe the question is not about scanned documents, it is about parsing data regardless of file format. I am familiar with one such project though I did not work on it. If you need to import a specific type of data, for example, typical reporting information for a given industry, it is possible to classify/profile how your data looks like.You can then automatically import it from arbitrarily-formatted documents, e.g. a spreadsheet with undefined layout, or from scanned documents. That particulal project took a couple of years to get off the ground. MVP possible within 6 months with the right team.

Aleksandr Sidorov Product manager

June 6th, 2019

Hello, Herbert!


Here is library behind text recognition for Google Translate: https://github.com/tesseract-ocr/tesseract


From documentation page:

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".


Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.


Seems like this is what you are looking for.


Developers should pick up this as microservice (in dockerized version for example). Then customize/integrate it with customer application.

Richard Strickland Interested in building businesses around chat bots and IOT

June 11th, 2019

Not sure if by 'machine learning' you mean OCR or straight AI (or a combo) but of course it depends on how many doc types you need to accommodate, How many people are on your team, and the skill level of your team (have they done this before??). Say your typical project outline looks like this: 3 custom extractors 1 for each of pdf, word and html doc types. Testing And training your AI algorithms, building a database to store/manage your results, A front end interface or portal so that people can actually access and use the data. Some reporting.
A project of this size with say 3 developers (1 AI, 2 full stacks) and a designer you are looking at 3-6 months minimum for an enterprise class MVP.

Venkat Rangamani

May 29th, 2019

You may benefit from a zonal OCR application. ABBYY is the expensive market leader but other options include Nuance Omnipage and Docparser. There is a bunch of applications out there to scrape data, including ones that do it as a service on a per document fee basis. You shouldn't be building this.