Have a look at Amazon, they recently released Amazon Textract, a service for extracting data from scanned documents by completely wrapping the "ML" part away from you. Provided that the confidentiality allows, use a service for text extraction and document processing.
As the above comment said: you shouldn't be building this.
I believe the question is not about scanned documents, it is about parsing data regardless of file format. I am familiar with one such project though I did not work on it. If you need to import a specific type of data, for example, typical reporting information for a given industry, it is possible to classify/profile how your data looks like.You can then automatically import it from arbitrarily-formatted documents, e.g. a spreadsheet with undefined layout, or from scanned documents. That particulal project took a couple of years to get off the ground. MVP possible within 6 months with the right team.
Here is library behind text recognition for Google Translate: https://github.com/tesseract-ocr/tesseract
From documentation page:
Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".
Tesseract supports various output formats: plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV. The master branch also has experimental support for ALTO (XML) output.
Seems like this is what you are looking for.
Developers should pick up this as microservice (in dockerized version for example). Then customize/integrate it with customer application.
Not sure if by 'machine learning' you mean OCR or straight AI (or a combo) but of course it depends on how many doc types you need to accommodate, How many people are on your team, and the skill level of your team (have they done this before??). Say your typical project outline looks like this: 3 custom extractors 1 for each of pdf, word and html doc types. Testing And training your AI algorithms, building a database to store/manage your results, A front end interface or portal so that people can actually access and use the data. Some reporting.
A project of this size with say 3 developers (1 AI, 2 full stacks) and a designer you are looking at 3-6 months minimum for an enterprise class MVP.
You may benefit from a zonal OCR application. ABBYY is the expensive market leader but other options include Nuance Omnipage and Docparser. There is a bunch of applications out there to scrape data, including ones that do it as a service on a per document fee basis. You shouldn't be building this.