I am investing in on a text mining project and am trying to figure out how long a machine learning project takes to do. I have been in software project delivery for 20+ yrs and I know how vague this question is. All I need is a ballpark idea of how long will it take, what should be team size to develop a prototype. The project will involve regular components like developing the core procedures to perfrom texxt mining, compiling a database, training the AI system etc....what are core phases of the project I should have in mind ?
As others have already mentioned it's much harder to estimate effort for ML projects than for software projects. The Google ML team, which has more than 4,000 ML projects in production, has reported that up to 70-80% of the time will be spent on data engineering. The remaining 20-30% on model training and deployment. The quality of your data will be the biggest factor in determining how much time it will take to develop.
Also note that every week new literature and techniques come out that further what can be done with ML. What is not possible or provides poor results today may be very easy in 2-3 years. If you are working on a "AI-first" company, its best to build a working prototype of the technology using contrived example data and off the shelf algorithms with minor tuning. Provide the proof of concept to your investors and raise enough capital to build a team and run for 2-3 years. That will give you the best chance of success. In that time your team will have enough runway to build out the core technology, accumulate data, and get ahead of the market demand.
Advances in the literature will only accelerate your lead as you can be first to integrate them. A lot of times these "AI-first" companies are just an arms race to collect the data and technology doesn't really exist today or isn't good enough to do anything with that data, but when it catches up, those companies that own the data, own the market.
Google has a quote that goes something like "Machine learning is the high interest credit card of technical debt".
At a high level, a project would consist of these fundamental steps:
1) A data science person/team determines the features or characteristics of the problem it will be solving to be used in a model.
2) A neural network architecture is decided to work best for the problem. There are a few basic ones with many variations of each
a) Traditional neural net - Good for linear & logistic regression problems like predicting real estate prices in a subdivision or detecting objects in an image
b) Long Short Term Memory (LSTM) - Good for time series like speech recognition & natural language processing (NLP)
c) Reinforcement Learning - Good for trying a bunch of different things until one works and using what works. Also more of an unsupervised learning approach whereas (a) and (b) are supervised learning approaches because a person validates the accuracy of the models.
3) Data scientists would work with software engineers to implement the model.
4) Training data is gathered representing real world uses to train the model.
5) Validate, test, add more training data, repeat until you get acceptable results from the model.
6) Run the model in parallel with a working system and compare the results. Continue this step and tweak the model until the model meets or exceeds accuracy of the working system consistently.
Before I left the workforce at my last job, I worked with a team for over 2 years developing multiple models in a horse race where the best model won out. The winning model would read documents (OCR) and determine the doc type and data of interest on the documents to extract to be used to prefill screens so data entry operators wouldn't have to type so much.
The first release of it was going to production when I left the company and I also authored 3 patent applications and coauthored another as part of the experience.
Hope that helps.
We have been developing a platform to mine data out of insurance contracts. We use a combination of techniques involving ML classifiers, regular expression searches, and human operator QA and feedback. We have learned 3 key lessons so far:
1) Everything is harder and takes longer than you anticipate. While that is true of most software development projects of any significant scope, I was surprised at how much nuance and tuning is involved with text extraction. Make sure you account for a significant investment of time and resources into non-technical tasks such as annotating training data sets, testing etc. that are not immediately obvious.
2) AI/ML will definitely not be the whole solution. In our experience, polytomous classifiers produce more useful/actionable outcomes for text extraction than binary classifiers. Further, classifiers tend to perform more robustly when there is a fair amount of variation in the content being processed (eg NLP applications, sentiment analysis etc.). Be prepared to recognize the limits of the automated system and figure out what the "last mile" of the solution needs to be.
3) You'll need humans. We have learned the quality statistics we need to pay attention to with our automated search components, and we are always working to incrementally improve these. In the meantime, be prepared to build a QA workflow, and make sure you implement a feedback loop into the automated search platform. You will need those trained human operators to validate or correct the output from the automated system, and it would be smart to use the same process to feedback into and improve that automation.
Feel free to give me a shout on linked in. I'm happy to share more about what we are implementing and hear a little about your project. Good luck!!
As someone who runs an analytics and AI company, I agree with the others here, there's no way to tell how long a project might take with your limited information. I've written a book on the subject and wouldn't be able to give a client an honest assessment with what you've provided. What I can offer is, the length of your project is going to be conditional on how clean your data is and how well you can wrangle it. Text mining can be much quicker than things like image recognition, so you've got that going for you, but that's not really saying a lot. The thing I would point out is that you must make sure you think long and hard about the tools that you use. If you decide to go the TensorFlow way, you won't be able to later convert your work to Caffe 2, and vice versa. These are all products that do their modeling on proprietary software and systems and you can't cross-pollinate the work, you have to start from scratch.
Here are some interesting articles about NLP that could be helpful:
It's impossible to say without more information. Machine learning is fundamentally research, and research can involve false starts, iterating over different models and hypotheses, and reaching certain plateaus of accuracy that you may or may not be comfortable with.
Unless your dataset is very large, the training time itself is unlikely to be a significant fraction of the overall project time. Most will be spent preprocessing and modeling the data.