By James Haughey, Data Science Consultant, Altius
There are countless reasons why data science projects fail. Luckily, there are also plenty of solutions to mitigate those risks. Here at Altius, we’ve adopted many tools from traditional software development domains that ensure our data science projects are successful.
The primary goal of any data scientist is to implement a good model – something that will meet the brief and make an impact. That could mean anything from an e-commerce business improving customer spend to a health technology company detecting cancer earlier. It’s all very well building an amazing predictive model, but it’s no good if the business can’t use if it can’t actually be integrated into the systems where it is required.
How do you achieve this? Our approach is to outline what is required to make sure your models are built consistently. This ensures the production team understands what they’re getting and it minimises the time and effort it takes to build it. The guidelines should describe what is needed to make models as transferable as possible. If the team adopts a standard approach to each project, this will also make it easier when new members come on board. They will quickly understand how things work and where to find answers to their questions. It also helps data scientists to transition seamlessly between projects.
So, what are key questions to ask to ensure a successful data science project?
Is your code properly documented?
Models can and do degrade. They can go stale over time. That’s why their performance should be monitored and action should be taken when the model falls below an acceptable level. This may involve retraining on fresh data, tweaking it slightly or completely refitting the model. And it’s not unusual for a different data scientist to revisit the project at this point. If you haven’t documented your code properly, it makes it hard, and sometimes even impossible, for someone else to use, adapt and extend it. There are several tools available to automatically generate code documentation. Use them!
One often overlooked aspect of documentation involves having guidelines in place for documenting the process and decisions made during a project. Data scientists are constantly evaluating and making decisions and these need to be documented so that someone else, who comes into the project, can understand and learn why certain decisions were made. For example, you might choose not to use a particular data set because it’s deemed unethical or because the data is too expensive. If this is documented properly, then it will prevent duplication of effort and save a great deal of time and resource.
Are you testing correctly?
Data science models can break. Your model can fall over if it encounters exotic new data in production. Bugs can also be introduced through future changes to the codebase. The way to guard against a model failing is by adopting good testing practices. In the world of software development, testing is well-defined and highly structured. From unit tests to integration tests and everything in between, there are a host of resources and tools available.
In data science, testing is a little more ambiguous. It’s less clear what you should be testing – the code; the data; your assumptions; everything!? You need guidelines detailing those elements that you need to be certain of. You don’t need to unit test everything or have 100% code coverage, but you do need to know if the standard errors are too large, or if the data you’re fitting is approximately normal. Codifying these kinds of requirements as tests makes it easy to keep them in check and hand them over to others who don’t know (or need to know) the intricacies of your model.
How good is your version control?
Choosing which Version Control Software (VCS) to use will depend on your team’s chosen approach regarding deployment, working collaboratively and rolling back changes. Explicitly outlining what these are for your project will make it straightforward for everyone to contribute. For the majority of data science projects, this tool will be Git. The next key step is taking the time to master your chosen VCS. This is an investment in time that will pay dividends in the long run.
There are of course plenty of other reasons that data science projects fail, and plenty of other tools and solutions that a data science team should put in place to achieve success.
Is your enterprise looking to take advantage of data science? Don’t hesitate to get in touch with us to find out how we can help.