Ml System Design

[Originally posted at:]


This blog is part of a series on designing and building AI systems. It aims to cover aspects on how AI, software engineering and product design interact in the real world. As an “AI Solution Architect / AI Engineer”, I often sit between the worlds of software engineering and analytics/data science. As part of my job in this middle ground, I explain the complexities and challenges of one side to the other, in order to help deliver functional technologies, systems and products. There are several unique aspects to systems that are supposedly “AI-driven”, but first, to set the scene, let’s visit each at a high level and get some real words behind the jargon.

What do we mean by “AI”?

AI means many things to many people, but here we will very loosely take it to mean - The use of complex data to predict novel outcomes using machine-learning algorithms. Which, I admit, is pretty generic. However, there are several key points to walk through here.

Firstly, what do we mean by “using complex data”? This is the idea that by joining and using larger datasets from multiple sources, new insights can be drawn from sophisticated data analysis. Creating a complex dataset can come from joining previously unused data, extracting information from sources previously considered unreadable, or by adding new sources tangential to the original dataset. In order to use this dataset, the data analysis can be visual, statistical, or by using unsupervised machine-learning algorithms. The aim of this analysis is to get to a concrete hypothesis with which to train a machine learning algorithm.

Second, the use of machine-learning algorithms to predict outcomes is simply that — each machine-learning algorithm is just a tool that enables complex correlations to be inferred from the data. These complex correlations are in some cases better than humans (predicting your next chess move), some cases worse (dog v.s. gun anyone?), and some cases roughly on par (facial recognition?).

Combining these two gives us the potential to process information better than manual analysis, or to predict outcomes that would not be guessed at by human analysts, and as such, it becomes “Artificial Intelligence”.

What is an AI system?

An AI system is then a software application that uses AI, i.e. processes complex data and uses machine-learning algorithms, to present information to users. Building software applications themselves is (mostly) well understood, with many standard patterns for different application formats. However, as of the moment (Autumn 2020), there are multiple ways ‘to build AI’, with various immature technologies. Our AI developer must gather data, store data, build the machine-learning algorithm, save the machine learning algorithm, and then execute the machine-learning algorithm, with no standards or common systems in place.

This choice of immature technologies manifests itself by making the choice of technology stack within an enterprise very difficult, leading to inconsistency and incompatibility between approaches. This inconsistency forces significant re-engineering when promoting AI systems from prototype to production, increasing the cost and time it takes to go from a prototype data science idea, to running a ML pipeline live for application users.

This has the knock-on effect of restricting the use of ML to a select set of use-cases, either in one-off investigations and analysis (where reproducibility is not a criteria) or in extremely basic production implementations (Google’s MLOps Level 0) which through their lack of maturity, often perform underwhelmingly and lead to suboptimal business outcomes. This is where an AI Architect is often the right person to propose, design, build and operate ML systems. She can suggest potential valuable use cases, offer opinions on the right technologies, advise analysts and data scientists on how to build algorithms and then oversee delivery teams to implement the results in a working system.

The design of AI Systems

Designing an AI system then, is about several key things:

  • Users, and how they interact with the system when algorithm-driven outputs are presented to them.
  • The system, and how it deals with algorithm-created results.
  • The data, and how we ingest complex data sources.
  • The algorithms, and how they are created, managed, and executed.

The importance of considering users, and how they respond to a system, is a key part of any product design, but crucial when you present information that they may not “guess” or ‘expect’, as their response is likely to determine the usability and success of your product. For example, in the case of recommendation systems, such as Netflix/Youtube/Spotify, presenting matching media increases user satisfaction and engagement, whilst presenting jarring or irrelevant recommendations may negatively impact adoption. Additionally, presenting novel recommendations at the “right” time, or to the “right” users can vastly increase usability of the product.

The algorithmically-created predictions presented to the user need to be stored and processed by the system. As these results are not pre-determined, the range of potential results need be considered, for example all possible numerical values, or all media titles in the database, to a infinite range of possible sentence and language replies, as in the case of Amazon’s Alexa. These results must be stored, monitored (often for risk and compliance) and then presented to the user in a format compatible with the application. Recording metrics around what is presented when is extremely useful for management of the system, and auditing / explaining the algorithm’s behaviour.

The data that flows through the system is critical to providing the right result to the user. This data can be loosely separated into three sorts, each useful for its own stage of the process. To explore, innovate and create the algorithms, a dataset of historic (and potentially simulated future) data is required, to learn from. This can be user logs, or shopping data, market or financial data, social network posts or internal enterprise documentation around processes and results, or any of the data sources that exist today.

Once the algorithms have been trained, they should be benchmarked to ensure quality using test data, which may contain specific real or simulated situations where the algorithmic results are of importance. Lastly, the live production data is fed through the system, for inference based on the algorithm. In more advanced cases, live feedback and new/revised data is used to automatically retrain the algorithm without human intervention. In order to ensure the algorithms are as relevant as possible, the data in the live system should be as similar and as accurate as the data used for development.

At the heart of the system, machine-learning algorithms need to be created, maintained and executed. There are several ways for our data scientist to create these algorithms, from using personal computers, clusters, and automated cloud-native algorithm training systems. The environment can vary immensely, and a balance should be found between standardisation and retaining flexibility for innovation. Once she has created an algorithm, this piece of software must be saved, versioned, deployed and archived using a model management system. As part of the process, continuous testing and monitoring should be done on promising models, to understand when they should be used live. The current live algorithms will be executed within a data pipeline, as part of a compute processing sequence, and either run once, in an event-based manner, or in batch, on a time-based load of data. Each of these patterns require different software frameworks, data pipelines, and management and operations, adding complexity to the whole process.

An effective application must often manage all of the above throughout its lifecycle, and designing an architecture to manage this is key to operating a system containing AI. By carefully considering each of these components, reusable systems can be constructed which enhance the human experience, and improve people’s lives, when automation and machine driven actions are used to improve the operational efficiency of the world.


I’ve given an overview of some of the key aspects of AI system design, and in the following blog posts I’ll dive deeper into each aspect, and why careful design and build are keys to success. Of course, the real question is why “AI” system design is worth considering, and any more special to Database/Network/Muppet/Whale or any other noun-based system design? My proposal would be that, they are all relevant, and the requirements to deal with arbitrary machine-generated predictions makes AI systems their own special case.