The operating system on your laptop is running tens or hundreds of processes concurrently. It gives each process just the right amount of resources that it needs (RAM, CPU, IO). It isolates them in their own virtual address space, locks them down to a set of predefined permissions, allows them to inter-communicate, and allow you, the user, to safely monitor and control them. The operating system abstracts away the hardware layer (writing to a flash drive is the same as writing to a hard drive) and it doesn’t care what programming language or technology stack you used to write those apps—it just runs them, smoothly and consistently.
As machine learning penetrates the enterprise, companies will soon find themselves productionizing more and more models and at a faster clip. Deployment efficiency, resource scaling, monitoring and auditing will start to become harder and more expensive to sustain over time. Data scientists from different corners of the company will each have their own set of preferred technology stacks (R, Python, Julia, TensorFlow, Caffe, deeplearning4j, H2O.ai, etc.) and data center strategies will shift from one cloud to hybrid. Running, scaling, and monitoring heterogeneous models in a cloud-agnostic way is a responsibility analogous to an operating system—that’s what we want to talk about.
At Algorithmia, we run 8,000+ algorithms (each has multiple versions, taking the figure up to 40k+ unique REST API endpoints). Any API endpoint can be called anywhere from once a day to a burst of 1,000+ times a second, with a completely no-devops experience. Those algorithms are written in any of the 14 programming languages we support today, can be CPU or GPU based, will run on any cloud, can read and write to any data source (S3, Dropbox, etc.) and operate with a latency of ~15ms on standard hardware.
We see algorithmia.com as an OS for AI and this post will share some of our thinking.
Training vs. Inference
Machine and deep learning are made up of two distinct phases: training and inference. The former is about building the model, and the latter is about running it in production.
Training a model is an iterative process that is very framework dependent. Some machine learning engineers will use TensorFlow on GPUs, others will use Scikit-learn on CPUs, and every training environment is a snowflake. This is analogous to building an app, where an app developer has a carefully put-together development toolchain and libraries, constantly compiling and testing their code. Training requires a long compute cycle (hours to days), is usually fixed load input (meaning you don’t need to scale from one machine to X machines in response to a trigger), and is ideally a stateful process whereby the data scientist will need to repeatedly save progress of their training, such as neural network checkpoints.
Inference on the other hand is about running that model at scale to multiple users. When running numerous models concurrently, each written in different frameworks and languages, that’s when it becomes analogous to an operating system. The OS will be responsible for scheduling jobs, sharing resources, and monitoring of those jobs. A “job” is an inference transaction, and unlike in the case of training, requires a short burst of compute cycle (similar to an SQL query), elastic load (machines need to increase/decrease in proportion to inference demand), and is stateless, where the result of a previous transaction does not impact the result of the next transaction.
We will be focusing on the Inference side of the equation.
Serverless FTW
We’ll be using serverless computing for our AI operating system, so let’s take a moment to explain why serverless architecture makes sense for artificial intelligence inference.
As we explained in the previous section, machine learning inference requires a short compute burst. This means a server that is serving a model as a REST API will be idle. When it receives a request, say, to classify an image, it will burst CPU/GPU utilization for a short period of time, return the result, then resume to being idle. This process is similar a database server that is idle until it receives an SQL query.
Because of this requirement, artificial intelligence inference is a perfect fit for serverless computing. Serverless architecture has an obvious scaling advantage and an economic advantage. For example, let’s say you’ve built an app called “SeeFood” which classifies images as Hotdog and Not Hotdog. Your app goes viral and is now on the top charts. Here’s how your daily app usage might look like.
In this fictional scenario, if we were to use traditional (fixed scale) architecture, then we’d be paying for 40 machines per day. This is the area highlighted in green below. With standard cloud pricing, this might cost us around 648∗40=648∗40=25,920 per month.
If instead we used a auto-scaling architecture (where we would add or remove machines every hour) then we’d be paying for on average 19 machines per day. This is the area highlighted in green below. That’s a total of 648∗19=648∗19=12,312 per month.
And finally, if we use a serverless architecture, then we’d be paying for (theoretically) exactly the amount that we use, and not paying for the idle time. This is all the blue area in the chart below. The cost in this fictional scenario is tricky to calculate – it comes down to an average of of 21 calls / sec, or equivalent of 6 machines. That’s 648∗6=648∗6=3,888 per month.
In this (fictional) scenario, our cloud computing cost went from ~26kto 26kto 4k. That’s a great reason to use Serverless Computing, in addition to other advantages such as simpler development (functions are encapsulated as atomic services), lower latency (when used with edge computing), and rolling deployment capabilities.
Building a serverless architecture from scratch is not a particularly difficult feat, especially with recent devops advancements. Projects like Kubernetes and Docker will greatly simplify the distribution, scaling, and healing requirements for a function-as-a-service architecture.
At this stage, we have table stakes for a serverless architecture. Building this and stopping there is the equivalent of using AWS Lambda. However for our system to be more valuable to AI workflows, we need to accommodate additional requirements such as GPU memory management, composability, cloud abstraction, instrumentation, and versioning to name a few.
Anything we build on top of our function-as-a-service platform is what really defines our operating system, and that’s what we’ll talk about next.
Kernels and Shells
Similar to how an operating system is made up of Kernel, Shell, and Services, our AI operating system will consist of those components as well.
The Shell is the part that the user interacts with, such as the website or API endpoint. Services are pluggable components of our operating system, such as authentication and reporting. The last layer, the Kernel, is what really defines our operating system. This is the layer we will focus on for the rest of the post.
Our kernel is made up of three major components: elastic scaling, runtime abstraction, and cloud abstraction. Let’s explore each of those components in detail.
Kernel #1 Intelligent, Elastic Scale
If we know that Algorithm #A will always call Algorithm #B, how can we use that knowledge to our advantage? That’s the intelligent part of scaling up on demand.
Composability is a very common pattern in the world of machine learning and data science. A data pipeline will typically consist of pre-processing stage(s), processing, and post-processing. In this case the pipeline is the composition of different functions that make up the pipeline. Composability is also found in ensembles, where the data scientist runs the model through different models and then synthesize the final score.
Here’s an example: let’s say our “SeeFood” takes a photo of a fruits or vegetables and then return the name of the fruit or the vegetable. Instead of building a single classifier for all food and vegetables, it’s common to break it down to three classifiers, like so.
In this case, we know that the model on the top (“Fruit or Veggie Classifier”) will always call either “Fruit Classifier” or “Veggie Classifier”. How do we use this to our advantage? One way is to instrument all the resources, keeping track of what CPU level, memory level, and IO level each model consumes. Our orchestrator can be designed to use this information when stack those jobs, in a way that reduces network or increase server utilization (fitting more models in a single server).
Kernel #2 Runtime Abstraction
The second component in our kernel is runtime abstraction. In machine learning and data science workflows, it’s common that we build a classifier with a certain stack (say R, Tensorflow over GPUs) and have pre-processing or adjacent model running on a different stack (maybe Python, scikit-learn over CPUs).
Kernel #2 Runtime Abstraction
The second component in our kernel is runtime abstraction. In machine learning and data science workflows, it’s common that we build a classifier with a certain stack (say R, Tensorflow over GPUs) and have pre-processing or adjacent model running on a different stack (maybe Python, scikit-learn over CPUs).
Our operating system must abstract away those details and enable seamless interoperability between functions and models. For this, we use REST APIs and exchange data using JSON or equivalent universal data representation. If the data was too large, we pass around the URI to the data source instead of the actual JSON blob.
Kernel #3 Cloud Abstraction
These day we do not expect programmers to be devops. More so, we do not expect data scientists to understand all the intricate details of different cloud vendors. Dealing with different data sources is a great example of such abstraction.
Imagine a scenario where a data scientist creates a model that classifies images. Consumers of that model might be three different personas: (a) backend production engineers might be using S3, (b) fellow data scientists might be using Hadoop, and (3) BI users in a different org might be using Dropbox. Asking the author of that model to build a data connector for each of those sources (and maintain it for new data sources in the future) is a distraction from their job. Instead, our operating system can offer a Data Adapter API that reads and writes from different data sources.
In the first block, not having storage abstraction requires us to write a connector for each data source (in this case S3) and hard code it in our model. In the second block, we use the Data Adapter API which takes in a URI to a data source and automatically injects the right data connector. Those URIs can point to S3, Azure Blob, HDFS, Dropbox, or anything else.
Summary – and a thought about Discoverability
Our operating system so far is auto-scaling, composable, self-optimizing, and cloud-agnostic. Monitoring your hundreds or thousands of models is also a critical requirement. There is however room for one more thing: Discoverability.
Discoverability is the ability to find models and functions created by:
- First-party users of the operating system (your colleagues)
- Third-party suppliers (academia, open source, independent developers)
Operating systems are not the product, they are the platform. If you examine how operating systems evolved over time, we went from punched cards, to machine language, to assemblers, and so on, slowly climbing the ladder of Abstraction. Abstraction is about looking at things as modular components, encouraging reuse, and making advanced workflows more accessible. That’s why the latest wave of operating systems (the likes iOS and Android) came with built-in App Stores, to encourage this view of an OS.
[“source=algorithmia”]