Write your own ML backend
Use the Label Studio ML backend to integrate Label Studio with machine learning models. The Label Studio ML backend is an SDK that you can use to wrap your machine learning model code and turn it into a web server. The machine learning server uses uWSGI and supervisord, and handles background training jobs with RQ.
There are several use cases for the ML backend:
- Pre-annotate data with a model
- Use active learning to select the most relevant data for labeling
- Interactive (AI-assisted) labeling
- Model fine-tuning based on recently annotated data
Follow the steps below to wrap custom machine learning model code with the Label Studio ML SDK, or see example ML backend tutorials to integrate with popular machine learning frameworks such as PyTorch, GPT2, and others.
The ML backend repository also includes several predefined examples that you can use and modify:
cd label_studio_ml/examples/<SOME-MODEL> docker-compose up
For more information about ML backend integration, see Integrate Label Studio into your machine learning pipeline.
Clone the Label Studio Machine Learning Backend git repository:
git clone https://github.com/HumanSignal/label-studio-ml-backend.git
Set up the environment:
cd label-studio-ml-backend/ pip install -U -e .
Create a new backend directory:
label-studio-ml create my_ml_backend
This creates the following file structure:
my_ml_backend/ ├── Dockerfile ├── docker-compose.yml ├── model.py ├── _wsgi.py ├── README.md └── requirements.txt
docker-compose.ymlare used to run the ML backend with Docker.
model.pyis the main file where you can implement your own training and inference logic.
_wsgi.pyis a helper file that is used to run the ML backend with Docker (you don’t need to modify this).
README.mdhas instructions on how to run the ML backend.
requirements.txtis a file with Python dependencies.
Run the ML backend server:
The ML backend server is available at
http://localhost:9090. You can use this URL when connecting the ML backend to Label Studio.
Start Label Studio:
Label Studio starts at
Before you can begin using your custom ML backend, you will need to implement inference logic. This allows you to get predictions from your model on-the-fly while annotating.
You can modify an existing
predict() method in the example ML backend scripts to make them work for your specific use case, or write your own code to override the
You can also include and customize prediction scores that you can use for an active learning loop.
To run without docker (for example, for debugging purposes), you can use the following command:
pip install -r my_ml_backend label-studio-ml start my_ml_backend
To modify the port, use the
label-studio-ml start my_ml_backend -p 9091
In your model directory, locate the
model.py file (for example,
model.py file contains a class declaration inherited from
LabelStudioMLBase. This class provides wrappers for the API methods that are used by Label Studio to communicate with the ML backend. You can override the methods to implement your own logic:
def predict(self, tasks, context, **kwargs): """Make predictions for the tasks.""" return predictions
predict() method makes predictions for tasks and uses the following:
tasks: Label Studio tasks in JSON format.
context: Label Studio context in JSON format. This is used with an interactive labeling scenario.
predictions: Predictions array in JSON format.
Once you implement the
predict() method, you can see predictions from the connected ML backend in Label Studio.
For another example of the
predict() method, see model.py.
If you want to support interactive pre-annotations in your machine learning backend, write an inference call using the
predict() method. For an example that does this for text labeling projects, see this code example for substring matching.
Complete the following steps:
- Define an inference call with the
predict()method as outlined above. The
predict()method takes task data and context data:
tasksparameter contains details about the task being pre-annotated. See Label Studio tasks in JSON format.
contextparameter contains details about annotation actions performed in Label Studio, such as a text string highlighted sent in Label Studio annotation results format.
contexthas the following properties.
annotation_id: The annotation ID.
draft_id: The draft annotation ID.
user_id: The user ID.
result: This is the annotation result, but it includes an
is_positive: trueflag that can be changed by the user. For example, by pressing the Alt key and using keypoints to interact with the image in the UI.
- With the task and context data, construct a prediction using the data received from Label Studio.
- Return a result in the Label Studio predictions format, which varies depending on the type of labeling being performed.
Refer to the code example linked above for more details about how this might be performed for a NER labeling project.
For more information about enabling pre-annotations, see Get interactive pre-annotations.
You can also implement the
fit() method to train your model. The
fit method is typically used to train the model on the labeled data, although it can be used for any arbitrary operations that require data persistence (for example, storing labeled data in database, saving model weights, keeping LLM prompts history, etc).
By default, the
fit method is called at any data action in Label Studio, like creating a new task or updating annotations. You can modify this behavior in the project settings under Webhooks.
To implement the
fit method, you need to override the
fit method in your
def fit(self, event, data, **kwargs): """Train the model on the labeled data.""" old_model = self.get('old_model') # write your logic to update the model self.set('new_model', new_model)
event: The event type can be
dataThe payload received from the event. See the Webhook event reference.
Additionally, there are two helper methods that you can use to store and retrieve data from the ML backend:
self.set(key, value)- Store data in the ML backend.
self.get(key)- Retrieve data from the ML backend.
Both methods can also be used elsewhere in the ML backend code, for example, in the
predict method to get the new model weights.
Other methods and parameters available within the
LabelStudioMLBase class include:
self.label_config- returns the Label Studio labeling config as XML string.
self.parsed_label_config- returns the Label Studio labeling config as JSON.
self.model_version- returns the current model version.
Starting in version 1.4.1 of Label Studio, when you add an ML backend to your project, Label Studio creates a webhook to your ML backend to send an event every time an annotation is created or updated.
By default, the payload of the webhook event does not contain the annotation itself. You can either modify the webhook event sent by Label Studio to send the full payload, or retrieve the annotation using the Label Studio API using the get annotation by its ID endpoint, SDK using the get task by ID method, or by retrieving it from target storage that you set up to store annotations.
See the annotation webhook event reference for more details about the webhook event.