Skip to main content

Deployments and Inference

tip

1. Introduction

With the Stochastic Inference Engine you will be able to deploy any model in your favorite Deep Learning framework (PyTorch, TensorFlow, ONNX, TensorRT, ...). Also, there are some templates available for the most common models or tasks.

Models can be deployed:

This guide will be divided in 4 sections depending on the approach. Review the common requirements first.

2. Common requirements

2.1. Sign in

Before you start using the CLI, you need to log in with your username and password.

stochasticx login --username "my_email@email.com" --password "my_password"

3. Graphical UI

3.1. Upload a model to the Stochasticx platform

In this guide we are going to download bert-base-uncased model from the HuggingFace Hub and we will upload it to the Stochasticx platform.

Download the model from the HF Hub:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

Save the model in your file system:

tokenizer.save_pretrained("./bert")
model.save_pretrained("./bert")

Now you can upload this model to the Stochastic platform. Go to the menu on the left and click on Models. Once there, you should see a Add model button in the top left corner.

Models

Once you have clicked on Add model, a new window will be shown. You have to enter a model name, the model type (select HuggingFace) and select the folder containing your BERT model that you downloaded from the HuggingFace Hub.

New model

3.2. Launch the deployment

To launch a deployment using the UI, do the following steps:

  1. Go to Deploy option in the left menu.
  2. Create a deployment. For that, you will have to specify the model and task type. Select the BERT model and sequence classification, repectively. For the instance type, select the standard option for normal workloads and the performance option for low latency requirements.

New deployment

3.3. Inference

Your deployment might take 10 minutes to be deployed. You should be able to see a deploying status.

Once your deployment has a running status, click on it to get the following information:

  • Endpoint (URL) and API key: all your deployments are protected by an API key.
  • An example of input and output request. Select the programming language that best suits your needs.

Deployment data

3.4. Delete a deployment

Go to the deployments page and click on the stop button.

Stop deployment

4. Stochastic template

4.1. Upload a model to the Stochasticx platform

In this guide we are going to download bert-base-uncased model from the HuggingFace Hub and we will upload it to the Stochasticx platform.

Download the model from the HF Hub:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

Save the model in your file system:

tokenizer.save_pretrained("./bert")
model.save_pretrained("./bert")

Now you can upload this model to the Stochastic platform:

stochasticx models add \
--name "bert-deployment-guide" \
--dir_path "./bert" \
--type "hf"
info
  • name: will allow you to identify the model later.
  • dir_path: directory path where your model is located.
  • type: model type. In this case a Hugging Face model

4.2. Launch the deployment

Stochastic templates allow to deploy a model without writing a single line of code. At this moment, we have templates for the following model types:

  • HuggingFace (hf)
  • ONNX (hf)
  • Large Language Models (llm)

For every model type we have several subtypes:

  • HuggingFace (hf): sequence classification (sequence_classification), question_answering (question_answering), token classification (token_classification), summarization (summarization) and translation (translation)
  • ONNX (onnx): sequence classification (sequence_classification), question answering (quesiton_answering) and token classification (token_classification):
  • Large Language Models (llm): gpt-j, flan-t5 and stable-diffusion

To launch a deployment with the BERT model for a sequence classification task you should execute the following command:

stochasticx deployments launch \
--name "bert_deployment_guide" \
--instance_type "g4dn.xlarge" \
--model_name "bert-deployment-guide" \
--model_type "hf" \
--sub_type "sequence_classification"
Output
[+] Creating deployment...
[+] Deployed

After creating the deployment you will have to wait around 10 minutes before you can start doing inferences. You can list your deployments with the following command. Once the status of your deployment become running you will be able to run inferences.

stochasticx deployments ls
Output

[+] Collecting all deployments

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Name ┃ Status ┃ Instance ┃ Model ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ bert_deployment_guide │ deploying │ g4dn.xlarge │ bert_deployment_...│
└───────────────────────────────────────────┴───────────┴─────────────┴────────────────────┘

[+] Client URL or API key will be available once the deployment is running. Execute the following command to get them: stochasticx inference inspect --deployment_name deployment_name

4.3. Inference

All the models that you deploy in our platform are protected by an API Key. To get the model endpoint and the API Key run the following command:

stochasticx inference inspect --deployment_name bert_deployment_guide

If you get the following output [+] Your deployment is still in deploying status. Wait some minutes, it means that your model is still deploying.

After some minutes, you should get an output similar to this:

Output

[+] Use these data to start the inference:
URL: http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict
API key: WNkn0X52r2fblv18SZj3mrMxstkKFeyZ

Once you have the endpoint and the API key you should be able to run inferences as simple as:

import requests

response = requests.post(
url="http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict",
headers={"apiKey": "WNkn0X52r2fblv18SZj3mrMxstkKFeyZ"},
json={"text": "your first deployment"}
)
response.raise_for_status()
output_data = response.json()

print(output_data)

4.4. Delete deployment

Once you have finished with your deployment, you can delete it executing the following command:

stochasticx deployments delete --name bert_deployment_guide

5. Custom template

5.1. Upload a model to the Stochasticx platform

In this guide we are going to download bert-base-uncased model from the HuggingFace Hub and we will upload it to the Stochasticx platform.

Download the model from the HF Hub and save it in your file system:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

tokenizer.save_pretrained("./bert")
model.save_pretrained("./bert")

Now you are ready to create your custom inference file. For that, create a new inference.py file in the model directory (in this case ./bert). This inference file should have at least one class called ModelInference with 2 methods:

  • __init__(self, root_dir_path): method to load and initialize the model. It will be executed only 1 time. The root_dir_path is the directory where your model is located. In this case, it will be /root/
  • def run(self, api_input): method that will receive the API input. If the API received a JSON input, the api_input variable will be a Python dictionary containing the JSON data. If the API received a file or several files, the api_input variable will contain a Python dictionary where the key contains the file name and value the path to the saved file. IMPORTANT: the output of this run method should be JSON serializable.

Here you can find an example of inference.py file:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


class ModelInference:
def __init__(self, root_dir_path):
self.tokenizer = AutoTokenizer.from_pretrained(root_dir_path)
self.model = AutoModelForSequenceClassification.from_pretrained(root_dir_path)
self.model.eval()

if torch.cuda.is_available():
self.model = self.model.cuda()

def run(self, api_input):
inputs = api_input['text']

if isinstance(inputs, str):
inputs = [inputs]

padding = api_input.get("padding", True)
truncation = api_input.get("truncation", True)
max_length = int(api_input.get("max_length", 128))

tokenized_inputs = self.tokenizer(
inputs,
padding=padding,
truncation=truncation,
max_length=max_length,
return_tensors="pt"
)

if torch.cuda.is_available():
tokenized_inputs = {k: v.cuda() for k,v in tokenized_inputs.items()}

with torch.no_grad():
model_outputs = self.model(**tokenized_inputs)

labels = []
output_classes = torch.argmax(model_outputs.logits, dim=-1)

if torch.cuda.is_available():
output_classes = output_classes.cpu()

output_classes = output_classes.numpy()

for output_class in output_classes:
output_class = int(output_class)

labels.append(
self.model.config.id2label[output_class]
)

return {"label": labels}

Now you can upload this model to the Stochastic platform:

stochasticx models add \
--name "bert-deployment-guide" \
--dir_path "./bert" \
--type "hf"
info
  • name: will allow you to identify the model later.
  • dir_path: directory path where your model is located.
  • type: model type. In this case a Hugging Face model

5.2. Launch the deployment

To launch a deployment with the uploaded model and a custom template, run the following command:

stochasticx deployments launch \
--name "bert_deployment_guide" \
--instance_type "g4dn.xlarge" \
--model_name "bert-deployment-guide"
Output
[+] Creating deployment...
[+] Deployed

After creating the deployment you will have to wait around 10 minutes before you can start doing inferences. You can list your deployments with the following command. Once the status of your deployment become running you will be able to run inferences.

stochasticx deployments ls
Output

[+] Collecting all deployments

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Name ┃ Status ┃ Instance ┃ Model ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ bert_deployment_guide │ deploying │ g4dn.xlarge │ bert_deployment_...│
└───────────────────────────────────────────┴───────────┴─────────────┴────────────────────┘

[+] Client URL or API key will be available once the deployment is running. Execute the following command to get them: stochasticx inference inspect --deployment_name deployment_name

5.3. Inference

All the models that you deploy in our platform are protected by an API Key. To get the model endpoint and the API Key run the following command:

stochasticx inference inspect --deployment_name bert_deployment_guide

If you get the following output [+] Your deployment is still in deploying status. Wait some minutes, it means that your model is still deploying.

After some minutes, you should get an output similar to this:

Output

[+] Use these data to start the inference:
URL: http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict
API key: WNkn0X52r2fblv18SZj3mrMxstkKFeyZ

Once you have the endpoint and the API key you should be able to run inferences as simple as:

import requests

response = requests.post(
url="http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict",
headers={"apiKey": "WNkn0X52r2fblv18SZj3mrMxstkKFeyZ"},
json={"text": "your first deployment"}
)
response.raise_for_status()
output_data = response.json()

print(output_data)

5.4. Delete the deployment

Once you have finished with your deployment, you can delete it executing the following command:

stochasticx deployments delete --name bert_deployment_guide

6. Custom Docker image

danger

Not in general availability. To ask early access, please contact us here

6.1. Launch the deployment

To launch a deployment with a custom Docker image, run the following command. The model should already be included in the Docker image.

stochasticx deployments launch \
--name "bert_deployment_guide" \
--instance_type "g4dn.xlarge" \
--docker_image "you_docker_image" \
--docker_registry_username "your_docker_username" \
--docker_registry_password "your_docker_password" \
--health_endpoint "/health" \
--init_endpoint "/init" \
--predict_endpoint "/predict"
--timeout 100
Output
[+] Creating deployment...
[+] Deployed

The purpose of each endpoint is:

  • Health endpoint: we will monitor you Docker container to know if everything is working fine. It should return a 2XX HTTP code. In case of returning another HTTP code, it will be recreated. When the container is started the first time or recreated, we will be hitting the /health endpoint until the endpoint returns a 2XX code or times out.
  • Init endpoint: once the container is running and the health check has passed, this endpoint will be called. Use it for initiliazing functions, models, etc.
  • Predict endpoint: endpoint that will be listening for incoming requests.

After creating the deployment you will have to wait around 10 minutes before you can start doing inferences. You can list your deployments with the following command. Once the status of your deployment become running you will be able to run inferences.

stochasticx deployments ls

6.2. Inference

All the models that you deploy in our platform are protected by an API Key. To get the predict endpoint and the API Key run the following command:

stochasticx inference inspect --deployment_name bert_deployment_guide

If you get the following output [+] Your deployment is still in deploying status. Wait some minutes, it means that your model is still deploying.

After some minutes, you should get an output similar to this:

Output

[+] Use these data to start the inference:
URL: http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict_endpoint
API key: WNkn0X52r2fblv18SZj3mrMxstkKFeyZ

Once you have the endpoint and the API key you should be able to run inferences as simple as:

import requests

response = requests.post(
url="http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict_endpoint",
headers={"apiKey": "WNkn0X52r2fblv18SZj3mrMxstkKFeyZ"},
json={"data": "your first deployment"}
)
response.raise_for_status()
output_data = response.json()

print(output_data)

6.3. Delete the deployment

Once you have finished with your deployment, you can delete it executing the following command:

stochasticx deployments delete --name bert_deployment_guide