Deployments and Inference
Recommended tutorials before starting to work with optimization jobs.
1. Introduction
With the Stochastic Inference Engine you will be able to deploy any model in your favorite Deep Learning framework (PyTorch, TensorFlow, ONNX, TensorRT, ...). Also, there are some templates available for the most common models or tasks.
Models can be deployed:
- Using the graphical UI.
- With a Stochastic template.
- Attaching your own
inference.py
file andrequirements.txt
. - With a custom Docker image.
This guide will be divided in 4 sections depending on the approach. Review the common requirements first.
2. Common requirements
2.1. Sign in
Before you start using the CLI, you need to log in with your username and password.
- Python
- CLI
- Platform
from stochasticx.auth.auth import Stochastic
username = "your_email@email.com"
password = "your_password"
Stochastic().login(username, password)
stochasticx login --username "my_email@email.com" --password "my_password"
You can log in on the Stochastic platform website: https://app.stochastic.ai/login
3. Graphical UI
3.1. Upload a model to the Stochasticx platform
In this guide we are going to download bert-base-uncased
model from the HuggingFace Hub and we will upload it to the Stochasticx platform.
Download the model from the HF Hub:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
Save the model in your file system:
tokenizer.save_pretrained("./bert")
model.save_pretrained("./bert")
Now you can upload this model to the Stochastic platform. Go to the menu on the left and click on Models
. Once there, you should see a Add model
button in the top left corner.
Once you have clicked on Add model
, a new window will be shown. You have to enter a model name, the model type (select HuggingFace
) and select the folder containing your BERT model that you downloaded from the HuggingFace Hub.
3.2. Launch the deployment
To launch a deployment using the UI, do the following steps:
- Go to Deploy option in the left menu.
- Create a deployment. For that, you will have to specify the model and task type. Select the
BERT
model andsequence classification
, repectively. For the instance type, select the standard option for normal workloads and the performance option for low latency requirements.
3.3. Inference
Your deployment might take 10 minutes to be deployed. You should be able to see a deploying
status.
Once your deployment has a running
status, click on it to get the following information:
- Endpoint (URL) and API key: all your deployments are protected by an API key.
- An example of input and output request. Select the programming language that best suits your needs.
3.4. Delete a deployment
Go to the deployments page and click on the stop
button.
4. Stochastic template
4.1. Upload a model to the Stochasticx platform
In this guide we are going to download bert-base-uncased
model from the HuggingFace Hub and we will upload it to the Stochasticx platform.
Download the model from the HF Hub:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
Save the model in your file system:
tokenizer.save_pretrained("./bert")
model.save_pretrained("./bert")
Now you can upload this model to the Stochastic platform:
- Python
- CLI
from stochasticx.models.models import Model, ModelType
model_to_upload = Model(
name="bert-deployment-guide",
dir_path="./bert",
type=ModelType.HUGGINGFACE
)
print(model_to_upload)
model_to_upload.upload()
print(model_to_upload)
stochasticx models add \
--name "bert-deployment-guide" \
--dir_path "./bert" \
--type "hf"
- name: will allow you to identify the model later.
- dir_path: directory path where your model is located.
- type: model type. In this case a Hugging Face model
4.2. Launch the deployment
Stochastic templates allow to deploy a model without writing a single line of code. At this moment, we have templates for the following model types:
- HuggingFace (hf)
- ONNX (hf)
- Large Language Models (llm)
For every model type we have several subtypes:
- HuggingFace (hf): sequence classification (sequence_classification), question_answering (question_answering), token classification (token_classification), summarization (summarization) and translation (translation)
- ONNX (onnx): sequence classification (sequence_classification), question answering (quesiton_answering) and token classification (token_classification):
- Large Language Models (llm): gpt-j, flan-t5 and stable-diffusion
To launch a deployment with the BERT model for a sequence classification task you should execute the following command:
stochasticx deployments launch \
--name "bert_deployment_guide" \
--instance_type "g4dn.xlarge" \
--model_name "bert-deployment-guide" \
--model_type "hf" \
--sub_type "sequence_classification"
Output
[+] Creating deployment...
[+] Deployed
After creating the deployment you will have to wait around 10 minutes before you can start doing inferences. You can list your deployments with the following command. Once the status of your deployment become running
you will be able to run inferences.
stochasticx deployments ls
Output
[+] Collecting all deployments
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Name ┃ Status ┃ Instance ┃ Model ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ bert_deployment_guide │ deploying │ g4dn.xlarge │ bert_deployment_...│
└───────────────────────────────────────────┴───────────┴─────────────┴────────────────────┘
[+] Client URL or API key will be available once the deployment is running. Execute the following command to get them: stochasticx inference inspect --deployment_name deployment_name
4.3. Inference
All the models that you deploy in our platform are protected by an API Key. To get the model endpoint and the API Key run the following command:
stochasticx inference inspect --deployment_name bert_deployment_guide
If you get the following output [+] Your deployment is still in deploying status. Wait some minutes
, it means that your model is still deploying.
After some minutes, you should get an output similar to this:
Output
[+] Use these data to start the inference:
URL: http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict
API key: WNkn0X52r2fblv18SZj3mrMxstkKFeyZ
Once you have the endpoint and the API key you should be able to run inferences as simple as:
import requests
response = requests.post(
url="http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict",
headers={"apiKey": "WNkn0X52r2fblv18SZj3mrMxstkKFeyZ"},
json={"text": "your first deployment"}
)
response.raise_for_status()
output_data = response.json()
print(output_data)
4.4. Delete deployment
Once you have finished with your deployment, you can delete it executing the following command:
stochasticx deployments delete --name bert_deployment_guide
5. Custom template
5.1. Upload a model to the Stochasticx platform
In this guide we are going to download bert-base-uncased
model from the HuggingFace Hub and we will upload it to the Stochasticx platform.
Download the model from the HF Hub and save it in your file system:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer.save_pretrained("./bert")
model.save_pretrained("./bert")
Now you are ready to create your custom inference file. For that, create a new inference.py
file in the model directory (in this case ./bert
). This inference file should have at least one class called ModelInference
with 2 methods:
__init__(self, root_dir_path)
: method to load and initialize the model. It will be executed only 1 time. Theroot_dir_path
is the directory where your model is located. In this case, it will be/root/
def run(self, api_input)
: method that will receive the API input. If the API received a JSON input, theapi_input
variable will be a Python dictionary containing the JSON data. If the API received a file or several files, theapi_input
variable will contain a Python dictionary where the key contains the file name and value the path to the saved file. IMPORTANT: the output of thisrun
method should be JSON serializable.
Here you can find an example of inference.py
file:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
class ModelInference:
def __init__(self, root_dir_path):
self.tokenizer = AutoTokenizer.from_pretrained(root_dir_path)
self.model = AutoModelForSequenceClassification.from_pretrained(root_dir_path)
self.model.eval()
if torch.cuda.is_available():
self.model = self.model.cuda()
def run(self, api_input):
inputs = api_input['text']
if isinstance(inputs, str):
inputs = [inputs]
padding = api_input.get("padding", True)
truncation = api_input.get("truncation", True)
max_length = int(api_input.get("max_length", 128))
tokenized_inputs = self.tokenizer(
inputs,
padding=padding,
truncation=truncation,
max_length=max_length,
return_tensors="pt"
)
if torch.cuda.is_available():
tokenized_inputs = {k: v.cuda() for k,v in tokenized_inputs.items()}
with torch.no_grad():
model_outputs = self.model(**tokenized_inputs)
labels = []
output_classes = torch.argmax(model_outputs.logits, dim=-1)
if torch.cuda.is_available():
output_classes = output_classes.cpu()
output_classes = output_classes.numpy()
for output_class in output_classes:
output_class = int(output_class)
labels.append(
self.model.config.id2label[output_class]
)
return {"label": labels}
Now you can upload this model to the Stochastic platform:
- Python
- CLI
from stochasticx.models.models import Model, ModelType
model_to_upload = Model(
name="bert-deployment-guide",
dir_path="./bert",
type=ModelType.HUGGINGFACE
)
print(model_to_upload)
model_to_upload.upload()
print(model_to_upload)
stochasticx models add \
--name "bert-deployment-guide" \
--dir_path "./bert" \
--type "hf"
- name: will allow you to identify the model later.
- dir_path: directory path where your model is located.
- type: model type. In this case a Hugging Face model
5.2. Launch the deployment
To launch a deployment with the uploaded model and a custom template, run the following command:
stochasticx deployments launch \
--name "bert_deployment_guide" \
--instance_type "g4dn.xlarge" \
--model_name "bert-deployment-guide"
Output
[+] Creating deployment...
[+] Deployed
After creating the deployment you will have to wait around 10 minutes before you can start doing inferences. You can list your deployments with the following command. Once the status of your deployment become running
you will be able to run inferences.
stochasticx deployments ls
Output
[+] Collecting all deployments
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Name ┃ Status ┃ Instance ┃ Model ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ bert_deployment_guide │ deploying │ g4dn.xlarge │ bert_deployment_...│
└───────────────────────────────────────────┴───────────┴─────────────┴────────────────────┘
[+] Client URL or API key will be available once the deployment is running. Execute the following command to get them: stochasticx inference inspect --deployment_name deployment_name
5.3. Inference
All the models that you deploy in our platform are protected by an API Key. To get the model endpoint and the API Key run the following command:
stochasticx inference inspect --deployment_name bert_deployment_guide
If you get the following output [+] Your deployment is still in deploying status. Wait some minutes
, it means that your model is still deploying.
After some minutes, you should get an output similar to this:
Output
[+] Use these data to start the inference:
URL: http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict
API key: WNkn0X52r2fblv18SZj3mrMxstkKFeyZ
Once you have the endpoint and the API key you should be able to run inferences as simple as:
import requests
response = requests.post(
url="http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict",
headers={"apiKey": "WNkn0X52r2fblv18SZj3mrMxstkKFeyZ"},
json={"text": "your first deployment"}
)
response.raise_for_status()
output_data = response.json()
print(output_data)
5.4. Delete the deployment
Once you have finished with your deployment, you can delete it executing the following command:
stochasticx deployments delete --name bert_deployment_guide
6. Custom Docker image
Not in general availability. To ask early access, please contact us here
6.1. Launch the deployment
To launch a deployment with a custom Docker image, run the following command. The model should already be included in the Docker image.
stochasticx deployments launch \
--name "bert_deployment_guide" \
--instance_type "g4dn.xlarge" \
--docker_image "you_docker_image" \
--docker_registry_username "your_docker_username" \
--docker_registry_password "your_docker_password" \
--health_endpoint "/health" \
--init_endpoint "/init" \
--predict_endpoint "/predict"
--timeout 100
Output
[+] Creating deployment...
[+] Deployed
The purpose of each endpoint is:
- Health endpoint: we will monitor you Docker container to know if everything is working fine. It should return a 2XX HTTP code. In case of returning another HTTP code, it will be recreated. When the container is started the first time or recreated, we will be hitting the
/health
endpoint until the endpoint returns a 2XX code or times out. - Init endpoint: once the container is running and the health check has passed, this endpoint will be called. Use it for initiliazing functions, models, etc.
- Predict endpoint: endpoint that will be listening for incoming requests.
After creating the deployment you will have to wait around 10 minutes before you can start doing inferences. You can list your deployments with the following command. Once the status of your deployment become running
you will be able to run inferences.
stochasticx deployments ls
6.2. Inference
All the models that you deploy in our platform are protected by an API Key. To get the predict endpoint and the API Key run the following command:
stochasticx inference inspect --deployment_name bert_deployment_guide
If you get the following output [+] Your deployment is still in deploying status. Wait some minutes
, it means that your model is still deploying.
After some minutes, you should get an output similar to this:
Output
[+] Use these data to start the inference:
URL: http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict_endpoint
API key: WNkn0X52r2fblv18SZj3mrMxstkKFeyZ
Once you have the endpoint and the API key you should be able to run inferences as simple as:
import requests
response = requests.post(
url="http://infer.stochastic.ai:8000/63b82c47c99c7ef77a3a5a0a/predict_endpoint",
headers={"apiKey": "WNkn0X52r2fblv18SZj3mrMxstkKFeyZ"},
json={"data": "your first deployment"}
)
response.raise_for_status()
output_data = response.json()
print(output_data)
6.3. Delete the deployment
Once you have finished with your deployment, you can delete it executing the following command:
stochasticx deployments delete --name bert_deployment_guide