Skip to main content

Devstral 2 (Mistral)

1. Model Introduction

Devstral 2 is an agentic LLM family for software engineering tasks. It is designed for agentic workflows such as tool use, codebase exploration, and multi-file edits, and achieves strong performance on SWE-bench.

The Devstral 2 Instruct checkpoints are instruction-tuned FP8 models, making them a good fit for chat, tool-using agents, and instruction-following SWE workloads.

Key Features:

  • Agentic coding: Optimized for tool-driven coding and software engineering agents
  • Improved performance: A step up compared to earlier Devstral models
  • Better generalization: More robust across diverse prompts and coding environments
  • Long context: Up to a 256K context window

Use Cases: AI code assistants, agentic coding, and software engineering tasks that require deep codebase understanding and tool integration.

For enterprises requiring specialized capabilities (increased context, domain-specific knowledge, etc.), please reach out to Mistral.

Models:


2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.

Please refer to the official SGLang installation guide for installation instructions.

Transformers version requirement

Devstral 2 requires a recent transformers. Please verify transformers >= 5.0.0.rc:

python -c "import transformers; print(transformers.__version__)"

If your version is lower, upgrade:

pip install -U --pre "transformers>=5.0.0rc0"

3. Model Deployment

3.1 Basic configuration

Interactive Command Generator: Use the configuration selector below to generate a launch command for Devstral Small 2 (24B) or Devstral 2 (123B).

note

The TP size is set to the minimum required for the selected model size.

Hardware Platform
Model
Weights / Precision
Tool Call Parser
Run this Command:
python -m sglang.launch_server \
  --model mistralai/Devstral-Small-2-24B-Instruct-2512

3.2 Configuration tips

  • Context length vs memory: Devstral 2 advertises a long context window; if you are memory-constrained, start by lowering --context-length (for example 32768) and increase once things are stable.
  • FP8 checkpoints: Both Devstral Small 2 and Devstral 2 are published as FP8 weights. If you hit kernel / dtype issues, try a newer SGLang build and recent CUDA drivers.

4. Model Invocation

4.1 Basic Usage (OpenAI-Compatible API)

SGLang exposes an OpenAI-compatible endpoint. Example:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)

resp = client.chat.completions.create(
model="mistralai/Devstral-Small-2-24B-Instruct-2512",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function that retries a request with exponential backoff."},
],
temperature=0.2,
max_tokens=512,
)

print(resp.choices[0].message.content)

Output Example:

  Here's a Python function that implements exponential backoff for retrying a request. This function uses the `requests` library to make HTTP requests and includes error handling for common HTTP and connection errors.

```python
import time
import requests
from requests.exceptions import RequestException

def retry_with_exponential_backoff(
url,
max_retries=3,
initial_delay=1,
backoff_factor=2,
method="GET",
**kwargs
):
"""
Retry a request with exponential backoff.

Parameters:
- url: The URL to request.
- max_retries: Maximum number of retry attempts (default: 3).
- initial_delay: Initial delay in seconds (default: 1).
- backoff_factor: Multiplier for the delay between retries (default: 2).
- method: HTTP method to use (default: "GET").
- **kwargs: Additional arguments to pass to the request function (e.g., headers, data, etc.).

Returns:
- Response object if the request succeeds.
- Raises an exception if all retries fail.
"""
retry_count = 0
delay = initial_delay

while retry_count < max_retries:
try:
response = requests.request(method, url, **kwargs)
# Check if the response status code indicates success
if response.status_code < 400:
return response
else:
raise RequestException(f"HTTP {response.status_code}: {response.text}")

except RequestException as e:
if retry_count == max_retries - 1:
raise Exception(f"All retries failed. Last error: {e}")

print(f"Attempt {retry_count + 1} failed. Retrying in {delay} seconds...")
time.sleep(delay)
...

4.2 Tool calling (optional)

Devstral 2 supports tool calling capabilities. Enable the tool call parser:

python -m sglang.launch_server \
--model mistralai/Devstral-2-123B-Instruct-2512 \
--tp 2 \
--tool-call-parser mistral

Python Example (with Thinking Process):

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="mistralai/Devstral-2-123B-Instruct-2512",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=0.7,
stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}

for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta

# Accumulate tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================\n", flush=True)
thinking_started = False

for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {
'name': None,
'arguments': ''
}

if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]['name'] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

# Print content
if delta.content:
print(delta.content, end="", flush=True)

# Print accumulated tool calls
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"🔧 Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")

print()

Output Example:

🔧 Tool Call: get_weather
Arguments: {"location": "Beijing"}

AMD GPU Support

1. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

1.1 Basic Usage

For basic API usage and request examples, please refer to:

1.2 Advanced Usage

python3 -m sglang.launch_server \
--model-path mistralai/Devstral-2-123B-Instruct-2512 \
--tp 8 \
--trust-remote-code \
--port 8888

2.Benchmark

2.1 Benchmark Commands

Scenario 1: Chat (1K/1K) - Most Important

  • Model Deployment
python3 -m sglang.launch_server \
--model-path mistralai/Devstral-2-123B-Instruct-2512 \
--tp 8 \
--trust-remote-code \
--port 8888
  • Low Concurrency (Latency-Optimized)
python3 -m sglang.bench_serving \
--backend sglang \
--model mistralai/Devstral-2-123B-Instruct-2512 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf \
--port 8888
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 94.30
Total input tokens: 6101
Total input text tokens: 6101
Total input vision tokens: 0
Total generated tokens: 4220
Total generated tokens (retokenized): 4206
Request throughput (req/s): 0.11
Input token throughput (tok/s): 64.70
Output token throughput (tok/s): 44.75
Peak output token throughput (tok/s): 82.00
Peak concurrent requests: 2
Total token throughput (tok/s): 109.44
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 9427.59
Median E2E Latency (ms): 5637.23
---------------Time to First Token----------------
Mean TTFT (ms): 4253.85
Median TTFT (ms): 116.95
P99 TTFT (ms): 37764.48
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 12.28
Median TPOT (ms): 12.29
P99 TPOT (ms): 12.30
---------------Inter-Token Latency----------------
Mean ITL (ms): 12.29
Median ITL (ms): 12.29
P95 ITL (ms): 12.38
P99 ITL (ms): 12.42
Max ITL (ms): 12.90
==================================================
  • Medium Concurrency (Balanced)
python -m sglang.bench_serving \
--backend sglang \
--model mistralai/Devstral-2-123B-Instruct-2512 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf \
--port 8888
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 52.11
Total input tokens: 39668
Total input text tokens: 39668
Total input vision tokens: 0
Total generated tokens: 40805
Total generated tokens (retokenized): 40761
Request throughput (req/s): 1.54
Input token throughput (tok/s): 761.31
Output token throughput (tok/s): 783.13
Peak output token throughput (tok/s): 1120.00
Peak concurrent requests: 20
Total token throughput (tok/s): 1544.44
Concurrency: 13.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 8856.19
Median E2E Latency (ms): 9314.71
---------------Time to First Token----------------
Mean TTFT (ms): 398.80
Median TTFT (ms): 127.81
P99 TTFT (ms): 1500.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 17.32
Median TPOT (ms): 16.90
P99 TPOT (ms): 32.78
---------------Inter-Token Latency----------------
Mean ITL (ms): 16.61
Median ITL (ms): 14.26
P95 ITL (ms): 15.07
P99 ITL (ms): 114.46
Max ITL (ms): 1224.45
==================================================
  • High Concurrency (Throughput-Optimized)
python -m sglang.bench_serving \
--backend sglang \
--model mistralai/Devstral-2-123B-Instruct-2512 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf \
--port 8888
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 116.08
Total input tokens: 249831
Total input text tokens: 249831
Total input vision tokens: 0
Total generated tokens: 252662
Total generated tokens (retokenized): 252523
Request throughput (req/s): 4.31
Input token throughput (tok/s): 2152.21
Output token throughput (tok/s): 2176.60
Peak output token throughput (tok/s): 3600.00
Peak concurrent requests: 107
Total token throughput (tok/s): 4328.81
Concurrency: 92.42
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 21456.71
Median E2E Latency (ms): 20126.82
---------------Time to First Token----------------
Mean TTFT (ms): 291.60
Median TTFT (ms): 199.24
P99 TTFT (ms): 866.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 42.42
Median TPOT (ms): 45.18
P99 TPOT (ms): 53.32
---------------Inter-Token Latency----------------
Mean ITL (ms): 41.97
Median ITL (ms): 27.59
P95 ITL (ms): 130.43
P99 ITL (ms): 137.87
Max ITL (ms): 616.73
==================================================

2.2 Understanding the Results

Key Metrics:

  • Request Throughput (req/s): Number of requests processed per second
  • Output Token Throughput (tok/s): Total tokens generated per second
  • Mean TTFT (ms): Time to First Token - measures responsiveness
  • Mean TPOT (ms): Time Per Output Token - measures generation speed
  • Mean ITL (ms): Inter-Token Latency - measures streaming consistency

Why These Configurations Matter:

  • 1K/1K (Chat): Represents the most common conversational AI workload. This is the highest priority scenario for most deployments.
  • 1K/8K (Reasoning): Tests long-form generation capabilities crucial for complex reasoning, code generation, and detailed explanations.
  • 8K/1K (Summarization): Evaluates performance with large context inputs, essential for RAG systems, document Q&A, and summarization tasks.
  • Variable Concurrency: Captures the Pareto frontier - the optimal trade-off between throughput and latency at different load levels. Low concurrency shows best-case latency, high concurrency shows maximum throughput.

Interpreting Results:

  • Compare your results against baseline numbers for your hardware
  • Higher throughput at same latency = better performance
  • Lower TTFT = more responsive user experience
  • Lower TPOT = faster generation speed

2.3 Accuracy Benchmark

Document model accuracy on standard benchmarks:

2.3.1 GSM8K Benchmark

  • Benchmark Command
python3 benchmark/gsm8k/bench_sglang.py \
--num-shots 8 \
--num-questions 1316 \
--parallel 1316 \
--port 8888

Test Results:

Accuracy: 0.922
Invalid: 0.000
Latency: 35.800 s
Output throughput: 4507.697 token/s