Skip to main content

Kimi-K2.5

1. Model Introduction

Kimi-K2.5 is an open-source, native multimodal agentic model by Moonshot AI, built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes.

Key Features:

  • Native Multimodality: Pre-trained on vision–language tokens, K2.5 excels in visual knowledge, cross-modal reasoning, and agentic tool use grounded in visual inputs.
  • Coding with Vision: K2.5 generates code from visual specifications (UI designs, video workflows) and autonomously orchestrates tools for visual data processing.
  • Agent Swarm: K2.5 transitions from single-agent scaling to a self-directed, coordinated swarm-like execution scheme. It decomposes complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents.
  • Speculative Decoding: EAGLE-based speculative decoding support for lower latency.

Available Models:

For details, see official documentation and deployment guidance.

2. SGLang Installation

Refer to the official SGLang installation guide.

3. Model Deployment

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, deployment strategy, and capabilities.

Hardware Platform
Quantization
Reasoning Parser
Tool Call Parser
DP Attention
Speculative Decoding
Run this Command:
sglang serve \
  --model-path moonshotai/Kimi-K2.5 \
  --tp 8 \
  --trust-remote-code \
  --reasoning-parser kimi_k2 \
  --tool-call-parser kimi_k2 \
  --host 0.0.0.0 \
  --port 30000

3.2 Configuration Tips

  • Memory: Requires GPUs with ≥140GB each. Supported platforms: H200 (8×, TP=8), B300 (8×, TP=8), MI300X/MI325X (4×, TP=4), MI350X/MI355X (4×, TP=4). Use --context-length 128000 to conserve memory.
  • AMD GPU TP Constraint: On AMD GPUs, TP must be ≤ 4 (not 8). Kimi-K2.5 has 64 attention heads; the AITER MLA kernel requires heads_per_gpu % 16 == 0. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid).
  • AMD Docker Image: Use lmsysorg/sglang:v0.5.9-rocm700-mi35x for MI350X/MI355X and lmsysorg/sglang:v0.5.9-rocm700-mi30x for MI300X/MI325X. The ROCm 7.2 images (rocm720) have an AITER compatibility issue.
  • DP Attention: Enable with --dp <N> --enable-dp-attention for production throughput. A common choice is to set --dp equal to --tp, but this is not required.
  • Reasoning Parser: Add --reasoning-parser kimi_k2 to separate thinking and content in model outputs.
  • Tool Call Parser: Add --tool-call-parser kimi_k2 for structured tool calls.

4. Model Invocation

4.1 Basic Usage

See Basic API Usage.

4.2 Advanced Usage

4.2.1 Multimodal (Vision + Text) Input

Kimi-K2.5 supports native multimodal input with images:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "What is in this image? Describe it in detail."
}
]
}
]
)

print(response.choices[0].message.content)

Output Example:

This image shows a **receipt from Auntie Anne's** (a pretzel franchise restaurant).

## Key Details:

**Item Purchased:**
- **CINNAMON SUGAR** - 1 unit × 17,000 = **17,000**

**Payment Summary:**
- **SUB TOTAL:** 17,000
- **GRAND TOTAL:** 17,000
- **CASH IDR:** 20,000 (Indonesian Rupiah)
- **CHANGE DUE:** 3,000

## Context:
The receipt indicates a transaction in **Indonesian Rupiah (IDR)**. A customer purchased one Cinnamon Sugar pretzel for 17,000 IDR, paid with a 20,000 IDR note, and received 3,000 IDR in change.

The top of the receipt shows the Auntie Anne's logo (a heart-shaped pretzel with a halo), and some text appears blurred for privacy, likely obscuring the store location, date, and transaction number. The receipt is printed on white thermal paper.

4.2.2 Reasoning Output

Kimi-K2.5 supports both thinking mode (default) and instant mode.

Thinking Mode (default) — reasoning content is automatically separated:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."}
]
)

print("====== Reasoning Content (Thinking Mode) ======")
print(response.choices[0].message.reasoning_content)
print("====== Response (Thinking Mode) ======")
print(response.choices[0].message.content)

Instant Mode (thinking off) — disable thinking for faster responses:

response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{"role": "user", "content": "Which one is bigger, 9.11 or 9.9? Think carefully."}
],
extra_body={"chat_template_kwargs": {"thinking": False}}
)

print("====== Response (Instant Mode) ======")
print(response.choices[0].message.content)

Output Example:

====== Reasoning Content (Thinking Mode) ======
The user is asking which number is bigger: 9.11 or 9.9.

At first glance, someone might think 9.11 is bigger because 11 > 9, but that's incorrect because we're dealing with decimal numbers, not whole numbers.

Let me compare them properly:
- 9.9 = 9.90
- 9.11

When comparing decimals, we look at each place value from left to right:
- Units place: 9 = 9 (tie)
- Tenths place: 9 vs 1

Since 9 > 1, we have 9.9 > 9.11.

Alternatively, we can think of it as:
- 9.9 = 9 + 9/10 = 9 + 0.9 = 9.90
- 9.11 = 9 + 11/100 = 9 + 0.11

Since 0.90 > 0.11, then 9.9 > 9.11.

So the answer is clearly 9.9 is bigger.

The "think carefully" hint suggests the user is trying to catch the common error where people compare 11 and 9 as whole numbers rather than understanding decimal place value (tenths vs hundredths).

I should explain this clearly to avoid confusion.
====== Response (Thinking Mode) ======
**9.9 is bigger.**

Here's why this can be tricky: Many people instinctively compare 11 and 9 and think "11 is bigger than 9," but that's comparing the wrong place values.

When comparing decimals, align them by place value:
- 9.9 = 9.**90**
- 9.11 = 9.**11**

After the decimal point:
- The first digit (tenths place): **9** vs **1**
- Since 9 > 1, we stop there. **9.9 is larger.**

Think of it as money:
- $9.90 (nine dollars and ninety cents)
- $9.11 (nine dollars and eleven cents)

$9.90 is clearly more than $9.11.
====== Response (Instant Mode) ======
Let me think through this carefully.

**9.9 is bigger than 9.11**

Here's why: When comparing decimals, we need to align them by their decimal places:

- 9.9 = 9.90
- 9.11 = 9.11

Now comparing:
- The whole number parts are equal (9 = 9)
- Comparing tenths: **9 > 1**

So 9.90 > 9.11

A common mistake is thinking 11 hundredths is larger than 9 tenths, but 9 tenths = 90 hundredths, which is clearly larger than 11 hundredths.

4.2.3 Tool Calling

Kimi-K2.5 supports tool calling capabilities for agentic tasks:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]

response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
stream=True
)

# Process streaming response
tool_calls_accumulator = {}

for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta

if hasattr(delta, 'tool_calls') and delta.tool_calls:
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {'name': None, 'arguments': ''}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]['name'] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments

if delta.content:
print(delta.content, end="", flush=True)

for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")

Output Example:

Tool Call: get_weather
Arguments: {"location": "Beijing"}

Handling Tool Call Results:

# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": "The weather in Beijing is 22°C and sunny."
}
]

final_response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=messages
)

print(final_response.choices[0].message.content)

Output Example:

The weather in Beijing is **22°C and sunny**. ☀️

It's a nice day there with comfortable temperatures and clear skies!

4.2.4 Multimodal + Tool Calling (Agentic Vision)

Combine vision understanding with tool calling for advanced agentic tasks:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

tools = [
{
"type": "function",
"function": {
"name": "search_product",
"description": "Search for a product by name or description",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The product name or description to search for"
}
},
"required": ["query"]
}
}
}
]

response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Can you identify this product and search for similar items?"
}
]
}
],
tools=tools
)

msg = response.choices[0].message

# Print reasoning process
if msg.reasoning_content:
print("=== Reasoning ===")
print(msg.reasoning_content)

# Print response content
if msg.content:
print("=== Content ===")
print(msg.content)

# Print tool calls
if msg.tool_calls:
print("=== Tool Calls ===")
for tc in msg.tool_calls:
print(f" Function: {tc.function.name}")
print(f" Arguments: {tc.function.arguments}")

Output Example:

=== Reasoning ===
The user is asking me to identify a product from a receipt and search for similar items.
Looking at the receipt, I can see:

1. The store is "Auntie Anne's" - which is a popular pretzel chain
2. The product purchased is "CINNAMON SUGAR"
3. Price is 17,000 (likely Indonesian Rupiah based on "CASH IDR")
4. Quantity is 1

So the product is a Cinnamon Sugar pretzel from Auntie Anne's.
Now I need to search for this product or similar items using the search_product function.
=== Content ===
I can see from the receipt that the product is a **Cinnamon Sugar** item from **Auntie Anne's** (the famous pretzel chain). This appears to be a Cinnamon Sugar Pretzel purchased for 17,000 IDR (Indonesian Rupiah).

Let me search for this product and similar items:
=== Tool Calls ===
Function: search_product
Arguments: {"query": "Auntie Anne's Cinnamon Sugar Pretzel"}

4.2.5 Speculative Decoding

Nvidia

Deploy Kimi-K2.5 with the following command (H200/B200, all features enabled):

SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--speculative-algorithm=EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000

Deploy Kimi-K2.5-NVFP4 with the following command (B200, all features enabled):

SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path nvidia/Kimi-K2.5-NVFP4 \
--tp 8 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--kv-cache-dtype fp8_e4m3 \
--speculative-algorithm=EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000

5. Benchmark

5.1 Accuracy Benchmark

5.1.1 MMMU Benchmark

You can evaluate the model's accuracy using the MMMU benchmark, which tests multimodal understanding and reasoning across various subjects:

  • Benchmark Command:
python3 benchmark/mmmu/bench_sglang.py \
--response-answer-regex "(?i)(?:answer|ans)[:\s]*(?:\*\*)?[\(\[]?([A-Za-z])[\)\]]?(?:\*\*)?" \
--port 30000 \
--concurrency 64
  • Result:
Benchmark time: 2785.4322692090645
answers saved to: ./answer_sglang.json
Evaluating...
answers saved to: ./answer_sglang.json
{'Accounting': {'acc': 0.667, 'num': 30},
'Agriculture': {'acc': 0.567, 'num': 30},
'Architecture_and_Engineering': {'acc': 0.733, 'num': 30},
'Art': {'acc': 0.833, 'num': 30},
'Art_Theory': {'acc': 0.8, 'num': 30},
'Basic_Medical_Science': {'acc': 0.833, 'num': 30},
'Biology': {'acc': 0.6, 'num': 30},
'Chemistry': {'acc': 0.633, 'num': 30},
'Clinical_Medicine': {'acc': 0.733, 'num': 30},
'Computer_Science': {'acc': 0.667, 'num': 30},
'Design': {'acc': 0.7, 'num': 30},
'Diagnostics_and_Laboratory_Medicine': {'acc': 0.5, 'num': 30},
'Economics': {'acc': 0.867, 'num': 30},
'Electronics': {'acc': 0.3, 'num': 30},
'Energy_and_Power': {'acc': 0.767, 'num': 30},
'Finance': {'acc': 0.833, 'num': 30},
'Geography': {'acc': 0.667, 'num': 30},
'History': {'acc': 0.767, 'num': 30},
'Literature': {'acc': 0.767, 'num': 30},
'Manage': {'acc': 0.733, 'num': 30},
'Marketing': {'acc': 0.833, 'num': 30},
'Materials': {'acc': 0.567, 'num': 30},
'Math': {'acc': 0.633, 'num': 30},
'Mechanical_Engineering': {'acc': 0.567, 'num': 30},
'Music': {'acc': 0.5, 'num': 30},
'Overall': {'acc': 0.698, 'num': 900},
'Overall-Art and Design': {'acc': 0.708, 'num': 120},
'Overall-Business': {'acc': 0.787, 'num': 150},
'Overall-Health and Medicine': {'acc': 0.74, 'num': 150},
'Overall-Humanities and Social Science': {'acc': 0.75, 'num': 120},
'Overall-Science': {'acc': 0.66, 'num': 150},
'Overall-Tech and Engineering': {'acc': 0.595, 'num': 210},
'Pharmacy': {'acc': 0.767, 'num': 30},
'Physics': {'acc': 0.767, 'num': 30},
'Psychology': {'acc': 0.667, 'num': 30},
'Public_Health': {'acc': 0.867, 'num': 30},
'Sociology': {'acc': 0.8, 'num': 30}}
eval out saved to ./val_sglang.json
Overall accuracy: 0.698

5.2 Speed Benchmark

Test Environment:

  • Hardware: NVIDIA H200 GPU (8x)
  • Model: Kimi-K2.5
  • Tensor Parallelism: 8
  • SGLang Version: 0.5.6.post2

We use SGLang's built-in benchmarking tool with the random dataset for standardized performance evaluation.

5.2.1 Latency Benchmark

  • Model Deployment:
sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
  • Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 39.77
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 4220
Total generated tokens (retokenized): 4221
Request throughput (req/s): 0.25
Input token throughput (tok/s): 153.40
Output token throughput (tok/s): 106.10
Peak output token throughput (tok/s): 156.00
Peak concurrent requests: 2
Total token throughput (tok/s): 259.50
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3972.87
Median E2E Latency (ms): 4044.55
P90 E2E Latency (ms): 7046.30
P99 E2E Latency (ms): 7441.13
---------------Time to First Token----------------
Mean TTFT (ms): 176.89
Median TTFT (ms): 154.24
P99 TTFT (ms): 285.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 9.22
Median TPOT (ms): 9.32
P99 TPOT (ms): 12.72
---------------Inter-Token Latency----------------
Mean ITL (ms): 9.02
Median ITL (ms): 8.80
P95 ITL (ms): 13.23
P99 ITL (ms): 14.17
Max ITL (ms): 29.38
==================================================
  • Medium Concurrency (Balanced)
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 158.05
Total input tokens: 39668
Total input text tokens: 39668
Total generated tokens: 40805
Total generated tokens (retokenized): 40775
Request throughput (req/s): 0.51
Input token throughput (tok/s): 250.99
Output token throughput (tok/s): 258.18
Peak output token throughput (tok/s): 1103.00
Peak concurrent requests: 19
Total token throughput (tok/s): 509.17
Concurrency: 14.09
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 27837.05
Median E2E Latency (ms): 23508.00
P90 E2E Latency (ms): 57126.31
P99 E2E Latency (ms): 66044.35
---------------Time to First Token----------------
Mean TTFT (ms): 374.30
Median TTFT (ms): 375.51
P99 TTFT (ms): 695.58
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 53.25
Median TPOT (ms): 57.93
P99 TPOT (ms): 85.45
---------------Inter-Token Latency----------------
Mean ITL (ms): 53.95
Median ITL (ms): 53.97
P95 ITL (ms): 84.74
P99 ITL (ms): 244.84
Max ITL (ms): 655.61
==================================================
  • High Concurrency (Throughput-Optimized)
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf
  • Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 996.64
Total input tokens: 249831
Total input text tokens: 249831
Total generated tokens: 252662
Total generated tokens (retokenized): 252588
Request throughput (req/s): 0.50
Input token throughput (tok/s): 250.67
Output token throughput (tok/s): 253.51
Peak output token throughput (tok/s): 1199.00
Peak concurrent requests: 104
Total token throughput (tok/s): 504.18
Concurrency: 92.70
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 184773.75
Median E2E Latency (ms): 174183.65
P90 E2E Latency (ms): 343625.28
P99 E2E Latency (ms): 404284.53
---------------Time to First Token----------------
Mean TTFT (ms): 1289.59
Median TTFT (ms): 1313.35
P99 TTFT (ms): 2346.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 364.70
Median TPOT (ms): 403.32
P99 TPOT (ms): 452.34
---------------Inter-Token Latency----------------
Mean ITL (ms): 363.82
Median ITL (ms): 316.21
P95 ITL (ms): 745.91
P99 ITL (ms): 1345.88
Max ITL (ms): 3118.59
==================================================

Scenario 2: Reasoning (1K/8K)

  • Low Concurrency
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 680.26
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 44462
Total generated tokens (retokenized): 44455
Request throughput (req/s): 0.01
Input token throughput (tok/s): 8.97
Output token throughput (tok/s): 65.36
Peak output token throughput (tok/s): 151.00
Peak concurrent requests: 2
Total token throughput (tok/s): 74.33
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 68019.29
Median E2E Latency (ms): 70568.85
P90 E2E Latency (ms): 113237.40
P99 E2E Latency (ms): 121682.34
---------------Time to First Token----------------
Mean TTFT (ms): 206.17
Median TTFT (ms): 177.28
P99 TTFT (ms): 445.37
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 14.36
Median TPOT (ms): 15.89
P99 TPOT (ms): 16.43
---------------Inter-Token Latency----------------
Mean ITL (ms): 15.26
Median ITL (ms): 15.85
P95 ITL (ms): 17.50
P99 ITL (ms): 23.21
Max ITL (ms): 45.22
==================================================
  • Medium Concurrency
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 2475.98
Total input tokens: 39668
Total input text tokens: 39668
Total generated tokens: 318306
Total generated tokens (retokenized): 318166
Request throughput (req/s): 0.03
Input token throughput (tok/s): 16.02
Output token throughput (tok/s): 128.56
Peak output token throughput (tok/s): 847.00
Peak concurrent requests: 18
Total token throughput (tok/s): 144.58
Concurrency: 14.62
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 452592.46
Median E2E Latency (ms): 486002.05
P90 E2E Latency (ms): 833197.57
P99 E2E Latency (ms): 957399.48
---------------Time to First Token----------------
Mean TTFT (ms): 359.38
Median TTFT (ms): 350.78
P99 TTFT (ms): 500.36
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 111.18
Median TPOT (ms): 122.76
P99 TPOT (ms): 145.90
---------------Inter-Token Latency----------------
Mean ITL (ms): 113.69
Median ITL (ms): 122.81
P95 ITL (ms): 147.87
P99 ITL (ms): 151.03
Max ITL (ms): 272.05
==================================================
  • High Concurrency
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 8000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
Waiting for completion...

Scenario 3: Summarization (8K/1K)

  • Low Concurrency
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 120.73
Total input tokens: 41941
Total input text tokens: 41941
Total generated tokens: 4220
Total generated tokens (retokenized): 4220
Request throughput (req/s): 0.08
Input token throughput (tok/s): 347.41
Output token throughput (tok/s): 34.96
Peak output token throughput (tok/s): 73.00
Peak concurrent requests: 2
Total token throughput (tok/s): 382.36
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 12068.56
Median E2E Latency (ms): 10211.36
P90 E2E Latency (ms): 23203.32
P99 E2E Latency (ms): 30677.66
---------------Time to First Token----------------
Mean TTFT (ms): 1625.64
Median TTFT (ms): 1526.63
P99 TTFT (ms): 3743.51
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 24.95
Median TPOT (ms): 23.95
P99 TPOT (ms): 35.40
---------------Inter-Token Latency----------------
Mean ITL (ms): 24.80
Median ITL (ms): 21.73
P95 ITL (ms): 59.56
P99 ITL (ms): 61.10
Max ITL (ms): 62.70
==================================================
  • Medium Concurrency
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 389.96
Total input tokens: 300020
Total input text tokens: 300020
Total generated tokens: 41669
Total generated tokens (retokenized): 41670
Request throughput (req/s): 0.21
Input token throughput (tok/s): 769.36
Output token throughput (tok/s): 106.86
Peak output token throughput (tok/s): 304.00
Peak concurrent requests: 19
Total token throughput (tok/s): 876.22
Concurrency: 14.95
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 72870.97
Median E2E Latency (ms): 70495.88
P90 E2E Latency (ms): 121820.46
P99 E2E Latency (ms): 148933.09
---------------Time to First Token----------------
Mean TTFT (ms): 2460.45
Median TTFT (ms): 1976.29
P99 TTFT (ms): 7305.53
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 140.57
Median TPOT (ms): 142.31
P99 TPOT (ms): 273.40
---------------Inter-Token Latency----------------
Mean ITL (ms): 135.44
Median ITL (ms): 95.96
P95 ITL (ms): 152.93
P99 ITL (ms): 1488.37
Max ITL (ms): 6540.24
==================================================
  • High Concurrency
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 320 \
--max-concurrency 64 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 64
Successful requests: 320
Benchmark duration (s): 1279.50
Total input tokens: 1273893
Total input text tokens: 1273893
Total generated tokens: 170000
Total generated tokens (retokenized): 169981
Request throughput (req/s): 0.25
Input token throughput (tok/s): 995.62
Output token throughput (tok/s): 132.86
Peak output token throughput (tok/s): 703.00
Peak concurrent requests: 67
Total token throughput (tok/s): 1128.49
Concurrency: 60.12
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 240385.63
Median E2E Latency (ms): 236266.30
P90 E2E Latency (ms): 429882.12
P99 E2E Latency (ms): 515158.36
---------------Time to First Token----------------
Mean TTFT (ms): 2710.44
Median TTFT (ms): 2345.63
P99 TTFT (ms): 7144.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 443.84
Median TPOT (ms): 493.29
P99 TPOT (ms): 606.19
---------------Inter-Token Latency----------------
Mean ITL (ms): 448.23
Median ITL (ms): 296.17
P95 ITL (ms): 1869.15
P99 ITL (ms): 2708.95
Max ITL (ms): 7778.47
==================================================

5.2.2 Speculative Decoding Benchmark

  • Model Deployment:
SGLANG_ENABLE_SPEC_V2=1 sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 8 \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--speculative-algorithm=EAGLE3 \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
  • Results:
Pending update...
  • Medium Concurrency (Balanced)
python -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
Pending update...
  • High Concurrency (Throughput-Optimized)
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100 \
--request-rate inf
Pending update...

5.3 Speed Benchmark (AMD MI350X)

Test Environment:

  • Hardware: AMD Instinct MI350X GPU (4x)
  • Model: Kimi-K2.5 (BF16)
  • Tensor Parallelism: 4
  • SGLang Version: 0.5.9
  • Docker Image: lmsysorg/sglang:v0.5.9-rocm700-mi35x
  • ROCm: 7.0

We use SGLang's built-in benchmarking tool with the random dataset for standardized performance evaluation.

:::info AMD GPU TP Constraint Kimi-K2.5 requires TP ≤ 4 on AMD GPUs. The model has 64 attention heads, and the AITER MLA kernel requires heads_per_gpu % 16 == 0. With TP=4, each GPU gets 16 heads (valid). With TP=8, each GPU gets 8 heads (invalid). :::

5.3.1 Latency Benchmark

  • Model Deployment:
SGLANG_USE_AITER=1 SGLANG_ROCM_FUSED_DECODE_MLA=0 \
sglang serve \
--model-path moonshotai/Kimi-K2.5 \
--tp 4 \
--mem-fraction-static 0.8 \
--trust-remote-code \
--reasoning-parser kimi_k2 \
--host 0.0.0.0 \
--port 30000
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
  • Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 155.81
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 4220
Total generated tokens (retokenized): 4222
Request throughput (req/s): 0.06
Input token throughput (tok/s): 39.16
Output token throughput (tok/s): 27.09
Peak output token throughput (tok/s): 29.00
Peak concurrent requests: 2
Total token throughput (tok/s): 66.24
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 15576.22
Median E2E Latency (ms): 12539.80
P90 E2E Latency (ms): 28150.56
P99 E2E Latency (ms): 34873.51
---------------Time to First Token----------------
Mean TTFT (ms): 563.50
Median TTFT (ms): 594.92
P99 TTFT (ms): 830.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 35.61
Median TPOT (ms): 35.66
P99 TPOT (ms): 35.77
---------------Inter-Token Latency----------------
Mean ITL (ms): 35.66
Median ITL (ms): 35.69
P95 ITL (ms): 35.96
P99 ITL (ms): 36.13
Max ITL (ms): 36.92
==================================================
  • Medium Concurrency (Balanced)
python3 -m sglang.bench_serving \
--backend sglang \
--model moonshotai/Kimi-K2.5 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 526.66
Total input tokens: 39668
Total input text tokens: 39668
Total generated tokens: 40805
Total generated tokens (retokenized): 40798
Request throughput (req/s): 0.15
Input token throughput (tok/s): 75.32
Output token throughput (tok/s): 77.48
Peak output token throughput (tok/s): 96.00
Peak concurrent requests: 18
Total token throughput (tok/s): 152.80
Concurrency: 14.59
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 96023.27
Median E2E Latency (ms): 93940.20
P90 E2E Latency (ms): 159449.54
P99 E2E Latency (ms): 194706.61
---------------Time to First Token----------------
Mean TTFT (ms): 989.08
Median TTFT (ms): 886.42
P99 TTFT (ms): 1543.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 191.04
Median TPOT (ms): 195.20
P99 TPOT (ms): 238.84
---------------Inter-Token Latency----------------
Mean ITL (ms): 186.68
Median ITL (ms): 183.82
P95 ITL (ms): 189.90
P99 ITL (ms): 673.64
Max ITL (ms): 1633.20
==================================================