Skip to main content

GLM-5

1. Model Introduction

GLM-5 is the most powerful language model in the GLM series developed by Zhipu AI, targeting complex systems engineering and long-horizon agentic tasks. Scaling from GLM-4.5's 355B parameters (32B active) to 744B parameters (40B active), GLM-5 integrates DeepSeek Sparse Attention (DSA) to largely reduce deployment cost while preserving long-context capacity.

With advances in both pre-training (28.5T tokens) and post-training via slime (a novel asynchronous RL infrastructure), GLM-5 delivers significant improvements over GLM-4.7 and achieves best-in-class performance among open-source models on reasoning, coding, and agentic tasks.

Key Features:

  • Systems Engineering & Agentic Tasks: Purpose-built for complex systems engineering and long-horizon agentic tasks
  • State-of-the-Art Performance: Best-in-class among open-source models on reasoning (HLE, AIME, GPQA), coding (SWE-bench, Terminal-Bench), and agentic tasks (BrowseComp, Vending Bench 2)
  • DeepSeek Sparse Attention (DSA): Reduces deployment cost while preserving long-context capacity
  • Multiple Quantizations: BF16 and FP8 variants for different performance/memory trade-offs
  • Speculative Decoding: EAGLE-based speculative decoding support for lower latency

Available Models:

License: MIT

2. SGLang Installation

SGLang offers multiple installation methods. You can choose the most suitable installation method based on your hardware platform and requirements.

GLM-5 requires a specific SGLang Docker image or install from source:

# For Hopper GPUs (H100/H200)
docker pull lmsysorg/sglang:glm5-hopper

# For Blackwell GPUs (B200)
docker pull lmsysorg/sglang:glm5-blackwell

For other installation methods, please refer to the official SGLang installation guide.

Blackwell (B200) Source Build

If you build SGLang from source on Blackwell GPUs, you need to manually compile sgl-kernel due to existing kernel issues (Hopper GPUs are unaffected). See sglang#18595 for details.

3. Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, quantization method, and capabilities.

Hardware Platform
Quantization
Reasoning Parser
Tool Call Parser
DP Attention
Speculative Decoding
Run this Command:
python -m sglang.launch_server \
  --model zai-org/GLM-5-FP8 \
  --tp 8 \
  --reasoning-parser glm45 \
  --tool-call-parser glm47 \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --mem-fraction-static 0.85

3.2 Configuration Tips

  • Speculative decoding (MTP) can significantly reduce latency for interactive use cases.
  • DP Attention: Enables data parallel attention for higher throughput under high concurrency. Note that DP attention trades off low-concurrency latency for high-concurrency throughput — disable it if your workload is latency-sensitive with few concurrent requests.
  • The --mem-fraction-static flag is recommended for optimal memory utilization, adjust it based on your hardware and workload.
  • BF16 model always requires 2x GPUs compared to FP8:
HardwareFP8BF16
H100tp=16tp=32
H200tp=8tp=16
B200tp=8tp=16

4. Model Invocation

Deploy GLM-5 with the following command (FP8 on H200, all features enabled):

python -m sglang.launch_server \
--model zai-org/GLM-5-FP8 \
--tp 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.85 \
--host 0.0.0.0 \
--port 30000

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Reasoning Parser

GLM-5 supports Thinking mode by default. Enable the reasoning parser during deployment to separate the thinking and content sections. The thinking process is returned via reasoning_content in the streaming response.

To disable thinking and use Instruct mode, pass chat_template_kwargs at request time:

  • Thinking mode (default): The model performs step-by-step reasoning before answering. No extra parameters needed.
  • Instruct mode ({"enable_thinking": false}): The model responds directly without a thinking process.

Example 1: Thinking Mode (Default)

Thinking mode is enabled by default. The model will reason step-by-step before answering, and the thinking process is returned via reasoning_content:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# Thinking mode is enabled by default, no extra parameters needed
response = client.chat.completions.create(
model="zai-org/GLM-5-FP8",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
max_tokens=2048,
stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta

# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)

# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)

print()

Output Example:

=============== Thinking =================
The user wants me to solve a math problem: "What is 15% of 240?".

Step 1: Understand the problem. I need to calculate a percentage of a number.
Formula: Percentage × Number = Result.

Step 2: Convert the percentage to a decimal or fraction.
15% = 15/100 or 0.15.

Step 3: Perform the multiplication.
Method A: Decimal multiplication.
0.15 × 240.
Break it down:
10% of 240 = 24.
5% is half of 10%, so 12.
15% = 10% + 5% = 24 + 12 = 36.

Method B: Fraction multiplication.
15/100 × 240.
Simplify 240/100 = 2.4.
15 × 2.4.
10 × 2.4 = 24.
5 × 2.4 = 12.
24 + 12 = 36.

Method C: Direct multiplication.
240 × 0.15.
240 × 0.10 = 24.
240 × 0.05 = 12.
24 + 12 = 36.

Step 4: Final Verification.
Is 36 reasonable?
10% is 24. 20% is 48.
15% is halfway between 10% and 20%.
Halfway between 24 and 48 is 36.
The result is correct.

Step 5: Structure the final response. I will present the calculation clearly, perhaps showing the fractional or decimal method, or the mental math shortcut (10% + 5%).
=============== Content =================
Here is the step-by-step solution:

**Step 1: Convert the percentage to a decimal.**
To convert 15% to a decimal, divide by 100.
$$15\% = \frac{15}{100} = 0.15$$

**Step 2: Multiply the decimal by the number.**
Now, multiply 0.15 by 240.
$$0.15 \times 240$$

**Step 3: Perform the calculation.**
You can break this down to make it easier:
$$0.15 = 0.10 + 0.05$$

* First, find 10% of 240:
$$0.10 \times 240 = 24$$
* Next, find 5% (which is half of 10%):
$$\frac{24}{2} = 12$$
* Add the two results together:
$$24 + 12 = 36$$

**Answer:**
15% of 240 is **36**.

Example 2: Instruct Mode (Thinking Off)

To disable thinking and get a direct response, pass {"enable_thinking": false} via chat_template_kwargs:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# Disable thinking mode via chat_template_kwargs
response = client.chat.completions.create(
model="zai-org/GLM-5-FP8",
messages=[
{"role": "user", "content": "What is 15% of 240?"}
],
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
max_tokens=2048,
stream=True
)

# In Instruct mode, the model responds directly without reasoning_content
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)

print()

Output Example:

To find **15% of 240**, follow these steps:

### Step 1: Convert the Percentage to a Decimal
First, convert the percentage to a decimal by dividing by 100.

\[
15\% = \frac{15}{100} = 0.15
\]

### Step 2: Multiply by the Number
Next, multiply the decimal by the number you want to find the percentage of.

\[
0.15 \times 240
\]

### Step 3: Perform the Multiplication
Calculate the multiplication:

\[
0.15 \times 240 = 36
\]

### Final Answer
\[
\boxed{36}
\]

4.2.2 Tool Calling

GLM-5 supports tool calling capabilities. Enable the tool call parser during deployment. Thinking mode is on by default; to disable it for tool calling requests, pass extra_body={"chat_template_kwargs": {"enable_thinking": False}}.

Python Example (with Thinking Process):

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]

# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="zai-org/GLM-5-FP8",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
stream=True
)

# Process streaming response
thinking_started = False
has_thinking = False

for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta

# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)

# Print tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================", flush=True)
thinking_started = False

for tool_call in delta.tool_calls:
if tool_call.function:
print(f"Tool Call: {tool_call.function.name}")
print(f" Arguments: {tool_call.function.arguments}")

# Print content
if delta.content:
print(delta.content, end="", flush=True)

print()

Output Example:

=============== Thinking =================
The user is asking for the weather in Beijing. I have access to a get_weather function that can provide current weather information. Let me check what parameters are required:

- location: required, should be "Beijing"
- unit: optional (not in required array), can be "celsius" or "fahrenheit"

Since the user didn't specify a unit preference and it's optional, I should not ask about it or make up a value. I'll just call the function with the required location parameter.I'll get the current weather in Beijing for you.
=============== Content =================
Tool Call: get_weather
Arguments:
Tool Call: None
Arguments: {
Tool Call: None
Arguments: "location": "Be
Tool Call: None
Arguments: ijing"
Tool Call: None
Arguments: }

5. Benchmark

5.1 Speed Benchmark

Test Environment:

  • Hardware: H200 (8x)
  • Model: GLM-5-FP8
  • Tensor Parallelism: 8
  • SGLang Version: commit 947927bdb

5.1.1 Latency Benchmark

python3 -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-5-FP8 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 35.78
Total input tokens: 6101
Total input text tokens: 6101
Total generated tokens: 4220
Total generated tokens (retokenized): 4213
Request throughput (req/s): 0.28
Input token throughput (tok/s): 170.54
Output token throughput (tok/s): 117.96
Peak output token throughput (tok/s): 148.00
Peak concurrent requests: 2
Total token throughput (tok/s): 288.50
Concurrency: 1.00
Accept length: 3.48
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3576.31
Median E2E Latency (ms): 2935.97
P90 E2E Latency (ms): 5908.97
P99 E2E Latency (ms): 8588.08
---------------Time to First Token----------------
Mean TTFT (ms): 290.88
Median TTFT (ms): 282.34
P99 TTFT (ms): 332.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 7.54
Median TPOT (ms): 6.97
P99 TPOT (ms): 9.04
---------------Inter-Token Latency----------------
Mean ITL (ms): 7.80
Median ITL (ms): 6.81
P95 ITL (ms): 13.51
P99 ITL (ms): 26.99
Max ITL (ms): 29.50
==================================================

5.1.2 Throughput Benchmark

python3 -m sglang.bench_serving \
--backend sglang \
--model zai-org/GLM-5-FP8 \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 1000 \
--max-concurrency 100 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 411.74
Total input tokens: 502493
Total input text tokens: 502493
Total generated tokens: 500251
Total generated tokens (retokenized): 499614
Request throughput (req/s): 2.43
Input token throughput (tok/s): 1220.41
Output token throughput (tok/s): 1214.97
Peak output token throughput (tok/s): 2648.00
Peak concurrent requests: 105
Total token throughput (tok/s): 2435.38
Concurrency: 96.30
Accept length: 3.50
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 39648.76
Median E2E Latency (ms): 39058.12
P90 E2E Latency (ms): 57009.82
P99 E2E Latency (ms): 68880.33
---------------Time to First Token----------------
Mean TTFT (ms): 20613.80
Median TTFT (ms): 21429.21
P99 TTFT (ms): 29543.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 38.73
Median TPOT (ms): 36.52
P99 TPOT (ms): 67.09
---------------Inter-Token Latency----------------
Mean ITL (ms): 38.13
Median ITL (ms): 16.57
P95 ITL (ms): 86.01
P99 ITL (ms): 164.88
Max ITL (ms): 1307.02
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

  • Benchmark Command
python3 benchmark/gsm8k/bench_sglang.py --port 30000
  • Test Result
Accuracy: 0.955
Invalid: 0.000
Latency: 32.470 s
Output throughput: 642.044 token/s

5.2.2 MMLU Benchmark

  • Benchmark Command
python3 benchmark/mmlu/bench_sglang.py --port 30000
  • Test Result
subject: abstract_algebra, #q:100, acc: 0.860
subject: anatomy, #q:135, acc: 0.874
subject: astronomy, #q:152, acc: 0.941
subject: business_ethics, #q:100, acc: 0.880
subject: clinical_knowledge, #q:265, acc: 0.932
subject: college_biology, #q:144, acc: 0.972
subject: college_chemistry, #q:100, acc: 0.640
subject: college_computer_science, #q:100, acc: 0.900
subject: college_mathematics, #q:100, acc: 0.810
subject: college_medicine, #q:173, acc: 0.873
subject: college_physics, #q:102, acc: 0.912
subject: computer_security, #q:100, acc: 0.880
subject: conceptual_physics, #q:235, acc: 0.928
subject: econometrics, #q:114, acc: 0.807
subject: electrical_engineering, #q:145, acc: 0.897
subject: elementary_mathematics, #q:378, acc: 0.937
subject: formal_logic, #q:126, acc: 0.778
subject: global_facts, #q:100, acc: 0.710
subject: high_school_biology, #q:310, acc: 0.961
subject: high_school_chemistry, #q:203, acc: 0.847
subject: high_school_computer_science, #q:100, acc: 0.960
subject: high_school_european_history, #q:165, acc: 0.891
subject: high_school_geography, #q:198, acc: 0.960
subject: high_school_government_and_politics, #q:193, acc: 0.984
subject: high_school_macroeconomics, #q:390, acc: 0.923
subject: high_school_mathematics, #q:270, acc: 0.696
subject: high_school_microeconomics, #q:238, acc: 0.962
subject: high_school_physics, #q:151, acc: 0.821
subject: high_school_psychology, #q:545, acc: 0.956
subject: high_school_statistics, #q:216, acc: 0.889
subject: high_school_us_history, #q:204, acc: 0.941
subject: high_school_world_history, #q:237, acc: 0.945
subject: human_aging, #q:223, acc: 0.857
subject: human_sexuality, #q:131, acc: 0.908
subject: international_law, #q:121, acc: 0.934
subject: jurisprudence, #q:108, acc: 0.907
subject: logical_fallacies, #q:163, acc: 0.933
subject: machine_learning, #q:112, acc: 0.830
subject: management, #q:103, acc: 0.942
subject: marketing, #q:234, acc: 0.940
subject: medical_genetics, #q:100, acc: 0.990
subject: miscellaneous, #q:783, acc: 0.959
subject: moral_disputes, #q:346, acc: 0.873
subject: moral_scenarios, #q:895, acc: 0.837
subject: nutrition, #q:306, acc: 0.922
subject: philosophy, #q:311, acc: 0.897
subject: prehistory, #q:324, acc: 0.929
subject: professional_accounting, #q:282, acc: 0.844
subject: professional_law, #q:1534, acc: 0.714
subject: professional_medicine, #q:272, acc: 0.941
subject: professional_psychology, #q:612, acc: 0.913
subject: public_relations, #q:110, acc: 0.791
subject: security_studies, #q:245, acc: 0.878
subject: sociology, #q:201, acc: 0.940
subject: us_foreign_policy, #q:100, acc: 0.920
subject: virology, #q:166, acc: 0.596
subject: world_religions, #q:171, acc: 0.936
Total latency: 165.275
Average accuracy: 0.877