Step-3.5
1. Model Introduction
Step-3.5-Flash is StepFun's production-grade reasoning engine built to decouple elite intelligence from heavy compute, and cuts attention cost for low-latency, cost-effective long-context inference—purpose-built for autonomous agents in real-world workflows. The model is available in multiple quantization formats optimized for different hardware platforms.
This generation delivers comprehensive upgrades across the board:
- Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 3:1 ratio and an aggressive 128-token window. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
- Sparse Mixture-of-Experts: Only 11B active parameters out of 196B parameters.
- Multi-Layer Multi-Token Prediction (MTP): Equipped with a 3-way Multi-Token Prediction (MTP-3). This allows for complex, multi-step reasoning chains with immediate responsiveness.
2.SGLang Installation
Step-3.5-Flash is currently available in SGLang via Docker image install.
Docker
# Pull the docker image
docker pull lmsysorg/sglang:dev-pr-18084
# Launch the container
docker run -it --gpus all \
--shm-size=32g \
--ipc=host \
--network=host \
lmsysorg/sglang:dev-pr-18084 bash
3.Model Deployment
This section provides deployment configurations optimized for different hardware platforms and use cases.
3.1 Basic Configuration
The Step-3.5-Flash series comes in only one sizes. Recommended starting configurations vary depending on hardware.
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.
sglang serve \ --model-path stepfun-ai/Step-3.5-Flash \ --tp 4
4.Model Invocation
4.1 Basic Usage
For basic API usage and request examples, please refer to:
4.2 Advanced Usage
4.2.1 Reasoning Parser
Step-3.5-Flash only supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:
sglang serve \
--model-path stepfun-ai/Step-3.5-Flash \
--tp 4 \
--ep 4 \
--reasoning-parser step3p5
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="stepfun-ai/Step-3.5-Flash",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
We are asked: "What is 15% of 240?" We need to solve step by step.
Step 1: Understand that "15% of 240" means we need to calculate 15 percent of 240. In mathematical terms, it is (15/100) * 240.
Step 2: Simplify the calculation. We can compute 15% of 240 by first finding 10% of 240 and then 5% of 240, and adding them. Alternatively, we can multiply directly.
Method 1:
10% of 240 = 240 * 0.10 = 24.
5% is half of 10%, so 5% of 240 = 24 / 2 = 12.
Then 15% = 10% + 5% = 24 + 12 = 36.
Method 2: Direct multiplication: 15% = 15/100 = 0.15, so 0.15 * 240 = 36.
We can also compute fractionally: (15/100)*240 = (15*240)/100. 15*240 = 3600, divided by 100 gives 36.
Thus, the answer is 36.
We'll present the solution step by step.
=============== Content =================
To find 15% of 240, follow these steps:
1. **Convert the percentage to a decimal**:
\( 15\% = \frac{15}{100} = 0.15 \)
2. **Multiply by the number**:
\( 0.15 \times 240 = 36 \)
Alternatively, break it down:
- \( 10\% \text{ of } 240 = 240 \times 0.10 = 24 \)
- \( 5\% \text{ of } 240 = \frac{24}{2} = 12 \) (since 5% is half of 10%)
- \( 15\% = 10\% + 5\% = 24 + 12 = 36 \)
**Answer:** 36
4.2.2 Tool Calling
Step-3.5 supports tool calling capabilities. Enable the tool call parser:
Python Example:
Start sglang server:
sglang serve \
--model-path stepfun-ai/Step-3.5-Flash \
--tp 4 \
--ep 4 \
--reasoning-parser step3p5 \
--tool-call-parser step3p5
from openai import OpenAI
import json
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# 1. define tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
},
"required": ["location"]
}
}
}
]
# 2. tool run
def get_weather(location, unit="celsius"):
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# 3. send first request
print("--- Sending first request ---")
response = client.chat.completions.create(
model="stepfun-ai/Step-3.5-Flash",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=1.0,
stream=False
)
message = response.choices[0].message
# 4. Handle Reasoning Content
reasoning = getattr(message, 'reasoning_content', None)
if reasoning:
print("=============== Thinking =================")
print(reasoning)
print("==========================================")
# 5. Handle Tool Calls
if message.tool_calls:
print("\n🔧 Tool Calls detected:")
history_messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
message
]
for tool_call in message.tool_calls:
print(f" Tool: {tool_call.function.name}")
print(f" Args: {tool_call.function.arguments}")
args = json.loads(tool_call.function.arguments)
tool_result = get_weather(args.get("location"), args.get("unit", "celsius"))
history_messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": tool_result
})
print("\n--- Sending tool results ---")
final_response = client.chat.completions.create(
model="stepfun-ai/Step-3.5-Flash",
messages=history_messages,
temperature=1.0,
stream=False
)
print("=============== Final Content =================")
print(final_response.choices[0].message.content)
else:
if message.content:
print("=============== Content =================")
print(message.content)
Output Example:
--- Sending first request ---
=============== Thinking =================
The user is asking for the weather in Beijing. I should use the get_weather function with location="Beijing". The unit parameter is optional and the user didn't specify a preference, so I'll leave it out (the default should be fine).
==========================================
🔧 Tool Calls detected:
Tool: get_weather
Args: {"location": "Beijing"}
--- Sending tool results ---
=============== Final Content =================
The weather in Beijing is 22°C and sunny.
Note:
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
5. Benchmark
5.1 Speed Benchmark
Test Environment:
- Hardware: NVIDIA H200 GPU (4x)
- Model: Step-3.5-Flash
- Tensor Parallelism: 4
- Expert Parallelism: 4
- sglang version: 0.5.8
We use SGLang's built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.
5.1.1 Standard Scenario Benchmark
- Model Deployment Command:
sglang serve \
--model-path stepfun-ai/Step-3.5-Flash \
--tp 4 \
--ep 4
5.1.1.1 Low Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model stepfun-ai/Step-3.5-Flash \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 35.30
Total input tokens: 6091
Total input text tokens: 6091
Total generated tokens: 4220
Total generated tokens (retokenized): 4212
Request throughput (req/s): 0.28
Input token throughput (tok/s): 172.57
Output token throughput (tok/s): 119.56
Peak output token throughput (tok/s): 124.00
Peak concurrent requests: 2
Total token throughput (tok/s): 292.14
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3527.94
Median E2E Latency (ms): 2884.72
P90 E2E Latency (ms): 6350.38
P99 E2E Latency (ms): 7858.53
---------------Time to First Token----------------
Mean TTFT (ms): 107.53
Median TTFT (ms): 80.93
P99 TTFT (ms): 269.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.12
Median TPOT (ms): 8.13
P99 TPOT (ms): 8.14
---------------Inter-Token Latency----------------
Mean ITL (ms): 8.12
Median ITL (ms): 8.11
P95 ITL (ms): 8.61
P99 ITL (ms): 8.91
Max ITL (ms): 20.77
==================================================
5.1.1.2 Medium Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model stepfun-ai/Step-3.5-Flash \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 54.06
Total input tokens: 39588
Total input text tokens: 39588
Total generated tokens: 40805
Total generated tokens (retokenized): 40479
Request throughput (req/s): 1.48
Input token throughput (tok/s): 732.33
Output token throughput (tok/s): 754.84
Peak output token throughput (tok/s): 928.00
Peak concurrent requests: 21
Total token throughput (tok/s): 1487.17
Concurrency: 14.06
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 9501.23
Median E2E Latency (ms): 10010.71
P90 E2E Latency (ms): 15655.09
P99 E2E Latency (ms): 18803.63
---------------Time to First Token----------------
Mean TTFT (ms): 198.34
Median TTFT (ms): 89.50
P99 TTFT (ms): 984.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 18.97
Median TPOT (ms): 18.80
P99 TPOT (ms): 35.67
---------------Inter-Token Latency----------------
Mean ITL (ms): 18.27
Median ITL (ms): 17.48
P95 ITL (ms): 18.44
P99 ITL (ms): 62.47
Max ITL (ms): 460.85
==================================================
5.1.1.3 High Concurrency
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model stepfun-ai/Step-3.5-Flash \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 125.88
Total input tokens: 249331
Total input text tokens: 249331
Total generated tokens: 252662
Total generated tokens (retokenized): 251323
Request throughput (req/s): 3.97
Input token throughput (tok/s): 1980.77
Output token throughput (tok/s): 2007.23
Peak output token throughput (tok/s): 2500.00
Peak concurrent requests: 109
Total token throughput (tok/s): 3987.99
Concurrency: 92.25
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 23223.31
Median E2E Latency (ms): 22631.90
P90 E2E Latency (ms): 42269.38
P99 E2E Latency (ms): 47637.53
---------------Time to First Token----------------
Mean TTFT (ms): 372.13
Median TTFT (ms): 127.26
P99 TTFT (ms): 1880.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 46.06
Median TPOT (ms): 47.61
P99 TPOT (ms): 51.34
---------------Inter-Token Latency----------------
Mean ITL (ms): 45.31
Median ITL (ms): 39.86
P95 ITL (ms): 72.49
P99 ITL (ms): 117.05
Max ITL (ms): 1359.81
==================================================
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Benchmark Command:
python3 -m sglang.test.few_shot_gsm8k --num-questions 200
-
Results:
- Step-3.5-Flash
Accuracy: 0.885
Invalid: 0.005
Latency: 9.986 s
Output throughput: 1972.911 token/s
- Step-3.5-Flash