Skip to main content

Step-3.5

1. Model Introduction

Step-3.5-Flash is StepFun's production-grade reasoning engine built to decouple elite intelligence from heavy compute, and cuts attention cost for low-latency, cost-effective long-context inference—purpose-built for autonomous agents in real-world workflows. The model is available in multiple quantization formats optimized for different hardware platforms.

This generation delivers comprehensive upgrades across the board:

  • Hybrid Attention Architecture: Interleaves Sliding Window Attention (SWA) and Global Attention (GA) with a 3:1 ratio and an aggressive 128-token window. This hybrid approach ensures consistent performance across massive datasets or long codebases while significantly reducing the computational overhead typical of standard long-context models.
  • Sparse Mixture-of-Experts: Only 11B active parameters out of 196B parameters.
  • Multi-Layer Multi-Token Prediction (MTP): Equipped with a 3-way Multi-Token Prediction (MTP-3). This allows for complex, multi-step reasoning chains with immediate responsiveness.

2.SGLang Installation

Step-3.5-Flash is currently available in SGLang via Docker image install.

Docker

# Pull the docker image
docker pull lmsysorg/sglang:dev-pr-18084

# Launch the container
docker run -it --gpus all \
--shm-size=32g \
--ipc=host \
--network=host \
lmsysorg/sglang:dev-pr-18084 bash

3.Model Deployment

This section provides deployment configurations optimized for different hardware platforms and use cases.

3.1 Basic Configuration

The Step-3.5-Flash series comes in only one sizes. Recommended starting configurations vary depending on hardware.

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model size, quantization method, and thinking capabilities.

Hardware Platform
Model Size
Quantization
Reasoning Parser
Tool Call Parser
Speculative Decoding
Run this Command:
sglang serve \
  --model-path stepfun-ai/Step-3.5-Flash \
  --tp 4

4.Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Reasoning Parser

Step-3.5-Flash only supports reasoning mode. Enable the reasoning parser during deployment to separate the thinking and content sections:

sglang serve \
--model-path stepfun-ai/Step-3.5-Flash \
--tp 4 \
--ep 4 \
--reasoning-parser step3p5
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)

# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="stepfun-ai/Step-3.5-Flash",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.7,
max_tokens=2048,
stream=True
)

# Process the stream
has_thinking = False
has_answer = False
thinking_started = False

for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta

# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)

# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)

print()

Output Example:

=============== Thinking =================
We are asked: "What is 15% of 240?" We need to solve step by step.

Step 1: Understand that "15% of 240" means we need to calculate 15 percent of 240. In mathematical terms, it is (15/100) * 240.

Step 2: Simplify the calculation. We can compute 15% of 240 by first finding 10% of 240 and then 5% of 240, and adding them. Alternatively, we can multiply directly.

Method 1:
10% of 240 = 240 * 0.10 = 24.
5% is half of 10%, so 5% of 240 = 24 / 2 = 12.
Then 15% = 10% + 5% = 24 + 12 = 36.

Method 2: Direct multiplication: 15% = 15/100 = 0.15, so 0.15 * 240 = 36.

We can also compute fractionally: (15/100)*240 = (15*240)/100. 15*240 = 3600, divided by 100 gives 36.

Thus, the answer is 36.

We'll present the solution step by step.

=============== Content =================

To find 15% of 240, follow these steps:

1. **Convert the percentage to a decimal**:
\( 15\% = \frac{15}{100} = 0.15 \)

2. **Multiply by the number**:
\( 0.15 \times 240 = 36 \)

Alternatively, break it down:
- \( 10\% \text{ of } 240 = 240 \times 0.10 = 24 \)
- \( 5\% \text{ of } 240 = \frac{24}{2} = 12 \) (since 5% is half of 10%)
- \( 15\% = 10\% + 5\% = 24 + 12 = 36 \)

**Answer:** 36

4.2.2 Tool Calling

Step-3.5 supports tool calling capabilities. Enable the tool call parser:

Python Example:

Start sglang server:

sglang serve \
--model-path stepfun-ai/Step-3.5-Flash \
--tp 4 \
--ep 4 \
--reasoning-parser step3p5 \
--tool-call-parser step3p5
from openai import OpenAI
import json

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# 1. define tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"], "description": "Temperature unit"}
},
"required": ["location"]
}
}
}
]

# 2. tool run
def get_weather(location, unit="celsius"):
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# 3. send first request
print("--- Sending first request ---")
response = client.chat.completions.create(
model="stepfun-ai/Step-3.5-Flash",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=1.0,
stream=False
)

message = response.choices[0].message

# 4. Handle Reasoning Content
reasoning = getattr(message, 'reasoning_content', None)
if reasoning:
print("=============== Thinking =================")
print(reasoning)
print("==========================================")

# 5. Handle Tool Calls
if message.tool_calls:
print("\n🔧 Tool Calls detected:")
history_messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
message
]

for tool_call in message.tool_calls:
print(f" Tool: {tool_call.function.name}")
print(f" Args: {tool_call.function.arguments}")

args = json.loads(tool_call.function.arguments)
tool_result = get_weather(args.get("location"), args.get("unit", "celsius"))

history_messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": tool_result
})

print("\n--- Sending tool results ---")
final_response = client.chat.completions.create(
model="stepfun-ai/Step-3.5-Flash",
messages=history_messages,
temperature=1.0,
stream=False
)

print("=============== Final Content =================")
print(final_response.choices[0].message.content)

else:
if message.content:
print("=============== Content =================")
print(message.content)

Output Example:

--- Sending first request ---
=============== Thinking =================
The user is asking for the weather in Beijing. I should use the get_weather function with location="Beijing". The unit parameter is optional and the user didn't specify a preference, so I'll leave it out (the default should be fine).

==========================================

🔧 Tool Calls detected:
Tool: get_weather
Args: {"location": "Beijing"}

--- Sending tool results ---
=============== Final Content =================
The weather in Beijing is 22°C and sunny.

Note:

  • The reasoning parser shows how the model decides to use a tool
  • Tool calls are clearly marked with the function name and arguments
  • You can then execute the function and send the result back to continue the conversation

5. Benchmark

5.1 Speed Benchmark

Test Environment:

  • Hardware: NVIDIA H200 GPU (4x)
  • Model: Step-3.5-Flash
  • Tensor Parallelism: 4
  • Expert Parallelism: 4
  • sglang version: 0.5.8

We use SGLang's built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.

5.1.1 Standard Scenario Benchmark

  • Model Deployment Command:
sglang serve \
--model-path stepfun-ai/Step-3.5-Flash \
--tp 4 \
--ep 4
5.1.1.1 Low Concurrency
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model stepfun-ai/Step-3.5-Flash \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 10 \
--max-concurrency 1
  • Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 35.30
Total input tokens: 6091
Total input text tokens: 6091
Total generated tokens: 4220
Total generated tokens (retokenized): 4212
Request throughput (req/s): 0.28
Input token throughput (tok/s): 172.57
Output token throughput (tok/s): 119.56
Peak output token throughput (tok/s): 124.00
Peak concurrent requests: 2
Total token throughput (tok/s): 292.14
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 3527.94
Median E2E Latency (ms): 2884.72
P90 E2E Latency (ms): 6350.38
P99 E2E Latency (ms): 7858.53
---------------Time to First Token----------------
Mean TTFT (ms): 107.53
Median TTFT (ms): 80.93
P99 TTFT (ms): 269.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 8.12
Median TPOT (ms): 8.13
P99 TPOT (ms): 8.14
---------------Inter-Token Latency----------------
Mean ITL (ms): 8.12
Median ITL (ms): 8.11
P95 ITL (ms): 8.61
P99 ITL (ms): 8.91
Max ITL (ms): 20.77
==================================================
5.1.1.2 Medium Concurrency
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model stepfun-ai/Step-3.5-Flash \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 80 \
--max-concurrency 16
  • Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 16
Successful requests: 80
Benchmark duration (s): 54.06
Total input tokens: 39588
Total input text tokens: 39588
Total generated tokens: 40805
Total generated tokens (retokenized): 40479
Request throughput (req/s): 1.48
Input token throughput (tok/s): 732.33
Output token throughput (tok/s): 754.84
Peak output token throughput (tok/s): 928.00
Peak concurrent requests: 21
Total token throughput (tok/s): 1487.17
Concurrency: 14.06
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 9501.23
Median E2E Latency (ms): 10010.71
P90 E2E Latency (ms): 15655.09
P99 E2E Latency (ms): 18803.63
---------------Time to First Token----------------
Mean TTFT (ms): 198.34
Median TTFT (ms): 89.50
P99 TTFT (ms): 984.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 18.97
Median TPOT (ms): 18.80
P99 TPOT (ms): 35.67
---------------Inter-Token Latency----------------
Mean ITL (ms): 18.27
Median ITL (ms): 17.48
P95 ITL (ms): 18.44
P99 ITL (ms): 62.47
Max ITL (ms): 460.85
==================================================
5.1.1.3 High Concurrency
  • Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--model stepfun-ai/Step-3.5-Flash \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--num-prompts 500 \
--max-concurrency 100
  • Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 500
Benchmark duration (s): 125.88
Total input tokens: 249331
Total input text tokens: 249331
Total generated tokens: 252662
Total generated tokens (retokenized): 251323
Request throughput (req/s): 3.97
Input token throughput (tok/s): 1980.77
Output token throughput (tok/s): 2007.23
Peak output token throughput (tok/s): 2500.00
Peak concurrent requests: 109
Total token throughput (tok/s): 3987.99
Concurrency: 92.25
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 23223.31
Median E2E Latency (ms): 22631.90
P90 E2E Latency (ms): 42269.38
P99 E2E Latency (ms): 47637.53
---------------Time to First Token----------------
Mean TTFT (ms): 372.13
Median TTFT (ms): 127.26
P99 TTFT (ms): 1880.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 46.06
Median TPOT (ms): 47.61
P99 TPOT (ms): 51.34
---------------Inter-Token Latency----------------
Mean ITL (ms): 45.31
Median ITL (ms): 39.86
P95 ITL (ms): 72.49
P99 ITL (ms): 117.05
Max ITL (ms): 1359.81
==================================================

5.2 Accuracy Benchmark

5.2.1 GSM8K Benchmark

  • Benchmark Command:
python3 -m sglang.test.few_shot_gsm8k --num-questions 200
  • Results:

    • Step-3.5-Flash
      Accuracy: 0.885
      Invalid: 0.005
      Latency: 9.986 s
      Output throughput: 1972.911 token/s