Kimi-K2
1. Model Introduction
Kimi-K2 is a state-of-the-art MoE language model by Moonshot AI with 32B activated parameters and 1T total parameters.
Model Variants:
- Kimi-K2-Instruct: Post-trained model optimized for general-purpose chat and agentic tasks. Compatible with vLLM, SGLang, KTransformers, and TensorRT-LLM.
- Kimi-K2-Thinking: Advanced thinking model with step-by-step reasoning and tool calling. Native INT4 quantization with 256k context window. Ideal for complex reasoning and multi-step tool use.
For details, see official documentation and technical report.
2. SGLang Installation
Refer to the official SGLang installation guide.
3. Model Deployment
This section provides a progressive guide from quick deployment to performance optimization, suitable for users at different levels.
3.1 Basic Configuration
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your hardware platform, model variant, deployment strategy, and capabilities.
python3 -m sglang.launch_server \ --model-path moonshotai/Kimi-K2-Instruct \ --tp 8 \ --trust-remote-code
3.2 Configuration Tips
- Memory: Requires 8 GPUs with ≥140GB each (H200/B200). Use
--context-length 128000to conserve memory. - Expert Parallelism (EP): Use
--epfor better MoE throughput. See EP docs. - Data Parallel (DP): Enable with
--dp 4 --enable-dp-attentionfor production throughput. - KV Cache: Use
--kv-cache-dtype fp8_e4m3to reduce memory by 50% (CUDA 11.8+). - Reasoning Parser: Add
--reasoning-parser kimi_k2for Kimi-K2-Thinking to separate thinking and content. - Tool Call Parser: Add
--tool-call-parser kimi_k2for structured tool calls.
4. Model Invocation
4.1 Basic Usage
See Basic API Usage.
4.2 Advanced Usage
4.2.1 Reasoning Parser
Enable reasoning parser for Kimi-K2-Thinking:
python -m sglang.launch_server \
--model moonshotai/Kimi-K2-Thinking \
--reasoning-parser kimi_k2 \
--tp 8 \
--host 0.0.0.0 \
--port 8000
Example:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Enable streaming to see the thinking process in real-time
response = client.chat.completions.create(
model="moonshotai/Kimi-K2-Thinking",
messages=[
{"role": "user", "content": "Solve this problem step by step: What is 15% of 240?"}
],
temperature=0.6,
max_tokens=2048,
stream=True
)
# Process the stream
has_thinking = False
has_answer = False
thinking_started = False
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Print answer content
if delta.content:
# Close thinking section and add content header
if has_thinking and not has_answer:
print("\n=============== Content =================", flush=True)
has_answer = True
print(delta.content, end="", flush=True)
print()
Output Example:
=============== Thinking =================
The user asks: "What is 15% of 240?" This is a straightforward percentage calculation problem. I need to solve it step by step.
Step 1: Understand what "percent" means.
- "Percent" means "per hundred". So 15% means 15 per 100, or 15/100, or 0.15.
Step 2: Convert the percentage to a decimal.
- 15% = 15 / 100 = 0.15
Step 3: Multiply the decimal by the number.
- 0.15 * 240
Step 4: Perform the multiplication.
- 0.15 * 240 = (15/100) * 240
- = 15 * 240 / 100
- = 3600 / 100
- = 36
Alternatively, I can calculate it directly:
- 0.15 * 240
- 15 * 240 = 3600
- 3600 / 100 = 36
Or, break it down:
- 10% of 240 = 24
- 5% of 240 = half of 10% = 12
- 15% of 240 = 10% + 5% = 24 + 12 = 36
I should present the solution clearly with steps. The most standard method is converting to decimal and multiplying.
Let me structure the answer:
1. Convert the percentage to a decimal.
2. Multiply the decimal by the number.
3. Show the calculation.
4. State the final answer.
This is simple and easy to follow.
=============== Content =================
Here is the step-by-step solution:
**Step 1: Convert the percentage to a decimal**
15% means 15 per 100, which is 15 ÷ 100 = **0.15**
**Step 2: Multiply the decimal by the number**
0.15 × 240
**Step 3: Calculate the result**
0.15 × 240 = **36**
**Answer:** 15% of 240 is **36**.
Note: The reasoning parser captures the model's step-by-step thinking process, allowing you to see how the model arrives at its conclusions.
4.2.2 Tool Calling
Kimi-K2-Instruct and Kimi-K2-Thinking support tool calling capabilities. Enable the tool call parser during deployment:
Deployment Command:
python -m sglang.launch_server \
--model moonshotai/Kimi-K2-Instruct \
--tool-call-parser kimi_k2 \
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
Python Example (with Thinking Process):
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request with streaming to see thinking process
response = client.chat.completions.create(
model="moonshotai/Kimi-K2-Thinking",
messages=[
{"role": "user", "content": "What's the weather in Beijing?"}
],
tools=tools,
temperature=0.7,
stream=True
)
# Process streaming response
thinking_started = False
has_thinking = False
tool_calls_accumulator = {}
for chunk in response:
if chunk.choices and len(chunk.choices) > 0:
delta = chunk.choices[0].delta
# Print thinking process
if hasattr(delta, 'reasoning_content') and delta.reasoning_content:
if not thinking_started:
print("=============== Thinking =================", flush=True)
thinking_started = True
has_thinking = True
print(delta.reasoning_content, end="", flush=True)
# Accumulate tool calls
if hasattr(delta, 'tool_calls') and delta.tool_calls:
# Close thinking section if needed
if has_thinking and thinking_started:
print("\n=============== Content =================\n", flush=True)
thinking_started = False
for tool_call in delta.tool_calls:
index = tool_call.index
if index not in tool_calls_accumulator:
tool_calls_accumulator[index] = {
'name': None,
'arguments': ''
}
if tool_call.function:
if tool_call.function.name:
tool_calls_accumulator[index]['name'] = tool_call.function.name
if tool_call.function.arguments:
tool_calls_accumulator[index]['arguments'] += tool_call.function.arguments
# Print content
if delta.content:
print(delta.content, end="", flush=True)
# Print accumulated tool calls
for index, tool_call in sorted(tool_calls_accumulator.items()):
print(f"🔧 Tool Call: {tool_call['name']}")
print(f" Arguments: {tool_call['arguments']}")
print()
Output Example:
=============== Thinking =================
The user is asking about the weather in Beijing. I need to use the get_weather function to retrieve this information. Beijing is a major city in China, so I should be able to get weather data for it. The location parameter is required, but the unit parameter is optional. Since the user didn't specify a temperature unit, I can just provide the location and let the function use its default. I'll check the weather in Beijing for you.
=============== Content =================
🔧 Tool Call: get_weather
Arguments: {"location":"Beijing"}
Note:
- The reasoning parser shows how the model decides to use a tool
- Tool calls are clearly marked with the function name and arguments
- You can then execute the function and send the result back to continue the conversation
Handling Tool Call Results:
# After getting the tool call, execute the function
def get_weather(location, unit="celsius"):
# Your actual weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# Send tool result back to the model
messages = [
{"role": "user", "content": "What's the weather in Beijing?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Beijing", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Beijing", "celsius")
}
]
final_response = client.chat.completions.create(
model="moonshotai/Kimi-K2-Thinking",
messages=messages,
temperature=0.7
)
print(final_response.choices[0].message.content)
# Output: "The weather in Beijing is currently 22°C and sunny."
5. Benchmark
5.1 Speed Benchmark
Test Environment:
- Hardware: NVIDIA B200 GPU (8x)
- Model: Kimi-K2-Instruct
- sglang version: 0.5.6.post1
We use SGLang's built-in benchmarking tool to conduct performance evaluation on the ShareGPT_Vicuna_unfiltered dataset. This dataset contains real conversation data and can better reflect performance in actual use scenarios.
5.1.1 Latency-Sensitive Benchmark
- Model Deployment Command:
python3 -m sglang.launch_server \
--model-path moonshotai/Kimi-K2-Instruct \
--tp 8 \
--dp 4 \
--enable-dp-attention \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 8000 \
--model moonshotai/Kimi-K2-Instruct\
--num-prompts 10 \
--max-concurrency 1
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 10
Benchmark duration (s): 44.93
Total input tokens: 1951
Total input text tokens: 1951
Total input vision tokens: 0
Total generated tokens: 2755
Total generated tokens (retokenized): 2748
Request throughput (req/s): 0.22
Input token throughput (tok/s): 43.42
Output token throughput (tok/s): 61.32
Peak output token throughput (tok/s): 64.00
Peak concurrent requests: 3
Total token throughput (tok/s): 104.74
Concurrency: 1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 4489.56
Median E2E Latency (ms): 4994.53
---------------Time to First Token----------------
Mean TTFT (ms): 141.22
Median TTFT (ms): 158.28
P99 TTFT (ms): 166.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 18.40
Median TPOT (ms): 15.63
P99 TPOT (ms): 39.88
---------------Inter-Token Latency----------------
Mean ITL (ms): 15.78
Median ITL (ms): 15.76
P95 ITL (ms): 16.36
P99 ITL (ms): 16.59
Max ITL (ms): 19.94
==================================================
5.1.2 Throughput-Sensitive Benchmark
- Model Deployment Command:
python3 -m sglang.launch_server \
--model-path moonshotai/Kimi-K2-Instruct \
--tp 8 \
--dp 4 \
--ep 4 \
--enable-dp-attention \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
- Benchmark Command:
python3 -m sglang.bench_serving \
--backend sglang \
--host 127.0.0.1 \
--port 8000 \
--model moonshotai/Kimi-K2-Instruct\
--num-prompts 1000 \
--max-concurrency 100
- Test Results:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: 100
Successful requests: 1000
Benchmark duration (s): 174.11
Total input tokens: 296642
Total input text tokens: 296642
Total input vision tokens: 0
Total generated tokens: 193831
Total generated tokens (retokenized): 168687
Request throughput (req/s): 5.74
Input token throughput (tok/s): 1703.73
Output token throughput (tok/s): 1113.25
Peak output token throughput (tok/s): 2383.00
Peak concurrent requests: 112
Total token throughput (tok/s): 2816.97
Concurrency: 89.60
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 15601.09
Median E2E Latency (ms): 10780.52
---------------Time to First Token----------------
Mean TTFT (ms): 457.42
Median TTFT (ms): 221.62
P99 TTFT (ms): 2475.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 97.23
Median TPOT (ms): 85.61
P99 TPOT (ms): 435.95
---------------Inter-Token Latency----------------
Mean ITL (ms): 78.61
Median ITL (ms): 43.66
P95 ITL (ms): 169.53
P99 ITL (ms): 260.91
Max ITL (ms): 1703.21
==================================================
5.2 Accuracy Benchmark
5.2.1 GSM8K Benchmark
- Server Command
python3 -m sglang.launch_server \
--model-path moonshotai/Kimi-K2-Instruct \
--tp 8 \
--dp 4 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
- Benchmark Command
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --port 8000
- Result:
Accuracy: 0.960
Invalid: 0.000
Latency: 15.956 s
Output throughput: 1231.699 token/s