Skip to main content

Llama 3.3 70B

1. Model Introduction

Llama-3.3-70B-Instruct is Meta's latest 70 billion parameter instruction-tuned language model, featuring improved performance and efficiency over Llama 3.1. With a 128K token context window and enhanced capabilities across reasoning, coding, and multilingual tasks, Llama 3.3 delivers state-of-the-art results while maintaining accessibility for production deployment.

Key Features:

  • Enhanced Performance: Improved instruction following, reasoning, and task completion over Llama 3.1
  • Tool Calling: Native support for function calling and tool use scenarios
  • Multilingual Support: Optimized for 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai)
  • Extended Context: 128K token context window for processing long documents and complex tasks
  • Efficient Deployment: 70B parameters enable deployment on single GPU with AMD MI300X

License: Llama 3.3 is licensed under the Llama 3.3 Community License. See LICENSE for details.

For more details, please refer to the official Llama models repository.

2. SGLang Installation

Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for AMD GPUs (MI300X, MI325X, MI355X).

3.1 Interactive Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your AMD GPU setup.

Hardware Platform
Quantization
Tool Calling
Run this Command:
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tp 1 \
  --tool-call-parser llama3 \
  --host 0.0.0.0 \
  --port 30000

3.2 Configuration Tips

AMD GPU Deployment:

  • All AMD GPUs (MI300X, MI325X, MI355X) support TP=1 for both BF16 and FP8 variants
  • FP8 Model Variant: Use AMD's optimized amd/Llama-3.3-70B-Instruct-FP8-KV
  • Tool Calling: Enable with --tool-call-parser llama3 for function calling support
  • Higher Throughput: Optional TP=2 or TP=4 can be used for increased throughput

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

4.2 Advanced Usage

4.2.1 Tool Calling

Llama 3.3 70B Instruct supports native tool calling. Enable the tool parser during deployment:

python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--tool-call-parser llama3 \
--tp 1 \
--host 0.0.0.0 \
--port 30000

Python Example:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]

# Make request
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"}
],
tools=tools,
temperature=0.7
)

# Check for tool calls
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")

Handling Tool Call Results:

# After executing the function, send the result back
def get_weather(location, unit="celsius"):
# Your weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Build conversation with tool result
messages = [
{"role": "user", "content": "What's the weather in Tokyo?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Tokyo", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Tokyo", "celsius")
}
]

final_response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=messages,
temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "The current weather in Tokyo is 22°C and sunny. A perfect day!"

4.2.2 Long Context Processing

Leverage the 128K context window for processing long documents:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)

# Example with long document
long_document = "..." * 10000 # Your long document here

response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[
{"role": "user", "content": f"Summarize this document:\n\n{long_document}"}
],
temperature=0.7,
max_tokens=1000
)

print(response.choices[0].message.content)

5. Benchmarking

Use the SGLang benchmarking suite to test model performance with different workload patterns:

5.1 Basic Benchmark Command

python -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 1000 \
--random-input 1024 \
--random-output 1024 \
--max-concurrency 16

5.2 Adjusting Benchmark Parameters

Input/Output Length: Adjust --random-input and --random-output to test different workload patterns:

  • Short conversations: --random-input 1024 --random-output 1024
  • Long outputs: --random-input 1024 --random-output 8192
  • Long inputs: --random-input 8192 --random-output 1024

Concurrency Levels: Adjust --max-concurrency to test different load scenarios:

  • Low concurrency (latency-focused): --max-concurrency 1 --num-prompts 100
  • Medium concurrency (balanced): --max-concurrency 16 --num-prompts 1000
  • High concurrency (throughput-focused): --max-concurrency 100 --num-prompts 2000

📚 Additional Resources