Fireworks Training: Train and Deploy Frontier Models

Three Entry Points, One Platform

Start where you are. Go as far as you need.

Same platform, same deployment targets throughout. Drop down when you need more control.

For Product Teams & App Builders

Training Agent

Describe your task. Deploy your model.

An autonomous agent that handles the entire training pipeline. Describe your goal, upload your data. The Agent handles data prep, model selection, hyperparameter sweep, evals, and deployment. Currently LoRA-only.

SFT Classification DPO

Automated data cleaning (the most common failure point in fine-tuning)
Evals generated and run automatically
Model live on production inference the moment training completes

Per-job pricing, confirmed upfront. Models under 16B train at no cost.

Get started →

For ML Engineers

Managed Training

You pick the method. We run everything else.

Choose SFT, DPO, or RFT. We handle GPU provisioning, distributed training, checkpointing, and scaling. Full-parameter training available for the deepest behavioral changes. One-click deployment to production.

SFT DPO RFT Full-param

RFT: Define a reward function instead of writing thousands of demonstrations. Outperforms SFT on complex agentic tasks.
Full-parameter: Behavioral changes LoRA can't produce, at any scale up to 1T parameters
Multi-LoRA: Serve hundreds of adapters on a shared base, no extra infrastructure cost per experiment
Chain stages: SFT → DPO → RFT, warm-start from any checkpoint

Per token / per GPU-hour depending on job type.

Get started → RFT deep dive →

For Production ML Teams

Training API

Your objective. Our compute. No constraints.

Bring your own training loop. The platform has no opinion on your objective. Write custom loss functions, run RL at scale across regions, or chain SFT into RFT with full optimizer state preserved. LoRA and full-parameter supported.

Custom loss Full-param Frontier RL

Custom loss: Write any objective: GRPO, DRO, DAPO, or your own. No rigid recipes.
Frontier RL: Elastic rollout inference across regions with weight sync. No co-located hardware required.
Full-parameter up to Kimi K2.5 (1T parameters) on 64 B200s
Per-GPU-hour pricing, predictable cost for rollout-heavy workflows

Per GPU-hour on provisioned clusters.

Talk to our team → View docs →

	Training Agent	Managed Training	Training API
You bring	Data + description	Formatted data + method	Your training loop + loss functions
We handle	Everything	GPU provisioning, distributed training, checkpointing	GPU execution, model parallelism, weight syncing, preemption recovery
LoRA / Full-param	LoRA only	LoRA + full-parameter	LoRA + full-parameter
Custom loss	No	No	Yes, any objective
Frontier RL	No	RFT with reward functions	Full control: cross-region, custom rollout
Pricing	Per job, upfront	Per token / GPU-hr	Per GPU-hr

Proven in Production

The fastest-moving AI companies train on Fireworks.

See more customer stories →

Vercel

40X latency improvement · 93% error-free rate

"Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."

Malte Ubl, CTO at Vercel

Read case study →

Genspark

33% more tool calls · 50% lower cost

"Fireworks enabled us to own our AI journey, and unlock better quality in just four weeks."

Kay Zhu, CTO at Genspark

Read case study →

Cursor

Production RL training across 3–4 clusters

"Our RL inference scales elastically and globally because of it. When we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."

Federico Cassano, Research at Cursor

Read case study →

rLLM

Research infrastructure for autonomous AI

"The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works."

Kyle Montgomery & Sijun Tan, Core Contributors at rLLM

View on GitHub →

Full-Parameter Training

From a single node to 1T parameters at frontier scale.

Most training services cap out at LoRA. LoRA is the right starting point: fast, cost-effective, and well-suited for rapid iteration. But LoRA and full-parameter training learn in meaningfully different ways. LoRA learns less and forgets less. Full-parameter produces behavioral changes that adapter-based methods can't reach.

Fireworks Training supports full-parameter training across the model catalog, from Qwen3 8B on a single node up to Kimi K2.5 at 1 trillion parameters. LoRA and full-parameter run on the same platform. You don't have to choose your ceiling when you start.

Read the infrastructure writeup →

One Infrastructure

What you train is exactly what you serve.

Fireworks runs production inference across DeepSeek, Kimi, Qwen, and GPT OSS at scale. That experience is built into the training platform. The numerical edge cases that surface in frontier MoE models aren't hypothetical to us. We've debugged them in production.

A trained checkpoint becomes a live endpoint in seconds. No format conversion, no serving stack migration. Training and inference share the same kernels, the same hardware, so model behavior in training is model behavior in production.

We publish k3 KL divergence between training and inference checkpoints for every model in our catalog. All values below 0.01 are production-grade. If your training and serving stacks disagree numerically, your evals are measuring the gap between them, not model quality.

Full k3 table and methodology →

Deploy at Scale

Serve hundreds of fine-tuned models on a single GPU.

Training a great model is half the story. Fireworks Multi-LoRA serving runs hundreds of fine-tuned adapters per GPU, sharing a single base model, deployed in one click with zero extra infrastructure cost.

Iterate across experiments or serve different model variants per customer, without paying for a dedicated GPU per adapter. Better model, same GPU budget.

100s

Adapters per GPU

1 click

To deploy

Extra infra cost

Adapters per GPU

GPU 1

GPU 2

GPU 3

GPU 4

Why Fireworks

Training and production inference, by design.

Alternative	Examples	The limitation	Fireworks advantage
Closed models	OpenAI, Anthropic	No weight ownership. High cost at scale. No retraining on your own data.	Open models you fully own. Retrain from production data. Compound over time.
LoRA-only training APIs	Open research APIs	No full-parameter support. No production inference colocation. Training and serving are separate stacks.	Full-parameter up to 1T parameters. Training and inference on the same infrastructure. Numerical parity verified.
Cloud-native	AWS Bedrock, GCP Vertex	Training and inference are separate silos. Limited open model support.	Model-agnostic across the full frontier catalog: DeepSeek, Kimi, Qwen, GPT OSS, and more.
Self-managed	PyTorch distributed, DIY clusters	Months of infra work before your first run. Silent numerical bugs. Ongoing ops burden.	Production-grade infrastructure on day one. Numerical parity guaranteed. Automated checkpointing and recovery.

FAQ

Common questions.

What does "preview" mean? Is this production-ready? +

Preview means the platform is live and serving real production workloads today. Cursor, Vercel, and Genspark are all running on Fireworks Training in production. It also means pricing and some features are still stabilizing before GA. Enterprise SLAs and GA timelines are available on request. Talk to our team if you have specific requirements.

Can I download my trained model weights? +

Your trained weights are yours. For open base models, you can download any checkpoint via the API at any time. For hosted closed models (currently Qwen), weights stay on Fireworks infrastructure: accessible to you for inference and further training, isolated from other customers, never shared. Download will follow as those models open. In both cases, Fireworks does not retain ownership of models trained on your data.

Is my training data used to train Fireworks models? +

No. Your data is used solely to fine-tune your models. We do not use your training data to train Fireworks' own models or any shared models.

What's the difference between Training Agent, Managed Training, and Training API? +

Training Agent is fully automated: describe your goal, upload data, get a deployed model. No ML knowledge required. Currently LoRA-only.

Managed Training gives you control over the training method (SFT, DPO, or RFT) while we handle all infrastructure. Supports full-parameter training.

Training API gives you full algorithmic control: bring your own training loop, write custom loss functions, run frontier RL. For advanced ML teams and researchers. See the comparison table above for a full breakdown.

How long does a training run take? +

It depends on model size, dataset size, and training method. A small LoRA job on Qwen3 8B with a few thousand examples typically completes in under an hour. Larger full-parameter runs on frontier models take longer. See the cost estimator in our docs for estimates by scenario.

What does it cost? +

Models under 16B parameters train at no cost. Training Agent uses per-job pricing confirmed upfront. Managed Training is priced per token or per GPU-hour depending on the job. Training API is priced per GPU-hour on provisioned clusters. See the pricing page for full details.

Train and deploy frontier models
on one platform.

Start where you are. Go as far as you need.

Training Agent

Managed Training

Training API

The fastest-moving AI companies train on Fireworks.

From a single node to 1T parameters at frontier scale.

What you train is exactly what you serve.

Serve hundreds of fine-tuned models on a single GPU.

Training and production inference, by design.

Common questions.

Your model is your product.
Your data is your moat.

Train and deploy frontier modelson one platform.

Start where you are. Go as far as you need.

Training Agent

Managed Training

Training API

The fastest-moving AI companies train on Fireworks.

From a single node to 1T parameters at frontier scale.

What you train is exactly what you serve.

Serve hundreds of fine-tuned models on a single GPU.

Training and production inference, by design.

Common questions.

Your model is your product.Your data is your moat.

Train and deploy frontier models
on one platform.

Your model is your product.
Your data is your moat.