Fireworks Training · Now in Preview

Train and deploy frontier models
on one platform.

Full-parameter training, custom loss functions, and frontier RL. All on the same infrastructure already serving production for Cursor, Vercel, and Genspark.

Your model is your product. Your data is your moat.

New Full-parameter training at frontier scale Custom loss functions Multi-LoRA serving Training Agent

Three Entry Points, One Platform

Start where you are. Go as far as you need.

Same platform, same deployment targets throughout. Drop down when you need more control.

For Product Teams & App Builders

Training Agent

Describe your task. Deploy your model.

An autonomous agent that handles the entire training pipeline. Describe your goal, upload your data. The Agent handles data prep, model selection, hyperparameter sweep, evals, and deployment. Currently LoRA-only.

SFT Classification DPO
  • Automated data cleaning (the most common failure point in fine-tuning)
  • Evals generated and run automatically
  • Model live on production inference the moment training completes
Per-job pricing, confirmed upfront. Models under 16B train at no cost.
For ML Engineers

Managed Training

You pick the method. We run everything else.

Choose SFT, DPO, or RFT. We handle GPU provisioning, distributed training, checkpointing, and scaling. Full-parameter training available for the deepest behavioral changes. One-click deployment to production.

SFT DPO RFT Full-param
  • RFT: Define a reward function instead of writing thousands of demonstrations. Outperforms SFT on complex agentic tasks.
  • Full-parameter: Behavioral changes LoRA can't produce, at any scale up to 1T parameters
  • Multi-LoRA: Serve hundreds of adapters on a shared base, no extra infrastructure cost per experiment
  • Chain stages: SFT → DPO → RFT, warm-start from any checkpoint
Per token / per GPU-hour depending on job type.
For Production ML Teams

Training API

Your objective. Our compute. No constraints.

Bring your own training loop. The platform has no opinion on your objective. Write custom loss functions, run RL at scale across regions, or chain SFT into RFT with full optimizer state preserved. LoRA and full-parameter supported.

Custom loss Full-param Frontier RL
  • Custom loss: Write any objective: GRPO, DRO, DAPO, or your own. No rigid recipes.
  • Frontier RL: Elastic rollout inference across regions with weight sync. No co-located hardware required.
  • Full-parameter up to Kimi K2.5 (1T parameters) on 64 B200s
  • Per-GPU-hour pricing, predictable cost for rollout-heavy workflows
Per GPU-hour on provisioned clusters.
Training AgentManaged TrainingTraining API
You bringData + descriptionFormatted data + methodYour training loop + loss functions
We handleEverythingGPU provisioning, distributed training, checkpointingGPU execution, model parallelism, weight syncing, preemption recovery
LoRA / Full-paramLoRA onlyLoRA + full-parameterLoRA + full-parameter
Custom lossNoNoYes, any objective
Frontier RLNoRFT with reward functionsFull control: cross-region, custom rollout
PricingPer job, upfrontPer token / GPU-hrPer GPU-hr

Proven in Production

The fastest-moving AI companies train on Fireworks.

See more customer stories →

40X latency improvement · 93% error-free rate
"Using a fine-tuned reinforcement learning model with Fireworks, we perform substantially better than SOTA. In our evaluation, Sonnet 3.5 compiled at 62%, and we got our error-free generation rate well into the 90s."
Malte Ubl, CTO at Vercel
Read case study →
33% more tool calls · 50% lower cost
"Fireworks enabled us to own our AI journey, and unlock better quality in just four weeks."
Kay Zhu, CTO at Genspark
Read case study →
Production RL training across 3–4 clusters
"Our RL inference scales elastically and globally because of it. When we have low prod traffic we scale up RL, when we have high prod traffic, we scale down RL."
Federico Cassano, Research at Cursor
Read case study →
Research infrastructure for autonomous AI
"The Fireworks Training SDK lets us focus on our research instead of wrestling with infrastructure. The platform is fast, well-optimized, and just works."
Kyle Montgomery & Sijun Tan, Core Contributors at rLLM
View on GitHub →

Full-Parameter Training

From a single node to 1T parameters at frontier scale.

Most training services cap out at LoRA. LoRA is the right starting point: fast, cost-effective, and well-suited for rapid iteration. But LoRA and full-parameter training learn in meaningfully different ways. LoRA learns less and forgets less. Full-parameter produces behavioral changes that adapter-based methods can't reach.

Fireworks Training supports full-parameter training across the model catalog, from Qwen3 8B on a single node up to Kimi K2.5 at 1 trillion parameters. LoRA and full-parameter run on the same platform. You don't have to choose your ceiling when you start.

Read the infrastructure writeup →

One Infrastructure

What you train is exactly what you serve.

Fireworks runs production inference across DeepSeek, Kimi, Qwen, and GPT OSS at scale. That experience is built into the training platform. The numerical edge cases that surface in frontier MoE models aren't hypothetical to us. We've debugged them in production.

A trained checkpoint becomes a live endpoint in seconds. No format conversion, no serving stack migration. Training and inference share the same kernels, the same hardware, so model behavior in training is model behavior in production.

We publish k3 KL divergence between training and inference checkpoints for every model in our catalog. All values below 0.01 are production-grade. If your training and serving stacks disagree numerically, your evals are measuring the gap between them, not model quality.

Full k3 table and methodology →

Deploy at Scale

Serve hundreds of fine-tuned models on a single GPU.

Training a great model is half the story. Fireworks Multi-LoRA serving runs hundreds of fine-tuned adapters per GPU, sharing a single base model, deployed in one click with zero extra infrastructure cost.

Iterate across experiments or serve different model variants per customer, without paying for a dedicated GPU per adapter. Better model, same GPU budget.

100s
Adapters per GPU
1 click
To deploy
$0
Extra infra cost
Adapters per GPU
GPU 1
GPU 2
GPU 3
GPU 4

Why Fireworks

Training and production inference, by design.

AlternativeExamplesThe limitationFireworks advantage
Closed models OpenAI, Anthropic No weight ownership. High cost at scale. No retraining on your own data. Open models you fully own. Retrain from production data. Compound over time.
LoRA-only training APIs Open research APIs No full-parameter support. No production inference colocation. Training and serving are separate stacks. Full-parameter up to 1T parameters. Training and inference on the same infrastructure. Numerical parity verified.
Cloud-native AWS Bedrock, GCP Vertex Training and inference are separate silos. Limited open model support. Model-agnostic across the full frontier catalog: DeepSeek, Kimi, Qwen, GPT OSS, and more.
Self-managed PyTorch distributed, DIY clusters Months of infra work before your first run. Silent numerical bugs. Ongoing ops burden. Production-grade infrastructure on day one. Numerical parity guaranteed. Automated checkpointing and recovery.

FAQ

Common questions.

What does "preview" mean? Is this production-ready? +
Preview means the platform is live and serving real production workloads today. Cursor, Vercel, and Genspark are all running on Fireworks Training in production. It also means pricing and some features are still stabilizing before GA. Enterprise SLAs and GA timelines are available on request. Talk to our team if you have specific requirements.
Can I download my trained model weights? +
Your trained weights are yours. For open base models, you can download any checkpoint via the API at any time. For hosted closed models (currently Qwen), weights stay on Fireworks infrastructure: accessible to you for inference and further training, isolated from other customers, never shared. Download will follow as those models open. In both cases, Fireworks does not retain ownership of models trained on your data.
Is my training data used to train Fireworks models? +
No. Your data is used solely to fine-tune your models. We do not use your training data to train Fireworks' own models or any shared models.
What's the difference between Training Agent, Managed Training, and Training API? +
Training Agent is fully automated: describe your goal, upload data, get a deployed model. No ML knowledge required. Currently LoRA-only.

Managed Training gives you control over the training method (SFT, DPO, or RFT) while we handle all infrastructure. Supports full-parameter training.

Training API gives you full algorithmic control: bring your own training loop, write custom loss functions, run frontier RL. For advanced ML teams and researchers. See the comparison table above for a full breakdown.
How long does a training run take? +
It depends on model size, dataset size, and training method. A small LoRA job on Qwen3 8B with a few thousand examples typically completes in under an hour. Larger full-parameter runs on frontier models take longer. See the cost estimator in our docs for estimates by scenario.
What does it cost? +
Models under 16B parameters train at no cost. Training Agent uses per-job pricing confirmed upfront. Managed Training is priced per token or per GPU-hour depending on the job. Training API is priced per GPU-hour on provisioned clusters. See the pricing page for full details.

Get Started

Your model is your product.
Your data is your moat.

Fireworks Training is in preview. Models under 16B train at no cost.

New to fine-tuning? Training Agent is the fastest way to get a model into production.