Fireworks Training · Now in Preview
Full-parameter training, custom loss functions, and frontier RL. All on the same infrastructure already serving production for Cursor, Vercel, and Genspark.
Your model is your product. Your data is your moat.
Three Entry Points, One Platform
Same platform, same deployment targets throughout. Drop down when you need more control.
Describe your task. Deploy your model.
An autonomous agent that handles the entire training pipeline. Describe your goal, upload your data. The Agent handles data prep, model selection, hyperparameter sweep, evals, and deployment. Currently LoRA-only.
You pick the method. We run everything else.
Choose SFT, DPO, or RFT. We handle GPU provisioning, distributed training, checkpointing, and scaling. Full-parameter training available for the deepest behavioral changes. One-click deployment to production.
Your objective. Our compute. No constraints.
Bring your own training loop. The platform has no opinion on your objective. Write custom loss functions, run RL at scale across regions, or chain SFT into RFT with full optimizer state preserved. LoRA and full-parameter supported.
| Training Agent | Managed Training | Training API | |
|---|---|---|---|
| You bring | Data + description | Formatted data + method | Your training loop + loss functions |
| We handle | Everything | GPU provisioning, distributed training, checkpointing | GPU execution, model parallelism, weight syncing, preemption recovery |
| LoRA / Full-param | LoRA only | LoRA + full-parameter | LoRA + full-parameter |
| Custom loss | No | No | Yes, any objective |
| Frontier RL | No | RFT with reward functions | Full control: cross-region, custom rollout |
| Pricing | Per job, upfront | Per token / GPU-hr | Per GPU-hr |
Proven in Production
Full-Parameter Training
Most training services cap out at LoRA. LoRA is the right starting point: fast, cost-effective, and well-suited for rapid iteration. But LoRA and full-parameter training learn in meaningfully different ways. LoRA learns less and forgets less. Full-parameter produces behavioral changes that adapter-based methods can't reach.
Fireworks Training supports full-parameter training across the model catalog, from Qwen3 8B on a single node up to Kimi K2.5 at 1 trillion parameters. LoRA and full-parameter run on the same platform. You don't have to choose your ceiling when you start.
One Infrastructure
Fireworks runs production inference across DeepSeek, Kimi, Qwen, and GPT OSS at scale. That experience is built into the training platform. The numerical edge cases that surface in frontier MoE models aren't hypothetical to us. We've debugged them in production.
A trained checkpoint becomes a live endpoint in seconds. No format conversion, no serving stack migration. Training and inference share the same kernels, the same hardware, so model behavior in training is model behavior in production.
We publish k3 KL divergence between training and inference checkpoints for every model in our catalog. All values below 0.01 are production-grade. If your training and serving stacks disagree numerically, your evals are measuring the gap between them, not model quality.
Deploy at Scale
Training a great model is half the story. Fireworks Multi-LoRA serving runs hundreds of fine-tuned adapters per GPU, sharing a single base model, deployed in one click with zero extra infrastructure cost.
Iterate across experiments or serve different model variants per customer, without paying for a dedicated GPU per adapter. Better model, same GPU budget.
Why Fireworks
| Alternative | Examples | The limitation | Fireworks advantage |
|---|---|---|---|
| Closed models | OpenAI, Anthropic | No weight ownership. High cost at scale. No retraining on your own data. | Open models you fully own. Retrain from production data. Compound over time. |
| LoRA-only training APIs | Open research APIs | No full-parameter support. No production inference colocation. Training and serving are separate stacks. | Full-parameter up to 1T parameters. Training and inference on the same infrastructure. Numerical parity verified. |
| Cloud-native | AWS Bedrock, GCP Vertex | Training and inference are separate silos. Limited open model support. | Model-agnostic across the full frontier catalog: DeepSeek, Kimi, Qwen, GPT OSS, and more. |
| Self-managed | PyTorch distributed, DIY clusters | Months of infra work before your first run. Silent numerical bugs. Ongoing ops burden. | Production-grade infrastructure on day one. Numerical parity guaranteed. Automated checkpointing and recovery. |
FAQ
Get Started
Fireworks Training is in preview. Models under 16B train at no cost.
New to fine-tuning? Training Agent is the fastest way to get a model into production.