Order Intelligence: LLM Classification for 40k+ B2B Orders/Day
Automated categorization of incoming B2B orders with 94% accuracy, reducing 11h of manual work to 8 minutes.
The Problem
Cipher AI processed 40,000+ B2B orders daily — arriving via email, EDI and unstructured PDFs. Three full-time operations staff spent a combined 11 hours per day sorting each order into one of 47 internal classification buckets.
Error rate: 4–6%. A misclassified order caused on average €800 in downstream cost — routing errors, delayed deliveries, complaints. Scaled up, that was five-figure monthly cost purely because the manual system broke under complexity.
The CEO wanted ML-based classification. Early quotes from the consulting world: six months of discovery, twelve months for a training pipeline, an external data-science team. Too slow, too expensive, too uncertain.
Discovery: Why classical ML was wrong here
In week one we analyzed ~500 historical orders per bucket. Findings:
- Classification patterns were often directly visible in the text — no high-end feature engineering required
- Many edge cases had too few examples for clean supervised-learning models
- Categories changed every few weeks — a trained model would go stale quickly
Decision: Retrieval-Augmented Classification instead of classical ML pipeline. No fine-tuning, no labeled-data campaign, no model-redeployment theater.
Architecture
Email / EDI / PDF input
↓
FastAPI queue (async)
↓
Embedding (pgvector in Postgres)
↓
Top-5 similar orders → few-shot context
↓
Claude Sonnet (structured JSON output)
↓
Confidence router:
• > 85% → auto-assign
• 70–85% → human review queue
• < 70% → flag + retrain signal
↓
Next.js ops dashboard + metrics
Everything runs in a single Postgres instance — no separate vector DB, no extra service layer. pgvector as an extension is enough for retrieval at this scale.
"This system does what the consultants promised for ten times the price. No external team, no six-month discovery. In seven weeks." — Head of Operations, Cipher AI
Build: seven weeks, three sprints
Weeks 1–2 — Discovery & MVP Pipeline We embedded the first 500 orders and loaded them into pgvector. Manual tests: how well does nearest-neighbor find the right category? Result: 87% accuracy with pure retrieval, no LLM. That proved the approach worked.
Weeks 3–5 — Production Pipeline & Dashboard FastAPI queue with async worker pool. Claude Sonnet as classifier with structured JSON output (tool use, no text parsing). Next.js ops dashboard with live metrics, confidence distribution, override history. Operations staff could click along from day 20.
Weeks 6–7 — Confidence Routing & Go-Live The confidence router was the critical piece — not every classification could run automatic. The last two weeks: threshold tuning on real data, human-in-the-loop UI for the review bucket, monitoring with drift alerts.
Result after 3 months in production
- 94% accuracy — comparable to the manual benchmark (94–96%)
- 98% reduction in manual work — 11 h/day → 8 min
- 40k orders/day stable — no queue backlogs
- ROI after 4 months — the three operations staff moved to higher-value work
What we learned
Retrieval-first beats fine-tuning for domain-specific classification tasks almost every time. Less infrastructure, faster iteration, every new category works without retraining. Fine-tuning would have made sense only if latency were a hard constraint — not the case here (batch processing, seconds tolerance).
Confidence routing matters more than the model itself. A 94%-accurate model without routing would be unusable — 6% error at 40k/day = 2,400 wrong decisions per day. With routing: only the uncertain 10% go to humans, the other 90% run automatically.
The ops dashboard was not a nice-to-have. Team adoption hinged completely on being able to trace AI decisions. Without the dashboard they wouldn't have trusted the system.
Other projects