What does a project cost?

We give you a concrete fixed price in the Discovery call for your exact scope — transparent, no hidden costs, no vague estimates. Focused 4-week sprints are significantly less than full SaaS builds (8–16 weeks).

What tech stack do you use?

Next.js + TypeScript + Tailwind on the frontend, Node.js or Python/FastAPI on the backend, PostgreSQL + Redis for data, deployed on Vercel or AWS. For AI features, we work primarily with Anthropic Claude and OpenAI APIs.

How long does a typical project take?

A sprint is 4 weeks. A full SaaS product runs 8–16 weeks depending on scope. We define scope clearly in the Discovery call — no timeline surprises.

Do you sign NDAs before the first conversation?

Always. NDA-first is our default — not something you have to ask for. We sign before any project discussion begins.

Who actually works on my project?

Senior engineers only. We don't bill juniors at senior rates. Typically a lead engineer plus a specialist depending on your stack.

What if I'm not happy with the result?

We work in short sprints with weekly reviews — you see progress from day one. If the direction is off, we correct immediately.

What's the best way to get started?

Book a free 30-minute Discovery call. We discuss your project, assess feasibility, and give you a concrete proposal with fixed price and timeline the same day.

AICostsPlanning

How Much Does AI Development Cost? An Honest Breakdown

March 15, 202614 min read

Why AI Projects Vary So Wildly in Cost

"How much does AI development cost?" sounds like a simple question. The honest answer: it depends on whether you're calling an API or training your own model. That's the difference between wiring a light switch and building a power plant.

We've been shipping AI features into production applications since 2024 — using Next.js, TypeScript, Python/FastAPI, and PostgreSQL. This article skips the marketing fluff and gives you real numbers, real code, and real architecture decisions.

The Current Pricing Landscape: What LLM APIs Actually Cost

Before we talk about project costs, you need to understand the raw material costs. Here are the current API prices for the models that matter (as of Q1 2026):

| Model | Input / 1M Tokens | Output / 1M Tokens | Context Window | Sweet Spot | |---|---|---|---|---| | Claude 3.5 Haiku | $0.80 | $4.00 | 200k | Classification, routing, simple tasks | | Claude 3.5 Sonnet | $3.00 | $15.00 | 200k | Coding, complex analysis, RAG | | Claude 3.5 Opus | $15.00 | $75.00 | 200k | Research, multi-step reasoning | | GPT-4o | $2.50 | $10.00 | 128k | General purpose, multimodal | | GPT-4o-mini | $0.15 | $0.60 | 128k | High-volume, simple tasks | | Gemini 1.5 Pro | $1.25 | $5.00 | 2M | Very long documents, video | | Gemini 1.5 Flash | $0.075 | $0.30 | 1M | Budget tasks, high throughput |

Prices keep dropping. Twelve months ago, GPT-4-Turbo was still at $10/$30 per 1M tokens. That said, for production workloads the costs add up fast.

A Concrete Example: Support Ticket Classification

Let's take a real-world use case: automatically classifying 10,000 support tickets per month (category, priority, sentiment).

Assumptions:

Average ticket: ~200 tokens input
System prompt + few-shot examples: ~300 tokens
Output (classification as JSON): ~50 tokens
Total per ticket: 500 input + 50 output tokens

Monthly API costs at 10,000 tickets:

| Model | Input Cost | Output Cost | Total/Month | |---|---|---|---| | Claude 3.5 Haiku | 5M x $0.80/1M = $4.00 | 0.5M x $4.00/1M = $2.00 | $6.00 | | GPT-4o-mini | 5M x $0.15/1M = $0.75 | 0.5M x $0.60/1M = $0.30 | $1.05 | | Claude 3.5 Sonnet | 5M x $3.00/1M = $15.00 | 0.5M x $15.00/1M = $7.50 | $22.50 | | GPT-4o | 5M x $2.50/1M = $12.50 | 0.5M x $10.00/1M = $5.00 | $17.50 |

Key takeaway: for classification, you don't need Sonnet or GPT-4o. Haiku or GPT-4o-mini deliver >95% accuracy on simple tasks — at a fraction of the cost. Model selection is the single most important cost decision.

Architecture: The Three Tiers

Every AI project falls into one of three tiers. Here's the architecture as we actually build it:

┌─────────────────────────────────────────────────────────────────────┐
│                    AI PROJECT ARCHITECTURE TIERS                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  TIER 1: API Integration (€10–30k)                                 │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │ Next.js  │───▶│ FastAPI      │───▶│ LLM API      │              │
│  │ Frontend │◀───│ Service      │◀───│ (Claude/GPT) │              │
│  └──────────┘    │ + Cache      │    └──────────────┘              │
│                  │ + Rate Limit │                                   │
│                  └──────────────┘                                   │
│                                                                     │
│  TIER 2: RAG / Product Features (€30–80k)                         │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │ Next.js  │───▶│ FastAPI      │───▶│ LLM API      │              │
│  │ Frontend │◀───│ Orchestrator │◀───│              │              │
│  └──────────┘    │              │    └──────────────┘              │
│                  │  ┌────────┐  │    ┌──────────────┐              │
│                  │  │Embedder│──│───▶│ pgvector /   │              │
│                  │  └────────┘  │◀───│ Vector DB    │              │
│                  │  ┌────────┐  │    └──────────────┘              │
│                  │  │Eval    │  │    ┌──────────────┐              │
│                  │  │Pipeline│──│───▶│ PostgreSQL   │              │
│                  │  └────────┘  │    │ (Logging)    │              │
│                  └──────────────┘    └──────────────┘              │
│                                                                     │
│  TIER 3: Custom Model (€80–250k+)                                 │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────┐              │
│  │ Next.js  │───▶│ FastAPI      │───▶│ Custom Model │              │
│  │ Frontend │◀───│ Inference    │◀───│ (Fine-tuned) │              │
│  └──────────┘    │ Server       │    └──────┬───────┘              │
│                  └──────────────┘           │                      │
│                  ┌──────────────┐    ┌──────▼───────┐              │
│                  │ Training     │───▶│ GPU Cluster  │              │
│                  │ Pipeline     │    │ (A100/H100)  │              │
│                  │ + MLOps      │    └──────────────┘              │
│                  └──────────────┘                                   │
└─────────────────────────────────────────────────────────────────────┘

Tier 1: API Integration (€10,000 – €30,000)

The fastest path. You call an existing model via API, wrap it in a clean abstraction layer, and ship a finished feature to your users. Sounds trivial — the real work is in the details.

What Actually Gets Built

A typical Tier 1 project with us includes:

API abstraction layer with retry logic, timeout handling, model fallback
Prompt management — versioned prompts, ready for A/B testing
Token tracking and cost monitoring per user/feature
Response validation — LLMs don't always return valid JSON
Rate limiting and queuing for fair use
Frontend integration with streaming (SSE) for better UX

Here's what a real API call with token tracking looks like in TypeScript:

import Anthropic from "@anthropic-ai/sdk";

interface ClassificationResult {
  category: string;
  priority: "low" | "medium" | "high" | "critical";
  sentiment: "positive" | "neutral" | "negative";
  confidence: number;
}

interface LLMUsage {
  inputTokens: number;
  outputTokens: number;
  costUSD: number;
  model: string;
  latencyMs: number;
}

// Per-token pricing for cost tracking
const PRICING: Record<string, { input: number; output: number }> = {
  "claude-3-5-haiku-20241022": { input: 0.8 / 1_000_000, output: 4.0 / 1_000_000 },
  "claude-3-5-sonnet-20241022": { input: 3.0 / 1_000_000, output: 15.0 / 1_000_000 },
};

const client = new Anthropic();

export async function classifyTicket(
  ticketText: string,
  model = "claude-3-5-haiku-20241022"
): Promise<{ result: ClassificationResult; usage: LLMUsage }> {
  const start = performance.now();

  const response = await client.messages.create({
    model,
    max_tokens: 150,
    messages: [
      {
        role: "user",
        content: `Classify this support ticket as JSON.
Fields: category (billing|technical|shipping|account|other), priority (low|medium|high|critical), sentiment (positive|neutral|negative), confidence (0-1).

Ticket: "${ticketText}"

Respond ONLY with valid JSON, no Markdown.`,
      },
    ],
  });

  const latencyMs = Math.round(performance.now() - start);
  const text = response.content[0].type === "text" ? response.content[0].text : "";
  const pricing = PRICING[model] ?? { input: 0, output: 0 };

  const usage: LLMUsage = {
    inputTokens: response.usage.input_tokens,
    outputTokens: response.usage.output_tokens,
    costUSD:
      response.usage.input_tokens * pricing.input +
      response.usage.output_tokens * pricing.output,
    model,
    latencyMs,
  };

  // Robust JSON parsing — LLMs sometimes wrap output in Markdown code fences
  const jsonStr = text.replace(/```json?\n?/g, "").replace(/```/g, "").trim();
  const result: ClassificationResult = JSON.parse(jsonStr);

  return { result, usage };
}

And the Backend Side with FastAPI

On the Python side, we build a service layer with caching so that repeated or identical requests don't get billed twice:

import hashlib
import json
import time

from anthropic import Anthropic
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from redis import Redis

app = FastAPI()
client = Anthropic()
redis = Redis(host="localhost", port=6379, db=0)

CACHE_TTL = 3600 * 24  # 24h — classifications don't change

class TicketRequest(BaseModel):
    text: str
    model: str = "claude-3-5-haiku-20241022"

class ClassificationResponse(BaseModel):
    category: str
    priority: str
    sentiment: str
    confidence: float
    cached: bool = False
    cost_usd: float = 0.0
    latency_ms: int = 0

@app.post("/api/classify", response_model=ClassificationResponse)
async def classify_ticket(req: TicketRequest):
    # Cache key from normalized text
    cache_key = f"classify:{hashlib.sha256(req.text.strip().lower().encode()).hexdigest()}"

    # Check cache
    cached = redis.get(cache_key)
    if cached:
        data = json.loads(cached)
        data["cached"] = True
        return ClassificationResponse(**data)

    # LLM call with timing
    start = time.monotonic()
    try:
        response = client.messages.create(
            model=req.model,
            max_tokens=150,
            messages=[{
                "role": "user",
                "content": f'Classify as JSON (category, priority, sentiment, confidence): "{req.text}"'
            }],
        )
    except Exception as e:
        raise HTTPException(status_code=502, detail=f"LLM API error: {str(e)}")

    latency_ms = int((time.monotonic() - start) * 1000)

    # Calculate cost
    pricing = {"claude-3-5-haiku-20241022": (0.8, 4.0)}
    inp_price, out_price = pricing.get(req.model, (3.0, 15.0))
    cost = (
        response.usage.input_tokens * inp_price / 1_000_000
        + response.usage.output_tokens * out_price / 1_000_000
    )

    # Parse response
    text = response.content[0].text.strip().strip("`").strip()
    if text.startswith("json"):
        text = text[4:].strip()
    result = json.loads(text)

    data = {**result, "cached": False, "cost_usd": round(cost, 6), "latency_ms": latency_ms}

    # Write to cache
    redis.setex(cache_key, CACHE_TTL, json.dumps(data))

    return ClassificationResponse(**data)

Why Redis caching? In practice, we see 15–30% duplicate rates in support systems (same error messages, copy-pasted tickets). The cache saves a solid 20% on API costs for 10,000 tickets/month — and responds in under 5ms instead of 500ms.

Timeline and Cost Breakdown

| Item | Effort | Share | |---|---|---| | Architecture & API abstraction | 2–3 days | 15% | | Prompt engineering & testing | 3–5 days | 25% | | Backend service (FastAPI/Node) | 3–5 days | 25% | | Frontend integration | 2–3 days | 15% | | Testing, monitoring, deployment | 2–3 days | 20% |

Total timeline: 2–4 weeks. The biggest line item isn't the code — it's the prompt engineering and evaluation.

Tier 2: AI-Powered Product Features (€30,000 – €80,000)

This is where AI becomes the core of the product. Typical projects: semantic search across company documents, RAG-based knowledge assistants, or intelligent workflows that orchestrate multiple LLM calls.

What Sets Tier 2 Apart from Tier 1

Data pipeline: Documents need to be chunked, embedded, and indexed in a vector database
Evaluation framework: You need measurable quality metrics (precision, recall, hallucination rate)
Orchestration: Multiple LLM calls in sequence or parallel — with routing logic
pgvector / Vector DB: Semantic search over your own data
Iterative prompt tuning: 3–5 rounds of iteration until quality is where it needs to be

Prompt Engineering: Where the Real Money Gets Saved

The biggest lever for cost reduction isn't the model — it's the prompt. Here's a real example of how better prompt engineering cut token costs by over 60%:

## BEFORE: Naive prompt (averaging ~820 tokens input)

You are a helpful assistant that classifies support tickets for an
e-commerce company. The company sells electronics and home appliances.
You should classify each ticket into a category and determine the priority.
The possible categories are: billing, technical support, shipping, account
management, other. The possible priorities are: low, medium, high, critical.
Please carefully analyze the following ticket and provide your assessment.
Also briefly explain why you chose this classification. Respond in a
structured format.

Here is the ticket:
[TICKET_TEXT]

Please classify the ticket and explain your decision.

---

## AFTER: Optimized prompt (averaging ~280 tokens input)

Classify as JSON. No explanation.
{"category":"billing|technical|shipping|account|other","priority":"low|medium|high|critical","sentiment":"positive|neutral|negative","confidence":0.0-1.0}

Ticket: [TICKET_TEXT]

Result: 820 → 280 input tokens = 66% reduction. At 10,000 tickets/month with Claude Haiku, that saves $2.60/month. Sounds like nothing? At 500,000 tickets/month it's $130/month — and with Sonnet, $250/month. For a RAG system with long context windows, the savings explode: a prompt that uses 4,000 instead of 12,000 tokens per request saves $2,400/month at 100,000 requests/month with Sonnet.

The rules for token-efficient prompts:

No pleasantries — "You are a helpful assistant" costs tokens and doesn't change the output
Specify output format in the prompt — fewer tokens wasted on explanations nobody needs
Few-shot over descriptions — one example beats 200 tokens of explanation
No redundant instructions — "Classify AND explain why" doubles your output tokens

Tier 2 Timeline

6–12 weeks. The biggest time sink isn't the code — it's the data pipeline and evaluation. You need a robust feedback loop: measure quality → adjust prompts → measure again → repeat.

Tier 3: Custom Models & Training (€80,000 – €250,000+)

The top end: fine-tuning your own models or training from scratch. This is the right approach if — and only if — at least two of these conditions apply:

Generic models can't hit the accuracy bar: Your domain is so specialized that even with perfect prompting you only reach 85% accuracy, but you need 98%
Proprietary data = competitive advantage: You have 500,000 labeled data points that no competitor has
Regulatory requirements force self-hosting: GDPR, financial industry compliance, or sector-specific regulations prohibit cloud APIs
Latency/cost at scale: At 10M+ requests/month, a small fine-tuned model on your own GPU becomes cheaper than API calls

Where the Budget Goes

| Item | Cost | Why | |---|---|---| | Data preparation & labeling | €15,000–40,000 | Often the biggest line item. Garbage in, garbage out | | Training infrastructure (GPU) | €5,000–30,000 | A100/H100 cluster, depends on model size | | ML engineering | €30,000–80,000 | Fine-tuning, hyperparameters, evaluation | | MLOps & deployment | €15,000–40,000 | Inference server, monitoring, auto-scaling | | Iterations & optimization | €10,000–30,000 | At least 2–3 training rounds |

Timeline: 3–6 months. And honestly — we talk 80% of our clients out of this. Not because we can't do it, but because Tier 1 or 2 almost always gets the job done — and goes live in a tenth of the time.

What Drives the Price Up or Down

Regardless of tier, several factors significantly influence the final price:

Data quality — The single most important factor. Clean, structured data with clear labels? The project comes in 30% cheaper. Messy CSVs, undocumented PDFs, three different date formats? Add 20–40% for data engineering.

Legacy integration — Connecting to a modern REST API is an afternoon. Connecting to a 2008-era SAP system over SOAP with custom auth? That's 2–3 extra weeks.

Security requirements — GDPR-compliant data processing, on-premise hosting, audit trails, encryption at rest and in transit. Each one is doable, but each one costs engineering time.

Iteration budget — AI projects aren't like traditional software projects. On day one, you don't know whether the prompt will deliver 92% or 97% accuracy in production. Budget for 2–3 iteration rounds after the first release.

Token Cost Calculator: A Quick Reference

Here's a simple calculator you can use for your own back-of-the-envelope estimates:

interface CostEstimate {
  monthlyTokensInput: number;
  monthlyTokensOutput: number;
  monthlyCostUSD: number;
  monthlyCostEUR: number;
  costPerRequest: number;
}

function estimateMonthlyCost(params: {
  requestsPerMonth: number;
  avgInputTokens: number;
  avgOutputTokens: number;
  model: "haiku" | "sonnet" | "gpt4o" | "gpt4o-mini" | "gemini-pro" | "gemini-flash";
  eurUsdRate?: number;
}): CostEstimate {
  const pricing: Record<string, [number, number]> = {
    "haiku":        [0.80, 4.00],   // [input, output] per 1M tokens
    "sonnet":       [3.00, 15.00],
    "gpt4o":        [2.50, 10.00],
    "gpt4o-mini":   [0.15, 0.60],
    "gemini-pro":   [1.25, 5.00],
    "gemini-flash": [0.075, 0.30],
  };

  const [inputPrice, outputPrice] = pricing[params.model];
  const rate = params.eurUsdRate ?? 0.92;

  const totalInput = params.requestsPerMonth * params.avgInputTokens;
  const totalOutput = params.requestsPerMonth * params.avgOutputTokens;

  const costUSD =
    (totalInput / 1_000_000) * inputPrice +
    (totalOutput / 1_000_000) * outputPrice;

  return {
    monthlyTokensInput: totalInput,
    monthlyTokensOutput: totalOutput,
    monthlyCostUSD: Math.round(costUSD * 100) / 100,
    monthlyCostEUR: Math.round(costUSD * rate * 100) / 100,
    costPerRequest: Math.round((costUSD / params.requestsPerMonth) * 1_000_000) / 1_000_000,
  };
}

// Example: 10,000 support tickets/month with Haiku
const estimate = estimateMonthlyCost({
  requestsPerMonth: 10_000,
  avgInputTokens: 500,
  avgOutputTokens: 50,
  model: "haiku",
});
// → { monthlyCostUSD: 4.20, monthlyCostEUR: 3.86, costPerRequest: 0.00042 }

Our Advice: Start with the Problem, Not the Technology

The most expensive mistake we see from clients: "We want to use AI" as a starting point. That's like saying "We want to use a database" — it tells you nothing about the actual problem.

The right starting point is always a specific question:

"We're burning 3 FTEs on support ticket routing — can that be automated?"
"Our customers can't find anything in our docs — can we build an intelligent search?"
"We have 50,000 contracts and due diligence takes 2 weeks — can we speed that up?"

From the specific question, the right approach, the right model, and the right architecture follow naturally — and with them, a realistic budget.

In 80% of cases, the answer is Tier 1 or 2. And that's a good thing. A well-built Tier 1 project that goes live in 3 weeks with 95% accuracy beats a Tier 3 project that's still not finished after 6 months.

Next Step

At SecretStack, we start every AI project with a free Discovery Call. Thirty minutes where we understand your problem, identify the right approach, and give you an honest assessment — including an architecture sketch, model recommendation, and cost range.

No pitch decks, no sales speak. Just engineering expertise and a straight answer to the question: what will it cost and how long will it take?