Complete case study

PromptTriage Case Study

A structured, evidence-backed page for the PromptTriage app, research studies, architecture, deployment history, Azure ML costs, tradeoffs, and remaining proof gaps.

Read the case study Back to case studies

Systems Working software, demos, and repositories.

Research Benchmarks, evaluation notes, and costs.

Operations Deployment, logging, runbooks, and failure modes.

Hiring signal Fast summaries with links to deeper evidence.

Complete case study

PromptTriage

RAG-powered prompt engineering platform with a documented research trail, modality-specific optimization, and a public demo.

Live demo Repository

28K+ Prompt corpus referenced in local project evidence.

4 Prompt modalities: text, image, video, and system prompts.

L1 Naive RAG was the strongest documented retrieval strategy.

5+ Completed research studies covering model, RAG, format, and judge behavior.

Research program summary

PromptTriage was built around several connected studies, not a single benchmark. The public articles cover the most readable findings, while the local research logs preserve model, RAG, judge-bias, and stopped/resumed run details.

Corpus analysis 28K+

Production prompt anti-pattern study

Analyzed a large corpus of production system prompts to identify recurring anti-patterns: persona bloat, contradictory long-prompt clauses, and emotional-pressure scaffolds that distracted from the actual task.

Hashnode 28K writeup, release chart, Pinecone corpus evidence.

Reference corpus 100s

Leaked frontier system prompts

Curated publicly leaked system prompts attributed to major AI products and labs, including Anthropic, OpenAI, Google, Perplexity, Cursor, v0, Windsurf, and Gemini CLI, then used them as reference patterns instead of republishing private prompt text on the portfolio page.

Local PromptTriage repo references, training data notes, RAG corpus docs.

Study E v2 1,080

Format Wars evaluation matrix

The public writeup describes five frontier models evaluated across 12 task domains, six output formats, and three prompt lengths. The published article says each result was blindly scored by a three-judge LLM panel; local research notes also preserve earlier Study E run details and cross-model judging variants.

Hashnode Format Wars article, release charts, local RESEARCH_PROGRESS.md.

Research correction Bias

Judge-bias and referee lessons

Several studies exposed judge bias, especially self-preference from the original Gemini judge. Later passes used neutral Llama 4 Maverick judging or cross-model judging, so the case study records both positive results and evaluator limitations.

PromptTriage research progress log and benchmark notes.

Canonical articles

The full research writeups currently live on Hashnode. This page treats them as source evidence and connects them to the structured case-study template.

Mar 24, 2026

What We Learned From Analyzing 28,000 Production AI System Prompts

Hashnode article covering the 28K prompt corpus, production anti-patterns, and why prompt hygiene became the core product thesis.

Read on Hashnode →

Mar 22, 2026

AI Format Wars: Does the Shape of Your Prompt Matter?

Hashnode article covering the 1,080-eval format study and the measurable impact of prompt structure across models and domains.

Read on Hashnode →

Research evidence gallery

Release assets from the PromptTriage repository are shown directly here so the case study has visible evidence instead of only outbound article links.

The app showcase is now available as a GitHub release asset. Because it is about 121 MB and marketing-heavy, this portfolio links to it deliberately instead of autoplaying or treating it as primary benchmark evidence.

App showcase

Marketing-oriented PromptTriage product video

The release video is useful as a product walkthrough, but the case study treats it as supporting demo evidence. The main technical proof remains the benchmark logs, Azure ML runs, research charts, and implementation source.

Open release video →

Cost and runtime evidence

Azure cost evidence now includes the EUR 178.12 credit spend screenshot plus CLI-reviewed Azure ML job metadata for the highest-signal benchmark runs.

EUR 178.12

Azure credits spent

Cost screenshot groups spend under Virtual Machines in East US 2 for the qwentrain workspace period.

38.19h

Longest benchmark job

MoE-only Study B benchmark, showing why naive inference became the cost lesson.

A100 80GB

Primary compute

Local Azure ML logs show PyTorch, CUDA, and NVIDIA A100 80GB PCIe runtime evidence.

Azure cost itemization from CLI/REST query

The cost API can break the research-period VM spend into meters, but it does not directly attach billed euros to each Azure ML run ID. For reviewer accuracy, this page shows meter-level cost plus job runtime evidence instead of pretending the portal gives clean per-job billing.

NC24ads_A100_v4 EUR 177.97 · 99.92%

Primary Azure VM meter behind the gpu-a100 benchmark and training work.

E4ds v4 EUR 0.14 · 0.08%

Small supporting VM meter surfaced in the same East US 2 cost query.

Study B: Benchmark (MoE only)

Completed gpu-a100 38.19 hours

qwen3_30b_a3b benchmark run on Azure ML; this is the main expensive slow-inference evidence.

Study A: RAG Pipeline Comparison

Completed gpu-a100 5.35 hours

Qwen3-14B run across six RAG levels for 180 prompt-generation combinations.

Study B: qwen3_30b_a3b training

Completed gpu-a100 2.19 hours

QLoRA fine-tuning run for the dense-vs-MoE comparison.

Study B: all-model benchmark attempts

Failed / stopped / resumed gpu-a100 0.13 to 4.05 hours

Some runs failed, while others were intentionally stopped and resumed with the next job at the planned continuation point.

ML techniques used

This is the technical layer behind the case study: not just an app wrapper, but a research workflow using fine-tuning, retrieval, benchmarks, and judged evaluation.

Unsloth

Used to patch Qwen-family training and inference paths for faster fine-tuning and lower VRAM pressure on Azure ML GPU runs.

QLoRA adapters

Fine-tuned compact adapters for Qwen variants instead of retraining full base models, keeping experiments feasible on limited credits.

A100 GPU benchmarking

Ran generation and evaluation workloads on Azure ML gpu-a100 compute, including the long MoE benchmark that exposed inference cost.

RAG ablation study

Compared six retrieval levels and found simple Pinecone retrieval more useful than heavier orchestration for this prompt-generation path.

LLM-as-judge evaluation

Used judged outputs and re-judging passes to compare model, prompt format, and strategy quality while tracking judge-bias risks.

Benchmark continuation workflow

Some Azure ML jobs were stopped intentionally and resumed in follow-up jobs, so not every non-completed job represents a broken experiment.

Platform history

This separates benchmark infrastructure from app hosting so reviewers do not confuse the research environment with the product runtime.

Research and training primarily ran on Azure ML compute clusters in the qwentrain workspace.
The app previously ran on Azure Container Apps and Google Cloud Run before the Azure resources, including ACR, were accidentally removed through CLI cleanup.
The current case study keeps deployment history separate from benchmark evidence so the research claims stay centered on model and prompt-evaluation work.

Architecture diagram

Current-build view based on the local PromptTriage source: Supabase Auth and Stripe sit in the app layer, FastAPI handles retrieval, Pinecone stores prompt vectors, and GitHub Actions deploys containers to Azure.

User layer

Browser UI

Prompt intent

Modality selection

Next.js app

Analyze/refine routes

Supabase Auth

Stripe subscription checks

Retrieval and context

FastAPI RAG service

Pinecone prompt vectors

Context7 and Firecrawl enrichment

Generation and output

Model provider calls

Modality-specific metaprompts

Production prompt result

Delivery

GitHub Actions

Azure Container Registry

Azure Container Apps

Security model evidence

The table follows the case-study template and separates risks, implemented controls, and current evidence sources.

Risk	Control	Evidence
Unauthenticated use of paid or rate-limited workflows	Supabase Auth sessions are checked by server routes before analyze, refine, and system-prompt generation.	Local source: Supabase middleware and API route auth checks.
Subscription state drift between payment and app access	Stripe checkout, billing portal, and webhooks write subscription state into Supabase.	Local source: Stripe webhook and subscription routes.
Secret leakage from deployment pipeline	Azure OIDC is used for GitHub Actions deployment, with runtime secrets stored outside source code.	Local source: deploy-frontend.yml and deploy-backend.yml.
Vector database or model provider exposure	Pinecone and provider keys are server-side environment variables; the browser only talks to application routes.	Local source: backend RAG service and environment examples.

Deployment path

The implementation evidence now points to Azure Container Apps rather than the older deployment notes.

Push to main when frontend or backend paths change.
GitHub Actions authenticates to Azure through OIDC.
Docker image is built for the changed service.
Image is pushed to Azure Container Registry.
Azure Container App is updated to the commit-tagged image.
Health route and analyze/refine smoke tests confirm the service path.

Evidence still needed

Job-level Azure ML cost allocation if you want the EUR 178.12 total mapped from VM meters to exact run IDs.
Deployment screenshots from Azure Container Apps, ACR, Google Cloud Run, and GitHub Actions after redacting identifiers.
Monitoring evidence: logs, alert rules, health checks, and failure handling examples.
Public-safe screenshots of Azure ML Studio run pages if you want visual proof beside the CLI-derived runtime table.
Optional PNG or editable source version of the architecture diagram alongside the included SVG.

Remaining template gaps

These are the items from the professional case-study template that still need more source material before the PromptTriage page becomes fully audit-ready.

Cost analysis

Job-level Azure ML cost allocation that maps run IDs to billed spend. Current evidence itemizes the EUR 178.12 VM meters, but Azure Cost Management does not directly attach every euro to each ML job.

Operations

Public-safe screenshots or summaries for monitoring, alerts, rollback, incident response, and restore paths.

Security model

Screenshots or redacted config evidence for Supabase Auth, Stripe webhook verification, secrets, and deployment identity.

Deployment pipeline

Visual evidence for Azure Container Apps, ACR, Google Cloud Run history, GitHub Actions runs, image scanning, and rollback.

Architecture

Exportable SVG is now included; optional PNG, Excalidraw, or Draw.io source can be added if you want a diagram asset that is easier to reuse in decks.

Results

A compact public table that maps each study to sample size, model set, evaluation method, and final takeaway.

Executive summary

PromptTriage is a RAG-powered prompt engineering platform built to turn rough ideas into production-ready prompts across text, image, video, and system-prompt workflows. The current build combines a Next.js interface, Supabase Auth, Stripe subscription flows, FastAPI retrieval services, Pinecone vector search, modality-specific metaprompts, and Azure Container Apps deployment. The strongest evidence is not only the app surface, but the documented research process behind retrieval strategy, prompt format, model selection, evaluation bias, and cost tradeoffs.

Problem

Most prompt tools behave like thin wrappers over a model API. The project needed to show whether structured prompt generation, retrieval context, and modality-specific templates could produce better, more reliable prompts than generic chat UX. The user problem is practical: builders need reusable prompts that include constraints, output format, assumptions, and evaluation criteria without manually writing every section from scratch.

Context and constraints

The project was developed as a solo engineering effort with limited budget, evolving model availability, provider API instability, GPU quota constraints, and a need to publish evidence without exposing secrets or private datasets. Most benchmark and training work ran on Azure ML compute clusters, while the app moved through Azure Container Apps and Google Cloud Run deployment phases. Research notes show cost-sensitive experimentation across fine-tuned Qwen variants, RAG strategies, and LLM-as-judge workflows.

Requirements

Functional requirements included prompt analysis, dynamic clarification questions, modality routing, retrieval from a prompt corpus, final prompt synthesis, one-click iteration, paid-plan capability checks, and optional current-documentation lookup through MCP tooling. Non-functional requirements included secure authentication, environment-based secrets, responsive UI, reproducible prompt versions, structured logging, and a containerized path from GitHub to Azure runtime. Research requirements included Unsloth-based QLoRA experiments, Azure ML benchmark jobs, and repeatable evaluation outputs.

Architecture

The architecture separates the user-facing workflow from retrieval and prompt-engineering services. The UI collects intent and modality, Next.js API routes enforce Supabase session and subscription context, the backend retrieves relevant examples from Pinecone, and modality-specific metaprompts synthesize the final result. Azure Container Apps hosts the containerized frontend and backend services through images built and pushed by GitHub Actions.

Security model

The current security model centers on Supabase Auth, server-side subscription checks, Stripe webhook verification, environment-based secrets, Azure OIDC deployment, and no publication of private provider keys or vector indexes. The risk/control table below is intentionally evidence-oriented so reviewers can see which controls are implemented and which supporting artifacts still need screenshots.

Deployment pipeline

The local workflows document a frontend and backend Docker deployment path to Azure Container Apps through Azure Container Registry. Historically, PromptTriage also ran on Google Cloud Run. The Azure app runtime and ACR were later accidentally removed during CLI cleanup, which is now part of the operational lessons learned rather than hidden from the case study.

Operations

Operational evidence includes API logging, health routes, error feedback UX, dataset ingestion scripts, research progress notes, and failure tracking around model/provider behavior. The next operational step is to attach Azure Container Apps logs, alert thresholds, incident response notes, rollback proof, and cost-monitoring screenshots to make the operations section audit-ready.

Cost analysis

The Azure cost screenshot shows EUR 178.12 of credits spent for the PromptTriage research period, grouped under Virtual Machines in East US 2 for the qwentrain workspace. Azure ML job metadata shows the longest completed MoE benchmark ran for 38.19 hours on the gpu-a100 compute target, while Study A ran for 5.35 hours. The main cost lesson is that benchmark design, retry behavior, and inference implementation can matter more than raw model size.

Results

Documented results include a 28K+ prompt evidence base, Pinecone retrieval, multiple completed studies, qwen3_14b selected as a production candidate in the research notes, Unsloth/QLoRA model experiments, and a finding that naive RAG performed better than more complex RAG variants for this pipeline. Some downstream benchmark results were nuanced, so the case study should preserve both positive findings and null results instead of overstating uniform gains.

Tradeoffs

The main tradeoff was simplicity versus orchestration complexity. Naive RAG was easier to operate and outperformed more agentic retrieval variants in the documented study, while complex corrective or judge-based RAG added noise. The project also trades broad provider coverage for a more controlled, research-backed prompt workflow.

Failure modes and lessons learned

The research logs are valuable because they document failures and controlled interruptions: judge bias, provider/model availability issues, expensive slow MoE inference, failed Azure ML runs, planned stop/resume benchmark jobs, overfitting on small datasets, prompt strategies that did not move downstream scores, and accidental cloud-resource teardown. These failures make the project stronger as engineering evidence because they show measurement discipline rather than only demo polish.

What I would improve next

Next improvements should include a formal runbook, security model page, threat model, Azure cost report, architecture diagram export stored in the repo, API rate limiting, clearer public benchmark reports, and a more complete deployment checklist. Those additions align directly with the documentation standard from the career execution pack.

Repository and demo links

The public demo has historically been hosted at prompttriage.kaelux.dev and the repository is available on GitHub. The strongest supporting evidence is the README, backend research notes, Azure ML logs, benchmark output JSON, deployment workflows, security policy, release charts, and the marketing-oriented app showcase video.

Interview explanation

I built PromptTriage because prompt quality is usually treated as subjective, but production AI systems need measurable prompt structure, retrieval context, and evaluation discipline. The architecture uses a Next.js interface, Supabase Auth, Stripe billing, API orchestration, FastAPI retrieval, Pinecone vectors, Azure Container Apps, and modality-specific metaprompts. The most important lesson was that more complex RAG is not automatically better: in this project, simple retrieval produced stronger documented results than agentic retrieval traces.

Resume bullets

Designed a RAG-powered prompt engineering platform with modality-specific optimization, Pinecone retrieval, and documented evaluation studies. Built and evaluated prompt-generation workflows across model selection, retrieval strategy, judge bias, and format sensitivity. Documented cost, failure modes, and tradeoffs to make the project inspectable by engineering reviewers.

Need the project index?

The main case-studies page separates PromptTriage from Kaelux.dev, ViperMesh, and n8n Automation Atlas so each project can grow into its own structured page.

Back to case studies GitHub

PromptTriage Case Study

PromptTriage

Research program summary

Production prompt anti-pattern study

Leaked frontier system prompts

Format Wars evaluation matrix

Judge-bias and referee lessons

Canonical articles

What We Learned From Analyzing 28,000 Production AI System Prompts

AI Format Wars: Does the Shape of Your Prompt Matter?

Research evidence gallery

28K anti-pattern prevalence

Enhancement comparison

Diverging format comparison

Dual-color ranking

Horizontal enhancement chart

Grouped format comparison

Format and model heatmap

Length impact

Format share view

Marketing-oriented PromptTriage product video

Cost and runtime evidence

Azure credits spent

Longest benchmark job

Primary compute

Azure cost itemization from CLI/REST query

Study B: Benchmark (MoE only)

Study A: RAG Pipeline Comparison

Study B: qwen3_30b_a3b training

Study B: all-model benchmark attempts

ML techniques used

Unsloth

QLoRA adapters

A100 GPU benchmarking

RAG ablation study

LLM-as-judge evaluation

Benchmark continuation workflow

Platform history

Architecture diagram

User layer

Next.js app

Retrieval and context

Generation and output

Delivery

Security model evidence

Deployment path

Remaining template gaps

Cost analysis

Operations

Security model

Deployment pipeline

Architecture

Results

Executive summary

Problem

Context and constraints

Requirements

Architecture

Security model

Deployment pipeline

Operations

Cost analysis

Results

Tradeoffs

Failure modes and lessons learned

What I would improve next

Repository and demo links

Interview explanation

Resume bullets

Need the project index?