28K+ Prompt corpus referenced in local project evidence.
4 Prompt modalities: text, image, video, and system prompts.
L1 Naive RAG was the strongest documented retrieval strategy.
5+ Completed research studies covering model, RAG, format, and judge behavior.
Mar 24, 2026 What We Learned From Analyzing 28,000 Production AI System Prompts
Hashnode article covering the 28K prompt corpus, production anti-patterns, and why prompt hygiene became the core product thesis.
Read on Hashnode →
Mar 22, 2026 AI Format Wars: Does the Shape of Your Prompt Matter?
Hashnode article covering the 1,080-eval format study and the measurable impact of prompt structure across models and domains.
Read on Hashnode →
The app showcase is now available as a GitHub release asset. Because it is about
121 MB and marketing-heavy, this portfolio links to it deliberately instead of
autoplaying or treating it as primary benchmark evidence.
Cost analysis
Job-level Azure ML cost allocation that maps run IDs to billed spend. Current evidence itemizes the EUR 178.12 VM meters, but Azure Cost Management does not directly attach every euro to each ML job.
Operations
Public-safe screenshots or summaries for monitoring, alerts, rollback, incident response, and restore paths.
Security model
Screenshots or redacted config evidence for Supabase Auth, Stripe webhook verification, secrets, and deployment identity.
Deployment pipeline
Visual evidence for Azure Container Apps, ACR, Google Cloud Run history, GitHub Actions runs, image scanning, and rollback.
Architecture
Exportable SVG is now included; optional PNG, Excalidraw, or Draw.io source can be added if you want a diagram asset that is easier to reuse in decks.
Results
A compact public table that maps each study to sample size, model set, evaluation method, and final takeaway.
Executive summary
PromptTriage is a RAG-powered prompt engineering platform built to turn rough ideas into production-ready prompts across text, image, video, and system-prompt workflows. The current build combines a Next.js interface, Supabase Auth, Stripe subscription flows, FastAPI retrieval services, Pinecone vector search, modality-specific metaprompts, and Azure Container Apps deployment. The strongest evidence is not only the app surface, but the documented research process behind retrieval strategy, prompt format, model selection, evaluation bias, and cost tradeoffs.
Problem
Most prompt tools behave like thin wrappers over a model API. The project needed to show whether structured prompt generation, retrieval context, and modality-specific templates could produce better, more reliable prompts than generic chat UX. The user problem is practical: builders need reusable prompts that include constraints, output format, assumptions, and evaluation criteria without manually writing every section from scratch.
Context and constraints
The project was developed as a solo engineering effort with limited budget, evolving model availability, provider API instability, GPU quota constraints, and a need to publish evidence without exposing secrets or private datasets. Most benchmark and training work ran on Azure ML compute clusters, while the app moved through Azure Container Apps and Google Cloud Run deployment phases. Research notes show cost-sensitive experimentation across fine-tuned Qwen variants, RAG strategies, and LLM-as-judge workflows.
Requirements
Functional requirements included prompt analysis, dynamic clarification questions, modality routing, retrieval from a prompt corpus, final prompt synthesis, one-click iteration, paid-plan capability checks, and optional current-documentation lookup through MCP tooling. Non-functional requirements included secure authentication, environment-based secrets, responsive UI, reproducible prompt versions, structured logging, and a containerized path from GitHub to Azure runtime. Research requirements included Unsloth-based QLoRA experiments, Azure ML benchmark jobs, and repeatable evaluation outputs.
Architecture
The architecture separates the user-facing workflow from retrieval and prompt-engineering services. The UI collects intent and modality, Next.js API routes enforce Supabase session and subscription context, the backend retrieves relevant examples from Pinecone, and modality-specific metaprompts synthesize the final result. Azure Container Apps hosts the containerized frontend and backend services through images built and pushed by GitHub Actions.
Security model
The current security model centers on Supabase Auth, server-side subscription checks, Stripe webhook verification, environment-based secrets, Azure OIDC deployment, and no publication of private provider keys or vector indexes. The risk/control table below is intentionally evidence-oriented so reviewers can see which controls are implemented and which supporting artifacts still need screenshots.
Deployment pipeline
The local workflows document a frontend and backend Docker deployment path to Azure Container Apps through Azure Container Registry. Historically, PromptTriage also ran on Google Cloud Run. The Azure app runtime and ACR were later accidentally removed during CLI cleanup, which is now part of the operational lessons learned rather than hidden from the case study.
Operations
Operational evidence includes API logging, health routes, error feedback UX, dataset ingestion scripts, research progress notes, and failure tracking around model/provider behavior. The next operational step is to attach Azure Container Apps logs, alert thresholds, incident response notes, rollback proof, and cost-monitoring screenshots to make the operations section audit-ready.
Cost analysis
The Azure cost screenshot shows EUR 178.12 of credits spent for the PromptTriage research period, grouped under Virtual Machines in East US 2 for the qwentrain workspace. Azure ML job metadata shows the longest completed MoE benchmark ran for 38.19 hours on the gpu-a100 compute target, while Study A ran for 5.35 hours. The main cost lesson is that benchmark design, retry behavior, and inference implementation can matter more than raw model size.
Results
Documented results include a 28K+ prompt evidence base, Pinecone retrieval, multiple completed studies, qwen3_14b selected as a production candidate in the research notes, Unsloth/QLoRA model experiments, and a finding that naive RAG performed better than more complex RAG variants for this pipeline. Some downstream benchmark results were nuanced, so the case study should preserve both positive findings and null results instead of overstating uniform gains.
Tradeoffs
The main tradeoff was simplicity versus orchestration complexity. Naive RAG was easier to operate and outperformed more agentic retrieval variants in the documented study, while complex corrective or judge-based RAG added noise. The project also trades broad provider coverage for a more controlled, research-backed prompt workflow.
Failure modes and lessons learned
The research logs are valuable because they document failures and controlled interruptions: judge bias, provider/model availability issues, expensive slow MoE inference, failed Azure ML runs, planned stop/resume benchmark jobs, overfitting on small datasets, prompt strategies that did not move downstream scores, and accidental cloud-resource teardown. These failures make the project stronger as engineering evidence because they show measurement discipline rather than only demo polish.
What I would improve next
Next improvements should include a formal runbook, security model page, threat model, Azure cost report, architecture diagram export stored in the repo, API rate limiting, clearer public benchmark reports, and a more complete deployment checklist. Those additions align directly with the documentation standard from the career execution pack.
Repository and demo links
The public demo has historically been hosted at prompttriage.kaelux.dev and the repository is available on GitHub. The strongest supporting evidence is the README, backend research notes, Azure ML logs, benchmark output JSON, deployment workflows, security policy, release charts, and the marketing-oriented app showcase video.
Interview explanation
I built PromptTriage because prompt quality is usually treated as subjective, but production AI systems need measurable prompt structure, retrieval context, and evaluation discipline. The architecture uses a Next.js interface, Supabase Auth, Stripe billing, API orchestration, FastAPI retrieval, Pinecone vectors, Azure Container Apps, and modality-specific metaprompts. The most important lesson was that more complex RAG is not automatically better: in this project, simple retrieval produced stronger documented results than agentic retrieval traces.
Resume bullets
Designed a RAG-powered prompt engineering platform with modality-specific optimization, Pinecone retrieval, and documented evaluation studies. Built and evaluated prompt-generation workflows across model selection, retrieval strategy, judge bias, and format sensitivity. Documented cost, failure modes, and tradeoffs to make the project inspectable by engineering reviewers.