PROJECTS · BUILD LOG · 2025–2026

Projects

Things I've built at the intersection of SRE and AI infrastructure. Each one runs in production (or close to it), and each one taught me something I couldn't have learned from reading docs.

→ runcast-intelligence/

LIVE

RunCast Intelligence

Semantic search · 4,151 episodes · 9 podcasts

I transcribed thousands of hours of running podcasts using OpenAI Whisper, split them into chunks, embedded each chunk with a vector model, and stored everything in Supabase pgvector. Now you can ask "how do elites taper for a marathon?" and get a real answer — with exact timestamps so you can jump straight to the source.

The search pipeline embeds your query, runs cosine similarity against 589 transcript chunks across 9 podcasts, and feeds the top results to a Claude RAG pipeline that writes the answer. The whole thing runs on a FastAPI backend deployed on Railway, with a Next.js frontend on Vercel.

Python FastAPI pgvector Whisper RAG Next.js Railway

LIVE DEMO ↗ SOURCE

→ winesnap/

LIVE

WineSnap

AI wine picker · mobile PWA · 3-stage pipeline

You're standing in front of a wine shelf. You don't know where to start. Take a photo — WineSnap reads every label using a vision model, searches the web for live prices and critic scores for each wine, then tells you which bottle is the best value for your money (or your dinner, or your occasion).

The pipeline runs three stages in sequence: Gemini Vision extracts wine names from the photo, Tavily searches for price and score for each one in parallel, and an LLM calculates the value ratio (score ÷ price) and writes the recommendation. Deployed as a mobile-first PWA — add it to your home screen and use it like a native app.

Python FastAPI Gemini Vision Tavily Next.js PWA OpenRouter

LIVE DEMO ↗ SOURCE

→ inferenceforge/

IN PROGRESS

InferenceForge

LLM inference · Kubernetes · GPU-native

A production-grade LLM serving platform designed for GPU Kubernetes clusters. One Helm flag separates a local CPU deployment (Ollama + TinyLlama, zero cost) from a GPU deployment (vLLM + Mistral-7B on NVIDIA T4s on EKS). The architecture is the same either way.

The gateway sits in front of the model backend and handles everything you'd need in production: OpenAI-compatible API, request queue (scales on queue depth via custom HPA metric — not just CPU), rate limiting, streaming responses, and Prometheus metrics for latency histograms, token throughput, and active requests. The Terraform provisions EKS with Karpenter so GPU nodes scale to zero when idle.

Kubernetes Helm Terraform FastAPI Prometheus vLLM NVIDIA T4

SOURCE