Building systems that scale and agents that reason

I design highly scalable applications, develop agentic microservices, and work on distributed systems — across both computation and storage. Projects, live demos, and daily technical writing live here.

Find me

What I work on

Scalable Systems

Designing applications that grow gracefully — from architecture decisions to operational patterns that hold under load.

Agentic Applications

Building autonomous software that reasons, plans, and acts — from single agents to coordinated multi-agent workflows.

Agentic Microservices

Decomposing intelligence into independently deployable services with clear boundaries, contracts, and observability.

Distributed Computation

Parallel and fault-tolerant processing across nodes — task orchestration, stream processing, and workload scheduling.

Distributed Storage

Data systems built for consistency, partition tolerance, and horizontal scale — from replication to sharding strategies.

Projects

Agentic AILangGraphBrowser AutomationLLM

UI-Navigator Agent

Autonomous browser agent that accepts a target platform, operational goal, and task intent, then plans and executes multi-step UI workflows end to end. Designed to translate natural-language objectives into reliable, platform-specific interaction sequences without manual navigation.

NLPSpaCyGeminiFastAPIAzure

ResumeSnap

Career intelligence platform with a companion browser extension. SpaCy NLP resolves contextual semantics from resume content, the Gemini REST API generates stack-aligned project outlines, and Azure Whisper handles voice-to-text ingestion for hands-free input.

Next.jsStripeReal-timeE-commerce

Gandom Bakery Platform

Production e-commerce system for a local bakery with real-time admin notifications, bidirectional inventory sync between back office and storefront, Stripe payment processing, and automated inventory ingestion workflows.

Recent posts

rdd
Apache Spark··15

Rethinking Fault Tolerance and Data Locality in Distributed Systems: From WAL to RDDs

🚀 Rethinking Fault Tolerance and Data Locality in Distributed Systems: From WAL to RDDs Distributed computing faces a constant engineering dilemma: how do you prevent data loss when a server crashes without completely destroying your processing speed? Traditionally, systems relied on heavy Write-Ahead Logging (WAL)—shipping transaction text files across the network to secure backups before a process could even execute. While safe, this disk and network-heavy approach created massive bottlenecks for big data analytics. Apache Spark completely flipped this paradigm by introducing Resilient Distributed Datasets (RDDs). By trading micro-level edits for bulk, coarse-grained transformations, Spark eliminates the need for data backups entirely. Instead, it logs a lightweight "recipe" of your data pipeline called a Lineage Graph. If a node dies, Spark simply reads the blueprint and recomputes only the missing piece in-memory. But true performance goes beyond memory access; it requires mastering Data Locality. By overriding default storage boundaries and explicitly enforcing Hash Partitioning on high-cardinality keys (like vendor categories in the NYC TLC dataset), engineers can structurally segregate data at the cluster hardware level. The payoff? Downstream aggregations transform from expensive, network-choking Wide Dependency Shuffles into localized, lightning-fast Narrow Dependency operations executed entirely within local RAM.

LLMFineTuning
#AIInfrastructure··20 min

Demystifying LLM Fine-Tuning: How LoRA and QLoRA Save Your Hardware (and Your Budget)

High-performance Large Language Models (LLMs) are incredibly powerful, but fine-tuning them on private corporate data can be astronomically expensive. This technical report breaks down how LoRA (Low-Rank Adaptation) and QLoRA use clever linear algebra and bit-precision compression to drastically reduce GPU memory and training costs—allowing you to build custom AI agents without breaking your hardware budget.

#LLMs (Large Language Models) #FineTuning * #MachineLearning #ArtificialIntelligence #GenerativeAI
DistributiveSystemDesign
#AIInfrastructure··10 min

Tyche: Optimizing Serverless Machine Learning via Proactive Pre-Loading

Can we completely eliminate Machine Learning "Cold Starts" in Serverless Clusters? When packaging ML models into serverless functions, the standard "container pre-warming" used by cloud providers isn't enough. Why? Because traditional apps are lightweight, but ML workflows carry massive dependencies (like PyTorch) and heavy model files (like BERT). A staggering 70% of a serverless ML cold start is spent just loading these libraries from disk into memory. In my latest technical report, I break down "Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters" (published in IEEE Transactions on Parallel and Distributed Systems*, Vol. 37, No. 2, February 2026). The paper introduces Tyche, an architecture that solves this by opportunistically pre-loading ML artifacts into already-warmed containers and GPUs before a request even lands. Here is how the underlying math dynamically handles erratic traffic spikes without wasting heavy CPU retraining cycles: ⏱️ The 7.4-Second Math Adaptation Instead of relying on rigid, historical 24-hour traffic averages that fail during sudden surges, Tyche monitors a tight sliding window of recent requests (e.g., W=5) to calculate the request arrival rate (lambda). It then plugs this live rate into a Poisson distribution formula using two optimal probability thresholds: Load Threshold P_load = 6 The moment the probability of an incoming request hits 6%, Tyche acts. For a standard traffic pace of 0.5 requests/min, the math triggers a proactive pre-load timer at exactly 7.4 seconds of idle time. The model is booted and waiting before the user arrives. Offload Threshold P_offload = 94%: If a traffic lull happens and the probability that a prediction was wrong hits 94% (around 5.6 minutes), Tyche immediately flushes the model to keep the cluster memory lean. ⚡ The Real Engineering Win When a sudden burst of traffic hits, the sliding window instantly recalculates. If $\lambda$ jumps from 0.5 to 0.55: 1. Zero Retraining Overhead: No heavy GPU/CPU cycles are wasted adjusting complex ML weights. 2. Instant Math Recalculation: The target pre-load window automatically tightens from 7.4 seconds down to ~6.7 seconds. The entire system winds up aggressively during surges and relaxes during lulls—yielding up to a 93% reduction in loading latency. #Serverless #MachineLearning #SystemArchitecture #CloudComputing #AWSLambda #DistributedSystems #IEEE #TechCommunity

Connect

Interested in distributed systems, agentic architecture, or collaborating on a project? Find me on GitHub, LinkedIn, or NotebookLM — or book a quick intro call.