Building systems that scale and agents that reason
I design highly scalable applications, develop agentic microservices, and work on distributed systems — across both computation and storage. Projects, live demos, and daily technical writing live here.
Autonomous browser agent that accepts a target platform, operational goal, and task intent, then plans and executes multi-step UI workflows end to end. Designed to translate natural-language objectives into reliable, platform-specific interaction sequences without manual navigation.
Career intelligence platform with a companion browser extension. SpaCy NLP resolves contextual semantics from resume content, the Gemini REST API generates stack-aligned project outlines, and Azure Whisper handles voice-to-text ingestion for hands-free input.
Production e-commerce system for a local bakery with real-time admin notifications, bidirectional inventory sync between back office and storefront, Stripe payment processing, and automated inventory ingestion workflows.
🚀 Rethinking Fault Tolerance and Data Locality in Distributed Systems: From WAL to RDDs
Distributed computing faces a constant engineering dilemma: how do you prevent data loss when a server crashes without completely destroying your processing speed?
Traditionally, systems relied on heavy Write-Ahead Logging (WAL)—shipping transaction text files across the network to secure backups before a process could even execute. While safe, this disk and network-heavy approach created massive bottlenecks for big data analytics.
Apache Spark completely flipped this paradigm by introducing Resilient Distributed Datasets (RDDs). By trading micro-level edits for bulk, coarse-grained transformations, Spark eliminates the need for data backups entirely. Instead, it logs a lightweight "recipe" of your data pipeline called a Lineage Graph. If a node dies, Spark simply reads the blueprint and recomputes only the missing piece in-memory.
But true performance goes beyond memory access; it requires mastering Data Locality. By overriding default storage boundaries and explicitly enforcing Hash Partitioning on high-cardinality keys (like vendor categories in the NYC TLC dataset), engineers can structurally segregate data at the cluster hardware level. The payoff? Downstream aggregations transform from expensive, network-choking Wide Dependency Shuffles into localized, lightning-fast Narrow Dependency operations executed entirely within local RAM.
High-performance Large Language Models (LLMs) are incredibly powerful, but fine-tuning them on private corporate data can be astronomically expensive. This technical report breaks down how LoRA (Low-Rank Adaptation) and QLoRA use clever linear algebra and bit-precision compression to drastically reduce GPU memory and training costs—allowing you to build custom AI agents without breaking your hardware budget.
#LLMs (Large Language Models) #FineTuning * #MachineLearning #ArtificialIntelligence #GenerativeAI
Can we completely eliminate Machine Learning "Cold Starts" in Serverless Clusters?
When packaging ML models into serverless functions, the standard "container pre-warming" used by cloud providers isn't enough.
Why? Because traditional apps are lightweight, but ML workflows carry massive dependencies (like PyTorch) and heavy model files (like BERT). A staggering 70% of a serverless ML cold start is spent just loading these libraries from disk into memory.
In my latest technical report, I break down "Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters" (published in IEEE Transactions on Parallel and Distributed Systems*, Vol. 37, No. 2, February 2026). The paper introduces Tyche, an architecture that solves this by opportunistically pre-loading ML artifacts into already-warmed containers and GPUs before a request even lands.
Here is how the underlying math dynamically handles erratic traffic spikes without wasting heavy CPU retraining cycles:
⏱️ The 7.4-Second Math Adaptation
Instead of relying on rigid, historical 24-hour traffic averages that fail during sudden surges, Tyche monitors a tight sliding window of recent requests (e.g., W=5) to calculate the request arrival rate (lambda).
It then plugs this live rate into a Poisson distribution formula using two optimal probability thresholds:
Load Threshold P_load = 6 The moment the probability of an incoming request hits 6%, Tyche acts. For a standard traffic pace of 0.5 requests/min, the math triggers a proactive pre-load timer at exactly 7.4 seconds of idle time. The model is booted and waiting before the user arrives.
Offload Threshold P_offload = 94%: If a traffic lull happens and the probability that a prediction was wrong hits 94% (around 5.6 minutes), Tyche immediately flushes the model to keep the cluster memory lean.
⚡ The Real Engineering Win
When a sudden burst of traffic hits, the sliding window instantly recalculates. If $\lambda$ jumps from 0.5 to 0.55:
1. Zero Retraining Overhead: No heavy GPU/CPU cycles are wasted adjusting complex ML weights.
2. Instant Math Recalculation: The target pre-load window automatically tightens from 7.4 seconds down to ~6.7 seconds.
The entire system winds up aggressively during surges and relaxes during lulls—yielding up to a 93% reduction in loading latency.
#Serverless #MachineLearning #SystemArchitecture #CloudComputing #AWSLambda #DistributedSystems #IEEE #TechCommunity
Connect
Interested in distributed systems, agentic architecture, or collaborating on a project? Find me on GitHub, LinkedIn, or NotebookLM — or book a quick intro call.