I’m driven by a passion for building practical, intelligent systems that bridge learning with real-world usability. My projects reflect a focus on efficiency, scalability, and creativity from cloud-deployed LLMs to voice agents and control systems. I’m always eager to learn, explore emerging tools, and collaborate on solving meaningful challenges in AI !!
LLM Systems & AI Engineering
Pre-Training & Foundation Models (Github) (Loss Curve during Training)
GPT-2 (124M) Pretraining from Scratch Distributed PyTorch
Designed and implemented a decoder-only GPT-2 (124M) transformer model from scratch in PyTorch and trained it on 0.8B tokens using fully distributed multi-GPU DDP with gradient accumulation and mixed precision.
Tools & Technologies:-
PyTorch, PyTorch Distributed Data Parallel (DDP), torch.distributed, Custom Transformer Implementation, Causal Self-Attention, FlashAttention, AdamW, Cosine Learning Rate Scheduler, Gradient Accumulation, Mixed Precision, Selective Weight Decay Parameter Grouping, Multi-GPU Token Sharding, HellaSwag Benchmarking.
Results:-
Achieved 0.294 HellaSwag accuracy, matching OpenAI’s GPT-2 (124M) baseline performance within a single epoch of training.
Fine-Tuning & Alignment with DPO
8B Model Fine-Tuning for Low-Resource Translation (QLoRA + DDP) with DPO
Fine-tuned an 8B-parameter LLM using QLoRA with multi-GPU DDP training to improve translation quality for low-resource and morphologically complex languages, followed by Direct Preference Optimization (DPO) for alignment stabilization.
Tools & Technologies:-
PyTorch, Hugging Face Transformers, PEFT (QLoRA), TRL, Multi-GPU DDP, 4-bit NF4 Quantization, Gradient Checkpointing, Direct Preference Optimization, LLM-as-a-Judge Evaluation, Vertex AI, Docker, GKE
Results:-
Achieved 10x BLEU and 36% BLEURT improvement over baseline,
LLM Inference Systems & GPU Optimization (GitHub) (Project Slides)
Dual-GPU LLM Inference Engine
Engineered a disaggregated LLM inference system separating prefill and decode across dual NVIDIA L4 GPUs.
Tools & Technologies:-
PyTorch, CUDA, vLLM, Layer-wise KV Streaming, Prefill–Decode Separation, Multi-GPU Architecture, GPU Memory Management, Async Batching, Compute–Transfer Overlap, torch.cuda Streams, Latency Profiling, Throughput Benchmarking
Results:-
60–96% compute/transfer overlap, 20x TTFT improvement and 4x throughput gain over single-GPU baseline
Agents, RAG & Applied AI Systems
Low-Latency Dense Retrieval RAG System (Github) (Architecture)
Designed and built a full-stack Two-Stage Retrieval RAG system over a 12,000+ document corpus using LoRA-fine-tuned Sentence Transformers and FLAN-T5 for answer synthesis.
Tools & Technologies:-
Python, PyTorch, Transformers, Sentence Transformers, FAISS, LoRA, Two-Stage Retrieval, Semantic Chunking, Docker.
Results:-
Top-3 retrieval accuracy from 81% → 92.4% and reduced retrieval latency by 1.1x
Voice-Driven AI Appointment Agent (GitHub) (DockerHub)
Built a voice-enabled AI appointment booking agent using Whisper for transcription, LangGraph for stateful orchestration, and Groq's LLM for dialogue slot-filling.
Tools & Technologies:-
Whisper, LangChain, LangGraph, Groq API, AWS SES, Docker, Streamlit, Python
Results:-
86% booking success across 15+ real-world test scenarios.
News Summarization (GitHub)
Fine-tuned T5-small on CNN/DailyMail using QLoRA and 8-bit quantization for abstractive summarization, deployed via AWS Lambda + API Gateway with Streamlit frontend.
Tools & Technologies:-
T5-small, QLoRA, Hugging Face, SageMaker, AWS Lambda, API Gateway, 8-bit quantization, Streamlit, Python.
Results:-
Improved ROUGE-L score by 20% (up to 42.7)
Cold Email Generator with ChromaDB (GitHub)
Built a context-aware cold email generator using ChromaDB for semantic search and GPT/Claude APIs for personalized text generation.
Tools & Technologies:-
ChromaDB, Sentence Transformers, APIs, CSV, Python, Streamlit, Jupyter.
Results:-
Focused on embedding-driven semantic matching for high-relevance content generation.