1
1
India
1 year of experience
AI/ML engineer and undergraduate at IIT (BHU) focused on agentic AI systems, LLM reasoning architectures, and hybrid retrieval pipelines. I build production oriented AI workflows using LangGraph, FastAPI, vector databases, and PEFT/QLoRA fine-tuning. Currently working on systems involving autonomous agents, reflection based reasoning, adaptive retrieval, and reliability aware AI evaluation. My recent projects include a Startup Intelligence & Hybrid Knowledge Reasoning System and graph orchestrated LLM agents with tool integration and conversational memory. Interested in building scalable AI systems that combine reasoning, retrieval, evaluation, and real world deployment.

As AI agents become increasingly integrated into real-world applications, understanding retrieval reliability and preprocessing sensitivity has become a major challenge in Retrieval-Augmented Generation (RAG) systems. Most traditional evaluations focus only on architecture performance while ignoring how preprocessing decisions can significantly affect retrieval robustness and benchmark outcomes. In this project, we built an interactive observability and benchmarking platform for evaluating robustness in Agentic RAG systems. The platform compares Single-Agent and Multi-Agent RAG architectures across SQuAD and HotpotQA benchmarks using Exact Match (EM) and F1 evaluation metrics. Through systematic experiments, we discovered a key insight: preprocessing strategies such as chunking can completely flip benchmark winners. Without chunking, the Single-Agent system slightly outperformed the Multi-Agent system on SQuAD. However, after introducing chunking, the Multi-Agent architecture became significantly more robust under noisy retrieval conditions. To make these behaviors observable, we developed an interactive Streamlit dashboard featuring benchmark comparison analytics, retrieval trace visualization, chunking impact analysis, and failure inspection. One of the core components of the platform is the Retrieval Trace Viewer, which allows users to inspect how Multi-Agent systems rewrite queries, retrieve semantically richer chunks, and improve answer generation step-by-step. We also analyzed common RAG failure modes such as vocabulary mismatch, retrieval pollution, and chunk fragmentation. Our findings demonstrate that retrieval robustness depends not only on architecture design but also heavily on preprocessing strategy and retrieval quality. Technologies used include LangChain, LangGraph, FAISS, HuggingFace Embeddings, Groq LLMs, Streamlit, Plotly, and Python.
19 May 2026