No description
Find a file
2025-02-10 22:30:39 +05:30
requirements initial commit 2025-02-10 22:20:49 +05:30
.gitignore initial commit 2025-02-10 22:20:49 +05:30
app.py initial commit 2025-02-10 22:20:49 +05:30
Makefile initial commit 2025-02-10 22:20:49 +05:30
README.md add README 2025-02-10 22:30:39 +05:30

Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)

Overview

This application implements a Retrieval-Augmented Generation (RAG) based Question Answering System using Streamlit for the user interface, ChromaDB for vector storage, and Ollama for generating responses. The system allows users to upload PDF documents, process them into text chunks, store them as vector embeddings, and retrieve relevant information to generate AI-powered responses.


System Components

1. File Processing and Text Chunking

Function: process_document(uploaded_file: UploadedFile) -> list[Document]

  • Takes a user-uploaded PDF file and processes it into smaller text chunks.
  • Uses PyMuPDFLoader to extract text from PDFs.
  • Splits extracted text into overlapping segments using RecursiveCharacterTextSplitter.
  • Returns a list of Document objects containing text chunks and metadata.

Key Steps:

  1. Save uploaded file to a temporary file.
  2. Load content using PyMuPDFLoader.
  3. Split text using RecursiveCharacterTextSplitter.
  4. Delete the temporary file.
  5. Return the list of Document objects.

2. Vector Storage and Retrieval (ChromaDB)

Creating a ChromaDB Collection

Function: get_vector_collection() -> chromadb.Collection

  • Initializes ChromaDB with a persistent vector store.
  • Uses OllamaEmbeddingFunction to generate vector embeddings.
  • Retrieves or creates a collection for storing document embeddings.
  • Uses cosine similarity for querying documents.

Key Steps:

  1. Define OllamaEmbeddingFunction for embedding generation.
  2. Initialize ChromaDB PersistentClient.
  3. Retrieve or create a ChromaDB collection for storing vectors.
  4. Return the collection object.

Adding Documents to Vector Store

Function: add_to_vector_collection(all_splits: list[Document], file_name: str)

  • Takes a list of document chunks and stores them in ChromaDB.
  • Each document is stored with unique IDs based on file name.
  • Success message displayed via Streamlit.

Key Steps:

  1. Retrieve ChromaDB collection using get_vector_collection().
  2. Convert document chunks into a list of text embeddings, metadata, and unique IDs.
  3. Use upsert() to store document embeddings.
  4. Display success message.

Querying the Vector Collection

Function: query_collection(prompt: str, n_results: int = 10) -> dict

  • Queries ChromaDB with a user-provided search query.
  • Returns the top n most relevant documents based on similarity.

Key Steps:

  1. Retrieve ChromaDB collection.
  2. Perform query using collection.query().
  3. Return retrieved documents and metadata.

3. Language Model Interaction (Ollama API)

Generating Responses using the AI Model

Function: call_llm(context: str, prompt: str)

  • Calls Ollama's language model to generate a context-aware response.
  • Uses a system prompt to guide the models behavior.
  • Streams the AI-generated response in chunks.

Key Steps:

  1. Send system prompt and user query to Ollama.
  2. Retrieve and yield streamed responses.
  3. Display results in Streamlit.

4. Cross-Encoder Based Re-Ranking

Function: re_rank_cross_encoders(documents: list[str]) -> tuple[str, list[int]]

  • Uses CrossEncoder (MS MARCO MiniLM model) to re-rank retrieved documents.
  • Selects the top 3 most relevant documents.
  • Returns concatenated relevant text and document indices.

Key Steps:

  1. Load MS MARCO MiniLM CrossEncoder model.
  2. Rank documents using cross-encoder re-ranking.
  3. Extract the top-ranked documents.
  4. Return concatenated text and indices.

User Interface (Streamlit)

1. Document Uploading and Processing

  • Sidebar allows PDF file upload.
  • User clicks Process to extract text and store embeddings.
  • File name is normalized before processing.
  • Extracted text chunks are stored in ChromaDB.

2. Question Answering System

  • Main interface displays a text area for users to enter questions.
  • Clicking Ask triggers the retrieval and response generation process:
    1. Query ChromaDB to retrieve relevant documents.
    2. Re-rank documents using cross-encoder.
    3. Pass relevant text and question to the LLM.
    4. Stream and display the AI-generated response.
    5. Provide options to view retrieved documents and rankings.

Technologies Used

  • Streamlit → UI framework for interactive user interface.
  • PyMuPDF → PDF text extraction.
  • ChromaDB → Vector database for semantic search.
  • Ollama → LLM API for generating responses.
  • LangChain → Document processing utilities.
  • Sentence Transformers (CrossEncoder) → Document re-ranking.

Error Handling & Edge Cases

  • File I/O Errors: Proper handling of temporary file read/write issues.
  • ChromaDB Errors: Ensures database consistency and query failures are managed.
  • Ollama API Failures: Detects and handles API unavailability or timeouts.
  • Empty Document Handling: Ensures that no empty files are processed.
  • Invalid Queries: Provides feedback for low-relevance queries.

Conclusion

This application provides a RAG-based interactive Q&A system, leveraging retrieval, ranking, and generation techniques to deliver highly relevant AI-generated responses. The architecture ensures efficient document processing, vector storage, and intelligent answer generation using state-of-the-art models and embeddings.