mirror of
https://git.adityakumar.xyz/design-project.git/
synced 2025-02-23 05:10:01 +00:00
141 lines
5.6 KiB
Markdown
141 lines
5.6 KiB
Markdown
# **Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)**
|
||
|
||
## **Overview**
|
||
This application implements a **Retrieval-Augmented Generation (RAG) based Question Answering System** using Streamlit for the user interface, ChromaDB for vector storage, and Ollama for generating responses. The system allows users to upload **PDF documents**, process them into **text chunks**, store them as **vector embeddings**, and retrieve relevant information to generate AI-powered responses.
|
||
|
||
---
|
||
|
||
## **System Components**
|
||
|
||
### **1. File Processing and Text Chunking**
|
||
**Function:** `process_document(uploaded_file: UploadedFile) -> list[Document]`
|
||
|
||
- Takes a user-uploaded **PDF file** and processes it into **smaller text chunks**.
|
||
- Uses **PyMuPDFLoader** to extract text from PDFs.
|
||
- Splits extracted text into **overlapping segments** using **RecursiveCharacterTextSplitter**.
|
||
- Returns a list of **Document objects** containing text chunks and metadata.
|
||
|
||
**Key Steps:**
|
||
1. Save uploaded file to a **temporary file**.
|
||
2. Load content using **PyMuPDFLoader**.
|
||
3. Split text using **RecursiveCharacterTextSplitter**.
|
||
4. Delete the temporary file.
|
||
5. Return the **list of Document objects**.
|
||
|
||
---
|
||
|
||
### **2. Vector Storage and Retrieval (ChromaDB)**
|
||
|
||
#### **Creating a ChromaDB Collection**
|
||
**Function:** `get_vector_collection() -> chromadb.Collection`
|
||
|
||
- Initializes **ChromaDB** with a **persistent vector store**.
|
||
- Uses **OllamaEmbeddingFunction** to generate vector embeddings.
|
||
- Retrieves or creates a collection for storing **document embeddings**.
|
||
- Uses **cosine similarity** for querying documents.
|
||
|
||
**Key Steps:**
|
||
1. Define **OllamaEmbeddingFunction** for embedding generation.
|
||
2. Initialize **ChromaDB PersistentClient**.
|
||
3. Retrieve or create a **ChromaDB collection** for storing vectors.
|
||
4. Return the **collection object**.
|
||
|
||
#### **Adding Documents to Vector Store**
|
||
**Function:** `add_to_vector_collection(all_splits: list[Document], file_name: str)`
|
||
|
||
- Takes a list of document chunks and stores them in **ChromaDB**.
|
||
- Each document is stored with **unique IDs** based on file name.
|
||
- Success message displayed via **Streamlit**.
|
||
|
||
**Key Steps:**
|
||
1. Retrieve ChromaDB collection using `get_vector_collection()`.
|
||
2. Convert document chunks into a list of **text embeddings, metadata, and unique IDs**.
|
||
3. Use `upsert()` to store document embeddings.
|
||
4. Display success message.
|
||
|
||
#### **Querying the Vector Collection**
|
||
**Function:** `query_collection(prompt: str, n_results: int = 10) -> dict`
|
||
|
||
- Queries **ChromaDB** with a user-provided search query.
|
||
- Returns the **top n most relevant documents** based on similarity.
|
||
|
||
**Key Steps:**
|
||
1. Retrieve ChromaDB collection.
|
||
2. Perform query using `collection.query()`.
|
||
3. Return **retrieved documents and metadata**.
|
||
|
||
---
|
||
|
||
### **3. Language Model Interaction (Ollama API)**
|
||
|
||
#### **Generating Responses using the AI Model**
|
||
**Function:** `call_llm(context: str, prompt: str)`
|
||
|
||
- Calls **Ollama**'s language model to generate a **context-aware response**.
|
||
- Uses a **system prompt** to guide the model’s behavior.
|
||
- Streams the AI-generated response in **chunks**.
|
||
|
||
**Key Steps:**
|
||
1. Send **system prompt** and user query to **Ollama**.
|
||
2. Retrieve and yield streamed responses.
|
||
3. Display results in **Streamlit**.
|
||
|
||
---
|
||
|
||
### **4. Cross-Encoder Based Re-Ranking**
|
||
**Function:** `re_rank_cross_encoders(documents: list[str]) -> tuple[str, list[int]]`
|
||
|
||
- Uses **CrossEncoder (MS MARCO MiniLM model)** to **re-rank retrieved documents**.
|
||
- Selects the **top 3 most relevant documents**.
|
||
- Returns **concatenated relevant text** and **document indices**.
|
||
|
||
**Key Steps:**
|
||
1. Load **MS MARCO MiniLM CrossEncoder model**.
|
||
2. Rank documents using **cross-encoder re-ranking**.
|
||
3. Extract the **top-ranked documents**.
|
||
4. Return **concatenated text** and **indices**.
|
||
|
||
---
|
||
|
||
## **User Interface (Streamlit)**
|
||
|
||
### **1. Document Uploading and Processing**
|
||
- Sidebar allows **PDF file upload**.
|
||
- User clicks **Process** to extract text and store embeddings.
|
||
- File name is **normalized** before processing.
|
||
- Extracted **text chunks** are stored in **ChromaDB**.
|
||
|
||
### **2. Question Answering System**
|
||
- Main interface displays a **text area** for users to enter questions.
|
||
- Clicking **Ask** triggers the retrieval and response generation process:
|
||
1. **Query ChromaDB** to retrieve relevant documents.
|
||
2. **Re-rank documents** using **cross-encoder**.
|
||
3. **Pass relevant text** and **question** to the **LLM**.
|
||
4. Stream and display the AI-generated response.
|
||
5. Provide options to view **retrieved documents and rankings**.
|
||
|
||
---
|
||
|
||
## **Technologies Used**
|
||
- **Streamlit** → UI framework for interactive user interface.
|
||
- **PyMuPDF** → PDF text extraction.
|
||
- **ChromaDB** → Vector database for semantic search.
|
||
- **Ollama** → LLM API for generating responses.
|
||
- **LangChain** → Document processing utilities.
|
||
- **Sentence Transformers (CrossEncoder)** → Document re-ranking.
|
||
|
||
---
|
||
|
||
## **Error Handling & Edge Cases**
|
||
- **File I/O Errors**: Proper handling of **temporary file read/write issues**.
|
||
- **ChromaDB Errors**: Ensures **database consistency and query failures** are managed.
|
||
- **Ollama API Failures**: Detects and **handles API unavailability or timeouts**.
|
||
- **Empty Document Handling**: Ensures that **no empty files** are processed.
|
||
- **Invalid Queries**: Provides **feedback for low-relevance queries**.
|
||
|
||
---
|
||
|
||
## **Conclusion**
|
||
This application provides a **RAG-based interactive Q&A system**, leveraging **retrieval, ranking, and generation** techniques to deliver highly **relevant AI-generated responses**. The architecture ensures efficient document processing, vector storage, and intelligent answer generation using state-of-the-art models and embeddings.
|
||
|
||
|