mirror of https://git.adityakumar.xyz/design-project.git/ synced 2025-02-21 12:30:00 +00:00

No description

Find a file

Aditya f60b215860 add README		2025-02-10 22:30:39 +05:30
requirements	initial commit	2025-02-10 22:20:49 +05:30
.gitignore	initial commit	2025-02-10 22:20:49 +05:30
app.py	initial commit	2025-02-10 22:20:49 +05:30
Makefile	initial commit	2025-02-10 22:20:49 +05:30
README.md	add README	2025-02-10 22:30:39 +05:30

README.md

Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)

Overview

This application implements a Retrieval-Augmented Generation (RAG) based Question Answering System using Streamlit for the user interface, ChromaDB for vector storage, and Ollama for generating responses. The system allows users to upload PDF documents, process them into text chunks, store them as vector embeddings, and retrieve relevant information to generate AI-powered responses.

System Components

1. File Processing and Text Chunking

Function: process_document(uploaded_file: UploadedFile) -> list[Document]

Takes a user-uploaded PDF file and processes it into smaller text chunks.
Uses PyMuPDFLoader to extract text from PDFs.
Splits extracted text into overlapping segments using RecursiveCharacterTextSplitter.
Returns a list of Document objects containing text chunks and metadata.

Key Steps:

Save uploaded file to a temporary file.
Load content using PyMuPDFLoader.
Split text using RecursiveCharacterTextSplitter.
Delete the temporary file.
Return the list of Document objects.

2. Vector Storage and Retrieval (ChromaDB)

Creating a ChromaDB Collection

Function: get_vector_collection() -> chromadb.Collection

Initializes ChromaDB with a persistent vector store.
Uses OllamaEmbeddingFunction to generate vector embeddings.
Retrieves or creates a collection for storing document embeddings.
Uses cosine similarity for querying documents.

Key Steps:

Define OllamaEmbeddingFunction for embedding generation.
Initialize ChromaDB PersistentClient.
Retrieve or create a ChromaDB collection for storing vectors.
Return the collection object.

Adding Documents to Vector Store

Function: add_to_vector_collection(all_splits: list[Document], file_name: str)

Takes a list of document chunks and stores them in ChromaDB.
Each document is stored with unique IDs based on file name.
Success message displayed via Streamlit.

Key Steps:

Retrieve ChromaDB collection using get_vector_collection().
Convert document chunks into a list of text embeddings, metadata, and unique IDs.
Use upsert() to store document embeddings.
Display success message.

Querying the Vector Collection

Function: query_collection(prompt: str, n_results: int = 10) -> dict

Queries ChromaDB with a user-provided search query.
Returns the top n most relevant documents based on similarity.

Key Steps:

Retrieve ChromaDB collection.
Perform query using collection.query().
Return retrieved documents and metadata.

3. Language Model Interaction (Ollama API)

Generating Responses using the AI Model

Function: call_llm(context: str, prompt: str)

Calls Ollama's language model to generate a context-aware response.
Uses a system prompt to guide the model’s behavior.
Streams the AI-generated response in chunks.

Key Steps:

Send system prompt and user query to Ollama.
Retrieve and yield streamed responses.
Display results in Streamlit.

4. Cross-Encoder Based Re-Ranking

Function: re_rank_cross_encoders(documents: list[str]) -> tuple[str, list[int]]

Uses CrossEncoder (MS MARCO MiniLM model) to re-rank retrieved documents.
Selects the top 3 most relevant documents.
Returns concatenated relevant text and document indices.

Key Steps:

Load MS MARCO MiniLM CrossEncoder model.
Rank documents using cross-encoder re-ranking.
Extract the top-ranked documents.
Return concatenated text and indices.

User Interface (Streamlit)

1. Document Uploading and Processing

Sidebar allows PDF file upload.
User clicks Process to extract text and store embeddings.
File name is normalized before processing.
Extracted text chunks are stored in ChromaDB.

2. Question Answering System

Main interface displays a text area for users to enter questions.
Clicking Ask triggers the retrieval and response generation process:
1. Query ChromaDB to retrieve relevant documents.
2. Re-rank documents using cross-encoder.
3. Pass relevant text and question to the LLM.
4. Stream and display the AI-generated response.
5. Provide options to view retrieved documents and rankings.

Technologies Used

Streamlit → UI framework for interactive user interface.
PyMuPDF → PDF text extraction.
ChromaDB → Vector database for semantic search.
Ollama → LLM API for generating responses.
LangChain → Document processing utilities.
Sentence Transformers (CrossEncoder) → Document re-ranking.

Error Handling & Edge Cases

File I/O Errors: Proper handling of temporary file read/write issues.
ChromaDB Errors: Ensures database consistency and query failures are managed.
Ollama API Failures: Detects and handles API unavailability or timeouts.
Empty Document Handling: Ensures that no empty files are processed.
Invalid Queries: Provides feedback for low-relevance queries.

Conclusion

This application provides a RAG-based interactive Q&A system, leveraging retrieval, ranking, and generation techniques to deliver highly relevant AI-generated responses. The architecture ensures efficient document processing, vector storage, and intelligent answer generation using state-of-the-art models and embeddings.

README.md Unescape Escape

Document Ingestion and Semantic Query System Using Retrieval-Augmented Generation (RAG)

Overview

System Components

1. File Processing and Text Chunking

2. Vector Storage and Retrieval (ChromaDB)

Creating a ChromaDB Collection

Adding Documents to Vector Store

Querying the Vector Collection

3. Language Model Interaction (Ollama API)

Generating Responses using the AI Model

4. Cross-Encoder Based Re-Ranking

User Interface (Streamlit)

1. Document Uploading and Processing

2. Question Answering System

Technologies Used

Error Handling & Edge Cases

Conclusion

README.md