Commit
·
7c71fe1
1
Parent(s):
43ad92f
pushing last changes - dockerignore ...4
Browse files- .gitignore +1 -2
- README.md +162 -0
- backend/rag_system.py +1 -1
.gitignore
CHANGED
|
@@ -54,5 +54,4 @@ Thumbs.db
|
|
| 54 |
# Documents (tracked via Hugging Face Xet)
|
| 55 |
|
| 56 |
QUICKSTART.md
|
| 57 |
-
GITHUB_SETUP.md
|
| 58 |
-
README.md
|
|
|
|
| 54 |
# Documents (tracked via Hugging Face Xet)
|
| 55 |
|
| 56 |
QUICKSTART.md
|
| 57 |
+
GITHUB_SETUP.md
|
|
|
README.md
ADDED
|
@@ -0,0 +1,162 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Saudi Law AI Assistant
|
| 3 |
+
emoji: ⚖️
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: docker
|
| 7 |
+
pinned: false
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Law Document RAG Chat Application
|
| 11 |
+
...
|
| 12 |
+
|
| 13 |
+
# Law Document RAG Chat Application
|
| 14 |
+
|
| 15 |
+
A web application that allows users to ask questions about indexed legal documents using Retrieval Augmented Generation (RAG) techniques.
|
| 16 |
+
|
| 17 |
+
## Features
|
| 18 |
+
|
| 19 |
+
- 🤖 **RAG-powered Q&A**: Ask questions about your legal documents and get answers extracted directly from the context
|
| 20 |
+
- 📚 **Document Indexing**: Automatically index PDF, TXT, DOCX, and DOC files from a folder
|
| 21 |
+
- 🎨 **Modern React Frontend**: Beautiful, responsive chat interface
|
| 22 |
+
- ⚡ **FastAPI Backend**: High-performance API with LangChain and FAISS
|
| 23 |
+
- 🔍 **Exact Context Extraction**: Answers are extracted directly from documents, not generated
|
| 24 |
+
- 🔀 **Hybrid Search**: Combines BM25 (keyword-based) and semantic search for improved retrieval accuracy
|
| 25 |
+
- 🤗 **Qwen Model Support**: Uses Qwen/Qwen3-32B model via HuggingFace router for high-quality Arabic language understanding
|
| 26 |
+
- 🚀 **Hugging Face Spaces Ready**: Configured for easy deployment
|
| 27 |
+
|
| 28 |
+
## Tech Stack
|
| 29 |
+
|
| 30 |
+
- **Frontend**: React 18
|
| 31 |
+
- **Backend**: FastAPI
|
| 32 |
+
- **RAG**: LangChain + FAISS with Hybrid Search (BM25 + Semantic)
|
| 33 |
+
- **Vector Database**: FAISS
|
| 34 |
+
- **Embeddings**: Qwen/Qwen3-Embedding-8B (HuggingFace) or OpenAI embeddings (configurable)
|
| 35 |
+
- **LLM**: Qwen/Qwen3-32B via HuggingFace router (default) or OpenAI API (configurable)
|
| 36 |
+
- **Hybrid Search**: BM25 + Semantic search using EnsembleRetriever
|
| 37 |
+
- **Python**: 3.10 or 3.11 (required for faiss-cpu compatibility)
|
| 38 |
+
|
| 39 |
+
## Project Structure
|
| 40 |
+
|
| 41 |
+
```
|
| 42 |
+
KSAlaw-document-agent/
|
| 43 |
+
├── backend/
|
| 44 |
+
│ ├── main.py # FastAPI application
|
| 45 |
+
│ ├── rag_system.py # RAG implementation
|
| 46 |
+
│ ├── document_processor.py # Document processing logic
|
| 47 |
+
│ ├── embeddings.py # OpenAI embeddings wrapper
|
| 48 |
+
│ └── chat_history.py # Chat history management
|
| 49 |
+
├── frontend/
|
| 50 |
+
│ ├── src/
|
| 51 |
+
│ │ ├── App.js # Main React component
|
| 52 |
+
│ │ ├── App.css # Styles
|
| 53 |
+
│ │ ├── index.js # React entry point
|
| 54 |
+
│ │ └── index.css # Global styles
|
| 55 |
+
│ ├── build/ # Built React app (for deployment)
|
| 56 |
+
│ ├── public/
|
| 57 |
+
│ │ └── index.html # HTML template
|
| 58 |
+
│ └── package.json # Node dependencies
|
| 59 |
+
├── documents/ # Place your PDF documents here
|
| 60 |
+
├── vectorstore/ # FAISS vectorstore (auto-generated)
|
| 61 |
+
├── app.py # Hugging Face Spaces entry point
|
| 62 |
+
├── Dockerfile # Docker configuration
|
| 63 |
+
├── pyproject.toml # Python dependencies (uv)
|
| 64 |
+
├── uv.lock # Locked dependencies
|
| 65 |
+
├── processed_documents.json # Processed document summaries
|
| 66 |
+
└── README.md # This file
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## Quick Start
|
| 70 |
+
|
| 71 |
+
**Local Development:**
|
| 72 |
+
1. Install dependencies: `uv sync` and `cd frontend && npm install`
|
| 73 |
+
2. Create `.env` file in the project root with required environment variables:
|
| 74 |
+
- `HF_TOKEN`: Your HuggingFace API token (required for Qwen model)
|
| 75 |
+
- `OPENAI_API_KEY`: Your OpenAI API key (required for document processing)
|
| 76 |
+
- Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search (BM25 + Semantic)
|
| 77 |
+
3. Add documents to `documents/` folder
|
| 78 |
+
4. Run backend: `uv run python backend/main.py`
|
| 79 |
+
5. Run frontend: `cd frontend && npm start`
|
| 80 |
+
|
| 81 |
+
**Deployment to Hugging Face Spaces:**
|
| 82 |
+
1. Build frontend: `cd frontend && npm run build`
|
| 83 |
+
2. Set up Xet storage (recommended) or prepare to upload PDFs via UI
|
| 84 |
+
3. Push to Hugging Face: `git push hf main`
|
| 85 |
+
4. Set required environment variables in Space secrets:
|
| 86 |
+
- `HF_TOKEN`: Your HuggingFace API token
|
| 87 |
+
- `OPENAI_API_KEY`: Your OpenAI API key
|
| 88 |
+
- Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search
|
| 89 |
+
|
| 90 |
+
## API Endpoints
|
| 91 |
+
|
| 92 |
+
- `GET /api/` - Health check
|
| 93 |
+
- `GET /api/health` - Health status
|
| 94 |
+
- `POST /api/index` - Index documents from a folder
|
| 95 |
+
```json
|
| 96 |
+
{
|
| 97 |
+
"folder_path": "documents"
|
| 98 |
+
}
|
| 99 |
+
```
|
| 100 |
+
- `POST /api/ask` - Ask a question
|
| 101 |
+
```json
|
| 102 |
+
{
|
| 103 |
+
"question": "What is the law about X?",
|
| 104 |
+
"use_history": true,
|
| 105 |
+
"context_mode": "chunks",
|
| 106 |
+
"model_provider": "qwen"
|
| 107 |
+
}
|
| 108 |
+
```
|
| 109 |
+
- `question` (required): The question to ask
|
| 110 |
+
- `use_history` (optional): Whether to use chat history (default: `true`)
|
| 111 |
+
- `context_mode` (optional): Context mode - `"full"` (entire document) or `"chunks"` (top semantic chunks, default)
|
| 112 |
+
- `model_provider` (optional): Model provider - `"qwen"` (default) or `"openai"`
|
| 113 |
+
|
| 114 |
+
**Note**: The default `model_provider` is `"qwen"` which uses Qwen/Qwen3-32B via HuggingFace router. When using `context_mode="chunks"` with hybrid search enabled, the system combines BM25 and semantic search for improved retrieval accuracy.
|
| 115 |
+
|
| 116 |
+
## Environment Variables
|
| 117 |
+
|
| 118 |
+
### Required Variables
|
| 119 |
+
|
| 120 |
+
- `HF_TOKEN`: Your HuggingFace API token (required for Qwen model and HuggingFace embeddings)
|
| 121 |
+
- `OPENAI_API_KEY`: Your OpenAI API key (required for document processing and optional for embeddings/LLM)
|
| 122 |
+
|
| 123 |
+
### Optional Configuration
|
| 124 |
+
|
| 125 |
+
- `QWEN_MODEL`: Qwen model to use (default: `Qwen/Qwen3-32B:nscale`)
|
| 126 |
+
- `EMBEDDINGS_PROVIDER`: Embeddings provider - `"openai"` or `"hf"`/`"huggingface"` (default: `"openai"`)
|
| 127 |
+
- `HF_EMBEDDING_MODEL`: HuggingFace embedding model (default: `Qwen/Qwen3-Embedding-8B`)
|
| 128 |
+
- `OPENAI_LLM_MODEL`: OpenAI LLM model to use (default: `gpt-4o-mini`)
|
| 129 |
+
- `OPENAI_EMBEDDING_MODEL`: OpenAI embedding model (default: `text-embedding-ada-002`)
|
| 130 |
+
- `USE_HYBRID_SEARCH`: Enable hybrid search combining BM25 and semantic search (default: `"false"`, set to `"true"` to enable)
|
| 131 |
+
- `HYBRID_BM25_WEIGHT`: Weight for BM25 component in hybrid search (default: `0.5`)
|
| 132 |
+
- `HYBRID_SEMANTIC_WEIGHT`: Weight for semantic component in hybrid search (default: `0.5`)
|
| 133 |
+
- `CHAT_HISTORY_TURNS`: Number of conversation turns to keep in history (default: `10`)
|
| 134 |
+
|
| 135 |
+
## Notes
|
| 136 |
+
|
| 137 |
+
- The system extracts exact text from documents, not generated responses
|
| 138 |
+
- Supported document formats: PDF, TXT, DOCX, DOC
|
| 139 |
+
- The vectorstore is saved locally and persists between sessions
|
| 140 |
+
- Documents are automatically processed on startup (no manual indexing needed)
|
| 141 |
+
- **Default Model**: The system uses Qwen/Qwen3-32B via HuggingFace router by default for better Arabic language understanding
|
| 142 |
+
- **Hybrid Search**: When enabled (`USE_HYBRID_SEARCH=true`), combines BM25 keyword search with semantic search for improved retrieval accuracy
|
| 143 |
+
- For Hugging Face Spaces, the frontend automatically uses `/api` as the API URL
|
| 144 |
+
- This project uses `uv` for Python package management - dependencies are defined in `pyproject.toml` and `uv.lock`
|
| 145 |
+
- The `.env` file should be in the project root (not in the backend folder)
|
| 146 |
+
- PDFs can be stored using Hugging Face Xet storage or uploaded via the Space UI
|
| 147 |
+
|
| 148 |
+
## Troubleshooting
|
| 149 |
+
|
| 150 |
+
### Common Issues
|
| 151 |
+
|
| 152 |
+
- **HF_TOKEN Error**: Make sure `HF_TOKEN` is set in your `.env` file (local) or Space secrets (deployment) when using Qwen model
|
| 153 |
+
- **OpenAI API Key Error**: Make sure `OPENAI_API_KEY` is set in your `.env` file (local) or Space secrets (deployment) for document processing
|
| 154 |
+
- **No documents found**: Ensure documents are in the `documents/` folder with supported extensions (PDF, TXT, DOCX, DOC)
|
| 155 |
+
- **Frontend can't connect**: Check that the backend is running on port 8000
|
| 156 |
+
- **Build fails on Spaces**: Ensure `frontend/build/` exists (run `npm run build`), check Dockerfile, verify dependencies in `pyproject.toml`
|
| 157 |
+
- **RAG system not initialized**: Check Space logs, ensure `processed_documents.json` exists and is not ignored by `.dockerignore`
|
| 158 |
+
- **Hybrid search not working**: Ensure `rank-bm25` is installed (`uv sync` should handle this) and `USE_HYBRID_SEARCH=true` is set
|
| 159 |
+
|
| 160 |
+
## License
|
| 161 |
+
|
| 162 |
+
MIT
|
backend/rag_system.py
CHANGED
|
@@ -823,7 +823,7 @@ Respond with ONLY one of: "law-new", "law-followup", or provide an answer if it'
|
|
| 823 |
# Chunk mode: build or load per-document chunk vectorstore and retrieve top-k chunks
|
| 824 |
chunk_vs, chunk_docs = self._get_or_build_chunk_vectorstore(matched_filename, full_text)
|
| 825 |
total_chunks = len(chunk_docs)
|
| 826 |
-
print(f"Document '{matched_filename}' has {total_chunks} chunks")
|
| 827 |
# Use the current question as the chunk search query
|
| 828 |
top_k = 4
|
| 829 |
try:
|
|
|
|
| 823 |
# Chunk mode: build or load per-document chunk vectorstore and retrieve top-k chunks
|
| 824 |
chunk_vs, chunk_docs = self._get_or_build_chunk_vectorstore(matched_filename, full_text)
|
| 825 |
total_chunks = len(chunk_docs)
|
| 826 |
+
# print(f"Document '{matched_filename}' has {total_chunks} chunks")
|
| 827 |
# Use the current question as the chunk search query
|
| 828 |
top_k = 4
|
| 829 |
try:
|