Spaces:

AldawsariNLP
/

Saudi-Law-AI-Assistant

Running

App Files Files Community

AldawsariNLP commited on 6 days ago

Commit

7c71fe1

1 Parent(s): 43ad92f

pushing last changes - dockerignore ...4

Browse files

Files changed (3) hide show

.gitignore +1 -2
README.md +162 -0
backend/rag_system.py +1 -1

.gitignore CHANGED Viewed

@@ -54,5 +54,4 @@ Thumbs.db
 # Documents (tracked via Hugging Face Xet)
 QUICKSTART.md
-GITHUB_SETUP.md
-README.md

 # Documents (tracked via Hugging Face Xet)
 QUICKSTART.md
+GITHUB_SETUP.md

README.md ADDED Viewed

	@@ -0,0 +1,162 @@

+---
+title: Saudi Law AI Assistant
+emoji: ⚖️
+colorFrom: blue
+colorTo: purple
+sdk: docker
+pinned: false
+---
+# Law Document RAG Chat Application
+...
+# Law Document RAG Chat Application
+A web application that allows users to ask questions about indexed legal documents using Retrieval Augmented Generation (RAG) techniques.
+## Features
+- 🤖 **RAG-powered Q&A**: Ask questions about your legal documents and get answers extracted directly from the context
+- 📚 **Document Indexing**: Automatically index PDF, TXT, DOCX, and DOC files from a folder
+- 🎨 **Modern React Frontend**: Beautiful, responsive chat interface
+- ⚡ **FastAPI Backend**: High-performance API with LangChain and FAISS
+- 🔍 **Exact Context Extraction**: Answers are extracted directly from documents, not generated
+- 🔀 **Hybrid Search**: Combines BM25 (keyword-based) and semantic search for improved retrieval accuracy
+- 🤗 **Qwen Model Support**: Uses Qwen/Qwen3-32B model via HuggingFace router for high-quality Arabic language understanding
+- 🚀 **Hugging Face Spaces Ready**: Configured for easy deployment
+## Tech Stack
+- **Frontend**: React 18
+- **Backend**: FastAPI
+- **RAG**: LangChain + FAISS with Hybrid Search (BM25 + Semantic)
+- **Vector Database**: FAISS
+- **Embeddings**: Qwen/Qwen3-Embedding-8B (HuggingFace) or OpenAI embeddings (configurable)
+- **LLM**: Qwen/Qwen3-32B via HuggingFace router (default) or OpenAI API (configurable)
+- **Hybrid Search**: BM25 + Semantic search using EnsembleRetriever
+- **Python**: 3.10 or 3.11 (required for faiss-cpu compatibility)
+## Project Structure
+```
+KSAlaw-document-agent/
+├── backend/
+│   ├── main.py              # FastAPI application
+│   ├── rag_system.py        # RAG implementation
+│   ├── document_processor.py # Document processing logic
+│   ├── embeddings.py        # OpenAI embeddings wrapper
+│   └── chat_history.py     # Chat history management
+├── frontend/
+│   ├── src/
+│   │   ├── App.js           # Main React component
+│   │   ├── App.css          # Styles
+│   │   ├── index.js         # React entry point
+│   │   └── index.css        # Global styles
+│   ├── build/               # Built React app (for deployment)
+│   ├── public/
+│   │   └── index.html       # HTML template
+│   └── package.json         # Node dependencies
+├── documents/               # Place your PDF documents here
+├── vectorstore/            # FAISS vectorstore (auto-generated)
+├── app.py                   # Hugging Face Spaces entry point
+├── Dockerfile               # Docker configuration
+├── pyproject.toml           # Python dependencies (uv)
+├── uv.lock                  # Locked dependencies
+├── processed_documents.json # Processed document summaries
+└── README.md                # This file
+```
+## Quick Start
+**Local Development:**
+1. Install dependencies: `uv sync` and `cd frontend && npm install`
+2. Create `.env` file in the project root with required environment variables:
+   - `HF_TOKEN`: Your HuggingFace API token (required for Qwen model)
+   - `OPENAI_API_KEY`: Your OpenAI API key (required for document processing)
+   - Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search (BM25 + Semantic)
+3. Add documents to `documents/` folder
+4. Run backend: `uv run python backend/main.py`
+5. Run frontend: `cd frontend && npm start`
+**Deployment to Hugging Face Spaces:**
+1. Build frontend: `cd frontend && npm run build`
+2. Set up Xet storage (recommended) or prepare to upload PDFs via UI
+3. Push to Hugging Face: `git push hf main`
+4. Set required environment variables in Space secrets:
+   - `HF_TOKEN`: Your HuggingFace API token
+   - `OPENAI_API_KEY`: Your OpenAI API key
+   - Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search
+## API Endpoints
+- `GET /api/` - Health check
+- `GET /api/health` - Health status
+- `POST /api/index` - Index documents from a folder
+  ```json
+  {
+    "folder_path": "documents"
+  }
+  ```
+- `POST /api/ask` - Ask a question
+  ```json
+  {
+    "question": "What is the law about X?",
+    "use_history": true,
+    "context_mode": "chunks",
+    "model_provider": "qwen"
+  }
+  ```
+  - `question` (required): The question to ask
+  - `use_history` (optional): Whether to use chat history (default: `true`)
+  - `context_mode` (optional): Context mode - `"full"` (entire document) or `"chunks"` (top semantic chunks, default)
+  - `model_provider` (optional): Model provider - `"qwen"` (default) or `"openai"`
+  **Note**: The default `model_provider` is `"qwen"` which uses Qwen/Qwen3-32B via HuggingFace router. When using `context_mode="chunks"` with hybrid search enabled, the system combines BM25 and semantic search for improved retrieval accuracy.
+## Environment Variables
+### Required Variables
+- `HF_TOKEN`: Your HuggingFace API token (required for Qwen model and HuggingFace embeddings)
+- `OPENAI_API_KEY`: Your OpenAI API key (required for document processing and optional for embeddings/LLM)
+### Optional Configuration
+- `QWEN_MODEL`: Qwen model to use (default: `Qwen/Qwen3-32B:nscale`)
+- `EMBEDDINGS_PROVIDER`: Embeddings provider - `"openai"` or `"hf"`/`"huggingface"` (default: `"openai"`)
+- `HF_EMBEDDING_MODEL`: HuggingFace embedding model (default: `Qwen/Qwen3-Embedding-8B`)
+- `OPENAI_LLM_MODEL`: OpenAI LLM model to use (default: `gpt-4o-mini`)
+- `OPENAI_EMBEDDING_MODEL`: OpenAI embedding model (default: `text-embedding-ada-002`)
+- `USE_HYBRID_SEARCH`: Enable hybrid search combining BM25 and semantic search (default: `"false"`, set to `"true"` to enable)
+- `HYBRID_BM25_WEIGHT`: Weight for BM25 component in hybrid search (default: `0.5`)
+- `HYBRID_SEMANTIC_WEIGHT`: Weight for semantic component in hybrid search (default: `0.5`)
+- `CHAT_HISTORY_TURNS`: Number of conversation turns to keep in history (default: `10`)
+## Notes
+- The system extracts exact text from documents, not generated responses
+- Supported document formats: PDF, TXT, DOCX, DOC
+- The vectorstore is saved locally and persists between sessions
+- Documents are automatically processed on startup (no manual indexing needed)
+- **Default Model**: The system uses Qwen/Qwen3-32B via HuggingFace router by default for better Arabic language understanding
+- **Hybrid Search**: When enabled (`USE_HYBRID_SEARCH=true`), combines BM25 keyword search with semantic search for improved retrieval accuracy
+- For Hugging Face Spaces, the frontend automatically uses `/api` as the API URL
+- This project uses `uv` for Python package management - dependencies are defined in `pyproject.toml` and `uv.lock`
+- The `.env` file should be in the project root (not in the backend folder)
+- PDFs can be stored using Hugging Face Xet storage or uploaded via the Space UI
+## Troubleshooting
+### Common Issues
+- **HF_TOKEN Error**: Make sure `HF_TOKEN` is set in your `.env` file (local) or Space secrets (deployment) when using Qwen model
+- **OpenAI API Key Error**: Make sure `OPENAI_API_KEY` is set in your `.env` file (local) or Space secrets (deployment) for document processing
+- **No documents found**: Ensure documents are in the `documents/` folder with supported extensions (PDF, TXT, DOCX, DOC)
+- **Frontend can't connect**: Check that the backend is running on port 8000
+- **Build fails on Spaces**: Ensure `frontend/build/` exists (run `npm run build`), check Dockerfile, verify dependencies in `pyproject.toml`
+- **RAG system not initialized**: Check Space logs, ensure `processed_documents.json` exists and is not ignored by `.dockerignore`
+- **Hybrid search not working**: Ensure `rank-bm25` is installed (`uv sync` should handle this) and `USE_HYBRID_SEARCH=true` is set
+## License
+MIT

backend/rag_system.py CHANGED Viewed

@@ -823,7 +823,7 @@ Respond with ONLY one of: "law-new", "law-followup", or provide an answer if it'
 				# Chunk mode: build or load per-document chunk vectorstore and retrieve top-k chunks
 				chunk_vs, chunk_docs = self._get_or_build_chunk_vectorstore(matched_filename, full_text)
 				total_chunks = len(chunk_docs)
-				print(f"Document '{matched_filename}' has {total_chunks} chunks")
 				# Use the current question as the chunk search query
 				top_k = 4
 				try:

 				# Chunk mode: build or load per-document chunk vectorstore and retrieve top-k chunks
 				chunk_vs, chunk_docs = self._get_or_build_chunk_vectorstore(matched_filename, full_text)
 				total_chunks = len(chunk_docs)
+				# print(f"Document '{matched_filename}' has {total_chunks} chunks")
 				# Use the current question as the chunk search query
 				top_k = 4
 				try: