AldawsariNLP commited on
Commit
7c71fe1
·
1 Parent(s): 43ad92f

pushing last changes - dockerignore ...4

Browse files
Files changed (3) hide show
  1. .gitignore +1 -2
  2. README.md +162 -0
  3. backend/rag_system.py +1 -1
.gitignore CHANGED
@@ -54,5 +54,4 @@ Thumbs.db
54
  # Documents (tracked via Hugging Face Xet)
55
 
56
  QUICKSTART.md
57
- GITHUB_SETUP.md
58
- README.md
 
54
  # Documents (tracked via Hugging Face Xet)
55
 
56
  QUICKSTART.md
57
+ GITHUB_SETUP.md
 
README.md ADDED
@@ -0,0 +1,162 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Saudi Law AI Assistant
3
+ emoji: ⚖️
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ pinned: false
8
+ ---
9
+
10
+ # Law Document RAG Chat Application
11
+ ...
12
+
13
+ # Law Document RAG Chat Application
14
+
15
+ A web application that allows users to ask questions about indexed legal documents using Retrieval Augmented Generation (RAG) techniques.
16
+
17
+ ## Features
18
+
19
+ - 🤖 **RAG-powered Q&A**: Ask questions about your legal documents and get answers extracted directly from the context
20
+ - 📚 **Document Indexing**: Automatically index PDF, TXT, DOCX, and DOC files from a folder
21
+ - 🎨 **Modern React Frontend**: Beautiful, responsive chat interface
22
+ - ⚡ **FastAPI Backend**: High-performance API with LangChain and FAISS
23
+ - 🔍 **Exact Context Extraction**: Answers are extracted directly from documents, not generated
24
+ - 🔀 **Hybrid Search**: Combines BM25 (keyword-based) and semantic search for improved retrieval accuracy
25
+ - 🤗 **Qwen Model Support**: Uses Qwen/Qwen3-32B model via HuggingFace router for high-quality Arabic language understanding
26
+ - 🚀 **Hugging Face Spaces Ready**: Configured for easy deployment
27
+
28
+ ## Tech Stack
29
+
30
+ - **Frontend**: React 18
31
+ - **Backend**: FastAPI
32
+ - **RAG**: LangChain + FAISS with Hybrid Search (BM25 + Semantic)
33
+ - **Vector Database**: FAISS
34
+ - **Embeddings**: Qwen/Qwen3-Embedding-8B (HuggingFace) or OpenAI embeddings (configurable)
35
+ - **LLM**: Qwen/Qwen3-32B via HuggingFace router (default) or OpenAI API (configurable)
36
+ - **Hybrid Search**: BM25 + Semantic search using EnsembleRetriever
37
+ - **Python**: 3.10 or 3.11 (required for faiss-cpu compatibility)
38
+
39
+ ## Project Structure
40
+
41
+ ```
42
+ KSAlaw-document-agent/
43
+ ├── backend/
44
+ │ ├── main.py # FastAPI application
45
+ │ ├── rag_system.py # RAG implementation
46
+ │ ├── document_processor.py # Document processing logic
47
+ │ ├── embeddings.py # OpenAI embeddings wrapper
48
+ │ └── chat_history.py # Chat history management
49
+ ├── frontend/
50
+ │ ├── src/
51
+ │ │ ├── App.js # Main React component
52
+ │ │ ├── App.css # Styles
53
+ │ │ ├── index.js # React entry point
54
+ │ │ └── index.css # Global styles
55
+ │ ├── build/ # Built React app (for deployment)
56
+ │ ├── public/
57
+ │ │ └── index.html # HTML template
58
+ │ └── package.json # Node dependencies
59
+ ├── documents/ # Place your PDF documents here
60
+ ├── vectorstore/ # FAISS vectorstore (auto-generated)
61
+ ├── app.py # Hugging Face Spaces entry point
62
+ ├── Dockerfile # Docker configuration
63
+ ├── pyproject.toml # Python dependencies (uv)
64
+ ├── uv.lock # Locked dependencies
65
+ ├── processed_documents.json # Processed document summaries
66
+ └── README.md # This file
67
+ ```
68
+
69
+ ## Quick Start
70
+
71
+ **Local Development:**
72
+ 1. Install dependencies: `uv sync` and `cd frontend && npm install`
73
+ 2. Create `.env` file in the project root with required environment variables:
74
+ - `HF_TOKEN`: Your HuggingFace API token (required for Qwen model)
75
+ - `OPENAI_API_KEY`: Your OpenAI API key (required for document processing)
76
+ - Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search (BM25 + Semantic)
77
+ 3. Add documents to `documents/` folder
78
+ 4. Run backend: `uv run python backend/main.py`
79
+ 5. Run frontend: `cd frontend && npm start`
80
+
81
+ **Deployment to Hugging Face Spaces:**
82
+ 1. Build frontend: `cd frontend && npm run build`
83
+ 2. Set up Xet storage (recommended) or prepare to upload PDFs via UI
84
+ 3. Push to Hugging Face: `git push hf main`
85
+ 4. Set required environment variables in Space secrets:
86
+ - `HF_TOKEN`: Your HuggingFace API token
87
+ - `OPENAI_API_KEY`: Your OpenAI API key
88
+ - Optionally set `USE_HYBRID_SEARCH=true` to enable hybrid search
89
+
90
+ ## API Endpoints
91
+
92
+ - `GET /api/` - Health check
93
+ - `GET /api/health` - Health status
94
+ - `POST /api/index` - Index documents from a folder
95
+ ```json
96
+ {
97
+ "folder_path": "documents"
98
+ }
99
+ ```
100
+ - `POST /api/ask` - Ask a question
101
+ ```json
102
+ {
103
+ "question": "What is the law about X?",
104
+ "use_history": true,
105
+ "context_mode": "chunks",
106
+ "model_provider": "qwen"
107
+ }
108
+ ```
109
+ - `question` (required): The question to ask
110
+ - `use_history` (optional): Whether to use chat history (default: `true`)
111
+ - `context_mode` (optional): Context mode - `"full"` (entire document) or `"chunks"` (top semantic chunks, default)
112
+ - `model_provider` (optional): Model provider - `"qwen"` (default) or `"openai"`
113
+
114
+ **Note**: The default `model_provider` is `"qwen"` which uses Qwen/Qwen3-32B via HuggingFace router. When using `context_mode="chunks"` with hybrid search enabled, the system combines BM25 and semantic search for improved retrieval accuracy.
115
+
116
+ ## Environment Variables
117
+
118
+ ### Required Variables
119
+
120
+ - `HF_TOKEN`: Your HuggingFace API token (required for Qwen model and HuggingFace embeddings)
121
+ - `OPENAI_API_KEY`: Your OpenAI API key (required for document processing and optional for embeddings/LLM)
122
+
123
+ ### Optional Configuration
124
+
125
+ - `QWEN_MODEL`: Qwen model to use (default: `Qwen/Qwen3-32B:nscale`)
126
+ - `EMBEDDINGS_PROVIDER`: Embeddings provider - `"openai"` or `"hf"`/`"huggingface"` (default: `"openai"`)
127
+ - `HF_EMBEDDING_MODEL`: HuggingFace embedding model (default: `Qwen/Qwen3-Embedding-8B`)
128
+ - `OPENAI_LLM_MODEL`: OpenAI LLM model to use (default: `gpt-4o-mini`)
129
+ - `OPENAI_EMBEDDING_MODEL`: OpenAI embedding model (default: `text-embedding-ada-002`)
130
+ - `USE_HYBRID_SEARCH`: Enable hybrid search combining BM25 and semantic search (default: `"false"`, set to `"true"` to enable)
131
+ - `HYBRID_BM25_WEIGHT`: Weight for BM25 component in hybrid search (default: `0.5`)
132
+ - `HYBRID_SEMANTIC_WEIGHT`: Weight for semantic component in hybrid search (default: `0.5`)
133
+ - `CHAT_HISTORY_TURNS`: Number of conversation turns to keep in history (default: `10`)
134
+
135
+ ## Notes
136
+
137
+ - The system extracts exact text from documents, not generated responses
138
+ - Supported document formats: PDF, TXT, DOCX, DOC
139
+ - The vectorstore is saved locally and persists between sessions
140
+ - Documents are automatically processed on startup (no manual indexing needed)
141
+ - **Default Model**: The system uses Qwen/Qwen3-32B via HuggingFace router by default for better Arabic language understanding
142
+ - **Hybrid Search**: When enabled (`USE_HYBRID_SEARCH=true`), combines BM25 keyword search with semantic search for improved retrieval accuracy
143
+ - For Hugging Face Spaces, the frontend automatically uses `/api` as the API URL
144
+ - This project uses `uv` for Python package management - dependencies are defined in `pyproject.toml` and `uv.lock`
145
+ - The `.env` file should be in the project root (not in the backend folder)
146
+ - PDFs can be stored using Hugging Face Xet storage or uploaded via the Space UI
147
+
148
+ ## Troubleshooting
149
+
150
+ ### Common Issues
151
+
152
+ - **HF_TOKEN Error**: Make sure `HF_TOKEN` is set in your `.env` file (local) or Space secrets (deployment) when using Qwen model
153
+ - **OpenAI API Key Error**: Make sure `OPENAI_API_KEY` is set in your `.env` file (local) or Space secrets (deployment) for document processing
154
+ - **No documents found**: Ensure documents are in the `documents/` folder with supported extensions (PDF, TXT, DOCX, DOC)
155
+ - **Frontend can't connect**: Check that the backend is running on port 8000
156
+ - **Build fails on Spaces**: Ensure `frontend/build/` exists (run `npm run build`), check Dockerfile, verify dependencies in `pyproject.toml`
157
+ - **RAG system not initialized**: Check Space logs, ensure `processed_documents.json` exists and is not ignored by `.dockerignore`
158
+ - **Hybrid search not working**: Ensure `rank-bm25` is installed (`uv sync` should handle this) and `USE_HYBRID_SEARCH=true` is set
159
+
160
+ ## License
161
+
162
+ MIT
backend/rag_system.py CHANGED
@@ -823,7 +823,7 @@ Respond with ONLY one of: "law-new", "law-followup", or provide an answer if it'
823
  # Chunk mode: build or load per-document chunk vectorstore and retrieve top-k chunks
824
  chunk_vs, chunk_docs = self._get_or_build_chunk_vectorstore(matched_filename, full_text)
825
  total_chunks = len(chunk_docs)
826
- print(f"Document '{matched_filename}' has {total_chunks} chunks")
827
  # Use the current question as the chunk search query
828
  top_k = 4
829
  try:
 
823
  # Chunk mode: build or load per-document chunk vectorstore and retrieve top-k chunks
824
  chunk_vs, chunk_docs = self._get_or_build_chunk_vectorstore(matched_filename, full_text)
825
  total_chunks = len(chunk_docs)
826
+ # print(f"Document '{matched_filename}' has {total_chunks} chunks")
827
  # Use the current question as the chunk search query
828
  top_k = 4
829
  try: