Initial Commit of Local files
This commit is contained in:
21
.env
Normal file
21
.env
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
# Paperless
|
||||||
|
PAPERLESS_URL=http://your-paperless-ip:8000
|
||||||
|
PAPERLESS_TOKEN=your_api_token_here
|
||||||
|
|
||||||
|
# Postgres
|
||||||
|
POSTGRES_USER=raguser
|
||||||
|
POSTGRES_PASSWORD=ragpass
|
||||||
|
POSTGRES_DB=ragdb
|
||||||
|
POSTGRES_HOST=postgres
|
||||||
|
POSTGRES_PORT=5432
|
||||||
|
|
||||||
|
# Chroma
|
||||||
|
CHROMA_HOST=chromadb
|
||||||
|
CHROMA_PORT=8000
|
||||||
|
|
||||||
|
# Ollama
|
||||||
|
OLLAMA_BASE_URL=http://ollama:11434
|
||||||
|
|
||||||
|
# Telegram
|
||||||
|
TELEGRAM_BOT_TOKEN=your_telegram_bot_token
|
||||||
|
ALLOWED_TELEGRAM_USERS=123456789,987654321
|
||||||
18
Dockerfile
Normal file
18
Dockerfile
Normal file
@@ -0,0 +1,18 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
# System dependencies für unstructured (PDFs, Tabellen, OCR)
|
||||||
|
RUN apt-get update && apt-get install -y \
|
||||||
|
poppler-utils \
|
||||||
|
tesseract-ocr \
|
||||||
|
tesseract-ocr-deu \
|
||||||
|
libmagic-dev \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
COPY src/ ./src/
|
||||||
|
|
||||||
|
ENV PYTHONPATH=/app
|
||||||
125
README.md
Normal file
125
README.md
Normal file
@@ -0,0 +1,125 @@
|
|||||||
|
# DocuMind-G4
|
||||||
|
|
||||||
|
Dieses Projekt implementiert ein lokales Retrieval-Augmented Generation (RAG) System, das Dokumente aus **Paperless-ngx** synchronisiert, verarbeitet und über eine **Streamlit** Web-UI sowie einen **Telegram Bot** durchsuchbar macht.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
* **Delta-Sync mit Paperless-ngx**: Synchronisiert nur neue oder geänderte Dokumente via API und löscht entfernte Dokumente automatisch aus dem Vektor-Speicher.
|
||||||
|
* **Hi-Res PDF Parsing**: Nutzt `unstructured` mit OCR (Tesseract), um Tabellen und komplexe Layouts in PDFs korrekt zu erfassen.
|
||||||
|
* **Parent-Child Retrieval**: Splittet Dokumente intelligent auf (kleine Chunks für die Vektorsuche in ChromaDB, große Chunks/ganze Dokumente für den LLM-Kontext in PostgreSQL/DocStore), um den Kontextverlust zu minimieren.
|
||||||
|
* **Lokales LLM**: Verwendet `granite4:tiny-h` über **Ollama** für maximale Daten-Privatsphäre.
|
||||||
|
* **Multi-Interface**: Bietet eine Web-Oberfläche (Streamlit) mit Metadaten-Filterung (z.B. nach Document ID) und einen zugangsbeschränkten Telegram-Bot.
|
||||||
|
* **Vollständig Dockerisiert**: Alle Komponenten (PostgreSQL, ChromaDB, Ollama, Scheduler, Streamlit, Telegram) laufen isoliert in Containern.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Voraussetzungen
|
||||||
|
|
||||||
|
* **Docker** und **Docker Compose** müssen installiert sein.
|
||||||
|
* Eine laufende **Paperless-ngx** Instanz.
|
||||||
|
* Ein **Paperless API Token** (kann im Paperless-Admin-Bereich erstellt werden).
|
||||||
|
* Ein **Telegram Bot Token** (über den BotFather in Telegram erstellbar) sowie deine Telegram User-ID.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Setup & Installation
|
||||||
|
|
||||||
|
**1. Repository vorbereiten**
|
||||||
|
Stelle sicher, dass alle Dateien (`docker-compose.yml`, `Dockerfile`, `requirements.txt`, `init_ollama.sh` und der `src/`-Ordner) korrekt am selben Ort liegen.
|
||||||
|
|
||||||
|
**2. Skript ausführbar machen (Linux/macOS)**
|
||||||
|
Das Ollama-Startskript benötigt Ausführungsrechte:
|
||||||
|
```bash
|
||||||
|
chmod +x init_ollama.sh
|
||||||
|
```
|
||||||
|
**3. Umgebungsvariablen konfigurieren**
|
||||||
|
Erstelle eine .env Datei im Hauptverzeichnis und fülle sie mit deinen Daten:
|
||||||
|
|
||||||
|
```
|
||||||
|
# Paperless
|
||||||
|
PAPERLESS_URL=http://<DEINE-PAPERLESS-IP>:8000
|
||||||
|
PAPERLESS_TOKEN=<DEIN_PAPERLESS_TOKEN>
|
||||||
|
|
||||||
|
# Postgres (Standardwerte können belassen werden)
|
||||||
|
POSTGRES_USER=raguser
|
||||||
|
POSTGRES_PASSWORD=ragpass
|
||||||
|
POSTGRES_DB=ragdb
|
||||||
|
POSTGRES_HOST=postgres
|
||||||
|
POSTGRES_PORT=5432
|
||||||
|
|
||||||
|
# Chroma & Ollama (Interne Docker-Routings)
|
||||||
|
CHROMA_HOST=chromadb
|
||||||
|
CHROMA_PORT=8000
|
||||||
|
OLLAMA_BASE_URL=http://ollama:11434
|
||||||
|
|
||||||
|
# Telegram
|
||||||
|
TELEGRAM_BOT_TOKEN=<DEIN_TELEGRAM_TOKEN>
|
||||||
|
ALLOWED_TELEGRAM_USERS=<DEINE_USER_ID>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Starten & Ausführen
|
||||||
|
|
||||||
|
Starte das gesamte System im Hintergrund:
|
||||||
|
```bash
|
||||||
|
docker-compose up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
**Was passiert beim ersten Start?**
|
||||||
|
|
||||||
|
* Die Datenbanken (Postgres, Chroma) werden initialisiert.
|
||||||
|
|
||||||
|
* Der ollama-Container startet, wartet kurz und lädt automatisch das Modell granite4:tiny-h herunter.
|
||||||
|
|
||||||
|
* Der scheduler wartet auf 03:00 Uhr nachts für den initialen Ingest (siehe Troubleshooting für einen manuellen Start).
|
||||||
|
|
||||||
|
* streamlit und der telegram-bot gehen online.
|
||||||
|
|
||||||
|
**Zugriff:**
|
||||||
|
|
||||||
|
* Streamlit Web-UI: Öffne http://localhost:8501 in deinem Browser.
|
||||||
|
|
||||||
|
* Telegram Bot: Suche deinen Bot in Telegram und sende /start.
|
||||||
|
|
||||||
|
---------------------------------------------------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
## Nützliche Docker-Befehle (Troubleshooting & Logs)
|
||||||
|
|
||||||
|
Da das System aus vielen Microservices besteht, ist es wichtig zu wissen, wie man die Logs der einzelnen Container ausliest.
|
||||||
|
|
||||||
|
**Logs des nächtlichen Schedulers ansehen:**
|
||||||
|
Hier siehst du, ob neue Dokumente aus Paperless geladen wurden oder ob Fehler beim PDF-Parsing (OCR) auftraten.
|
||||||
|
```bash
|
||||||
|
docker-compose logs -f scheduler
|
||||||
|
```
|
||||||
|
|
||||||
|
**Logs des Telegram-Bots ansehen:**
|
||||||
|
Hilfreich, wenn der Bot nicht antwortet oder User-IDs abgewiesen werden.
|
||||||
|
```bash
|
||||||
|
docker-compose logs -f telegram-bot
|
||||||
|
```
|
||||||
|
|
||||||
|
**Ollama Status prüfen:**
|
||||||
|
Sieh nach, ob das Modell erfolgreich heruntergeladen wurde.
|
||||||
|
```bash
|
||||||
|
docker-compose logs -f ollama
|
||||||
|
```
|
||||||
|
|
||||||
|
**Manuellen Ingest (Sync) sofort anstoßen:**
|
||||||
|
Falls du nicht bis 03:00 Uhr nachts warten willst, kannst du den Sync-Job manuell im laufenden Scheduler-Container ausführen:
|
||||||
|
```bash
|
||||||
|
docker exec -it <name_des_scheduler_containers> python src/ingest_job.py
|
||||||
|
```
|
||||||
|
(Den genauen Containernamen findest du mit docker ps heraus).
|
||||||
|
|
||||||
|
**System komplett stoppen und Daten behalten:**
|
||||||
|
```bash
|
||||||
|
docker-compose down
|
||||||
|
```
|
||||||
|
|
||||||
|
**System stoppen und ALLE Daten (Vektoren, DB, Modelle) löschen:
|
||||||
|
Achtung: Dies löscht alle Volumes unwiderruflich.**
|
||||||
|
```bash
|
||||||
|
docker-compose down -v
|
||||||
|
```
|
||||||
68
docker-compose.yml
Normal file
68
docker-compose.yml
Normal file
@@ -0,0 +1,68 @@
|
|||||||
|
version: '3.8'
|
||||||
|
|
||||||
|
services:
|
||||||
|
postgres:
|
||||||
|
image: postgres:15
|
||||||
|
environment:
|
||||||
|
POSTGRES_USER: ${POSTGRES_USER}
|
||||||
|
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
|
||||||
|
POSTGRES_DB: ${POSTGRES_DB}
|
||||||
|
volumes:
|
||||||
|
- postgres_data:/var/lib/postgresql/data
|
||||||
|
ports:
|
||||||
|
- "5432:5432"
|
||||||
|
|
||||||
|
chromadb:
|
||||||
|
image: chromadb/chroma:latest
|
||||||
|
volumes:
|
||||||
|
- chroma_data:/chroma/chroma
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
|
||||||
|
ollama:
|
||||||
|
image: ollama/ollama:latest
|
||||||
|
volumes:
|
||||||
|
- ollama_data:/root/.ollama
|
||||||
|
# Binde das lokale Skript in den Container ein
|
||||||
|
- ./init_ollama.sh:/init_ollama.sh
|
||||||
|
# Setze das Skript als Startbefehl und stelle sicher, dass es mit bash ausgeführt wird
|
||||||
|
entrypoint: ["/usr/bin/env", "bash", "/init_ollama.sh"]
|
||||||
|
ports:
|
||||||
|
- "11434:11434"
|
||||||
|
|
||||||
|
# Container für Streamlit UI
|
||||||
|
streamlit:
|
||||||
|
build: .
|
||||||
|
command: streamlit run src/app.py --server.port=8501 --server.address=0.0.0.0
|
||||||
|
ports:
|
||||||
|
- "8501:8501"
|
||||||
|
env_file: .env
|
||||||
|
depends_on:
|
||||||
|
- postgres
|
||||||
|
- chromadb
|
||||||
|
- ollama
|
||||||
|
|
||||||
|
# Container für den Telegram Bot
|
||||||
|
telegram-bot:
|
||||||
|
build: .
|
||||||
|
command: python src/telegram_bot.py
|
||||||
|
env_file: .env
|
||||||
|
depends_on:
|
||||||
|
- postgres
|
||||||
|
- chromadb
|
||||||
|
- ollama
|
||||||
|
|
||||||
|
# Container für den Ingest-Scheduler
|
||||||
|
scheduler:
|
||||||
|
build: .
|
||||||
|
command: python src/run_scheduler.py
|
||||||
|
env_file: .env
|
||||||
|
depends_on:
|
||||||
|
- postgres
|
||||||
|
- chromadb
|
||||||
|
- ollama
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
postgres_data:
|
||||||
|
chroma_data:
|
||||||
|
ollama_data:
|
||||||
21
init_ollama.sh
Executable file
21
init_ollama.sh
Executable file
@@ -0,0 +1,21 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# 1. Starte den Ollama-Server im Hintergrund
|
||||||
|
/bin/ollama serve &
|
||||||
|
OLLAMA_PID=$!
|
||||||
|
|
||||||
|
# 2. Warte, bis die API erreichbar ist
|
||||||
|
echo "Warte darauf, dass der Ollama-Server hochfährt..."
|
||||||
|
while ! curl -s http://localhost:11434/api/tags > /dev/null; do
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "Ollama ist erreichbar! Prüfe/Lade das Modell 'granite4:tiny-h'..."
|
||||||
|
|
||||||
|
# 3. Lade das Modell herunter (falls noch nicht vorhanden)
|
||||||
|
ollama pull granite4:tiny-h
|
||||||
|
|
||||||
|
echo "Modell ist einsatzbereit!"
|
||||||
|
|
||||||
|
# 4. Halte den Container am Laufen, indem der Ollama-Prozess im Vordergrund gehalten wird
|
||||||
|
wait $OLLAMA_PID
|
||||||
12
requirements.txt
Normal file
12
requirements.txt
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
langchain
|
||||||
|
langchain-community
|
||||||
|
langchain-huggingface
|
||||||
|
langchain-postgres
|
||||||
|
langchain-chroma
|
||||||
|
psycopg2-binary
|
||||||
|
unstructured[pdf]
|
||||||
|
pdf2image
|
||||||
|
python-telegram-bot
|
||||||
|
streamlit
|
||||||
|
schedule
|
||||||
|
requests
|
||||||
67
src/app.py
Normal file
67
src/app.py
Normal file
@@ -0,0 +1,67 @@
|
|||||||
|
import streamlit as st
|
||||||
|
from langchain.chains import create_retrieval_chain
|
||||||
|
from langchain.chains.combine_documents import create_stuff_documents_chain
|
||||||
|
from langchain_core.prompts import ChatPromptTemplate
|
||||||
|
from src.rag_core import get_retriever, get_llm
|
||||||
|
|
||||||
|
st.set_page_config(page_title="RAG Chat", layout="wide")
|
||||||
|
st.title("Paperless-NGX RAG Assistant")
|
||||||
|
|
||||||
|
retriever, vectorstore, _ = get_retriever()
|
||||||
|
llm = get_llm()
|
||||||
|
|
||||||
|
# Prompt Template
|
||||||
|
system_prompt = (
|
||||||
|
"Du bist ein hilfreicher Assistent. Nutze den folgenden Kontext, um die Frage zu beantworten. "
|
||||||
|
"Wenn du die Antwort nicht weißt, sage einfach, dass du sie nicht weißt.\n\n"
|
||||||
|
"{context}"
|
||||||
|
)
|
||||||
|
prompt = ChatPromptTemplate.from_messages([
|
||||||
|
("system", system_prompt),
|
||||||
|
("human", "{input}"),
|
||||||
|
])
|
||||||
|
|
||||||
|
question_answer_chain = create_stuff_documents_chain(llm, prompt)
|
||||||
|
|
||||||
|
# Optionale Filter-UI
|
||||||
|
st.sidebar.header("Filter")
|
||||||
|
filter_id = st.sidebar.text_input("Nur in Document ID suchen (optional):")
|
||||||
|
|
||||||
|
if "messages" not in st.session_state:
|
||||||
|
st.session_state.messages = []
|
||||||
|
|
||||||
|
for msg in st.session_state.messages:
|
||||||
|
with st.chat_message(msg["role"]):
|
||||||
|
st.markdown(msg["content"])
|
||||||
|
|
||||||
|
if prompt_input := st.chat_input("Stelle eine Frage zu deinen Dokumenten..."):
|
||||||
|
st.session_state.messages.append({"role": "user", "content": prompt_input})
|
||||||
|
with st.chat_message("user"):
|
||||||
|
st.markdown(prompt_input)
|
||||||
|
|
||||||
|
with st.chat_message("assistant"):
|
||||||
|
# Dynamischer Retriever mit Metadaten-Filter
|
||||||
|
search_kwargs = {"k": 3}
|
||||||
|
if filter_id:
|
||||||
|
search_kwargs["filter"] = {"paperless_id": int(filter_id)}
|
||||||
|
|
||||||
|
# Override search_kwargs temporär
|
||||||
|
retriever.search_kwargs = search_kwargs
|
||||||
|
|
||||||
|
rag_chain = create_retrieval_chain(retriever, question_answer_chain)
|
||||||
|
|
||||||
|
with st.spinner("Denke nach..."):
|
||||||
|
response = rag_chain.invoke({"input": prompt_input})
|
||||||
|
answer = response["answer"]
|
||||||
|
sources = response.get("context", [])
|
||||||
|
|
||||||
|
st.markdown(answer)
|
||||||
|
|
||||||
|
if sources:
|
||||||
|
st.write("---")
|
||||||
|
st.write("**Quellen:**")
|
||||||
|
unique_sources = {doc.metadata.get("source") for doc in sources if doc.metadata.get("source")}
|
||||||
|
for s in unique_sources:
|
||||||
|
st.write(f"- {s}")
|
||||||
|
|
||||||
|
st.session_state.messages.append({"role": "assistant", "content": answer})
|
||||||
54
src/database.py
Normal file
54
src/database.py
Normal file
@@ -0,0 +1,54 @@
|
|||||||
|
import os
|
||||||
|
import psycopg2
|
||||||
|
from psycopg2.extras import RealDictCursor
|
||||||
|
|
||||||
|
def get_db_connection():
|
||||||
|
return psycopg2.connect(
|
||||||
|
host=os.getenv("POSTGRES_HOST", "localhost"),
|
||||||
|
port=os.getenv("POSTGRES_PORT", "5432"),
|
||||||
|
database=os.getenv("POSTGRES_DB", "ragdb"),
|
||||||
|
user=os.getenv("POSTGRES_USER", "raguser"),
|
||||||
|
password=os.getenv("POSTGRES_PASSWORD", "ragpass")
|
||||||
|
)
|
||||||
|
|
||||||
|
def init_db():
|
||||||
|
conn = get_db_connection()
|
||||||
|
cur = conn.cursor()
|
||||||
|
# Tabelle für Delta Sync
|
||||||
|
cur.execute('''
|
||||||
|
CREATE TABLE IF NOT EXISTS sync_state (
|
||||||
|
paperless_id INTEGER PRIMARY KEY,
|
||||||
|
modified_at TIMESTAMP,
|
||||||
|
checksum TEXT
|
||||||
|
)
|
||||||
|
''')
|
||||||
|
conn.commit()
|
||||||
|
cur.close()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
def get_sync_state():
|
||||||
|
conn = get_db_connection()
|
||||||
|
cur = conn.cursor(cursor_factory=RealDictCursor)
|
||||||
|
cur.execute("SELECT paperless_id, modified_at FROM sync_state")
|
||||||
|
rows = cur.fetchall()
|
||||||
|
conn.close()
|
||||||
|
return {row['paperless_id']: row['modified_at'] for row in rows}
|
||||||
|
|
||||||
|
def update_sync_state(paperless_id, modified_at):
|
||||||
|
conn = get_db_connection()
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute('''
|
||||||
|
INSERT INTO sync_state (paperless_id, modified_at)
|
||||||
|
VALUES (%s, %s)
|
||||||
|
ON CONFLICT (paperless_id) DO UPDATE
|
||||||
|
SET modified_at = EXCLUDED.modified_at
|
||||||
|
''', (paperless_id, modified_at))
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
def delete_sync_state(paperless_id):
|
||||||
|
conn = get_db_connection()
|
||||||
|
cur = conn.cursor()
|
||||||
|
cur.execute("DELETE FROM sync_state WHERE paperless_id = %s", (paperless_id,))
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
93
src/ingest_job.py
Normal file
93
src/ingest_job.py
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
import requests
|
||||||
|
from datetime import datetime
|
||||||
|
from langchain_community.document_loaders import UnstructuredPDFLoader
|
||||||
|
from src.database import init_db, get_sync_state, update_sync_state, delete_sync_state
|
||||||
|
from src.rag_core import get_retriever
|
||||||
|
|
||||||
|
PAPERLESS_URL = os.getenv("PAPERLESS_URL")
|
||||||
|
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")
|
||||||
|
HEADERS = {"Authorization": f"Token {PAPERLESS_TOKEN}"}
|
||||||
|
|
||||||
|
def fetch_paperless_documents():
|
||||||
|
url = f"{PAPERLESS_URL}/api/documents/"
|
||||||
|
documents = []
|
||||||
|
while url:
|
||||||
|
resp = requests.get(url, headers=HEADERS).json()
|
||||||
|
documents.extend(resp['results'])
|
||||||
|
url = resp.get('next')
|
||||||
|
return documents
|
||||||
|
|
||||||
|
def download_pdf(doc_id):
|
||||||
|
url = f"{PAPERLESS_URL}/api/documents/{doc_id}/download/"
|
||||||
|
resp = requests.get(url, headers=HEADERS)
|
||||||
|
if resp.status_code == 200:
|
||||||
|
fd, path = tempfile.mkstemp(suffix=".pdf")
|
||||||
|
with os.fdopen(fd, 'wb') as f:
|
||||||
|
f.write(resp.content)
|
||||||
|
return path
|
||||||
|
return None
|
||||||
|
|
||||||
|
def process_and_ingest(doc_id, pdf_path, metadata):
|
||||||
|
retriever, _, _ = get_retriever()
|
||||||
|
# Hi-Res mode um Tabellen korrekt zu extrahieren
|
||||||
|
loader = UnstructuredPDFLoader(pdf_path, strategy="hi_res")
|
||||||
|
docs = loader.load()
|
||||||
|
|
||||||
|
for d in docs:
|
||||||
|
d.metadata.update(metadata)
|
||||||
|
|
||||||
|
# ParentDocumentRetriever splittet automatisch in Parent/Child und speichert sie
|
||||||
|
retriever.add_documents(docs, ids=[str(doc_id)])
|
||||||
|
|
||||||
|
def remove_deleted_documents(deleted_ids):
|
||||||
|
retriever, vectorstore, store = get_retriever()
|
||||||
|
for doc_id in deleted_ids:
|
||||||
|
# Lösche aus Chroma (Child Chunks) - benötigt custom Logik via Metadata
|
||||||
|
# In Chroma löschen wir alles mit der doc_id in den Metadaten
|
||||||
|
vectorstore.delete(where={"paperless_id": doc_id})
|
||||||
|
# Lösche aus DocStore
|
||||||
|
store.mdelete([str(doc_id)])
|
||||||
|
# DB Sync löschen
|
||||||
|
delete_sync_state(doc_id)
|
||||||
|
print(f"Gelöscht: {doc_id}")
|
||||||
|
|
||||||
|
def run_sync():
|
||||||
|
print("Starte Paperless Delta-Sync...")
|
||||||
|
init_db()
|
||||||
|
current_state = get_sync_state()
|
||||||
|
paperless_docs = fetch_paperless_documents()
|
||||||
|
|
||||||
|
active_paperless_ids = set()
|
||||||
|
|
||||||
|
for p_doc in paperless_docs:
|
||||||
|
doc_id = p_doc['id']
|
||||||
|
modified_at_str = p_doc['modified'] # ISO Format
|
||||||
|
active_paperless_ids.add(doc_id)
|
||||||
|
|
||||||
|
# Delta Check
|
||||||
|
if doc_id not in current_state or str(current_state[doc_id]) != modified_at_str:
|
||||||
|
print(f"Verarbeite neues/geändertes Dokument: {doc_id}")
|
||||||
|
pdf_path = download_pdf(doc_id)
|
||||||
|
if pdf_path:
|
||||||
|
metadata = {
|
||||||
|
"paperless_id": doc_id,
|
||||||
|
"title": p_doc.get("title", ""),
|
||||||
|
"source": f"{PAPERLESS_URL}/documents/{doc_id}/details"
|
||||||
|
}
|
||||||
|
process_and_ingest(doc_id, pdf_path, metadata)
|
||||||
|
os.remove(pdf_path)
|
||||||
|
update_sync_state(doc_id, modified_at_str)
|
||||||
|
|
||||||
|
# Löschlogik: Was in der DB ist, aber nicht mehr in Paperless
|
||||||
|
db_ids = set(current_state.keys())
|
||||||
|
deleted_ids = db_ids - active_paperless_ids
|
||||||
|
if deleted_ids:
|
||||||
|
print(f"Entferne gelöschte Dokumente: {deleted_ids}")
|
||||||
|
remove_deleted_documents(deleted_ids)
|
||||||
|
|
||||||
|
print("Sync abgeschlossen.")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
run_sync()
|
||||||
45
src/rag_core.py
Normal file
45
src/rag_core.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
import os
|
||||||
|
from langchain_community.llms import Ollama
|
||||||
|
from langchain_huggingface import HuggingFaceEmbeddings
|
||||||
|
from langchain_chroma import Chroma
|
||||||
|
from langchain.retrievers import ParentDocumentRetriever
|
||||||
|
from langchain.storage import create_kv_docstore
|
||||||
|
from langchain.storage._lc_store import create_kv_docstore
|
||||||
|
from langchain.storage import LocalFileStore
|
||||||
|
from langchain_text_splitters import RecursiveCharacterTextSplitter
|
||||||
|
|
||||||
|
def get_llm():
|
||||||
|
return Ollama(
|
||||||
|
base_url=os.getenv("OLLAMA_BASE_URL", "http://localhost:11434"),
|
||||||
|
model="granite4:tiny-h"
|
||||||
|
)
|
||||||
|
|
||||||
|
def get_embeddings():
|
||||||
|
return HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-small") # Gut für Deutsch/Englisch
|
||||||
|
|
||||||
|
def get_retriever():
|
||||||
|
chroma_client = Chroma(
|
||||||
|
collection_name="rag_collection",
|
||||||
|
embedding_function=get_embeddings(),
|
||||||
|
persist_directory="./chroma_data" if not os.getenv("CHROMA_HOST") else None,
|
||||||
|
# Falls Chroma als Server läuft (wie im Docker Compose):
|
||||||
|
# chroma_server_host=os.getenv("CHROMA_HOST"),
|
||||||
|
# chroma_server_http_port=os.getenv("CHROMA_PORT")
|
||||||
|
)
|
||||||
|
|
||||||
|
# Store für die originalen (großen) Parent-Dokumente
|
||||||
|
# Für Skalierbarkeit im Docker nutzen wir hier einen simplen FileStore
|
||||||
|
# (oder man nutzt PostgresByteStore aus langchain-postgres)
|
||||||
|
fs = LocalFileStore("./docstore_data")
|
||||||
|
store = create_kv_docstore(fs)
|
||||||
|
|
||||||
|
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
|
||||||
|
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=50)
|
||||||
|
|
||||||
|
retriever = ParentDocumentRetriever(
|
||||||
|
vectorstore=chroma_client,
|
||||||
|
docstore=store,
|
||||||
|
child_splitter=child_splitter,
|
||||||
|
parent_splitter=parent_splitter,
|
||||||
|
)
|
||||||
|
return retriever, chroma_client, store
|
||||||
20
src/run_scheduler.py
Normal file
20
src/run_scheduler.py
Normal file
@@ -0,0 +1,20 @@
|
|||||||
|
import time
|
||||||
|
import schedule
|
||||||
|
from src.ingest_job import run_sync
|
||||||
|
|
||||||
|
def job():
|
||||||
|
try:
|
||||||
|
run_sync()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Fehler beim Sync: {e}")
|
||||||
|
|
||||||
|
# Einmal nachts um 01:00 Uhr laufen lassen
|
||||||
|
schedule.every().day.at("01:00").do(job)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
print("Scheduler gestartet. Warte auf Ausführung...")
|
||||||
|
# Optional: Einmal beim Start ausführen
|
||||||
|
# job()
|
||||||
|
while True:
|
||||||
|
schedule.run_pending()
|
||||||
|
time.sleep(60)
|
||||||
55
src/telegram_bot.py
Normal file
55
src/telegram_bot.py
Normal file
@@ -0,0 +1,55 @@
|
|||||||
|
import os
|
||||||
|
from telegram import Update
|
||||||
|
from telegram.ext import Application, CommandHandler, MessageHandler, filters, ContextTypes
|
||||||
|
from langchain.chains import create_retrieval_chain
|
||||||
|
from langchain.chains.combine_documents import create_stuff_documents_chain
|
||||||
|
from langchain_core.prompts import ChatPromptTemplate
|
||||||
|
from src.rag_core import get_retriever, get_llm
|
||||||
|
|
||||||
|
TOKEN = os.getenv("TELEGRAM_BOT_TOKEN")
|
||||||
|
ALLOWED_USERS = [int(u) for u in os.getenv("ALLOWED_TELEGRAM_USERS", "").split(",") if u]
|
||||||
|
|
||||||
|
retriever, _, _ = get_retriever()
|
||||||
|
llm = get_llm()
|
||||||
|
prompt = ChatPromptTemplate.from_messages([
|
||||||
|
("system", "Nutze folgenden Kontext für die Antwort:\n\n{context}"),
|
||||||
|
("human", "{input}"),
|
||||||
|
])
|
||||||
|
chain = create_retrieval_chain(retriever, create_stuff_documents_chain(llm, prompt))
|
||||||
|
|
||||||
|
def check_auth(update: Update) -> bool:
|
||||||
|
return update.effective_user.id in ALLOWED_USERS
|
||||||
|
|
||||||
|
async def start(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if not check_auth(update):
|
||||||
|
await update.message.reply_text("Zugriff verweigert.")
|
||||||
|
return
|
||||||
|
await update.message.reply_text("Hallo! Ich bin dein Paperless RAG-Bot. Frag mich etwas.")
|
||||||
|
|
||||||
|
async def handle_message(update: Update, context: ContextTypes.DEFAULT_TYPE):
|
||||||
|
if not check_auth(update):
|
||||||
|
return
|
||||||
|
|
||||||
|
question = update.message.text
|
||||||
|
# "Denke nach" Indikator
|
||||||
|
await context.bot.send_chat_action(chat_id=update.effective_chat.id, action='typing')
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = chain.invoke({"input": question})
|
||||||
|
answer = response["answer"]
|
||||||
|
sources = list({doc.metadata.get("source", "Unbekannt") for doc in response.get("context", [])})
|
||||||
|
|
||||||
|
reply = f"{answer}\n\n**Quellen:**\n" + "\n".join(f"- {s}" for s in sources)
|
||||||
|
await update.message.reply_text(reply, parse_mode='Markdown')
|
||||||
|
except Exception as e:
|
||||||
|
await update.message.reply_text(f"Es gab einen Fehler: {str(e)}")
|
||||||
|
|
||||||
|
def main():
|
||||||
|
app = Application.builder().token(TOKEN).build()
|
||||||
|
app.add_handler(CommandHandler("start", start))
|
||||||
|
app.add_handler(MessageHandler(filters.TEXT & ~filters.COMMAND, handle_message))
|
||||||
|
print("Telegram Bot gestartet...")
|
||||||
|
app.run_polling()
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user