Local Document Indexing Pipeline

Problem

Documents spread across folders become hard to search, audit, and reuse. I wanted a local indexing pipeline that scans files, extracts text and metadata, stores structured entries, and makes retrieval faster without sending private notes anywhere.

Why I built it

The main reason for building this project was to practice data ingestion and indexing on unstructured files. It behaves like a small private data platform: sources are scanned, transformed into searchable records, and served through a simple API.

The offline workflow also keeps the project grounded in a real use case: focused study, local privacy, and quick movement from query to useful context.

How it works

Scan configured folders and detect supported document files.
Extract readable text, file paths, timestamps, and source metadata.
Load structured records into SQLite for quick lookup.
Rank matching documents and return useful context snippets.
Keep ingestion, storage, and search local to the machine.

Key decisions

Kept the system local-first for privacy and reproducibility.
Stored metadata separately from extracted content for cleaner querying.
Prioritized incremental indexing as the next production-style step.
Focused on making the result-to-reading flow immediate.

What I learned

This project taught me how ingestion choices affect the final search experience. File parsing, metadata consistency, storage design, indexing, and API responses all shape whether a retrieval workflow feels reliable.

What I'd improve next

Add incremental re-indexing when files change.
Add data quality checks for failed parses and empty extracts.
Introduce filtering by folder, file type, source, and tags.
Track indexing history and failures in operational tables.