Problem
Documents spread across folders become hard to search, audit, and reuse. I wanted a local indexing pipeline that scans files, extracts text and metadata, stores structured entries, and makes retrieval faster without sending private notes anywhere.
Why I built it
The main reason for building this project was to practice data ingestion and indexing on unstructured files. It behaves like a small private data platform: sources are scanned, transformed into searchable records, and served through a simple API.
The offline workflow also keeps the project grounded in a real use case: focused study, local privacy, and quick movement from query to useful context.
How it works
- Scan configured folders and detect supported document files.
- Extract readable text, file paths, timestamps, and source metadata.
- Load structured records into SQLite for quick lookup.
- Rank matching documents and return useful context snippets.
- Keep ingestion, storage, and search local to the machine.
Key decisions
- Kept the system local-first for privacy and reproducibility.
- Stored metadata separately from extracted content for cleaner querying.
- Prioritized incremental indexing as the next production-style step.
- Focused on making the result-to-reading flow immediate.
What I learned
This project taught me how ingestion choices affect the final search experience. File parsing, metadata consistency, storage design, indexing, and API responses all shape whether a retrieval workflow feels reliable.
What I'd improve next
- Add incremental re-indexing when files change.
- Add data quality checks for failed parses and empty extracts.
- Introduce filtering by folder, file type, source, and tags.
- Track indexing history and failures in operational tables.