Lofi Mode

PRJ.Local Document Indexing Pipeline

Local ingestion, metadata, and ranked retrieval Built like a small private data platform

Problem

Documents spread across folders become hard to search, audit, and reuse. I wanted a local indexing pipeline that scans files, extracts text and metadata, stores structured entries, and makes retrieval faster without sending private notes anywhere.

Why I built it

The main reason for building this project was to practice data ingestion and indexing on unstructured files. It behaves like a small private data platform: sources are scanned, transformed into searchable records, and served through a simple API.

The offline workflow also keeps the project grounded in a real use case: focused study, local privacy, and quick movement from query to useful context.

How it works

  • Scan configured folders and detect supported document files.
  • Extract readable text, file paths, timestamps, and source metadata.
  • Load structured records into SQLite for quick lookup.
  • Rank matching documents and return useful context snippets.
  • Keep ingestion, storage, and search local to the machine.

Key decisions

  • Kept the system local-first for privacy and reproducibility.
  • Stored metadata separately from extracted content for cleaner querying.
  • Prioritized incremental indexing as the next production-style step.
  • Focused on making the result-to-reading flow immediate.

What I learned

This project taught me how ingestion choices affect the final search experience. File parsing, metadata consistency, storage design, indexing, and API responses all shape whether a retrieval workflow feels reliable.

What I'd improve next

  • Add incremental re-indexing when files change.
  • Add data quality checks for failed parses and empty extracts.
  • Introduce filtering by folder, file type, source, and tags.
  • Track indexing history and failures in operational tables.