Problem
Book catalogs contain useful metadata, but raw scraped data is messy: missing descriptions, inconsistent authors, duplicate titles, and uneven categories. I wanted a pipeline that turns that raw catalog into clean tables for search, recommendations, and reporting.
Why I built it
This project shows data engineering fundamentals through a familiar data product. The useful work is not only the final recommendations; it is the ingestion, cleaning, schema design, feature preparation, and reproducible handoff to the app layer.
How it works
- Scrape titles, authors, descriptions, categories, and ratings.
- Normalize text fields and remove unusable or duplicate records.
- Create clean staging tables before feature generation.
- Build feature tables for TF-IDF retrieval and catalog browsing.
- Expose the prepared data through a lightweight product interface.
Key decisions
- Separated raw, cleaned, and feature-ready datasets.
- Kept transformations readable so pipeline behavior is easy to audit.
- Used TF-IDF as a simple downstream consumer of the prepared data.
- Designed the project as a data pipeline first and a recommender second.
What I learned
This project helped me understand that data products depend on dependable upstream work. Clean schema choices, validation, and reproducible transformations matter as much as the recommendation logic sitting at the end of the pipeline.
What I'd improve next
- Add dbt-style tests for uniqueness, nulls, and accepted categories.
- Schedule refreshes and track pipeline runtime in a simple log table.
- Store catalog snapshots so changes can be compared over time.
- Add data documentation for each cleaned table and feature field.