Skip to main content

LLM book recommender system

·1161 words·6 mins
Reading Self Hosting Llm Data Science

I now have several different apps to organize my self-hosted media: Plex for movies and TV, Audiobookshelf for audiobooks, and Calibre-web-automated for ebooks.

🎥 Discovery on Plex: really good
#

The UX on these apps is uniformly pretty good, with a few being great. Plex, for example, is a joy: the metadata is super well-organized, and so you can easily traverse your movie collection to find, e.g., all the Colin Firth movies you do and do not own (narrator: she owns enough). Similarly, it enriches your existing (probably paltry) metadata with, for example, IMDb and Rotten Tomato ratings, category, country of origin, etc. This is all super helpful for discovery which - when you have a big media library - is key. You can use either deterministic rules (“sort all my unwatched Italian-language movies by audience rating”). And it includes, on every movie page, some “more like this” suggestions - probably powered, under the hood, by a reasonable recommender system (“recsys”).

🎧 Discovery on Audiobookshelf: also quite good
#

Audiobookshelf is probably next in terms of UX and discoverability: like Plex, it does a great job of organizing your media library’s metadata. It does not have a built-in machine learning-powered recommender system, alas, but it does have very good metadata matching. Each audiobook has a ton of rich tags and categories and so you can, at least, manually look things up by categorical vibe.

📚 Discovery on Calibre-web-automated: not so good
#

Which brings us to Calibre. Oh, Calibre. Salty, salt-o-the-earth Calibre. Calibre is dinosaur software, it’s been around forever and has a huge community. It has lots of quirks, both in its community, its development, and its user interface/UX. But it is, as they say, very powerful. Several attempts have been made to essentially re-skin Calibre into a slightly more modern-looking web interface: Calibre-web being the main one I know about. A fork of calibre-web is Calibre-Web-Automated which… well, to be honest, I’ve read the README.md tons of times and I still don’t see why I’m using it versus just calibre-web. But no matter. It has 3.5k stars on GitHub. They can’t all be wrong?!

Anyway, one thing that Calibre(-web)(-automated) all lack is any sense of discovery. This is mostly because, weirdly (for such well-used and well-loved software!), the way they handle book metadata is kinda garbage? At least, when you add an ebook to your library, the flow is not to immediately give you a best-match metadata chunk (as in Plex or Audiobookshelf). There’s tons of UX love this app (these apps) need (e.g. I can’t click on multiple books unless I first switch to “Book list” mode? whyyyy?). But the big one that I need (NEED) is decent discovery.

Using an LLM to do discovery on my own Calibre library
#

I just finished The AI Con (5 stars, v good, highly recommend), and so I do hesitate to touch LLMs these days - especially due to the water usage. But, well, okay, this one’s pretty cool. Basically, in plain English:

  1. Get my Calibre metadata.db, the sqlite3 database that stores all my book data (title, author, read/not read, description, tags…), from my Calibre container.
  2. Add above .db to my self-hosted Marimo container.
  3. Launch a cool, new notebook called librarian.py, hehehehe.
  4. Connect that notebook to the Anthropic API.
  5. pd.read_sql("SELECT * FROM books", con=conn) from your metadata.db and do some light cleaning (well, hopefully). I realized early on that I was just blowing money by feeding way too much textual data about my ebook library. For now, I just send the title-author pairs of all unread books to Claude.
  6. Ask Claude to be my librarian!

Here’s my very exciting prompt and Anthropic API call:

def get_book_recs(mood: str, books_df: pd.DataFrame) -> str:
    if not mood:
        print("No mood!")
        return pd.DataFrame()

    # create a context string from your books
    book_context = books_df.to_dict(orient="records")

    prompt = f"""
    You are a world-class librarian. You prioritize books from under-represented
    groups (women, POC).

    Here are the books in my library (id, title, author):
    {book_context}

    I'm in the mood for: {mood}

    Please recommend books from this list that most match my mood. If you can't
    find any in my library, please point me to the most relevant books that I 
    can look for externally.

    Please use Markdown formatting. Format every recommended book
    with <a href="http://my_super_cool_calibre_ip/book/id>title</a>.
    """

    # hit your llm api here
    message = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )


    # parse response and filter df
    return message.content

TEST_PROMPT = "something solarpunk and light, <200 pages"

# Your API call here
recommendations = get_book_recs(mood_input.value, df)

Results
#

So how does this perform? Honestly, pretty good!

I’m basically pulling from Anthropic’s training data for any scraped data on my title-authors. This works great for relatively well-known (or, at least, well-represented-in-the-corpus) stuff. It struggles, of course, on anything very new or anything obscure. So the recommendations are always going to skew relatively well-known. But that’s okay! This has been a great discovery tool. I have about ~1k ebooks in my library - a mix of technical books (so many Humble Bundles, mamma mia), sci-fi, literary fiction, politics, history, economics. Sometimes I’ll be in the mood for a specific theme or type of book. Being able to “speak” to my library in natural language is great! And fun!

Example run
#

Here’s an example run I did just now.

screenshot of a marimo notebook

Can you even see that? Sorry it’s so small. My request was, “what’s some good fiction to read about working in the tech industry?”

The full response is pretty good:

From Your Library:

Close to the Machine

by Ellen Ullman This is a classic memoir/fiction hybrid about a software engineer in San Francisco during the first dot-com boom. Ullman brilliantly captures the isolation, obsession, and strange intimacy of coding work. It’s considered one of the best books ever written about what it actually feels like to work as a programmer.

Uncanny Valley

by Anna Wiener A memoir about working at various Silicon Valley startups during the 2010s tech boom. Wiener chronicles her transition from the literary world to tech with sharp, insightful observations about startup culture, gender dynamics, and the industry’s impact on society.

Several People Are Typing

by Calvin Kasulke A clever, experimental novel told entirely through Slack messages about a worker whose consciousness gets stuck in Slack. It’s a funny, surprisingly moving meditation on remote work culture and digital existence.

Moonbound

. by Robin Sloan While more fantastical, this book features tech industry elements and Sloan (who also wrote Mr. Penumbra’s 24-Hour Bookstore) has a background in tech and writes beautifully about digital culture.

Other Books to Look For Externally:

  • Microserfs by Douglas Coupland - A novel about Microsoft employees in the 90s
  • The Circle by Dave Eggers - A dystopian take on a Google/Facebook-like company
  • The Every by Dave Eggers - Sequel to The Circle
  • Severance by Ling Ma - Combines office work dystopia with a pandemic narrative
  • Qualityland by Marc-Uwe Kling - Satirical sci-fi about tech monopolies

Related

Hosting my own notebook server
·226 words·2 mins
Cool Tools Self Hosting
Using marimo as my notebook server.
Homelab overview
·1011 words·5 mins
Self Hosting
Taking stock of my homelab
Imprisoning your users - an experience with Google Takeout
·1095 words·6 mins
Self Hosting
Just my experience trying to get 300G of photos out of Google Takeout.