David Shatsky
No. 01

An archive built so the digital deluge actually sticks.

Next.jsTypeScriptFirebaseGemini APITailwindRAGVercelTipTapVitest

The problem I noticed

Knowledge workers consume articles, books, and podcasts constantly, but retain almost none of it. The crux is the absence of any reinforcement loop. Content gets saved once and forgotten, so the knowledge never compounds.

What I tried first

I tried the obvious moves: I bookmarked things religiously, which left me with 400 saved links I never opened. I tried saving all the content I found interesting into different NotebookLM projects, which I found useful for.....[why was it useful]. But it wasn't a sufficient solution for organizing information when I read about dozens of unrelated topics across months.

None of it significantly closed the loop between reading something, storing it, and then actually knowing it.

The idea that changed my thinking

I came across Andrej Karpathy's writing about LLMs as compression engines. He argued that LLM's distill the statistical structure of human knowledge into something you can query in natural language. You can probe the compressed representation directly.

And then there was a medium article that critiqued the LLM wikis applied to organziations because one error or misinerprateation by the llm could curropt the entire knowledge base over time. This article argued for RAG instead.

That reframe is what unlocked the project. The question became what it would look like to apply the same principle to a single person's reading life. A RAG of what things mean to you, annotated with your synthesis, and made queryable from there.

Writing a curator note, even a single sentence, is lossy compression with high signal preservation. You are not archiving the web. You are archiving what the web means to you. And then having the web still as a main source of truth.

What I built

A full-stack web app where I save URLs, PDFs, and YouTube videos, write notes on what I took from them, tag them semantically, and index the whole corpus as vector embeddings through the Gemini API. The result is a personal RAG system.

When I query the archive, the response reflects my curatorial taste because I chose the sources, my synthesis because my notes are part of the corpus, and my ontology because my tags reflect how I organize knowledge. That is qualitatively different from asking Google or ChatGPT.

Stack: Next.js 14 with the App Router, TypeScript, Firebase Firestore, Gemini for embeddings and generation, Tailwind and shadcn/ui, deployed on Vercel.

Who it's for

Primary: Curious professionals who consume widely — they're hiring Living Archive to turn their reading habit into retained knowledge and stay accountable to actually learning what they've saved.

Secondary: Technical visitors, potential employers, or collaborators — they're encountering the archive to read the corpus as a signal of intellectual range, curatorial taste, and engineering depth.

Key metrics

  • Weekly learning sessions — number of quiz, recall, or chat sessions completed per week; tracked in Firestore
  • Recall rate — % of quiz answers correct on first attempt, trending over time; tracked per post in Firestore
  • Active archive ratio — % of saved posts interacted with at least once via quiz or chat vs. total saved; computed from Firestore
  • Reinforcement depth — number of posts revisited 2+ times via quiz or recall; tracked in Firestore

Tracks

Content ingestion & corpus

Getting content in cleanly: URL scraping, article and podcast parsing, metadata enrichment, and the admin editor.

Why it serves the approach: Every piece must be clean and structured before it can become a learning object.

Learning loop

Quizzes, recall sessions, spaced repetition scheduling, grading, and the RAG chat interface.

Why it serves the approach: This is where retention actually happens — the core reinforcement mechanism.

Knowledge synthesis

Semantic search, cross-post linking, related content surfacing, and citations across the full archive.

Why it serves the approach: Turns isolated saved posts into a compounding, queryable knowledge base.

Marketing

One-liner: A personal knowledge corpus — curated by taste, powered by a scrape → summarize → embed → query pipeline, built to compound into retained knowledge over time.

Key message: What you're seeing isn't "500 annotated essays." It's taste as curation — links chosen, typed (article vs. video vs. research), tagged, and published as a corpus. It's a working knowledge pipeline end-to-end. And notes are the north star: the product is designed for synthesis, honest that the corpus is still being enriched.


What I learned building it

Curation is harder than retrieval. Getting the RAG pipeline working took a weekend. Getting myself to write a substantive curator note instead of just dropping a URL is the real product problem. The feature that makes the system valuable is also the one that takes the most user effort.

I added a Learn tool with quiz and recall modes that pull questions from my own corpus, and that was the point where the system stopped feeling like a more sophisticated Pocket and started feeling like something with a thesis. Passive retrieval is reference, active recall is training.

Roughly 67% of saved items in every PKM tool are never revisited. A beautiful corpus that nobody reviews, including the person who built it, is just a prettier version of the original problem.

Features shipped

  • URL, PDF, and YouTube ingestion with AI enrichment, where Gemini auto-generates summaries and tags on save
  • Rich-text curator notes through a TipTap editor
  • Semantic RAG chat for querying the corpus in natural language with cited answers
  • Adaptive quiz layer with struggle-weighted question selection and a four-stage review path: recognition, surface recall, deep recall, and connection-finding
  • Socratic tutor mode that asks you a question first, before answering yours
  • Deliberate-error recall variant, where 1 in 5 sessions presents a plausible but wrong summary you have to identify and correct
  • Daily digest email with a pre-built session queue delivered each morning
  • Semantic drift detector that runs weekly, flags topic clusters in the archive you have stopped engaging with, and surfaces them in your next session

What I'm working on now

The retention habit loop: a unified session surface, an interaction event log that tracks what you have studied rather than just saved, and adaptive scheduling that weights review toward the posts you struggled with. The goal is to make the daily practice of returning feel automatic rather than effortful.

The longer arc is this. A year from now, you should be able to ask my archive a question and get a better answer than Google for that specific question, because the answer is filtered through my own synthesis, taste, and history of thinking about the problem.