Davids Archive

The problem

Knowledge workers consume articles, books, and podcasts constantly, but retain almost none of it. The crux is the absence of any reinforcement loop: content gets saved once and forgotten, so the knowledge never compounds.

What I tried first

I tried the obvious moves. I bookmarked things religiously, which left me with 400 saved links I never opened. I tried saving everything I found interesting into separate NotebookLM projects, which was useful for going deep on a single, self-contained topic, but it broke down as a way to organize what I was reading once that spanned dozens of unrelated topics across months.

None of it meaningfully closed the loop between reading something, storing it, and actually knowing it.

The reframe that unlocked it

I came across Andrej Karpathy's writing about LLMs as compression engines: the idea that an LLM distills the statistical structure of human knowledge into something you can query in natural language, probing the compressed representation directly.

Around the same time, a Medium piece critiqued LLM "wikis" for organizations: one error or misinterpretation by the model could corrupt the entire knowledge base over time. It argued for RAG instead: keep the sources as the source of truth and retrieve against them.

That reframe is what unlocked the project. The question became: what would it look like to apply the same principle to a single person's reading life? A RAG of what things mean to you, annotated with your own synthesis, and made queryable from there.

Writing a curator note, even a single sentence, is lossy compression with high signal preservation. You're not archiving the web; you're archiving what the web means to you, with the web itself still standing as the source of truth.

What I built

A full-stack web app where I save URLs, PDFs, and YouTube videos, write notes on what I took from them, tag them semantically, and index the whole corpus as vector embeddings through the Gemini API. The result is a personal RAG system. Today it holds 62 saved sources chunked into a 1,021-chunk corpus, with 159 review questions generated at ingest to seed the recall flows.

When I query the archive, the response reflects my curatorial taste (I chose the sources), my synthesis (my notes are part of the corpus), and my ontology (my tags reflect how I organize knowledge). That's qualitatively different from asking Google or ChatGPT.

Stack: Next.js 14 (App Router), TypeScript, Firestore vector search, Gemini for embeddings and generation, Tailwind and shadcn/ui, deployed on Vercel.

Who it's for

Primary: curious professionals who consume widely. They're hiring Davids Archive to turn a reading habit into retained knowledge, and to stay accountable to actually learning what they've saved.

Secondary: technical visitors, potential employers, collaborators. They're reading the corpus as a signal of intellectual range, curatorial taste, and engineering depth.

How I measure it

Weekly learning sessions: quiz, recall, or chat sessions completed per week.
Recall rate: % of quiz answers correct on first attempt, trending over time, tracked per post.
Active archive ratio: % of saved posts interacted with at least once (quiz or chat) vs. total saved.
Reinforcement depth: number of posts revisited two or more times via quiz or recall.

The three tracks

Content ingestion & corpus. Getting content in cleanly: URL scraping, article and podcast parsing, metadata enrichment, and the admin editor. Every piece has to be clean and structured before it can become a learning object.

Learning loop. Quizzes, recall sessions, spaced-repetition scheduling, grading, and the RAG chat interface. This is where retention actually happens, the core reinforcement mechanism.

Knowledge synthesis. Semantic search, cross-post linking, related-content surfacing, and citations across the full archive. This turns isolated saved posts into a compounding, queryable knowledge base.

What I learned building it

Curation is harder than retrieval. Getting the RAG pipeline working took a weekend; getting myself to write a substantive curator note instead of just dropping a URL is the real product problem. The feature that makes the system valuable is also the one that takes the most user effort.

Adding a Learn tool (quiz and recall modes that pull questions from my own corpus) was the point where the system stopped feeling like a more sophisticated Pocket and started feeling like something with a thesis. Passive retrieval is reference; active recall is training.

Roughly 67% of saved items in every PKM tool are never revisited. A beautiful corpus nobody reviews, including the person who built it, is just a prettier version of the original problem.

Features shipped

URL, PDF, and YouTube ingestion with AI enrichment: Gemini auto-generates summaries and tags on save.
Rich-text curator notes via a TipTap editor.
Semantic RAG chat for querying the corpus in natural language, with cited answers.
Adaptive quiz layer with struggle-weighted question selection and a four-stage review path: recognition, surface recall, deep recall, and connection-finding.
Socratic tutor mode that asks you a question first, before answering yours.
Deliberate-error recall variant: 1 in 5 sessions presents a plausible but wrong summary you have to catch and correct.
Daily digest email with a pre-built session queue delivered each morning.
Semantic drift detector that runs weekly, flags topic clusters you've stopped engaging with, and surfaces them in your next session.

What I'm working on now

The retention habit loop: a unified session surface, an interaction event log that tracks what you've studied rather than just saved, and adaptive scheduling that weights review toward the posts you struggled with. The goal is to make the daily practice of returning feel automatic rather than effortful.

The longer arc: a year from now, you should be able to ask my archive a question and get a better answer than Google for that specific question, because the answer is filtered through my own synthesis, taste, and history of thinking about the problem.