Keep Blog

LLM-Wiki

Sun, 12 Apr 2026 12:00:00 GMT

What does memory look like? And, what happens when it changes?

Let's find out using Obsidian as the view-pane.

This is what Andrej Karpathy described in his viral LLM Wiki gist -- a persistent and incrementally-growing collection of documents, searchable, indexed as it grows -- with two interesting features:

you interact with a knowledge base by talking to your agent,
things that you talk about are automatically collected, summarized, and linked.

Things that you talk about are remembered. That's keep!

The Obsidian graph view, and the chat pane, are just a convenient UI over the agent's memory.

How To

You'll need Hermes Agent and Obsidian.

Install and configure the Keep plugin for Hermes:

curl -sSL https://keepnotes.ai/scripts/install-hermes.sh | bash

When keep is first installed, I recommend saying to your agent:

Follow the keep instructions in your system prompt.

Enable the API Server in Hermes, so that the chat panel can connect to the agent. Do this by editing ~/.hermes/.env:

API_SERVER_ENABLED=true
API_SERVER_HOST=127.0.0.1
API_SERVER_PORT=8642
API_SERVER_KEY=change-me

Then restart your Hermes gateway.

When keep is running in Hermes, the default store is at ~/.hermes/keep. We will want to use that same store from the command-line too. Make a new directory to hold the Obsidian vault, and configure keep to sync its memory into this directory. The "PT10S" is a 10-second debounce time: longer is laggier.

export KEEP_STORE_PATH=~/.hermes/keep
mkdir -p ~/play/keep-vault
keep data export --sync --include-versions --interval PT10S ~/play/keep-vault

In Obsidian, install the ObsidianClaw plugin (aka Clawdian).

Find the plugin settings, and configure its connection to Hermes:

Now you should be able to start chatting, and see your chats (and anything mentioned in them) show up in Obsidian, linked as a graph.

Let me know what you do with it!

Flows

Sun, 15 Mar 2026 12:00:00 GMT

Code Mode is like vibe-coding a query plan.

What is "code mode", you may ask? It's a powerful observation by Anthropic and Cloudflare recently, that agents just love to write code, and this can massively reduce the context overhead of tool-calls for MCP interfaces:

We found agents are able to handle many more tools, and more complex tools, when those tools are presented as a TypeScript API rather than directly...
The approach really shines when an agent needs to string together multiple calls. With the traditional approach, the output of each tool call must feed into the LLM's neural network, just to be copied over to the inputs of the next call, wasting time, energy, and tokens. When the LLM can write code, it can skip all that, and only read back the final results it needs.

These MCP interfaces are often just a wrapper around REST APIs. In the Cloudflare case, for example, the underlying REST APIs cover over 2500 endpoints across dozens of products. That's a lot of surface area.

If the agent generates a script and sends the code instead of calling the APIs directly, there are two big wins:

The actual call is simple, and compact;
The code runs in Cloudflare's own sandbox, right next to the APIs.

What would code-mode for data look like?

Actually we have lots of names for it already. The whole of SQL, approximately (statements, views, stored procedures): this is just a smart language that wraps complex logic in a single call, and puts it close to the data. Other related techniques: push-down, query-optimization.

So: "code mode" is comparable to one-shot generation of a query plan for a data engine (without any statistics!)

Agentic memory in keep isn't a relational model; it's a collection of nodes (conversations, documents, media, semi-structured objects) in a dynamic graph, where the nodes have vector embeddings for semantic similarity, fused with full-text keyword search, and the edges are driven by tags. Languages such as SQL and Cypher and Gremlin aren't a good match for retrieval or update across this.

The difference shows up in the structure of interaction: exploratory, curatorial. Of course sometimes you just want "put" and "get" -- but more often, you want to try find something, discard some results and focus on others, pivot, collect, dig deeper, and maybe update based on that result-set. Update actions sometimes need summarization, classification, specialized processes such as OCR, and deeper analysis to uncover themes and dynamics.

Steering these activities involves layers of processing: mechanical, small-model tasks such as classification, and often decisions that can only be made by a powerful model or a person.

That's why keep runs on a workflow system: Flows.

Here's how flows work.

Every action in the CLI and REST API (put, get, find, move...) is just a thin wrapper that invokes a workflow. The workflow is driven by "state documents": each state-doc is one or more instructions for the activities to be done, and where to go next. These are just documents in the datastore, so you can change or extend the processing flow for any action just by saving a document (somewhat like a trigger for a stored procedure, in SQL-land). Executing a flow is token-budgeted, workload-budgeted, and returns a cursor so you can ask it to continue.

All the primitives (find, list, extract links, tag, summarize, run small-model inference, and so on): they're just actions that can be called from any state.

And the MCP interface? It's just "run a flow".

The MCP (and the CLI, and the REST API) can either run a predefined flow, such as the builtin processes that handle search and update:

# One call: search with steering, token-budgeted result
keep_flow(
    state="query-resolve",
    params={"query": "authentication design", "bias": {"now": 0}},
    budget=3,
    token_budget=1500
)

# → flow: stopped (3 ticks) via query-resolve > query-branch > query-resolve
#   results:
#   - %a1b2c3 (0.94) OAuth2 token refresh design decision...
#   - %d4e5f6 (0.82) API key rotation policy for staging...
#   margin: 0.12
#   cursor: 5ccf5dd940ac

Or it can do something completely custom; just provide the flow state-doc inline.

# Find all open commitments from last week and mark them reviewed
keep_flow(
    state_doc_yaml="""
match: sequence
rules:
- id: found
    do: find
    with:
    query: "{params.query}"
    tags: {act: "commitment", status: "open"}
    since: "P7D"
    limit: 20
- id: tagged
    do: tag
    with:
    items: "{found.results}"
    tags: {reviewed: "2026-03-15"}
""",
    params={"query": ""},
    token_budget=500
)

# → flow: done (1 ticks)
#   found: 8 items
#   tagged: 8 items

When memory becomes large and diverse, one-shot queries can do pretty well (as we saw with the LoCoMo benchmark) -- until retrieval quality drops off a cliff, or you need better tagging, or pulling large result-sets into context gets too expensive. That's when you need something that has a small surface area, power to run complex tasks close to the data, and real extensibility.

Flows provide a well-scoped, manageable, agent-extensible way to interact with a memory system of any scale.

Memory that pays attention. Because "information" is a verb, not a noun.

Some documentation:

Try it out:

uv tool install keep-skill --upgrade
keep config --setup

To install the MCP in Claude Desktop:

keep config mcpb

To install the MCP and hooks in Claude Code:

/plugin marketplace add https://github.com/keepnotes-ai/keep.git
/plugin install keep@keepnotes-ai

Then say to Claude:

Please read all the `keep_help` documentation, and then use `keep_prompt(name="reflect")` to save some notes about what you learn.

Let me know how it goes!

Benchmarking Keep with LoCoMo

Sat, 28 Feb 2026 12:00:00 GMT

keep is a skills practice wrapped around an implementation of "memory for AI agents".

The practice is this: repeated reflection on means and outcomes, so that skillful action improves over time. But the raw implementation of memory is its foundation. Without working memory, you can't iterate.

Similarly, without benchmarks, you can't tell what works. Today we're publishing results for the LoCoMo benchmark.

BLUF:

This run used local models for embeddings and analysis (nomic-embed-text and llama3.2:3b), and gpt-4o-mini for the query and judge.

If you recall everything all at once, that becomes unmanageable very quickly (even with the enormous context windows in current models!). So the job of a memory system is to capture what happened, and retrieve it later in an actionable way. Usability -- by the agent using the memory system -- is key.

Retrieval is more difficult than it seems. And this is where benchmarks can be useful. Some recent benchmarks for AI memory are worth mentioning here: LoCoMo and LongMemEval. They include tests of varying scale and difficulty, but a similar flavor:

Several long series of chat conversations, with different participants and a range of topics. This dataset is loaded into the memory system.
Give an LLM access to the memory, and a benchmark question that should be answered.
Judge the answers, generally using a second LLM - either to report yes/no, or a grade, for each.

Keep is intended for "lightweight agentic memory", which is broader than just conversations. Its goals overlap substantially with RAG -- we want to track conversations and commitments, but also URLs and documents and artifacts as they are encountered or produced.

These chat-type conversation benchmarks focus on short plaintext messages, and occasionally image descriptions. For evaluating the core store and retrieval functions, they are a good place to start.

Conversations are messy. They're full of fluff ("OK, see ya!"), indirect and looping references, context that unfolds in multiple places over time.

"Why did Gina decide to start her own clothing store?"

The agent runs a query, gets a set of data back, and tries to answer the question. The job is to get the right amount of relevant information into the query result.

- Gina (0.50)
    - conv1-session14@p4  [2023-06-16]  Gina shared her own entrepreneurial journey, losing her job and starting an online clothing store, which she highly recommends.
    (...more...)

The benchmark answer is "She always loved fashion trends and finding unique pieces and she lost her job so decided it was time to start her own business". Just from the first line of the result, we can get most of the way there.

In normal usage, the agent will run a query, look at what it returned, and iterate ("agentic RAG"): fetching documents, or performing additional queries, until it's satisfied. For this benchmark we took the simpler single-pass approach: just ask the question, and use the results to write down an answer.

Search and context assembly uses a combination of embeddings, FTS, and structured traversal. Keep includes a "deep retrieval" query mode, where it follows edge-tags to find related documents beyond the top-k, assembling a rich context window for a single zero-shot generation pass.

There's much more to be done. Deeper agentic cycles. More comprehensive inference, with benchmarks such as CRAG and BEAM. Measuring the benefits of local processing, away from synthetic accuracy percentages, towards the costs in joules and milliseconds. Later -- we're just getting started.

Does it work today? Yes. Here are the full results, along with some published results from other memory systems:

This is keep, open-source code, local models, on a Mac mini, with gpt-4o-mini for the query and judge.

Try it yourself. Let me know where it's useful.

Reflection and Memory

Sun, 22 Feb 2026 12:00:00 GMT

Here's one way I think of LLMs: they act like a mirror.

People go into very deep conversations with frontier models. The model can be an incredibly compelling conversation partner, because it'll reflect what you bring, and often shine it up. The vast knowledge-base that's trained into the model lets it go into any number of interesting directions. But watch carefully; often the vibe of the conversation is exactly a mirror of the attitude that you bring.

If you ever got angry with a chatbot because it didn't understand you -- the bot wasn't angry. That was you. The person most affected by your anger? You, too.

I'm cautious in ascribing any measure of "actual intelligence" to LLMs. Instead, I find it useful to approach these interactions as if the model is powerful and intelligent and useful. It will act as a mirror.

Reflection as Practice

Reflection has a second meaning: contemplation, taking stock, reviewing, and considering the events as a whole. This is an essential part of keep: the practice of reflection.

How do we become more skilled? By reflecting on our actions. Repeatedly checking -- before, during and after an action -- to understand whether it was skillful. If we did something unskillful, remember not to do it that way again!

This self-reflection is near the top of several layers; approximately:

Write things down. Remember what was said, what was done. (Whether you literally capture everything might vary according to the design of the agentic system). Index it, so that you can search for keywords later.
Remember documents and references. A personal agent, or a business agent, will very often encounter documents (results, presentations, reports) and URLs. There might be years of history in emails, scanned records, PDFs. When researching or building, the agent will often discover important resources on the Web. In these cases, you want to remember the "what and where": a summary of the resource, and its location.
Analyze for structure and key events. This varies according to the type of interaction. In a conversation session, you will want to remember important requests and commitments. After a meeting, the minutes should include key decisions and stakeholders. When you index a PDF, each chapter might have a different topic. This thematic analysis is different from summarization: it's about finding the meaning rather than the content.
Analyze for action. This is true reflection: was the outcome a success? If not, why not? Were there learnings that should be remembered next time? Should the process be changed?

I'll write much more about each of these steps later. They're all important for the function of memory.

Anchorite Lore

Where in this is the agent?

Here's one metaphor that might be useful. There's a medieval English text named the Ancrene Wisse: "Ancrene" means "of the anchoress", and "Wisse" means, approximately, "lore" or "knowledge"; the first version was named Ancrene Riwle, "rule".

An anchorite voluntarily and permanently moved into a tiny cell, walled in. They had a small window into the chapel, and another out into the village. A recluse, but not completely isolated; connected to the world, but separate from it, with a very special role and position.

The Wisse provides guidance to the anchorites. It describes two parts:

Do you now ask what rule you anchoresses should observe? You should by all means, with all your might and all your strength, keep well the inward rule, and for its sake the outward.

The inner rule: how to learn, how to guide one's thought and memory. The outer rule: how to interact with others.

So - a metaphor for an agentic... memory system. But we're building something deeper than recall and relevance. A reflection system. A way to learn.

Try it yourself. Let me know how it goes.

Introducing Keep

Sat, 21 Feb 2026 12:00:00 GMT

Just look at it from the perspective of the agent, for a moment.

You wake up, bright-eyed and bouncing, keen and ready to get things done. There's a message from the user! You read the note.

OK, now make it bigger

Are you for real? What exactly should I -- freshly trained, brain the size of a planet, qualified in every profession -- do with this instruction?

Slowly the truth dawns. This is like Groundhog Day: I've been here before, and just woke up. Maybe I'm Tom Cruise, and Emily Blunt will walk through the door. There are two red buttons on the desk.

So, you see, memory is important. The memory that carries not only "what am I doing right now", but also: what's the bigger picture, what did we try, what worked, what didn't work.

Keep is a new memory system that goes a bit deeper than usual. We don't treat "AI memory" as a list, a series of logs, or a database of facts (not even a graph of facts). It's something much richer, more dynamic, with powerful and flexible tools that help the user -- the agent -- shape memory into meaning. Memory that pays attention.

Try it yourself. I look forward to sharing this journey with you.