This is a list of papers and blog posts that I've read and want to read related to AI safety and ML (and a few broader SWE-related entries). I'm publishing this in an effort to learn in public. Last updated 2026-06-22.
The star ratings are entirely for me and are mostly based on how relevant the work is to my interests and how easily I was able to digest useful insights from the work. The rating is roughly how likely I am to refer back to this piece in the future or recommend it to others (directly vs a summary). A low rating does not mean the work or results are "bad". The commentary is dictated off-the-cuff and all opinions are weakly held.
Tons of great advice in here. I've returned to it several times.
It's the primary framework with which I think about my research progress.
Spiritual successor to "Auditing language models for hidden objectives", but with a greater diversity of model organisms and agents instead of humans doing the auditing. It found that scaffolded blackbox methods are the most effective audit methods among the ones tested.
Dario does a good job of laying out the ways in which governments will need to interact with and respond to AI advancements. This is written for a smart lay-audience and doesn't go into enough detail in most areas (for example, it doesn't discuss the possibility of coordinating a pause at all).
Enjoyed this paper. Found the writing detailed and easy to digest. Both the construction of the model organism and the blue team process analysis + commentary on the relative usefulness of different tools are valuable contributions. I think this is one of the first papers to observe that merely describing (rather than demonstrating) AI behaviors in pretraining/midtraining corpuses is sufficient to cause AIs to exhibit those behaviors (this was later confirmed in AuditBench and Teaching Claude Why).
Lots of good ideas, good detail.
Helps build a lot of good intuition about LoRA.
Lots of good stuff in here.
This paper says a lot and really does a good job of establishing a framework for thinking about monitorability. The numerical results are sometimes noisy, which makes it hard to interpret them.
Petri seems like a useful tool. No super groundbreaking ideas here but I like the diversity of misaligned behaviors and audit scenarios the authors described (this wasn't the main point: the authors describe these as a starting point, but nonetheless this was one of the biggest take aways for me).
Reasonably solid. It feels like, for better or for worse, a lot of the ideas are hacks, so I didn't take away a lot of internalizable knowledge from this.
Good for getting an understanding of how RL envs work in practice.
A simple but good idea and the math is fun.
Lots of good detail, but the main point is that GPU kernels are often not batch-size invariant, and it takes a while to get there.
Good ideas, although I didn't have a lot of take-aways.
This is specifically for the blog post, not the paper. It's a nice result that clearly demonstrates that CoT pressure degrades monitorability without reducing bad behavior much.
Makes some good points. Buck has a lot of perspective on this.
A pretty cool set of results. Feels like there's more to be said here.
(I didn't finish the whole thing.) Significant and nuanced contribution to the field of AI alignment with lots of good detail on how Anthropic wants Claude to behave. As the authors note it's more intended for Claude itself to read rather than humans and is quite lengthy. Emphasizes Claude's autonomous moral judgement instead of blind deferral to human oversight.
It took me a while to digest the algorithm and why it's faster. Feels like this could work better as a blog post with some visualizations.
LoRA is a very simple concept. You could just ask an LLM to explain it to you in like 30 seconds, but the paper is not bad.
Famously, the authors of this paper don't know why the technique works, so it's not very insightful, but everyone uses this and the research is good.
Everyone uses this, but I felt that learning this from the paper wasn't super helpful. It's easier to just ask an LLM to explain it to you.
I didn't "get" it from reading the paper, but learned the idea a lot faster just by asking an LLM to explain it to me.
It felt very fluffy.
I abandoned this because the writing style made it hard to read. The LessWrong summary is a much better articulation of this idea.