This is a list of papers and blog posts that I've read and want to read related to AI safety and ML (and a few broader SWE-related entries). I'm publishing this in an effort to learn in public. Last updated 2026-05-17.
The star ratings are entirely for me and are mostly based on how relevant the work is to my interests and how easily I was able to digest useful insights from the work. A low rating does not mean the work or results are "bad". The commentary is dictated off-the-cuff and all opinions are weakly held.
Tons of great advice in here. I've returned to it several times.
It's the primary framework with which I think about my research progress.
Lots of good ideas, good detail.
Helps build a lot of good intuition about LoRA.
Lots of good stuff in here.
This paper says a lot and really does a good job of establishing a framework for thinking about monitorability. The numerical results are sometimes noisy, which makes it hard to interpret them.
Reasonably solid. It feels like, for better or for worse, a lot of the ideas are hacks, so I didn't take away a lot of internalizable knowledge from this.
Good for getting an understanding of how RL envs work in practice.
A simple but good idea and the math is fun.
Lots of good detail, but the main point is that GPU kernels are often not batch-size invariant, and it takes a while to get there.
Good ideas, although I didn't have a lot of take-aways.
This is specifically for the blog post, not the paper. It's a nice result that clearly demonstrates that CoT pressure degrades monitorability without reducing bad behavior much.
Makes some good points. Buck has a lot of perspective on this.
A pretty cool set of results. Feels like there's more to be said here.
It took me a while to digest the algorithm and why it's faster. Feels like this could work better as a blog post with some visualizations.
LoRA is a very simple concept. You could just ask an LLM to explain it to you in like 30 seconds, but the paper is not bad.
Famously, the authors of this paper don't know why the technique works, so it's not very insightful, but everyone uses this and the research is good.
Everyone uses this, but I felt that learning this from the paper wasn't super helpful. It's easier to just ask an LLM to explain it to you.
I didn't "get" it from reading the paper, but learned the idea a lot faster just by asking an LLM to explain it to me.
It felt very fluffy.
I abandoned this because the writing style made it hard to read. The LessWrong summary is a much better articulation of this idea.