Kerrick's AI safety + ML reading list

This is a list of papers and blog posts that I've read and want to read related to AI safety and ML (and a few broader SWE-related entries). I'm publishing this in an effort to learn in public. Last updated 2026-07-18.

The star ratings are entirely for me and are mostly based on how relevant the work is to my interests and how easily I was able to digest useful insights from the work. The rating is roughly how likely I am to refer back to this piece in the future or recommend it to others (directly vs a summary). A low rating does not mean the work or results are "bad". The commentary is dictated off-the-cuff and all opinions are weakly held.

Read

Tips for Empirical Alignment Research ★★★★⯪
Tons of great advice in here. I've returned to it several times.
Research as a Stochastic Decision Process ★★★★⯪
It's the primary framework with which I think about my research progress.
How to Scale Your Model ★★★★☆
Fantastic resource on scaling training runs (especially pretraining on TPUs). In-depth coverage of sharding schemes, bottlenecks (so many bottlenecks!), and throughput math. Would be 5/5 stars if I was still primarily working in this area.
Agentic Engineering Patterns - Simon Willison's Weblog ★★★★☆
Lots of great ideas on how to effectively use coding agents, from small tips and tricks to broader points about test-driven development and managing context. I had already adopted a lot of the practices described here, and I nonetheless took away a few new things. I'd recommend it to anyone not already deep in the weeds on agentic engineering.
AuditBench ★★★★☆
Spiritual successor to "Auditing language models for hidden objectives", but with a greater diversity of model organisms and agents instead of humans doing the auditing. It found that scaffolded blackbox methods are the most effective audit methods among the ones tested.
Policy on the AI Exponential ★★★★☆
Dario does a good job of laying out the ways in which governments will need to interact with and respond to AI advancements. This is written for a smart lay-audience and doesn't go into enough detail in most areas (for example, it doesn't discuss the possibility of coordinating a pause at all).
Auditing language models for hidden objectives ★★★★☆
Enjoyed this paper. Found the writing detailed and easy to digest. Both the construction of the model organism and the blue team process analysis + commentary on the relative usefulness of different tools are valuable contributions. I think this is one of the first papers to observe that merely describing (rather than demonstrating) AI behaviors in pretraining/midtraining corpuses is sufficient to cause AIs to exhibit those behaviors (this was later confirmed in AuditBench and Teaching Claude Why).
MiMo V2 Flash Technical Report ★★★★☆
Lots of good ideas, good detail.
LoRA Without Regret ★★★★☆
Helps build a lot of good intuition about LoRA.
Highly Opinionated Advice on How to Write ML Papers — AI Alignment Forum ★★★★☆
Lots of good stuff in here.
Monitoring Monitorability | OpenAI ★★★★☆
This paper says a lot and really does a good job of establishing a framework for thinking about monitorability. The numerical results are sometimes noisy, which makes it hard to interpret them.
Petri: An open-source auditing tool to accelerate AI safety research ★★★⯪☆
Petri seems like a useful tool. No super groundbreaking ideas here but I like the diversity of misaligned behaviors and audit scenarios the authors described (this wasn't the main point: the authors describe these as a starting point, but nonetheless this was one of the biggest take aways for me).
Deepseek V3.2 Technical Report ★★★⯪☆
Reasonably solid. It feels like, for better or for worse, a lot of the ideas are hacks, so I didn't take away a lot of internalizable knowledge from this.
An FAQ on Reinforcement Learning Environments | Epoch AI ★★★⯪☆
Good for getting an understanding of how RL envs work in practice.
Speculative Sampling ★★★⯪☆
A simple but good idea and the math is fun.
Defeating Nondeterminism in LLM Inference - Thinking Machines Lab ★★★⯪☆
Lots of good detail, but the main point is that GPU kernels are often not batch-size invariant, and it takes a while to get there.
Early work on monitorability evaluations - METR ★★★⯪☆
Good ideas, although I didn't have a lot of take-aways.
Detecting misbehavior in frontier reasoning models | OpenAI ★★★⯪☆
This is specifically for the blog post, not the paper. It's a nice result that clearly demonstrates that CoT pressure degrades monitorability without reducing bad behavior much.
Why it's hard to make settings for high-stakes control research ★★★⯪☆
Makes some good points. Buck has a lot of perspective on this.
Teaching Claude Why ★★★⯪☆
A pretty cool set of results. Feels like there's more to be said here.
Claude's Constitution ★★★⯪☆
(I didn't finish the whole thing.) Significant and nuanced contribution to the field of AI alignment with lots of good detail on how Anthropic wants Claude to behave. As the authors note it's more intended for Claude itself to read rather than humans and is quite lengthy. Emphasizes Claude's autonomous moral judgement instead of blind deferral to human oversight.
Flash Attention ★★★☆☆
It took me a while to digest the algorithm and why it's faster. Feels like this could work better as a blog post with some visualizations.
LoRA ★★★☆☆
LoRA is a very simple concept. You could just ask an LLM to explain it to you in like 30 seconds, but the paper is not bad.
GLU Variants Improve Transformer ★★⯪☆☆
Famously, the authors of this paper don't know why the technique works, so it's not very insightful, but everyone uses this and the research is good.
RoPE paper ★★⯪☆☆
Everyone uses this, but I felt that learning this from the paper wasn't super helpful. It's easier to just ask an LLM to explain it to you.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model ★★☆☆☆
I didn't "get" it from reading the paper, but learned the idea a lot faster just by asking an LLM to explain it to me.
Kimi 2.5 Technical Report ★⯪☆☆☆
It felt very fluffy.
Coherent Extrapolated Volition ★☆☆☆☆
I abandoned this because the writing style made it hard to read. The LessWrong summary is a much better articulation of this idea.

Kerrick's AI safety + ML reading list

Read

To Read