A newcomer's experience attending a two-day alignment research workshop, and what I learned from it.
March 2026
Epistemic humility note: I'm relatively new to AI safety. I joined the field through the Cambridge ERA:AI Safety Fellowship and this was my first workshop in this space. What follows is a personal account of my experience — not expert commentary. I may misunderstand nuances that veterans take for granted, and my impressions are shaped by a sample size of one workshop. I'm writing this in the spirit of Substack-style experience sharing: a low bar, first-person account that might be useful to others considering entering the field or attending similar events.
The FAR.AI London Alignment Workshop took place on March 2-3, 2026. It's part of a series of alignment workshops that FAR.AI has organized in cities including San Diego, Singapore, Vienna, New Orleans, and San Francisco.
My first surprise was how selective the process was. The application required an Expression of Interest that asked for a summary of who you are (previous experience and interest in alignment work), your goals for attending (questions you'd like answered or progress you'd like to make), and who you'd like to connect with at the event. The ERA:AI fellowship itself has an acceptance rate of less than 1%, but even among fellows, roughly 15% of those who applied were invited to attend. I know of about 4 fellows from my cohort who were invited. So the bar to even be in the room was quite high, which shaped the experience in ways I didn't fully anticipate.
The organizers strongly encouraged using an app called SwapCard, which listed all attendees and made it easy to see who was there and schedule 1:1 meetings. Even before the workshop started, I was already getting invitations for 1:1s from other attendees. I ended up scheduling 3-4 per day.
But the scheduled meetings were just the starting point. Conversations happened organically too — over coffee, over meals, by joining someone else's 1:1 when topics overlapped. By the end, I'd had far more conversations than I had planned.
What surprised me was how willing people were to engage with a newcomer. I think the selectivity of the workshop helped here — everyone had already passed a certain bar, so there was a baseline level of trust. Many attendees seemed genuinely curiosity-driven and eager to talk about research interests, regardless of seniority. The community felt friendlier and more approachable than I expected, and the combination of high selectivity and good facilitation through SwapCard probably contributed to that.
The talk quality was high across the board. The opening and keynotes were strong, but I was also impressed by the shorter ~10-minute talks — they were dense and well-prepared. For a newcomer, these served as a rapid survey of what's going on in the field. I found myself learning new terminology quickly just by being exposed to many talks in sequence.
I also attended a discussion group on interpretability later in the day, which helped me go deeper on the area most relevant to my own research.
Having so many 1:1s meant I got to explain my ERA project — developing evaluation metrics for machine unlearning methods — many times over two days. Each time, my description got a little more refined. By the end, I could articulate what I was doing and why it mattered more clearly than when I arrived.
I had a specific question going in. My project involves evaluating whether unlearning methods truly remove dangerous knowledge from a model or merely suppress it at the output level. Current behavioral metrics (which measure model responses) can't distinguish between the two. So I'm interested in looking at model internals to find structural metrics that can.
This led me to a question I asked many people: why are activation-based approaches to studying model internals more prevalent in the field, when activations are input-dependent and potentially confounding? Weight-based approaches (like SVD-based metrics) seem like they'd suffer less from this issue. What are the tradeoffs?
The responses were varied and helpful. Some researchers had a clear preference for activation-based methods. Others acknowledged the tradeoffs of both approaches and asked me back: what is the specific research question I'm trying to answer? That question — turned back on me — was more useful than any direct answer. It helped me think more carefully about what I actually need from a metric, rather than picking an approach first.
This is a genuinely open question in the field. Both approaches have their limitations, and the right choice depends on what you're trying to measure. Having many people push back on my assumptions from different angles over two days helped me refine my thinking more than I could have done alone in the same time.
In two separate 1:1s, different people independently suggested I talk to the same person because his work overlapped closely with mine. When I did, it was immediately productive — he had thought about the problem longer than I had, acknowledged that both SVD-based and activation-based approaches have their own limitations, and gave me feedback as if he were reviewing my project as a paper. That kind of targeted, expert feedback is hard to find outside of a setting like this.
Over a coffee break, I met a researcher who, after hearing about my work, shared a paper she was working on and suggested I reach out after the fellowship for a possible collaboration. This was entirely organic — not something I went in looking for.
One realization that crystallized through the workshop conversations: I'm entering a fundamentally different complexity regime from what I've worked with before.
In my physics research and industry ML work, the data I dealt with was mostly scalar. I worked with models like XGBoost, GAMs, and LSTMs — not simple, but their interpretability doesn't require navigating very high-dimensional spaces in the same way. I have solid experience in data analysis with statistics, and making sense of those models was manageable.
But with transformer-based architectures, the complexity increases by orders of magnitude. Model weights live in very high-dimensional spaces and are sensitive to data and hyperparameters. Even finding a clean, controlled setup where you can make causal claims is hard. This isn't just "harder" — it's a qualitatively different kind of problem. Talking with many people at the workshop accelerated this realization and gave me a better sense of where the real difficulties lie.
If I had to summarize: the talks gave me a rapid survey of the field, but the conversations were where the real value was. Two days of concentrated 1:1s with researchers who have thought about these problems longer than I have compressed a lot of learning into a short time.
The community was more approachable than I expected. People seemed genuinely interested in exchanging ideas, and the workshop structure — the selectivity, SwapCard, the meals, the discussion groups — was designed to make that happen.
I reposted a post from FAR.AI on day 1 of the workshop on LinkedIn — I hardly ever post there — noting that it was my first AI safety workshop and a sign that my career transition effort is going somewhere. It got more attention than I expected, including a comment from FAR.AI saying they hope to see me at future events. A small thing, but a nice signal of a welcoming community.
I came away with a sharper understanding of my own project, a better grasp of the field's open questions, several new connections, and a possible future collaborator. For a first workshop in a new field, that felt like a good outcome.