• AI Safety Thursdays: Understanding The Self-Other Overlap Approach

    Thursday, May 22nd, 6pm-8pm

    Leo Zovic presents on a less-explored technique that optimizes models to maintain similar internal representations when reasoning about themselves and others.

    ​This scalable approach not only reduces deceptive behavior in AI systems but can perfectly classify deceptive agents based on their self-other overlap values.

  • AI Safety Thursdays: Advanced AI's Impact on Power and Society

    Thursday, May 29th, 6pm-8pm

    Historically, significant technological shifts often coincide with political instability, and sometimes violent transfers of power. Should we expect AI to follow this pattern, or are there reasons to hope for a smooth transition to the post AI world?

    ​Anson Ho draws upon economic models, broad historical trends, and recent developments in deep learning to guide us through an exploration of this question.

  • Hackathon: Apart x Martian Mechanistic Router Interpretability Hackathon

    We are excited to host a jamsite for Apart Research and Martian's upcoming hackathon.

  • AI Safety Thursdays: Tracing the Thoughts of a Large Language Model

    Thursday, June 5th, 6pm-9pm

    How do LLMs actually work on the inside? Annie Sorkin presents on new research from Anthropic's Transformer Circuits team that opens up the "black box" of Claude 3.5 Haiku, revealing the computational mechanisms behind everything from multi-step reasoning to poetry planning.

    Using a new methodology called attribution graphs, we'll explore how models handle multiple languages, exhibit concerning behaviors like jailbreaks, and sometimes engage in unfaithful reasoning.

  • AI Safety Thursdays: Reasoning Models Don't Always Say What They Think

    Thursday, June 12th, 6pm-8pm

    ​​"Reasoning Models" have become among the most prominent state-of-the-art tools in the AI world. Can we trust the way they reason, and does it matter if they come up with the right answer but with incorrect reasoning?

    ​​At today's event, Giles Edkins will guide us through these questions as explored in Anthropic's paper from last month.​​

  • AI Safety Thursdays: Towards a Connectome of Concepts for Deep Neural Networks

    Thursday, June 19th​, 6pm-8pm

    A connectome maps every neuron and synapse in a biological brain—but can we do the same for concepts in an artificial neural network? Can we visualize all human-interpretable concepts and their connections across every layer?

    To explore this, Matt Kowal proposes the concept connectome—a method for identifying all concepts in a deep network and mapping their interactions.

  • AI Safety Thursdays: Are LLMs aware of their learned behaviors?

    Thursday, June 26th​, 6pm-8pm

    At this event, we'll explore self-awareness in LLMs, as described in the paper Tell me about yourself: LLMs are aware of their learned behaviors. Guiding us through the topic will be one of the paper's co-authors, Jenny Bao.

Past Events

  • AI Safety Thursdays: When Good Rewards Go Bad: Reward Overoptimization in RLHF

    Thursday, May 15, 6pm-8pm

    Reinforcement learning with human feedback (RLHF) has become a popular way to align AI behavior with human preferences. But what happens when the system gets too good at optimizing the reward signal?

    ​Evgenii Opryshko guided us through an exploration of how overoptimization can lead to unintended behaviors, why it happens, and what we can do about it.

  • AI Safety Thursdays: Extreme Sycophancy in GPT-4o

    Thursday, May 8, 2025.

    On April 25th, Openai deployed an update to ChatGPT GPT‑4o, and in the week that followed, it was clear that it had become concerningly more sycophantic, being willing to encourage and praise its users in extreme ways that ranged from comical to dangerous.

    ​At this event, Mario Gibney guided a discussion of what happened, why, how OpenAI responded, and what we can learn from it.

  • AI + Human Flourishing: Policy Levers for AI Governance

    Sunday, May 4, 2025, 6pm-8pm

    Considerations of AI governance are increasingly urgent as powerful models become more capable and widely deployed. Kathrin Gardhouse delivered a presentation on the available mechanisms we can use to govern AI, from policy levers to technical AI governance. It was a high-level introduction to the world of AI policy to get a sense of the lay of the land.

  • AI Safety Thursday: "AI-2027"

    Thursday April 24th, 2025, 6-8pm

    On April 3rd, a team of AI experts and superforecasters at The AI Futures Project, published a narrative called AI-2027 outlining a possible scenario of explosive AI development and takeover occurring during the coming 2 years.

    Mario Gibney guided us through a presentation and discussion of the scenario where we explore how likely it is to actually track reality in the coming years.

  • AI Safety Thursdays: Generalization and Out of Context Reasoning

    Thursday February 20, 2025, 6-8pm

    Recent papers describe a phenomenon called out-of-context reasoning, showing that AI models can go beyond recall to make complex inferences from their training data.

    Max Kaufmann, part of the team that discovered this, presented on what out-of-context reasoning is and its implications for our understanding of LLMs and AI safety.

  • AI Safety Thursdays: AI and Explosive Economic Growth

    Thursday January 23, 2025, 6-8pm

    This week, we had Epoch AI researcher Anson Ho presenting on the possibility of explosive economic growth due to the development of highly advanced AI. We due into the various feedback loops and aspects of diminishing returns that might push economic growth much higher (or not) than ever before.