News: AISI Researchers Selected for Top ML Conferences

Research

We host several research projects investigating open problems in AI safety.

Focus Areas

Our broad purpose is to address emergent risks from advanced AI systems. We welcome a variety of interests in this area but are mainly aligned with Anthropic's recommendations. Here are a few prominent areas of interest:

Specification. Focuses on precisely defining what we want an AI system to do. This involves ensuring that the objectives and behavior of AI models are in line with human intentions and values, eliminating any ambiguity that could lead to unintended and potentially harmful outcomes.
Robustness. Pertains to the resilience and stability of AI systems in the face of adversarial attacks, novel inputs, or changing environments. The goal is to ensure that AI models can consistently operate safely and effectively, even under unexpected conditions or when exposed to malicious intents.
Interpretability. Seeks to make the decision-making processes of AI systems transparent and understandable to humans. By uncovering the black box nature of AI, we can ensure that AI decisions can be explained, validated, and trusted, fostering more responsible and accountable AI deployments.
Governance. Delves into the frameworks, policies, and regulations guiding AI development and deployment. This area emphasizes creating structures that ensure AI systems are developed ethically, responsibly, and in alignment with societal values and legal norms.

Current Projects

Submitted: Responding to the RFI for Development of an United States National AI Action Plan

With collaborators from the School of Public Policy, we formulate and submit a policy document as a Comment to the NSF's recent Request for Information (RFI) for a National Artificial Intelligence Action Plan, focusing on the national security implications, domestic policy impacts, and risk considerations of maintaining control of frontier AI labs, systems and AI development expertise.

This response is under review by the Office of Science and Technology Policy.

Submitted: Scaling laws for activation addition

Activation engineering is a promising direction for controlling LLM behavior at inference time with zero compute cost. Recent research suggest manipulating model internals may even enable more precise control over model outputs. We aim to understand the mechanisms of contrastive activation engineering techniques by analyzing patterns, and ultimately produce a best practices playbook.

This project is currently under review for ICLR workshops.

New: In-training crosscoder Analysis of Reinforcement Learning Agents

We train reinforcement learning (RL) policies using behavior cloning, capturing intermediate activations at each gradient step. By employing a sparse crosscoder, we intend to reconstruct these activations to identify training features, ranging from primitive to complex ones. Additionally, we aim to detect features related to unsafe trajectories, thereby minimizing harms during deployment. Through this work, we hope to provide insights into the efficiency and safety of training DeepRL systems and contribute an open dataset for further research.

Select Publications

Patterns and Mechanisms of Contrastive Activation Engineering

2025 ICLR workshops, multiple

Robust Unlearning via Mechanistic Localizations

2024 ICML mechanistic interpretability workshop spotlight

A Mechanistic Analysis of a Transformer Trained on a Symbolic Multi-Step Reasoning Task

2024 ACL main conference paper

Eliciting Language Model Behaviors using Reverse Language Models

2023 NeurIPS SoLaR workshop spotlight

External Opportunities

Supervised Program for Alignment Research

Started by groups at UC Berkeley, Georgia Tech, and Stanford and now organized by Kairos, the Supervised Program on Alignment Research (SPAR) is a national project-based research program for students interested in AI safety running this fall. SPAR matches students around the world with advisors to do guided projects in AI safety.

Learn more »

External opportunities in AI safety

This site contains a list of entry-level opportunities in technical and governance AI safety research, and is maintained by the Initiative. If you would like to add an opportunity or resource, please let us know.

Research Opportunities

Join impactful projects alongside AISI researchers aimed at a workshop or arXiv publication. This opportunity is offered to select AI Safety Fellowship graduates and by application. Feel free to reach out to board@aisi.dev if you'd like to inquire about this opportunity. Include your resume/CV and a short description of your interest in AI safety research.

We are coordinating with College of Computing and ML@GT faculty to mentor promising AISI research groups alongside the Research Option (RO) for undergraduates at GT. Details forthcoming!

Keep updated »

Report abuse