Research
We host several resarch projects investigating open problems in AI safety.
Focus Areas
Our broad purpose is to address emergent risks from advanced AI systems. We welcome a variety of interests in this area. Here are a few prominent areas of interest:
Specification. Focuses on precisely defining what we want an AI system to do. This involves ensuring that the objectives and behavior of AI models are in line with human intentions and values, eliminating any ambiguity that could lead to unintended and potentially harmful outcomes.
Robustness. Pertains to the resilience and stability of AI systems in the face of adversarial attacks, novel inputs, or changing environments. The goal is to ensure that AI models can consistently operate safely and effectively, even under unexpected conditions or when exposed to malicious intents.
Interpretability. Seeks to make the decision-making processes of AI systems transparent and understandable to humans. By uncovering the black box nature of AI, we can ensure that AI decisions can be explained, validated, and trusted, fostering more responsible and accountable AI deployments.
Governance. Delves into the frameworks, policies, and regulations guiding AI development and deployment. This area emphasizes creating structures that ensure AI systems are developed ethically, responsibly, and in alignment with societal values and legal norms.
Current Projects
Eliciting Language Model Behaviors using Reverse Language Models
We evaluate the applicability of a reverse language model, pre-trained on inverted token-order, as a tool for automated identification of an LM's natural language failure modes.
Scaling laws for activation addition
Activation engineering is a promising direction for controlling LLM behavior at inference time with zero compute cost. Recent research suggest manipulating model internals may even enable more precise control over model outputs. We seek to understand how techniques operating on model activation scale with model size and improve their performance for larger models.
Supervised Program for Alignment Research
Organized by groups at UC Berkeley, Georgia Tech, and Stanford, the Supervised Program on Alignment Research (SPAR) is an intercollegiate project-based research program for students interested in AI safety running this fall. SPAR matches students around the world with advisors to do guided projects in AI safety.
Learn more »Select Papers from AISI
Under review
2024 ICML mechanistic interpretability workshop spotlight
2024 ACL main conference paper
2023 NeurIPS SoLaR workshop spotlight