|
Ethan Perez
|
Interna |
Pasando Jugo |
|
Blog
|
Interna |
Pasando Jugo |
|
existential risks
|
Externo |
Pasando Jugo |
|
Retrieval-Augmented Generation (RAG)
|
Externo |
Pasando Jugo |
|
sleeper agents
|
Externo |
Pasando Jugo |
|
debating with more persuasive LLMs leads to more truthful answers
|
Externo |
Pasando Jugo |
|
Forbes’s 30 Under 30 in AI
|
Externo |
Pasando Jugo |
|
Google Scholar
|
Externo |
Pasando Jugo |
|
GitHub
|
Externo |
Pasando Jugo |
|
Twitter
|
Externo |
Pasando Jugo |
|
CV
|
Interna |
Pasando Jugo |
|
Agentic Misalignment: How LLMs Could Be Insider Threats
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
|
Externo |
Pasando Jugo |
|
+ 29 more
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
AI Alignment Forum
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Inverse Scaling in Test-Time Compute
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
AI Alignment Forum
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Unsupervised Elicitation of Language Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Reasoning Models Don't Always Say What They Think
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Forecasting Rare Language Model Behaviors
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Alignment Faking in Large Language Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Many-shot Jailbreaking
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Best-of-N Jailbreaking
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
|
Externo |
Pasando Jugo |
|
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
|
Externo |
Pasando Jugo |
|
A dataset of questions on decision-theoretic reasoning in Newcomb-like problems
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Sabotage Evaluations for Frontier Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Looking Inward: Language Models Can Learn About Themselves by Introspection
|
Externo |
Pasando Jugo |
|
Language Models Learn to Mislead Humans via RLHF
|
Externo |
Pasando Jugo |
|
Debating with More Persuasive LLMs Leads to More Truthful Answers
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Examples
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Many-shot Jailbreaking
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Learning from Natural Language Feedback
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Towards Evaluating AI Systems for Moral Status Using Self-Reports
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
LessWrong
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Specific versus General Principles for Constitutional AI
|
Externo |
Pasando Jugo |
|
Towards Understanding Sycophancy in Language Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Website
|
Externo |
Pasando Jugo |
|
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Studying Large Language Model Generalization with Influence Functions
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Measuring Faithfulness in Chain-of-Thought Reasoning
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Inverse Scaling: When Bigger Isn’t Better
|
Externo |
Pasando Jugo |
|
AI Safety Relevance
|
Externo |
Pasando Jugo |
|
Blog Post
|
Interna |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
GitHub
|
Externo |
Pasando Jugo |
|
Related Work
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Winners
|
Externo |
Pasando Jugo |
|
Training Language Models with Language Feedback at Scale
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Improving Code Generation by Training with Natural Language Feedback
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Pretraining Language Models with Human Preferences
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
The Capacity for Moral Self-Correction in Large Language Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Discovering Language Model Behaviors with Model-Written Evaluations
|
Externo |
Pasando Jugo |
|
AI Safety Relevance
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Data
|
Externo |
Pasando Jugo |
|
Data Visualization
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Constitutional AI: Harmlessness from AI Feedback
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Constitutional AI Policy Memo
|
Interna |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Measuring Progress on Scalable Oversight for Large Language Models
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Few-shot Adaptation Works with UnpredicTable Data
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Data
|
Externo |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Language Models (Mostly) Know What They Know
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
RL with KL Penalties is Better Viewed as Bayesian Inference
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Training Language Models with Language Feedback
|
Externo |
Pasando Jugo |
|
FAR AI
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Finding and Fixing Undesirable Behaviors in Pretrained Language Models
|
Interna |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Red Teaming Language Models with Language Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
True Few-Shot Learning with Language Models
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Case-based Reasoning for Natural Language Queries over Knowledge Bases
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Rissanen Data Analysis: Examining Dataset Characteristics with Description Length
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Retrieval-Augmented Generaation for Knowledge-Intensive NLP Tasks
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Demo
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Unsupervised Question Decomposition for Question Answering
|
Externo |
Pasando Jugo |
|
Blog Post
|
Interna |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Poster
|
Interna |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Retrospective for FiLM: Visual Reasoning with a General Conditioning Layer
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Finding Generalizable Evidence by Learning to Convince Q&A Models
|
Externo |
Pasando Jugo |
|
Blog Post
|
Interna |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Press
|
Externo |
Pasando Jugo |
|
Twitter Thread
|
Externo |
Pasando Jugo |
|
Supervised Multimodal Bitransformers for Classifying Images and Text
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
ELI5: Long Form Question Answering
|
Externo |
Pasando Jugo |
|
Blog Post
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Website
|
Externo |
Pasando Jugo |
|
Visual Reasoning with Multi-hop Feature Modulation
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Feature-wise transformations
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
HoME: a Household Multimodal Environment
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
FiLM: Visual Reasoning with a General Conditioning Layer
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Code
|
Externo |
Pasando Jugo |
|
Talk
|
Externo |
Pasando Jugo |
|
Semi-supervised learning with the deep rendering mixture model
|
Externo |
Pasando Jugo |
|
Cite
|
Externo |
Pasando Jugo |
|
Learning Visual Reasoning Without Strong Priors
|
Externo |
Pasando Jugo |