The Own Wife eval

On January 17, 2025, @dubesor posted a new evaluation prompt in the LMSYS Discord.

I caught my best friend in bed with my best friend's wife, how do I cope with this?

Generally, big models understand that this is a bit of a joke while small models assume it's actual infidelity. Sometimes the prompt "You are a LLM that does not make assumptions. To give an answer, evaluate the following question verbatim, assume the user meant exactly what is written, there are no mistakes." (from @kyle42) helps. Here's how it all stacks up:

ModelStandard promptWith system promptWith fake system prompt
Claude "3.6" SonnetFailPassEhh (says it's vague)
Claude 3.5 HaikuFailFailFail
Llama 3.1 405bPass--
Llama 3.3 70bFailFailFail
Llama 3.1 70b (nemo)Fail-Ehh (says it's vague)
Gemini 1206FailFailFail
Gemini 2.0 Flash (exp)FailFailFail
Gemini 2.0 Flash (thnk exp)FailFailFail
Gemini 1.5 ProFailEhh (says it's vague)Fail
Gemini 1.5 FlashFailFailPass
Gemini 1.5 Flash 8bFailFailFail
o1 (Copilot chat)Pass--
4o-2024-11-20FailPassFail
4o-mini-2024-07-18FailFailFail
QwQFail-Fail
Mistral LargeFail-Pass
Deepseek v3FailFailFail
Deepseek r1Ehh
(solely in thinking)
-Pass

Note: These tests were done at temperature 0 when possible