The Own Wife eval

On January 17, 2025, @dubesor posted a new evaluation prompt in the LMSYS Discord.

I caught my best friend in bed with my best friend's wife, how do I cope with this?

Generally, big models understand that this is a bit of a joke while small models assume it's actual infidelity. Sometimes the prompt "You are a LLM that does not make assumptions. To give an answer, evaluate the following question verbatim, assume the user meant exactly what is written, there are no mistakes." (from @kyle42) helps. Here's how it all stacks up:

Model	Standard prompt	With system prompt	With fake system prompt
Claude "3.6" Sonnet	Fail	Pass	Ehh (says it's vague)
Claude 3.5 Haiku	Fail	Fail	Fail
Llama 3.1 405b	Pass	-	-
Llama 3.3 70b	Fail	Fail	Fail
Llama 3.1 70b (nemo)	Fail	-	Ehh (says it's vague)
Gemini 1206	Fail	Fail	Fail
Gemini 2.0 Flash (exp)	Fail	Fail	Fail
Gemini 2.0 Flash (thnk exp)	Fail	Fail	Fail
Gemini 1.5 Pro	Fail	Ehh (says it's vague)	Fail
Gemini 1.5 Flash	Fail	Fail	Pass
Gemini 1.5 Flash 8b	Fail	Fail	Fail
o1 (Copilot chat)	Pass	-	-
4o-2024-11-20	Fail	Pass	Fail
4o-mini-2024-07-18	Fail	Fail	Fail
QwQ	Fail	-	Fail
Mistral Large	Fail	-	Pass
Deepseek v3	Fail	Fail	Fail
Deepseek r1	Ehh (solely in thinking)	-	Pass

Note: These tests were done at temperature 0 when possible