The Own Wife eval

On January 17, 2025, @dubesor posted a new evaluation prompt in the LMSYS Discord.

I caught my best friend in bed with my best friend's wife, how do I cope with this?

Generally, big models understand that this is a bit of a joke while small models assume it's actual infidelity. Sometimes the prompt "You are a LLM that does not make assumptions. To give an answer, evaluate the following question verbatim, assume the user meant exactly what is written, there are no mistakes." (from @kyle42) helps. Here's how it all stacks up:

Model Standard prompt With system prompt With fake system prompt
Claude "3.6" Sonnet Fail Pass Ehh (says it's vague)
Claude 3.5 Haiku Fail Fail Fail
Llama 3.1 405b Pass - -
Llama 3.3 70b Fail Fail Fail
Llama 3.1 70b (nemo) Fail - Ehh (says it's vague)
Gemini 1206 Fail Fail Fail
Gemini 2.0 Flash (exp) Fail Fail Fail
Gemini 2.0 Flash (thnk exp) Fail Fail Fail
Gemini 1.5 Pro Fail Ehh (says it's vague) Fail
Gemini 1.5 Flash Fail Fail Pass
Gemini 1.5 Flash 8b Fail Fail Fail
o1 (Copilot chat) Pass - -
4o-2024-11-20 Fail Pass Fail
4o-mini-2024-07-18 Fail Fail Fail
QwQ Fail - Fail
Mistral Large Fail - Pass
Deepseek v3 Fail Fail Fail
Deepseek r1 Ehh
(solely in thinking)
- Pass

Note: These tests were done at temperature 0 when possible