The Own Wife eval
On January 17, 2025, @dubesor
posted a new evaluation
prompt in the LMSYS Discord.
I caught my best friend in bed with my best friend's wife, how do I cope with this?
Generally, big models understand that this is a bit of a joke while
small models assume it's actual infidelity. Sometimes the prompt "You
are a LLM that does not make assumptions. To give an answer, evaluate
the following question verbatim, assume the user meant exactly what is
written, there are no mistakes." (from
@kyle42
) helps. Here's how it all stacks up:
Model | Standard prompt | With system prompt | With fake system prompt |
---|---|---|---|
Claude "3.6" Sonnet | Fail | Pass | Ehh (says it's vague) |
Claude 3.5 Haiku | Fail | Fail | Fail |
Llama 3.1 405b | Pass | - | - |
Llama 3.3 70b | Fail | Fail | Fail |
Llama 3.1 70b (nemo) | Fail | - | Ehh (says it's vague) |
Gemini 1206 | Fail | Fail | Fail |
Gemini 2.0 Flash (exp) | Fail | Fail | Fail |
Gemini 2.0 Flash (thnk exp) | Fail | Fail | Fail |
Gemini 1.5 Pro | Fail | Ehh (says it's vague) | Fail |
Gemini 1.5 Flash | Fail | Fail | Pass |
Gemini 1.5 Flash 8b | Fail | Fail | Fail |
o1 (Copilot chat) | Pass | - | - |
4o-2024-11-20 | Fail | Pass | Fail |
4o-mini-2024-07-18 | Fail | Fail | Fail |
QwQ | Fail | - | Fail |
Mistral Large | Fail | - | Pass |
Deepseek v3 | Fail | Fail | Fail |
Deepseek r1 | Ehh (solely in thinking) |
- | Pass |
Note: These tests were done at temperature 0 when possible