Llama 3 fails some basic tests

The much-hyped Llama 3 came out recently (as of writing). It ranks high on standardized benchmarks, and LMSYS (English) says the 70B is on the same tier as GPT-4! And it's pretty good at songwriting.

However, I mainly use LLMs for answering my (and other people's) random questions. For this purpose, Llama 3 is surprisingly bad. Its answers often don't include the correct answer, or bury it in a lot of text.

I've collected a number of questions where Llama doesn't do well in. I'll show these by comparing the responses from 2 Mixtrals and 2 Llamas. Of course, this only includes ones where Llama does worse, but those are the majority (especially excluding writing-based tasks).

Responses generated on Vercel's AI plaform. I don't care if this post means Llama 4 will be trained on these questions, it'll just make it better for anyone else with the same questions