LLMonitor Benchmarks

Weekly benchmarks of popular LLMs using real-world prompts

Vince writes:

Traditional LLMs benchmarks have drawbacks: they quickly become part of training datasets and are hard to relate to in terms of real-world use-cases.

I made this as an experiment to address these issues. Here, the dataset is dynamic (changes every week) and composed of crowdsourced real-world prompts.

We then use GPT-4 to grade each model's response against a set of rubrics (more details on the about page). The prompt dataset is easily explorable.

Everything is then stored in a Postgres database and this page shows the raw results.

Each benchmarked LLM is ranked by score, linked to detailed results. You can also compare two LLMs scores side by side.

If you are applying LLMs within a security context, having a non-AI execute benchmark will highlight things an AI wouldn't. It may also help skeptics who (with some merit) challenge an AI benchmarking another AI as a potential risk that should be carefully controlled.