Spaces:
Sleeping
Apply for community grant: Personal project (gpu)
I'd like to build a "HEAD TO HEAD EVALUATOR". This will be a sophisticated head evaluation comparison, where for a given dataset and eval script, people can compare two models head to head.
Some more specifications:
Datasets: People can pull in a set of prominent default datasets (e.g., CodeForces, HSM, MMLU etc) from Hugging Face directly or input their own dataset with a custom eval script and eval metric (or metrics).
Models:
- People can upload their own models, or pull in models from Hugging Face directly.
- There will be opportunities to pull in different fine tuned versions, quantized versions etc.
- There will also be an opportunity to evaluate closed weight models (e.g., GPT-4, Claude) by inputing your own API key that it will be billed to.
Part of what the space will do is calculate roughly how long a full eval run will take given available hardware. It will also report how the evaluation score compares vs. a reported eval score in the relevant paper.
I believe this will be a really neat way to make evals accessible to less technical folks. If people who are new to AI can easily upload two models and evaluate them on any dataset, that will allow them to experiment with new models and new datasets for their specific use case a lot faster.