Safe and effective AI. Our focus is to help guide the use of safe and effective artificial intelligence by:
assisting organizations in their understanding and implementation of AI in the workplace
helping influence and craft legislation around AI’s (specifically LLM’s) development and how AI can be safely released to the public
creating AI solutions to improve the experience of working with and within the government
Commonground.dev
AI Latest
Establish A U.S. Government Hub for AI Benchmarking
Mar 2025
AI benchmarks are more than measurement tools; they drive investment and influence global standards. To this end, this RFI concerning the Development of an Artificial Intelligence Action Plan recommends that the U.S. government establish a government hub for AI benchmarking. This should be done by cultivating an ecosystem of independent AI benchmarking organizations. Though this RFI is mainly focused on capability benchmarks, it’s important to remember to encourage the development of trust and safety benchmarks as well.
COMPL-AI Framework: An LLM Benchmarking Suite for the EU AI Act
Oct 2024
COMPL-AI aggregates a set of state-of-the-art benchmarks to assess compliance of LLMs to the EU AI act. This benchmark suite might not be perfect, but in deriving a taxonomy of hazards based on something concrete (the EU AI Act), they have created a test that has immediate applicability to helping advance and enforce AI regulations.
MLCommons releases v1.0 of their AILuminate Benchmark
Dec 2024
In an effort to take benchmarks out of the realm of academia and move them into the realm of reliable industry measurement, MLCommons released v1.0 of AILuminate, their benchmark to assess the safety of LLMs used to power generalized ChatBots such as ChatGPT. Covering a taxonomy consisting of 12 hazard categories and evaluating 13 state-of-the-art LLMs using this taxonomy, AILuminate is a major milestone in the effort to make LLMs safer and more trustworthy. The benchmark works by evaluating a set of prompts, recording the responses, and then using a set of “safety evaluator models” to determine which of the responses are violations according to their Assessment Standard guidelines.
EUREKA: Evaluating and Understanding Large Foundation Models
Sept 2024
EUREKA is an open-source framework from Microsoft for standardizing evaluations of the performance of multimodal and text-only LLMs. A very interesting component of this framework is the analysis in Section 6 of the non-determinism of LLMs. “...we observe that very few large foundation models are fully deterministic and for most of them there are visible variations in the output — and most importantly in accuracy — when asked the same question several times, with generation temperature set to zero..."