Language Model Evaluation

Appier Research Unveils Agentic AI Breakthrough: A Risk-Aware Decision Framework

Appier today announced new research advancing the reliability of Agentic AI systems. To expand the impact of its research and ...

ascopubs.org

Simulation-Based Evaluation of a Large Language Model–Enabled Clinical Decision Support Platform in Oncology

In a remote, within-participant simulation, 26 oncologists from the United Kingdom, United States, Spain, and Singapore reviewed synthetic breast cancer cases and created comprehensive summaries for ...

Communications of the ACM

Measuring What Matters in Large Language Model Performance

As large language models (LLMs) gain momentum worldwide, there’s a growing need for reliable ways to measure their performance. Benchmarks that evaluate LLM outputs allow developers to track ...

Digital Journal

Generative AI: Evaluation of Small Language Models and their applications

Opinions expressed by Digital Journal contributors are their own. Generative AI has rapidly become a cornerstone of modern technology, revolutionizing how people interact with data and digital content ...

ZDNet

With AI models clobbering every benchmark, it's time for human evaluation

Artificial intelligence has traditionally advanced through automatic accuracy tests in tasks meant to approximate human knowledge. Carefully crafted benchmark tests such as The General Language ...

Forbes

The Importance Of Evaluation In The Reinforcement Learning Revolution

David Shan is the Co-Founder and CTO of Clado, who trains in-house small language models to build the best people search algorithm. We celebrate RL breakthroughs, but behind the hype lies a brittle ...

FedScoop

US AI Safety Institute taps Scale AI for model evaluation

Scale AI founder and CEO Alexandr Wang testifies during a House Armed Services Subcommittee on Cyber, Information Technologies and Innovation hearing about artificial intelligence on July 18, 2023, in ...

Communications of the ACMOpinion

Communication Bias in Large Language Models: A Regulatory Perspective

Though new regulatory frameworks address fairness, accountability, and safety in AI systems, they often fail to directly ...

Forbes

Augmenting The American Psychiatric Association App Evaluation Model To Include AI-Based Mental Health Apps

Forbes contributors publish independent expert analyses and insights. Dr. Lance B. Eliot is a world-renowned AI scientist and consultant. In today’s column, I examine an existing formalized evaluation ...

The Robot Report

Vision-language-action models are the next leap in autonomous robotics

Explore how vision-language-action models like Helix, GR00T N1, and RT-1 are enabling robots to understand instructions and act autonomously.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results