Computers ace IQ tests but still make dumb mistakes. Can different tests help?

Argentina Noticias Noticias

Computers ace IQ tests but still make dumb mistakes. Can different tests help?
Argentina Últimas Noticias,Argentina Titulares
  • 📰 NewsfromScience
  • ⏱ Reading Time:
  • 107 sec. here
  • 3 min. at publisher
  • 📊 Quality Score:
  • News: 46%
  • Publisher: 51%

While AI models can quickly master benchmarks and surpass human baselines, they often fall short in the real world. The solution, most researchers argue, is not to abandon these benchmarks—but to make them better. LongReads

By strategically adding stickers to a stop sign, for example, researchers in 2018 fooled standard image recognition systems into seeing a speed limit sign instead. And a 2018 project called Gender Shades found the accuracy of gender identification for commercial face-recognition systems dropped from 90% to 65% for dark-skinned women’s faces.

Dynabench relies on crowdworkers—hordes of internet users paid or otherwise incentivized to perform tasks. Using the system, researchers can create a benchmark test category—such as recognizing the sentiment of a sentence—and ask crowdworkers to submit phrases or sentences they think an AI model will misclassify. Examples that succeed in fooling the models get added to the benchmark data set. Models train on the data set, and the process repeats.

WILDS, a benchmark developed by Stanford University computer scientist Percy Liang and his students Pang Wei Koh and Shiori Sagawa, aims to rectify this. It consists of 10 carefully curated data sets that can be used to test models’ ability to identify tumors, categorize animal species, complete computer code, and so on. Crucially, each of the data sets draws from a variety of sources—the tumor pictures come from five different hospitals, for example.

Bowman says many researchers shy away from developing benchmarks to measure bias, because they could be blamed for enabling “fairwashing,” in which models that pass their tests—which can’t catch everything—are deemed safe. “We were sort of scared to work on this,” he says. But, he adds, “I think we found a reasonable protocol to get something that’s clearly better than nothing.” Bowman says he is already fielding inquiries about how best to use the benchmark.

Bowman has a different approach to closing off shortcuts. For his latest benchmark, posted online in December 2021 and called QuALITY , he hired crowdworkers to generate questions about text passages from short stories and nonfiction articles. He hired another group to answer the questions after reading the passages at their own pace, and a third group to answer them hurriedly under a strict time limit.

A more radical rethinking of scores acknowledges that often there’s no “ground truth” against which to say a model is right or wrong. People disagree on what’s funny or whether a building is tall. Some benchmark designers just toss out ambiguous or controversial examples from their test data, calling it noise.

Hemos resumido esta noticia para que puedas leerla rápidamente. Si estás interesado en la noticia, puedes leer el texto completo aquí. Leer más:

NewsfromScience /  🏆 515. in US

Argentina Últimas Noticias, Argentina Titulares

Similar News:También puedes leer noticias similares a ésta que hemos recopilado de otras fuentes de noticias.

Stratolaunch aces 5th test flight with giant hypersonic aircraft carrierStratolaunch aces 5th test flight with giant hypersonic aircraft carrierRoc flew again this Star Wars Day.
Leer más »

JonBenet Ramsey's father supports petition demanding new review of DNA 25 years after deathJonBenet Ramsey's father supports petition demanding new review of DNA 25 years after deathJohn Ramsey said he wants DNA evidence that was never tested before to be transferred away from Boulder police to a different agency,
Leer más »

NASA nearing crewed flight tests for its all-electric X-57 MaxwellNASA nearing crewed flight tests for its all-electric X-57 MaxwellNASA is edging nearer to its first flight test for its all-electric experimental 'X-plane' X-57 Maxwell after completing ground tests on the aircraft.
Leer más »

These 15 Hair Thickening Shampoos Are the Real DealThese 15 Hair Thickening Shampoos Are the Real DealI’ve tested each one on my uber-fine, flat hair.
Leer más »

Ozark Season 4 Finale: Jason Bateman Teases What's Next For The ByrdesOzark Season 4 Finale: Jason Bateman Teases What's Next For The ByrdesJason Bateman on what the Byrdes will do after the OzarkSeason4 finale 👀 'My assumption is that, while they’re smarter now than when we first met them, I still feel like their hubris and arrogance will continue to trip them up.'
Leer más »



Render Time: 2025-04-01 03:07:21