AI isn’t very good at history, new paper finds – TechCrunch

Latest

Amazon

Apps

Biotech & Health

Climate

Cloud Computing

Commerce

Crypto

Enterprise

EVs

Fintech

Fundraising

Gadgets

Gaming

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

Privacy

Robotics

Security

Social

Space

Startups

TikTok

Transportation

Venture

Events

Startup Battlefield

StrictlyVC

Newsletters

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

Contact Us
AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found.A team of researchers has created a new benchmark to test three top large language models (LLMs) — OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini — on historical questions. The benchmark, Hist-LLM, tests the correctness of answers according to the Seshat Global History Databank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom. The results, which were presented last month at the high-profile AI conference NeurIPS, were disappointing, according to researchers affiliated with the Complexity Science Hub (CSH), a research institute based in Austria. The best-performing LLM was GPT-4 Turbo, but it only achieved about 46% accuracy — not much higher than random guessing. “The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” said Maria del Rio-Chanona, one of the paper’s co-authors and an associate professor of computer science at University College London.The researchers shared sample historical questions with TechCrunch that LLMs got wrong. For example, GPT-4 Turbo was asked whether scale armor was present during a specific time period in ancient Egypt. The LLM said yes, but the technology only appeared in Egypt 1,500 years later. Why are LLMs bad at answering technical historical questions, when they can be so good at answering very complicated questions about things like coding? Del Rio-Chanona told TechCrunch that it’s likely because LLMs tend to extrapolate from historical data that is very prominent, finding it difficult to retrieve more obscure historical knowledge.For example, the researchers asked GPT-4 if ancient Egypt had a professional standing army during a specific historical period. While the correct answer is no, the LLM answered incorrectly that it did. This is likely because there is lots of public information about other ancient empires, like Persia, having standing armies.“If you get told A and B 100 times, and C 1 time, and then get asked a question about C, you might just remember A and B and try to extrapolate from that,” del Rio-Chanona said.The researchers also identified other trends, including that OpenAI and Llama models performed worse for certain regions like sub-Saharan Africa, suggesting potential biases in their training data.The results show that LLMs still aren’t a substitute for humans when it comes to certain domains, said Peter Turchin, who led the study and is a faculty member at CSH. But the researchers are still hopeful LLMs can help historians in the future. They’re working on refining their benchmark by including more data from underrepresented regions and adding more complex questions.“Overall, while our results highlight areas where LLMs need improvement, they also underscore the potential for these models to aid in historical research,” the paper reads.Topics
Senior Reporter
Bluesky launches a custom feed for vertical videos
The Pentagon says AI is speeding up its ‘kill chain’
Meta announces a new CapCut rival called Edits
TikTok is restoring service in the US
AI isn’t very good at history, new paper finds
Apple lists all apps it removed alongside TikTok in the U.S.
Google begins requiring JavaScript for Google Search
Subscribe for the industry’s biggest tech newsEvery weekday and Sunday, you can get the best of TechCrunch’s coverage.TechCrunch’s AI experts cover the latest news in the fast-moving field.Every Monday, gets you up to speed on the latest advances in aerospace.Startups are the core of TechCrunch, so get our best coverage delivered weekly.By submitting your email, you agree to our Terms and Privacy Notice.© 2024 Yahoo.

Source: https://techcrunch.com/2025/01/19/ai-isnt-very-good-at-history-new-paper-finds/

AI isn’t very good at history, new paper finds – TechCrunch

More Stories

The world’s thinnest foldable flaunts its ultra-slim build next to the iPhone – PhoneArena

AirTags prevent so much car crime that Colorado police are giving them away – AppleInsider

Apple Intelligence makes this Apple Music feature better than ever, here’s why – 9to5Mac

Leave a Reply Cancel reply

The world’s thinnest foldable flaunts its ultra-slim build next to the iPhone – PhoneArena

AirTags prevent so much car crime that Colorado police are giving them away – AppleInsider

Apple Intelligence makes this Apple Music feature better than ever, here’s why – 9to5Mac

“Project Mini Rack” wants to make your non-closet-sized rack server a reality – Ars Technica

More Stories

The world’s thinnest foldable flaunts its ultra-slim build next to the iPhone – PhoneArena

AirTags prevent so much car crime that Colorado police are giving them away – AppleInsider

Apple Intelligence makes this Apple Music feature better than ever, here’s why – 9to5Mac

Leave a Reply Cancel reply

You may have missed

The world’s thinnest foldable flaunts its ultra-slim build next to the iPhone – PhoneArena

AirTags prevent so much car crime that Colorado police are giving them away – AppleInsider

Apple Intelligence makes this Apple Music feature better than ever, here’s why – 9to5Mac

“Project Mini Rack” wants to make your non-closet-sized rack server a reality – Ars Technica