New LLM developed for under $50 outperforms OpenAI’s o1-preview – SiliconANGLE News
UPDATED 15:08 EST / FEBRUARY 06 2025
by
Maria Deutscher
Researchers have developed a large language model that can perform some tasks better than OpenAI’s o1-preview at a tiny fraction of the cost. The researchers, from Stanford and the University of Washington, first detailed their model in a paper published last Friday. TechCrunch reported on the project today. The algorithm, named s1-32B, is available on GitHub.Last September, OpenAI introduced a reasoning-optimized LLM dubbed o1-preview. The main innovation in the algorithm is a technology called test-time compute, which is referred to as test-time scaling by the creators of the new open-source s1-32B model. The technology boosts LLMs’ output quality by increasing the amount of time and hardware resources they invest in generating prompt answers.Following the release of o1-preview, multiple research groups set out to replicate test-time scaling. In their paper, the creators of s1-32B write that their LLM marks the first publicly disclosed successful attempt at replicating “clear test-time scaling behavior.”“Our model s1-32B exhibits test-time scaling,” the researchers wrote in their paper. “Further, s1-32B is the most sample-efficient reasoning model and outperforms closed-source models like OpenAI’s o1-preview.”The starting point of the project was Qwen2.5-32B-Instruct, an open-source LLM released by Alibaba Group Holding Ltd. last year. The researchers created s1-32B by customizing Qwen2.5-32B-Instruct using a dataset that comprised 1,000 prompts and AI-generated answers. The answers were sourced from Google LLC’s Gemini Thinking Experimental LLM.Instead of simply answering the user’s prompt, Gemini Thinking Experimental displays the thought process that led to its response. The model provides a natural language summary of each step in its thought process. Those summaries were added into s1-32B’s training dataset alongside the 1,000 sample prompts and the corresponding AI-generated answers.The researchers created the dataset through a multistep process.First, they collected 59,029 questions spanning topics such as math, physics and chemistry from public sources. They then removed questions that contained errors. The researchers subsequently filtered the dataset again to remove all but the 1,000 most challenging questions.After training s1-32B on the dataset, the researchers applied a new machine learning method dubbed budget forcing. It involves providing an LLM with a prompt that instructs it to either think longer about a problem than it otherwise would, or do the opposite and cut the reasoning process short. According to the researchers, this method addresses two of the main obstacles to implementing test-time scaling in LLMs.The first challenge is that LLMs sometimes spend too little time thinking about a task and consequently make mistakes. Budget forcing addresses the issue by inputting the word “wait” into s1-32B when it doesn’t spend enough time processing a query. According to s1-32B’s creators, this prompt causes the LLM to enhance its reasoning workflow. In one test, s1-32B attempted to display an incorrect answer to a user prompt. After the researchers instructed it to wait, the model noticed the mistake and generated the correct answer.The second issue the researchers’ budget forcing method addresses is that LLMs sometimes spend too much time thinking about prompts. That can decrease their output quality. For example, an LLM might find the correct answer to a prompt but change it during subsequent processing steps. Budget forcing avoids that issue by requiring LLMs to skip those subsequent processing steps.The researchers compared s1-32B against o1-preview across the MATH and AIME24 math benchmarks. The former model achieved scores up to 27% higher than OpenAI’s LLM. In another test that involved math questions, s1-32B successfully used test-time compute to boost its score from 50% to 57%.Budget forcing allows s1-32B to not only outperform o-1 across some tasks, but also do so at a lower cost. Niklas Muennighoff, one of the researchers who worked on the model, told TechCrunch today that it cost about $20 worth of hardware to develop. The researchers elaborated in their paper that s1-32B took 26 minutes to train using 16 of Nvidia Corp.’s H100 graphics cards. THANK YOUInvestors cool on cloud but CEOs double downReport: UK ordered Apple to implement backdoor in iCloud encryption systemIlya Sutskever’s SSI reportedly raising new funding at $20B+ valuationDOGE reportedly seeking to develop ‘GSAi’ government chatbotAs Trump and Musk scramble everything, investors cool on enterprise tech’s prospectsIBM’s open-source playbook: The AI market shift, DeepSeek’s lessons and the future of AI developmentInvestors cool on cloud but CEOs double downCLOUD – BY DAVE VELLANTE . 3 HOURS AGOReport: UK ordered Apple to implement backdoor in iCloud encryption systemSECURITY – BY MARIA DEUTSCHER . 24 HOURS AGOIlya Sutskever’s SSI reportedly raising new funding at $20B+ valuationAI – BY MARIA DEUTSCHER . 1 DAY AGODOGE reportedly seeking to develop ‘GSAi’ government chatbotAI – BY MARIA DEUTSCHER . 1 DAY AGOAs Trump and Musk scramble everything, investors cool on enterprise tech’s prospectsAI – BY ROBERT HOF . 1 DAY AGOIBM’s open-source playbook: The AI market shift, DeepSeek’s lessons and the future of AI developmentAI – BY JOHN FURRIER . 1 DAY AGOForgot Password?Like Free Content? Subscribe to follow.
Source: https://siliconangle.com/2025/02/06/new-llm-developed-50-outperforms-openais-o1-preview/