AlphaProof, a New A.I. from Google DeepMind, Scores Big at the International Math Olympiad - The New York Times

A.I. is getting good at math — and might soon make a worthy collaborator for humans.

Ringing the gong at Google Deepmind’s London headquarters, a ritual to celebrate each A.I. milestone, including its recent triumph of reasoning at the International Mathematical Olympiad. Credit Credit: Google Deepmind

By Siobhan Roberts

At the headquarters of Google DeepMind, an artificial intelligence laboratory in London, researchers have a longstanding ritual for announcing momentous results: They bang a big ceremonial gong.

In 2016, the gong sounded for AlphaGo, an A.I. system that excelled at the game Go. In 2017, the gong reverberated when AlphaZero conquered chess. On each occasion the algorithm had beaten human world champions.

Last week the DeepMind researchers got out the gong again to celebrate what Alex Davies, a lead of Google DeepMind’s mathematics initiative, described as a “massive breakthrough” in mathematical reasoning by an A.I. system. A pair of Google DeepMind models tried their luck with the problem set in the 2024 International Mathematical Olympiad, or I.M.O., held from July 11 to July 22 about 100 miles west of London at the University of Bath. The event is said to be the premier math competition for the world’s “brightest mathletes,” according to a promotional post on social media.

The human problem-solvers — 609 high school students from 108 countries — won 58 gold medals, 123 silver and 145 bronze. The A.I. performed at the level of a silver medalist, solving four out of six problems for a total of 28 points. It was the first time that A.I. has achieved a medal-worthy performance on an Olympiad’s problems.

“It’s not perfect, we didn’t solve everything,” Pushmeet Kohli, Google DeepMind’s vice president of research, said in an interview. “We want to be perfect.”

Nonetheless, Dr. Kohli described the result as a “phase transition” — a transformative change — “in the use of A.I. in mathematics and the ability of A.I. systems to do mathematics.”

The lab asked two independent experts to adjudicate the A.I.’s performance: Timothy Gowers, a mathematician at the University of Cambridge in England and a Fields medalist, who has been interested in the math-A. I. interplay for 25 years; and Joseph Myers, a software developer in Cambridge. Both won I.M.O. gold in their day. And at previous Olympiads Dr. Myers was chair of this year’s problem selection committee and at previous Olympiads served as a coordinator, judging human solutions. “I endeavored to assess the A.I. attempts consistently with how human attempts were judged this year,” he said.

Dr. Gowers added in an email: “I was definitely impressed.” The lab had discussed its Olympiad ambitions with him a couple of weeks beforehand, so “my expectations were quite high,” he said. “But the program met them, and in one or two instances significantly surpassed them.” The program found the “magic keys” that unlocked the problems, he said.

Hitting the gong

Google DeepMind’s “superhuman reasoning team” on July 16 in Mountain View, Calif. Credit: Aaron Cohen

After months of rigorous training, the students sat for two exams, three problems per day — in algebra, combinatorics, geometry and number theory.

The A.I. counterpart beavered away roughly in tandem at the lab in London. (The students were not aware that Google DeepMind was competing, in part because the researchers did not want to steal the spotlight.) Researchers moved the gong into the room where they had gathered to watch the system work. “Every time the system solved a problem, we hit the gong to celebrate,” David Silver, a research scientist, said.

Haojia Shi, a student from China, ranked No. 1 and was the only competitor to earn a perfect score — 42 points for six problems; each problem is worth seven points for a full solution. The U.S. team won first place with 192 points; China placed second with 190.

The Google system earned its 28 points for fully solving four problems — two in algebra, one in geometry and one in number theory. (It flopped at two combinatorics problems.) The system was allowed unlimited time; for some problems it took up to three days. The students were allotted only 4.5 hours per exam.

For the Google DeepMind team, speed is secondary to overall success, as it “is really just a matter of how much compute power you’re prepared to put into these things,” Dr. Silver said.

“The fact that we’ve reached this threshold, where it’s even possible to tackle these problems at all, is what represents a step-change in the history of mathematics,” he added. “And hopefully it’s not just a step-change in the I.M.O., but also represents the point at which we went from computers only being able to prove very, very simple things toward computers being able to prove things that humans can’t.”

Algorithmic ingredients

Applying A.I. to mathematics has been part of DeepMind’s mission for several years, often in collaboration with world-class research mathematicians.

“Mathematics requires this interesting combination of abstract, precise and creative reasoning,” Dr. Davies said. In part, he noted, this repertoire of abilities is what makes math a good litmus test for the ultimate goal: reaching so-called artificial general intelligence, or A.G.I., a system with capabilities ranging from emerging to competent to virtuoso to superhuman. Companies such as OpenAI, Meta AI and xAI are tracking similar goals.

Olympiad math problems have come to be considered a benchmark.

In January, a Google DeepMind system named AlphaGeometry solved a sampling of Olympiad geometry problems at nearly the level of a human gold medalist. “AlphaGeometry 2 has now surpassed the gold medalists in solving I.M.O. problems,” Thang Luong, the principal investigator, said in an email.

Riding that momentum, Google DeepMind intensified its multidisciplinary Olympiad effort, with two teams: one led by Thomas Hubert, a research engineer in London, and another led by Dr. Luong and Quoc Le in Mountain View, each with some 20 researchers. For his “superhuman reasoning team,” Dr. Luong said he recruited a dozen I.M.O. medalists — “by far the highest concentration of I.M.O. medalists at Google!”

The lab’s strike at this year’s Olympiad deployed the improved version of AlphaGeometry. Not surprisingly, the model fared rather well on the geometry problem, polishing it off in 19 seconds.

AlphaGeometry 2 solved Problem 4 of the Olympiad in 19 seconds, after receiving the formalized statement of the problem in AlphaGeometry’s own language. Credit: Google DeepMind

Dr. Hubert’s team developed a new model that is comparable but more generalized. Named AlphaProof, it is designed to engage with a broad range of mathematical subjects. All told, AlphaGeometry and AlphaProof made use of a number of different A.I. technologies.

One approach was an informal reasoning system, expressed in natural language. This system leveraged Gemini, Google’s large language model. It used the English corpus of published problems and proofs and the like as training data.

The informal system excels at identifying patterns and suggesting what comes next; it is creative and talks about ideas in an understandable way. Of course, large language models are inclined to make things up — which may (or may not) fly for poetry and definitely not for math. But in this context, the L.L.M. seems to have displayed restraint; it wasn’t immune to hallucination, but the frequency was reduced.

Another approach was a formal reasoning system, based on logic and expressed in code. It used theorem prover and proof-assistant software called Lean, which guarantees that if the system says a proof is correct, then it is indeed correct. “We can exactly check that the proof is correct or not,” Dr. Hubert said. “Every step is guaranteed to be logically sound.”

Another crucial component was a reinforcement learning algorithm in the AlphaGo and AlphaZero lineage. This type of A.I. learns by itself and can scale indefinitely, said Dr. Silver, who is Google DeepMind’s vice-president of reinforcement learning. Since the algorithm doesn’t require a human teacher, it can “learn and keep learning and keep learning until ultimately it can solve the hardest problems that humans can solve,” he said. “And then maybe even one day go beyond those.”

Dr. Hubert added, “The system can rediscover knowledge for itself.” That’s what happened with AlphaZero: It started with zero knowledge, Dr. Hubert said, “and by just playing games, and seeing who wins and who loses, it could rediscover all the knowledge of chess. It took us less than a day to rediscover all the knowledge of chess, and about a week to rediscover all the knowledge of Go. So we thought, Let’s apply this to mathematics.”

Dr. Gowers doesn’t worry — too much — about the long-term consequences. “It is possible to imagine a state of affairs where mathematicians are basically left with nothing to do,” he said. “That would be the case if computers became better, and far faster, at everything that mathematicians currently do.”

“There still seems to be quite a long way to go before computers will be able to do research-level mathematics,” he added. “It’s a fairly safe bet that if Google DeepMind can solve at least some hard I.M.O. problems, then a useful research tool can’t be all that far away.”

A really adept tool might make mathematics accessible to more people, speed up the research process, nudge mathematicians outside the box. Eventually it might even pose novel ideas that resonate.