> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit
I interpreted that bit as meaning they did not manually alter the problem statement before feeding it to the model - they gave it the exact problem text issued by IMO.
It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.
As a side question, do you think using tools like Lean will become a staple of these "deep reasoning" LLM flavors?
It seems that LLMs excel (relative to other paradigms) in the kind of "loose" creative thinking humans do, but are also prone to the same kinds of mistakes humans make (hallucinations, etc). Just as Lean and other formal systems can help humans find subtle errors in their own thinking, they could do the same for LLMs.
I was surprised to see them not using tools for it, that feels like a more reliable way to get useful results for this kind of thing.
I get the impression not using tools is as part of the point though - to help demonstrate how much mathematical "reasoning" you can get out of just a model on its own.
Yes, I'm similarly surprised. Intuitively I'd think that it's much better to train on using Lean, since it's much easier to do RL on it (Lean gives you an objective metric on whether you achieved your objective). It also seems more useful in some ways.
But all the model providers are putting emphasis on the "this is only using natural language" angle, which I think is interesting both from a "this is easier for humans to actually use" perspective, but also comes from a place of "look how general the model is".
End to end in natural language would imply no tool use, I'd imagine. Unless it called another tool which converted it but that would be a real stretch (smoke and mirrors).
I've had a couple bad experiences with Lyft recently, including one time the driver must have clicked that they picked me up while a block away, because I could see the lyft driving to the destination without me. I tried to get a refund since I was obviously waiting my start location the whole time, but the system claimed the drive went from start to finish (even though I wasn't in the car), so no refund.
Same thing happened to me, and the support system automatically decided nothing was wrong whatsoever despite my phone certainly sending a very different location from the driver. And the madness was I couldn't even book another ride as I was technically in one.
So I ended up getting it resolved via the security panic button which did put me through to a real person who was empathetic to the issue.
Is this some sort of a scam? The driver cannot even mark the ride as completed without being in the area right? So they have to drive it anyway. I can’t imagine they would be on the platform for long if this happened on a regular basis. I would say it’s probably an accident but how could this behavior be accidental? Someone might accidentally say that they picked you up, but they couldn’t accidentally then drive an empty car to the destination.
Entirely possible, people do get into wrong rideshare vehicles. Especially late night after people have been drinking. A decent driver will confirm the name when you’re in a place with a lot of pickups happening but if the language barrier is strong that might not happen.
Has anybody tried "driving" for one of these companies using GPS spoofing? You could fake the location of your phone. I suppose it'd only work a few times before the number of reports gets you banned, but I wonder whether on a laragr enough (and automated) scale it would be profitable for scammers
I had a driver commit GPS spoofing on me:
I was standing outside and there were no car to be seen anywhere even though the app showed the driver was there and had been "driving" to it
I tried to report a security incident to Uber, but not sure what happened. It would likely be easier to complain today, as now all taxis (which Uber technically is in Norway) need to be part of a Taxi dispatch central
Given that they track you every inch of your route, it'd be a pain in the butt to attempt to fake it.
I've gotten a refund on food before because my driver picked up my food and then went spend a half hour in a gas station before returning to their route even though my home was 2 minutes away.
Can it be both? Maybe semantics, but a lot of folks are taking Waymo because there's no human driver. Now "no human driver" may now be considered "premium," but saying that automation is not a significant factor doesn't quite ring true. As a single point of reference, the automation is a big part of what makes it attractive to me as a rider, both because there's no human driver (not super critical to my experience, but I prefer being in the car solo) and, more importantly, because of the driving behavior; it just feels like a better driver than most drivers on the road and that's due to the automation.
Comcast gives you the illusion of being able to talk to a human being if you are persistent enough.
What ends up happening is at some point they send you a link to talk to their support bot and tell you they are hanging up on you.
Threatening cancelation is the only way. The only reason they will not care is because of their captive markets. This is what you get with no competition.
Uber lets you enable a PIN for each ride. The driver can't say they picked you up until they punch in the random 4 digit PIN the app gave you for the ride.
It's not unusual to call a taxi for another person. Or to make a multi-stop journey where some people get out before others. You can even send a parcel across town in a taxi.
Checking phone proximity might be helpful in some cases, but it's not a silver bullet.
I never give location permissions to any app if I can avoid it (indeed I don't even have the spyware app if I can avoid it; e.g. I use the web to order an Uber)
Exactly - I believe it should be required for safety, limits shenanigans, etc. Apparently, it is required in Puerto Rico, but I don't know if drivers have to enable it themselves or if the app knows where the driver is operating. Are you saying the rider can also turn it on all the time? If so, that's good - I've only ever it seen driver's request it (all in PR, and one in mainland US, everywhere else, no PIN).
I waited 40 minutes for a Lyft at an airport because the driver made up a story about an accident and traffic, in the airport. No one else seemed to be affected by this traffic- so eventually I tried booking an Uber. It arrived 3 minutes later.
20 minutes after that the Lyft driver keeps texting me “where are you?!”. Their turn to wait!
Saw later they just started the ride without me and drove to my hotel.
Lyft said “this trip was completed, no refund”. Welp, app deleted.
I've had several cases of drivers just not picking me up. Reading their time to move anywhere at all, driving away and keep getting further and further away, it driving towards me only to turn some other direction. I always just cancel on them and have never had to pay a cancellation fee. I think once or twice they "picked me up" a block away. I'm pretty sure I was able to cancel or end the ride on that too, definitely was never charged though I don't recall if I had to use the support. But I never let it actually complete the trip when I wasn't riding. But I was always very miffed when anything like that happened as I did not appreciate them wasting my time.
On Uber I paid for priority pickup and watched as a driver drove within two blocks of my home and then sat in a neighborhood for 10 minutes. I finally message "Everything OK?" and get no reply but they finish their journey to my place.
That's must be annoying to say the least. In India drivers require an OTP to start a ride.
The OTP is the same for a user across rides, so I have mine memorised which is nifty. No fiddling with the phone during boarding.
On security: exploiting this would require the driver to stay in my vicinity the next time I book a ride, and also get the ride assigned to them.
In a high population density area, it's rare - I've never had the same driver twice.
It solves the problem for 99.99% of the time. Drivers are not going to memorize your OTP; and it is unlikely that an OTP list will be leaked/used anytime soon.
Whether one cares depends very strongly on what "retaliation" means. If they ban your account, not a big deal - you were getting bad service and didn't want to do business with them anyway. If they send an armed hit squad to kill you, that would be worth being concerned about though.
I "purchased" a digital game once on the PlayStation Store. It wasn't clear from the description that it was completely useless without an active subscription to PSN, so I tried to return it. They said no way, sales are final and you've already launched the game. I did a chargeback, and they basically locked down my account until I filed a support ticket and had to lie, saying someone else made a purchase on my account.
I’ve heard the story from the other side as well: App reports ride is arriving, people get in, they go the wrong way and see their original ride stating that you are not there and leave again.
So it may not be intentional. Just coincidence and poor verification.
Companies that cheap out by not performing the basic obligations of business end up paying more for small claims court - provided their ripped-off customers actually take them to small claims court. Did you?
Completely agree on both counts! I loved those two games and felt Conquests of the Longbow didn't get the recognition it deserves.
On the second point, when I read his book (https://kensbook.com/) I was disappointed to not hear about the magic of the games themselves and the creative process behind them. It became clear that his primary goal was to grow a business, he thought being a game distributor was more exciting, but then was disrupted by Steam, shareware, and online distribution.
I'm building such tools at https://sugaku.net, right now there's chatting with a paper and browsing similar papers. Generally arXiv and other repositories want you to link to them and not embed their papers, which makes it hard to build inline reading tools, but it's on my roadmap to support that for uploaded papers. Would love to hear if you have some feature requests there
One feature could be that it automatically fetches the papers that it refers to and also feeds them through the llm. And maybe apply that recursively. This could give the AI a better overview of the related literature.
After I opened up https://sugaku.net to be usable without login, it was astounding how quickly the crawlers started. I'd like the site to be accessible to all, but I've had to restrict most of the dynamic features to logged in users, restrict robots.txt, use cloudflare to block AI crawlers and bad bots, and I'm still getting ~1M automated requests per day (compared to ~1K organic), so I think I'll need to restrict the site to logged in users soon.
What happens if you use the proper rate limiting status of 429? It includes a next retry time [1]. I'm curious what (probably small) fraction would respect it.
I use IP addy. Users behind cgnat are already used to getting captcha the first time around
There’s some stuff you can do, like creating risk scores (if a user changes ip and uses the same captcha token, increase score). Many vendors do that, as does my captcha provider.
Very cool! This is also one of my beliefs in building tools for research, that if you can solve the problem of predicting and ranking the top references for a given idea, then you've learned to understand a lot about problem solving and decomposing problems into their ingredients. I've been pleasantly surprised by how well LLMs can rank relevance, compared to supervised training of a relevancy score. I'll read the linked paper (shameless plug, here it is on my research tools site: https://sugaku.net/oa/W4401043313/)
I'm serving AI models on Lambda Labs and after some trial and error I found having a single vllm server along with caddy, behind cloudflare dns, to work really well and really easy to set up
It's really best to avoid running web servers as root. It's easy to forward the port 80 with iptables, change the kernel knob to let unprivileged users use port 80 and above, or set the network capability on the binary.
As a former mathematician, I found research to be a very winding path. While that can be fun, I felt there's a lot of opportunity to train LLMs and ML models on the corpus of math papers, to try to make research more deliberate and less reliant on talking to the right person at the right time.
This is very much a work in progress but so far you can:
* Browse through similar papers
* Get recommendations for new papers and collaborators
* Chat with papers and ask questions to all the major reasoning models
* Have it come up with future paper ideas (along with references) giving a potential title or collaborators.
My focus very much is on the exploratory stages since that's where a lot of the time is spent, but I intend to integrate more tools for problem solving, writing, and computation.
I think you should have some "about us" section on your webpage if you want people to give their email addresses. I already get loads of spam that knows my email address belongs to someone with a PhD (though they are often shaky on the details). I looked at your site and there's no information about who is doing it and why.
That's fair, though there's not much to say since I'm building it out myself as a benefit corporation. I also have strict opt-out for any communications and a proper privacy policy.
I've also tried to keep as much as I can accessible without login, but I want to protect some of the more expensive features from being spammed.
No matter how original you think you are, it's almost always already been done. You think you found a new theorem and then you check some old pdf from 20+ years ago and it's already been done.
If you can pull it off, and the result is actually novel and not trivial, you can get a PhD. that is how hard it is.
The flipside of that is seeing hints of a result that would be really helpful. I still remember how excited I was to stumble on a book from 1931 (The Taylor Series by Dienes) since it had the only english-language proofs of some results by Szego and Polya that I felt could unblock my research. My hope is that this discovery problem can be largely solved.
This is also why I'm not as excited by the focus on pure reasoning and olympiad problem solving in the math and AI space. It's like the early career phase of trying to solve Collatz and Riemann but just repeating work from decades ago.