Gordo and Bruce are pioneers in the gliding world. One of their coolest flights that shows their creative flight planning shows up in their 3000km flight in the Sierra Nevada's, and the build up to it.
Some basics: The major challenge in flying gliders is the inherent stochasticity in the weather system. Think of it as a contextual bandit problem with high variance w.r.t local weather (i.e. Even the best planning cannot help if the weather doesn't comply). We have some observability due to forecasting tools (skysight.io) and any policy must have affordances for pilot skill and a margin of safety. A good pilot (or 'policy') starts with multiple plans, quickly modifies to plans to suit the environment, and can seamlessly switch between plans. The primary "reward signals" are duration of flight, distance covered, and (in competitions) hitting certain waypoints.
Previous WR's for longest flight were mostly in the Andes or Alps. You want to be in a mountain range to utilize either the [ridge lift](https://en.wikipedia.org/wiki/Orographic_lift) of a mountain face or [mountain wave](https://en.wikipedia.org/wiki/Lee_wave), ideally in a polar region during the summer to maximize the daylight hours so you can fly under VFR for longer.
However, while the Sierra Nevada's have great mountain wave and ridge lift, the number of daylight hours is not ``competitive''. Their main innovation was in acclimatizing themselves with using night vision goggles for long duration in a glider. There's an article on this [here](https://magazine.weglide.org/gliding-at-night-breaking-the-3...) which describes the acclimatizing flights and the 3k km flight in great detail. It doesn't get official recognition because the FAI requires the flight to be done in daylight, but still an extremely cool flight!
Can you speak more on why glider pilots need night vision googles to fly at night but single-engine pilots don’t? Is it the risk of landing out? Or are they flying closer to the terrain?
My understanding is that (1) there is, as you say, a very nonzero risk of landing in a field and good visibility of what is _in_ that field is critical; (2) when riding thermals it is traditionally the case that many gliders soar in close proximity close to the core rising air mass, circling at quite a high bank angle – and collisions need to be avoided (many glider pilots wear parachutes for that reason…) and having visual references, particularly to mountains, really helps; and finally (3) it is common to be flying visually as one typically staircases in an altitude profile, as seen here, and go in and out of controlled airsapce (or deliberately avoid bumping into it, as I have done at 10 kft in UK airspace a long time ago).
In contrast, general aviation aircraft:
a) Have bright lights
b) Will fly in a straight line at a well defined altitude, meaning that vertical separation is sufficient to deconflict aircraft
c) Do not typically land in fields and do instead land on runways which often _also_ have bright lights.
Slightly meta-level: I'm glad the authors finds the ICLR reviews useful, and this illustrates one of the successes of ICLR's policy of always open sourcing the reviews (regardless of whether the paper is accepted or rejected).
The authors benefit from having "testimonials" of how anonymous reviewers interpreted their works, and it also allows opens the door to people outside of the classic academic pipeline to see the behind the scenes arguments to accept/reject a paper.
In gliding, tow upsets are pretty common and, in rare cases, can be fatal. An out-of-position glider out can _easily_ and very quickly overcome the tow planes elevator authority (ability to pitch up or down) which leads to accidents like this. This video does a good job explaining explaining the root causes and potential dangers (https://youtu.be/5cpqFzhM9dY?si=J7GxP1dI9Xopy3xu). Also read the comments from testimonials from other glider pilots.
This is my biggest concern with this concept as well. Towing things is challenging because the tow plane's center of gravity can change drastically depending on the forces on the glider it is towing -- if the glider deploys its spoilers / crabs in a crosswind / gets in your wake turbulence you're not going to be able to predict how it changes your CG (and your control authority) without training or experience. Also, with gliders, the tow plane is traveling at around 60MPH to 90MPH, with a decision window of 2-3 seconds. Commercial planes travel at ~500 MPH... The concept seems like a hard sell to the pilot unions. I bet they've thought about this though.
> An out-of-position glider out can _easily_ and very quickly overcome the tow planes elevator authority
Would this not be trivially solvable with a system that detects the situation (e.g. by measuring the forces acting on the towing plane's attachment point) and detaches the tow? If in the final concept the towed plane would be unmanned and wouldn't contain fuel, even a crash would not be particularly catastrophic.
You misread. It causes the crash of the leading plane, not of the following one, so the glider having no fuel is completely irrelevant
As for a system that measures forces, that’s not likely to work either. Transient forces are OK, but the same force over a little bit of time is enough to force a nose down attitude that is unrecoverable. Attempting to draw the line unequivocally between the two is difficult because it depends on conditions, weights, centers of gravity, and many other things.
They didn't misread, what they're saying is that the lead plane would detect conditions/forces that would result in a tow upset and then cut the tow tether. There's a video in this thread that shows that currently, in manned gliders, the glider pilot can and has a responsibility to release if a tow upset is happening.
The force at the attachment point is constantly changing and depends on several factors.
- the weight of either airplane.
- the performance of the engine on that particular day (varies by altitude / airspeed / temp / mixture / type of fuel / ...)
- the instantaneous weather conditions
- the performance characteristics of either plane.
- slack in the rope (no tension to two times the weight of the glider)
- the glider's towing position (below / above wake)
- crosswinds
- the glider's preferred towing position (depends on visibility from the cockpit, e.g. if someone has a phone or a tablet on the dash, the towing position will be different)
So it isn't really a trivial problem, especially when false positive or false negative will lead to a crash.
notice how he's always on the stick. Also notice how fast it goes from stable to unstable positions.
> even a crash
Recklessness is never the answer in aviation (or coding matter of fact). Practically, good luck convincing insurance to cover a 100 ton (any appreciable cargo load) plane that might fall out of the sky on any property in the general vicinity.
So now we are dropping shipping containers with wings out of the sky when things go south with the towing.
In order to make such a contingency safe, we'll need swathes of ground that are clear of any population so that these things can crash without collateral damage.
If you have a corridor of land that's void of population between your origin and destination, then you might as well, you know, lay down some tracks or tarmac and get rid of the whole flying business altogether.
Now, if you have a body of water between your points this might be a better suited plan I think.
This is my research area. I just finished reviewing six NeurIPS papers (myself, no LLM involved) on LLM Agents for discovery and generation and I'm finding that evaluating LLM agents on raw performance for a task isn't as insightful anymore -- every paper is claiming state of the art 10x performance boost by {insert random acronym that devolves into combinatorial search}. Rather the true test for such algorithms is whether the empirical scaling curves for these algorithms are more computationally amenable than an existing baseline search algorithm (like CoT).
Three motivating points:
- GEPA / evolutionary agents are performing a zero-th order (no gradient) optimization in a combinatorial space. Their loss curves are VERY noisy and stochastic. If we run such agents multiple times, the performance variance is extremely high -- and in some cases cancels out the gains from single experiment. However, obtaining the error bounds is hard because the API costs are pretty restrictive.
- The problem we face with test time scaling is not that prompt engineering is ineffective/less effective than fine-tuning. It is that fine-tuning _reliably_ increases performance for a model for any subset of tasks and the scaling curves for performance per additional data token are well understood.
- Test time optimization techniques work well on in-distribution problems (e.g. generate and debug this Python code) but fail pretty badly on even slightly out of distribution problems (e.g. generate and debug this Julia code). Compare this to gradient search -- it wouldve been so fascinating and confusing if SGD failed to optimize a CNN image classifier on COCO but worked very well on ImageNet.
How do people feel about this? Does this line up with your viewpoints?
- raw accuracy is now a "vanity" metric. so the benchmarks need to get more sophisticated, and i think they're going to have to be far more task specific than hotpot or hover. they've become like the mnist of multi hop.
- in my use of MIPROv2 and SIMBA, I see a fair amount of improvements for multi hop tasks (published some of these on hn before). I'm going to try GEPA and see how it performs. so I think we're at the start of what I would call "meta learning".. tuning across a huge search surface rather than tweaking one prompt. hyper param search for higher dim spaces.
I can't comment on your detailed knowledge of the state of the art, but your points resonate (particularly because I have tried to generate Julia and Lean code).
So, as with any less informed user reviewing LLM output, what you say definitely sounds plausible and correct.
Finally—something directly relevant to my research (https://trishullab.github.io/lasr-web/).
Below are my take‑aways from the blog post, plus a little “reading between the lines.”
- One lesson DeepMind drew from AlphaCode, AlphaTensor, and AlphaChip is that large‑scale pre‑training, combined with carefully chosen inductive biases, enables models to solve specialized problems at—or above—human performance.
- These systems still require curated datasets and experts who can hand‑design task‑specific pipelines.
- In broad terms, FunSearch (and AlphaEvolve) follow three core design principles:
- Off‑the‑shelf LLMs can both generate code and recall domain knowledge. The “knowledge retrieval” stage may hallucinate, but—because the knowledge is expressed as code—we can execute it and validate the result against a custom evaluation function.
- Gradient descent is not an option for discrete code; a zeroth‑order optimizer—specifically evolutionary search—is required.
- During evolution we bias toward (1) _succinct_ programs and (2) _novel_ programs. Succinctness is approximated by program length; novelty is encouraged via a MAP‑Elites–style “novelty bias,” yielding a three‑dimensional Pareto frontier whose axes are _performance, simplicity,_ and _novelty_ (see e.g. OE‑Dreamer: (https://claireaoi.github.io/OE-Dreamer/).
Pros
- Any general‑purpose foundation model can be coupled with evolutionary search.
- A domain expert merely supplies a Python evaluation function (with a docstring explaining domain‑specific details). Most scientists I've talked with - astronomers, seismologists, neuroscientists, etc. - already maintain such evaluation functions for their own code.
- The output is an interpretable program; even if it overfits or ignores a corner case, it often provides valuable insight into the regimes where it succeeds.
Cons
- Evolutionary search is compute‑heavy and LLM calls are slow unless heavily optimized. In my projects we need ≈ 60 k LLM calls per iteration to support a reasonable number of islands and populations. In equation discovery we offset cost by making ~99 % of mutations purely random; every extra 1 % of LLM‑generated mutations yields roughly a 10 % increase in high‑performing programs across the population.
- Evaluation functions typically undergo many refinement cycles; without careful curation the search may converge to a useless program that exploits loopholes in the metric.
Additional heuristics make the search practical. If your evaluator is slow, overlap it with LLM calls. To foster diversity, try dissimilar training: run models trained on different data subsets and let them compete. Interestingly, a smaller model (e.g., Llama-3 8 B) often outperforms a larger one (Llama‑3 70 B) simply because it emits shorter programs.
Non-expert here who likes reading lots of this kind of research. I have a few questions.
1. Why does it need a zeroth order optimizer?
2. Most GA's I've seen use thousands of solutions. Sometimes ten thousand or more. What leads you to use 60,000 calls per iteration?
3. How do you use populations and "islands?" I never studied using islands.
4. You said the smaller models are often better for "shorter" code. That makes sense. I've seen people extend the context of model with training passes. You think it would help to similarly shrink a larger model to a smaller context instead of using the small models?
1. Because we only have blackbox access to the LLM and the evaluation function might not be differentiable.
2. We're trying to search over the space of all programs in a programming language. To cover enough of this huge search space, we need to instantiate (1) a large number of programs in each population and (2) a large number of populations themselves (3) A large number of update steps for each population.
4. This is an interesting question. I believe so. However, my observations were derived from a non turing complete language (mathematical equations). There might be other ways of enforcing a succinctness pressure.
> The “knowledge retrieval” stage may hallucinate, but—because the knowledge is expressed as code—we can execute it and validate the result against a custom evaluation function.
Can you give a concrete example of this? It's hard for me to conceptualize.
Assume you have data for Hooke's law (a spreadsheet with F, x, and other variables) and you want AlphaEvolve to give you the equation ``F = -C_1*x``.
Let's say the model hallucinates in two directions:
1. "There is a trigonometric relationship between variable F and x". It expresses this as ``F = -C_1*sin(x)``. You fit the constant C_1 w.r.t the dataset, execute the program, and your best fit has a high error. You can discard the program.
2. "There is an inverse linear relationship between variable F and x". Now it expresses this as ``F = -C_1*x``. You fit the constant C_1 w.r.t the dataset, execute the program, and your best fit has extremely low error. You now know for sure that you're on the right track.
Location: Pasadena, CA
Remote: No/Yes
Willing to relocate: Yes
Technologies: Computer Vision, Code generation, Program Synthesis, Coq, Lean, PyTorch, Julia
Résumé/CV: https://atharvas.net/cv/
Email: atharvas@utexas.edu
I'm a PhD student at UT Austin. Mainly looking for internships in companies that are working on code generation, mathematical reasoning (theorem provers), or interested in interpretable (yet performant) computer vision models. I'm pretty familiar with the state of the art in code generation and interpretable computer vision algorithms. Open-ended projects (research based) would be great but I'm happy to work on anything codegen/perception based really.
If you're looking to run a .exe file, there are a couple of hypervisors in the market (sometimes found on the high seas). I've tried these on a couple of obscure .exe files:
- Parallels
- VMWare fusion
- Apple's gamekit
Parallels has had the highest coverage, but they've locked down Parallels pretty well, and there is no way to outright own a copy of Parallels (monthly subscription). I think the gamekit project is the most exciting one of these three though!
Parallels still offers a one-time purchase for the most limited edition. If you need to give VMs more than 8GB RAM then you'll need a subscription-only edition, but you can get pretty far with 8GB and 4 cores for a VM that's only running Windows 11 and one application.
I encourage everyone to read this paper. It's well written and easy to follow along. To the uninitiated, SR is the problem of finding a mathematical (symbolic) expression that most accurately describes a dataset of input-output examples (regression). The most naive implementation of SR is basically a breath first search starting from the simplest program tree: x -> sin(x) -> cos(x) ... sin(cos(tan(x))) until timeout. However, we can prune out equivalent expressions and, in general, the problem is embarrassingly parallel which alludes to some hope that we can solve this pretty fast (check out PySR[1] for a modern implementation). I find SR fascinating because it can be used for model distillation: learn a DNN approximation and "distill" it to a symbolic program.
Note that the paper talks about the decision version of the SR problem. ie: can we discover the global optimum expression. I think this proof is important for the SR community but not particularly surprising (to me). However, I'm excited by the potential future work for this paper! A couple of discussion points:
* First, SR is technically a bottom up program synthesis problem where the DSL (math) has an equivalence operator. Can we use this proof to impose stronger guarantees on the "hyperparameters" for bottom up synthesis. Conversely, does the theoretical foundation of the inductive synthesis literature [2] help us define tighter bounds?
(EDIT: I was thinking a bit more about this and [2] is a bad reference for this... Jha et al give proofs for synthesis w/ CEGIS where the synthesizer queries a SMT solver for counterexamples until there are none... kinda like a GAN. Apologies.)
* Second, while SR itself is NP hard, can we say anything about the approximate algorithms (eg: distilling a deep neural network to find a solution[3])? Specifically, what proof tell us about the PAC learnability of SR?
Anyhow, pretty cool seeing such work get more attention!
Random question, if you don't mind. What does symbolic fitting give you that normal regression doesn't? Usually in physics, an analytic solution (our term) yields physical insight, although not everyone is a physicist and cares about that, hence why machine learning in general is popular.
You might find (Cramer, 1985) interesting! IIRC they go into exactly this problem. However, I can't find an open pdf to link to unfortunately. I'll edit this list tomorrow morning in case I misjudged (Cramer, 1985) or if I find a link..
Here are two hand-wavy arguments that may not be 100% correct:
* Structured Bias: A symbolic regression term allows you -- the scientist -- to control exactly what sort of expressions you expect to see. If I'm looking at data coming from a spring, I expect to see a lot of dampened sinusoids and little quantum physics. SR gives you control over the "programming language" while parametric regression will only allow you to change the number of parameters (not useful in this context).
* Generality: A regression term guarantees the best fit parametric equation as long as you have a comprehensive sample of your data range. A symbolic expression (most of the time) extrapolates beyond the provided data range. In fact, this is one of the constraints in the main proof in the paper (f* should generalize)! Basically: If I only have data for sin(x) from 0 to \pi, PR will find the best fit but there is no guarantee that the best fit will also work in the range \pi to 2\pi.
I want to stress that these aren't established facts and each of these pros actually introduces a lot of cons in the process (what if you introduce incorrect structured bias / what if the "general/simple" solution is actually a little imprecise... Like Newton's laws vs Einstein's theory)! This just means that there is plenty of exciting work to be done!
The regression will 'set the value of the parameters'. OPs point is that when you want to change the regression your main 'hyper-parameter' (i.e. one that you get to choose rather than one that the system determines for you) is the number of parameters.
Most Neural networks have other hyper-paramters, but the ones in SR are probably quite interpretable and intuitive.
My initial guess is that regression determines how well you fit an existing pattern or group of patterns but doesn't find out if there is an even better fitting pattern not used. Linear regression will determine how well you fit a linear pattern and while you can use it to search for non-linear patterns (like a regression of home prices against area squared instead of area), it only tests the patterns you put in.
Potentially much better extrapolation from the data, for one thing. The hope with SR is that the equation that is found will in some sense have a deeper connection to the process that generated the data, than a neural network made of lots of RelU units does.
A Fourier expansion is a symbolic fitting of data to a curve, yet it provides no information about the systems sampled as shown by the fact that it fits a geocentric model just as well as a heliocentric one.
The idea that an arbitrarily large expression is somehow more understandable than the fourier coefficients of a the first few large terms could only have been done by someone who hasn't looked the the vast array of semi-empirical formulas out there which are as clear as mud.
Well, I think the hope is that formulae reveal relationships. Notable applications of SR have mostly been to physics and other science phenomena. So, you do some SR to understand the mathematical form of the relationship, and then carry on by trying to understand it causally.
The algorithms are generally in the area of machine learning + programming languages and pretty flexible. This paper talks about how we "bias" these algorithms for applications in the hard sciences (specifically talking about behavioral neuroscience but has/is being applied in other areas as well).
Some basics: The major challenge in flying gliders is the inherent stochasticity in the weather system. Think of it as a contextual bandit problem with high variance w.r.t local weather (i.e. Even the best planning cannot help if the weather doesn't comply). We have some observability due to forecasting tools (skysight.io) and any policy must have affordances for pilot skill and a margin of safety. A good pilot (or 'policy') starts with multiple plans, quickly modifies to plans to suit the environment, and can seamlessly switch between plans. The primary "reward signals" are duration of flight, distance covered, and (in competitions) hitting certain waypoints.
Previous WR's for longest flight were mostly in the Andes or Alps. You want to be in a mountain range to utilize either the [ridge lift](https://en.wikipedia.org/wiki/Orographic_lift) of a mountain face or [mountain wave](https://en.wikipedia.org/wiki/Lee_wave), ideally in a polar region during the summer to maximize the daylight hours so you can fly under VFR for longer.
However, while the Sierra Nevada's have great mountain wave and ridge lift, the number of daylight hours is not ``competitive''. Their main innovation was in acclimatizing themselves with using night vision goggles for long duration in a glider. There's an article on this [here](https://magazine.weglide.org/gliding-at-night-breaking-the-3...) which describes the acclimatizing flights and the 3k km flight in great detail. It doesn't get official recognition because the FAI requires the flight to be done in daylight, but still an extremely cool flight!