In modal logic sense, Chinese is inherently more □ oriented while the language in US is using ◇ more. so in Chinese ¬(□¬A) can be used to represent a possibly concept ♢A
OP here. That's kind of the idea of listing the number of attempts alongside failure/successes. It's a loose metric for how "compliant" a model is - e.g. how much work you have to put it in order to get a nominally successful result.
At least most functional language tutorial claim to be based on abstract machines, not like the C language, which is a spherical cow that people not often aware of. https://queue.acm.org/detail.cfm?id=3212479
Oh it is selected output, yes I meant that I was a bit confused. So in the initial design when you first tried it, you passed both to the next layer? or it is part of where you find out to perform better?
Even in the earliest stages of the DDN concept, we had already decided to pass features down to the next layer.
I never even ran an ablation that disabled the stem features; I assume the network would still train without them, but since the previous layer has already computed the features, it would be wasteful not to reuse them. Retaining the stem features also lets DDN adopt the more efficient single-shot-generator architecture.
Another deeper reason is that, unlike diffusion models, DDN does not need the Markov-chain property between adjacent layers.
It may be a Claude specific thing. I tried to ask Claude to various tasks in machine learning, like implement gradient boosting without specifying the language, thinking it will use Python since it is the most common option and have utilities like Numpy to make it much easier. But Claude mostly choose Javascript for the language and somehow managed to do it in JS.
Maybe it is a cultural difference aspect, but I feel that "supermarket workers, gas station attendants" (in an Asian country) that I know of should be quite capable of most ARC tasks.
There is a meaningful distinction. They only use backprop one layer at a time, requiring additional space proportional to that layer. Full backprop requires additional space proportional to the whole network.
It's also a bit interesting as an experimental result, since the core idea didn't require backprop. Being an implementation detail, you could theoretically swap in other layer types or solvers.