This is amazing, I will def be playing with it (lossy is also my jam) and hi! I made pamplejuce, hope it worked ok for you, lemme know if anything was rough (its been growing up a bit lately)
Oh my goodness, hello! Pamplejuce has been a life-saver-- having github actions with pluginval proved to be an absolute necessity when building and linking the hacked encoder libraries. I love reading your juce blogposts as well. Thank you for all that you do!
I'm interested in this idea. I think I got confused at some point and mistakenly thought box blur was a 2D kernel and so it wouldn't perform great. vImage does contain a box blur but I haven't checked its performance (I did check the tent blur and it was so-so...) https://developer.apple.com/documentation/accelerate/blurrin...
Cache locality and specifically the vertical pass was top of my mind when trying to come up with good ways to vectorize. In the end (at least in my vector implementations) the difference between the passes weren't too large. But most of the them ended up having to do things like first convert the incoming row/col to its own float vector.
One main issue I never resolved is in the middle of the main loop, data has to be converted and written back to the source image and the incoming pixels have to be converted and loaded in. Even when doing all rows or cols in bulk (which was always faster somehow than doing batches of 32/64), that seemed pretty brutal.
I also wondered whether it might be more efficient to rotate the entire image before and after the vertical pass, but in my implementations at least, there wasn't a huge difference in the pass timings.
Fixed! Sorry about that. The rustrepo stuff confused me!
My question is — was solving the edge bleed worth the (assumed) performance tradeoff? There were a couple vendor APIs that seemed to have smarter edge bleed options, but they performed worse. I never got around to actually visually comparing the images, though...
Thanks for all the great comments in your repo, they were quite helpful when trying to figure out how Stack Blur worked!
> My question is — was solving the edge bleed worth the (assumed) performance tradeoff?
Absolutely. Edge bleed is a distracting visual artifact! If you take a look at the video in my README, you'll see how bad it is. It's slightly less bad if you implement the sRGB transfer functions properly, but it's still bad.
As for the performance tradeoff - it's probably not as big as you think. However, I can say there is a huge performance tradeoff in using real division, and the only reason I can't use libdivide is the fact that my denominator changes. That's some 30% of the runtime - replacing it with a multiplication (as in libdivide) would probably shave at least 20% off the runtime of my algorithm.
> Thanks for all the great comments in your repo, they were quite helpful when trying to figure out how Stack Blur worked!
You're welcome! I will admit, I mainly wanted to prove that I actually came up with a better version of the algorithm, and that my library was therefore the best :)
> If you take a look at the video in my README, you'll see how bad it is.
Ahh, the video threw me off originally, I think because of the FPS glitches on the full width — but I see it now! Big bands of blue on the left and right edges. I'll have to cook up some examples to reproduce. Also curious what it means (if anything) in the drop shadow context.
> I can say there is a huge performance tradeoff in using real division
Makes sense. It would also throw a wrench in my vector implementations, where the expectation is to perform the same operation efficiently across groups of pixels.
> and that my library was therefore the best :)
Haha, well it was for me! I don't know Rust, but it helped me figure out my first C++ implementations!
It seems like it should still be a win to use libdivide or something like libdivide even if the divisor varies within a row - you get to reuse those divisors for each row!
Thanks for the proofread! I had so much text juggling between this and the README that it was guaranteed some things would fall through the cracks! I updated the things you mentioned about `stackSum` and thanks for the catch on the `sumIn` definition.
> It couldn't provide this if it was moving every value in the queue.
I actually don't remember anymore what std::deque does under the hood, I did look into it, but the only thing I remember is that it was quite slow!
> You'd have to template the code on `radius` instead of passing it in as a runtime parameter so that the compiler could lower the divisions to bitshifts.
Yes, I really like this idea. Especially because radii only really vary between 1-48px for most drop shadow needs. It would be nice to have a handful of the common radii be ripping fast.
std::deque will never beat your hand-rolled ring buffer. It's a container, it will spend time managing memory every time you walk off the end of a chunk. If it's implemented right*, it will hold onto a chunk which fell off, and will only allocate twice. If it's implemented wrong*, it will allocate after `chunksize` pushes, and free every `chunksize` pops (which might be handled properly by your allocator). But it's still going to need to shuffle those chunks around, adding unnecessary overhead because you're using a ring in a way that is a perfect fit for your application: you use the value you're popping every time you push.
* for your purpose! Therein lies the challenge of writing standard libraries... choices must be made.
> std::deque will never beat your hand-rolled ring buffer.
std::deque may never beat a hand-rolled ring buffer of fixed size that maintains cursors, but it will beat a hand-rolled linked-list implementation.
The typical implementation will generally exhibit better cache locality than a linked list, and will outperform it for random access as required by the specification (amortized constant, vs linear).
(The typical implementation per cppreference, although I don't think this is formally required by the spec, is a sequence of fixed-length arrays.)
After some thought, I figured out a way to implement a chunked array of the sort specified by the c++ standard library, which does not require reallocation if the queue length stays below the chunksize. It involves storing cyclic offsets for each chunk.
So what I wrote isn't quite correct about needing to fiddle with memory regardless. But there is still overhead -- even if there's well-optimized code for the single chunk case, the container needs to check that condition.
Probably! The dequeue implementation was just for fun, as I miss working in higher level languages :) Home cooked circular buffers is usually how one usually sees real-time audio code written... I haven't had much hands on experience with std::span, but seeing as it's a non-owning view, maybe it would be perfect?
I must confess I didnt read the whole article yet, but yeah, if you are looking for a view onto some data, spans are awesome and surprisingly easy to use