Hacker Newsnew | past | comments | ask | show | jobs | submit | more trishume's commentslogin

The latency numbers they state seem achievable or beatable with Infiniband, Amazon's EFA, or TCPDirect. 2us round-trip is achievable for very simple systems. If this kind of networking sounds good to you, you can buy it today! It's even available on AWS, Azure and Oracle Cloud (but not GCP yet AFAIK).


Latency measurements are tricky, the usual benchmarks kind of suck and aren't predictive of actual performance in real systems under load.

Given that the entire Myrinet team went to work for Google, and the InfiniPath microarchitecture can be discovered by reading the device driver and some open source code, I'm pretty sure Google's team was well aware of what has been done in the recent past.


Thank you upvoters! I wonder why my other comment has so many downvotes, when it's just as relevant as this one.


This is a really cool example of tree diffing via path finding. I noticed that this was the approach I used when I did tree diffing, and sure enough looks like this was inspired by autochrome which was inspired by my post (https://thume.ca/2017/06/17/tree-diffing/).

I'm curious exactly why A* failed here. It worked great for me, as long as you design a good heuristic. I imagine it might have been complicated to design a good heuristic with an expanded move set. I see autochrome had to abandon A* and has an explanation of why, but that explanation shouldn't apply to difftastic I think.


I think (maybe I’m wrong) that your graph searches correspond to diffing single lists and you can have an expensive diagonal step to recurse into two sublists whereas the tool in this post has extra nodes for every token and extra edges for inserting/deleting delimiters. That seems to be the biggest difference to me and I guess is what you mean by it being complicated to design a good heuristic for the expanded move set. I agree it sounds complicated. I think that my guess was that bigger graphs would make things harder but that isn’t a reason for A* to fail.


I really hope he can work with cloud vendors and Intel to make Processor Trace a more popular and easier to use capability.

It's unfortunate how https://github.com/janestreet/magic-trace and PMUs in general can't be used by lots of people using cloud VMs.


Yes, getting PMCs enabled in VMs was just the start, I think the next hardware capabilities to enable are:

  - PEBS (Precise/Processor event based sampling, so that we can accurately get instruction pointers on PMC events)
  - uncore PMCs (in a safe manner)
  - LBR (last branch record, to aid stack walking)
  - BTS (branch trace store, " ")
  - Processor trace (for cycle traces)
Processor trace may be the final boss. We've got through level 1, PMCs, now onto PEBS and beyond.


Can this be safely/efficiently virtualized? I love using these tools but post-spectre I could understand people being hesitant to expose more internal "state" (I.e. Technically unique to a VM but only one processor bug away from kaboom?).

Congrats on the job.


Thanks! We have to work through each capability carefully. Some won't be safe, and will be available on bare-metal instances only. That may be ok, as it fits with the following evolution of an application (this is something I did for some recent talks):

  1. FaaS
  2. Containers
  3. Lightweight VMs (e.g., Firecracker)
  4. Bare-metal instances
As (and if) an application grows, it migrates to platforms with greater performance and observability.

The ship has sailed on neighbor detection BTW. There's so many ways to know you're a VM with neighbors that disabling PMCs for that reason alone doesn't make sense.


The ship has sailed on neighbor detection BTW.

In the crudest sense of "do I have a neighbour", sure. Of course, that's hardly secret -- if you're in EC2 you can just count your CPUs to figure that out.

But there's more questions you can ask:

1. Is my neighbour busy right now?

2. Is my neighbour a busy web server, a busy database, or a busy application server?

3. Is my neighbour hosting Brendan's website?

4. Is my neighbour hosting Brendan's website and he's logged in writing a blog post in vi right now?

5. What's Brendan writing right now?

It's not immediately clear which of these questions can be answered using certain capabilities! Few people would have guessed that you could read text off someone's screen using hyperthreading prior to 2005, for example. (Pretty simple although I don't know if anyone has published exploit code for it: Just look at which cache lines are fetched fetching glyphs to render to the screen.)


Congrats man, it sounds like a dream job for you. It will be fun to follow your blog at your next job. Thanks again for sharing everything that you do, it is so incredibly humbling and such a great learning experience.


On AMD systems, many hardware performance counters are locked behind BIOS flags/configuration.

I admit that I don't know how Intel works, but disabling the use of these performance-counters at startup should be sufficient for any potential security problem.

I'd expect that only development boxes (maybe staging?) would be interested in performance counters anyway. Maybe the occasional development box could be setup for performance-sampling and collecting these counters, but not all production boxes need to be run with performance-counters on.


No I want these performance counters everywhere. Obviously I know they can be disabled but that doesn't really help.

I also really want them in CI but that might be a long way away.


Being able to collect performance data from production boxes is invaluable.


Yes, getting LBR data from production workloads is the whole ballgame for AutoFDO/SamplePGO and BOLT/Propeller. You cannot access the LBR on any EC2 machine short of a "metal" instance.


When it comes to PGO (vs. profiling the whole system) though it's worth noting that a lot of the speedup comes from things which are too trivial for us humans to consider.

When I profiled the D compiler with and without PGO enabled it became obvious that a lot of the speedup of PGO basically comes just from running the program, the choice of testcases made almost no difference.


> not all production boxes need to be run with performance-counters on.

Production is exactly the place where you want full performance counter support, all the time, everywhere, on every machine.


Right. That's all good, but the important question is: what will your desk look like at Intel?[1]

1. Meta: https://twitter.com/brendangregg/status/1515482126871044098


One question: are you hiring?


Have you seen my Xi CRDT writeup from 2017 before? https://xi-editor.io/docs/crdt-details.html

It's a CRDT in Rust and it uses a lot of similar ideas. Raph and I had a plan for how to make it fast and memory efficient in very similar ways to your implementation. I think the piece I got working during my internship hits most of the memory efficiency goals like using a Rope and segment list representation. However we put off some of the speed optimizations you've done, like using a range tree instead of a Vec of ranges. I think it also uses a different style of algorithm without any parents.

We never finished the optimizing and polished it up, so it's awesome that there's now an optimized text CRDT in Rust people can use!


Oooohhhh no I haven’t read that - thanks for the link! I feel embarrassed to say this but I knew about Xi editor years ago but I totally forgot to go read & learn about your crdt implementation when I was learning about Yjs and automerge and others. I’ll have a read.

And thanks for writing such an in depth article. It’s really valuable going forward. Maybe it’s addressed in your write up but are there any plans for that code, or has everyone moved on? I’d love to have a zoom chat about it and hear about your experiences at some point if you’d be willing.


Out of curiosity, what do you use to make those diagrams?


https://www.figma.com/ and putting a lot of effort into them


This is awesome. In theory you could absolutely minimize the latency penalty to just the overhead of the gpu1->memory->gpu2 copy, if the display sync signals from the display the passthrough window was on were passed through to the GPU driver on Windows, and that was combined with fullscreen compositor bypass (available on many Linux WMs) or low-latency compositing (available on sway and now mutter https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1762 on Wayland).


I really hope we get more technical information on how Lumen and Nanite work, and additionally that Epic doesn't patent the techniques in either of them. A patent on either would make me so sad, 20 years is really long in software, absent Epic's amazing work I expect we would have something else like it in like 3 years given what we've seen in things like http://dreams.mediamolecule.com/.


A lot of information has been released here: https://docs.unrealengine.com/5.0/en-US/RenderingFeatures/Na... and here https://docs.unrealengine.com/5.0/en-US/RenderingFeatures/Lu...

There is also a source code release if you want to dive into that level of detail: https://github.com/EpicGames/UnrealEngine/releases/tag/5.0.0...

I don't think the released details are that surprising to those working on realtime computer graphics, but the engineering details and tradeoffs are certainly interesting. Epic has the budget and business case to allocate a team, including some of the best graphics engineers in the industry, to do R&D for over a year to make this a reality.


So Nanite is just traditional LOD baking implemented in a wholistic and automatic way?

The major difference seems to be they've done the work end to end to handle all the occlusion corner cases as well as a sophisticated mesh and texture streaming implementation that targets modern SSDs.


It's not traditional LOD baking. There's no LOD baking. It's the new rasterization system doing the whole work.

And most of us doesn't use fast SSDs like in PS5 and it works really well. Also these engineers said, it even works just fine with slowers HDDs too. Because, they don't stream meshes for each camera movements. But it's a continuos setup.


I’m not an expert on this, but there seems to be a custom GPU renderer optimized for dense triangle meshes, with its own occlusion pass. The LOD is also calculated based on clusters, multiple per-mesh with way to fix seams between cluster at different levels. This works best with very dense meshes such as those from photogrammetry or zbrush sculpted.


Standard Fenwick trees can only do prefix sums, which only get you general range queries on things with a subtraction operator, not operations like maximum.

The reddit comment I link contains an implementation that allegedly does arbitrary range queries, but it's nigh-incomprehensible so I can't tell how or why it uses 3 arrays.


I see, yeah I can't help you there either. I don't see how a tree based approach would ever need more than twice the amount of space.


Cool! I thought about using skip lists a bunch before I settled on this, trying to think of various ways to reduce complexity and memory usage. My best skip lists designs still had some pointer overhead that the implicit approach avoids, but it was pretty small and they seemed reasonably simple. I briefly tried thinking of what an implicit skip list would be, but then just ended up thinking about implicit search trees.


Yah mipmaps are an N-dimensional generalization of the breadth first layout of implicit aggregation, where the aggregation function is averaging.

It may in theory be possible to generalize the in-order layout I talk about in a similar way, but I'm not sure it would be that useful, maybe it would allow you to append rows or columns to your mipmapped image more easily, but I don't know of any applications where that's useful.


My suggestion would be to automatically add submissions of unique posts to the second chance pool or have a reviewer look at them when they're for a domain or user with a high hit rate but fall off new. I'm mostly thinking about technical blogs with consistent article quality like https://news.ycombinator.com/from?site=ciechanow.ski , https://news.ycombinator.com/from?site=raphlinus.github.io and https://news.ycombinator.com/from?site=scattered-thoughts.ne...

I'm biased on this though, as someone this might impact (https://news.ycombinator.com/from?site=thume.ca). From talking to other technical bloggers the consensus does seem to be that when we put a lot of effort into a technical article it nearly always makes it to the top of https://lobste.rs/ and /r/programming because it starts on the front page there but will sometimes flop off new on HN and maybe only make it months later if someone else resubmits the post.


Yes I agree. Like original content takes a lot of work to produce, and could get an extra chance by default. Whereas news articles, tweets, and content from large tech companies have their own promotional campaigns.

I'd rather have eclectic ideas and projects from HN users not be overlooked (thus encouraging more of such content), and am less worried about GAFAM announcements, CNBC/Axios/BBC news, or things already popular on Twitter/Reddit.

Would this be a doable change to try?


I'm all in favor of doing more to help obscure sites and having less major-media and $BigCo stuff, but there are limits. A site being obscure or having original content by no means implies that it is interesting in HN's sense. If you try to encode those criteria into software (and we've tried many times) the median-quality post comes nowhere close to clearing that bar, so you still need human curation, and that is basically the status quo. If you look at https://news.ycombinator.com/pool you should see a lot of such sites.

Also, a lot of those media and BigCo stories really are of interest to the community. We try to dampen the stuff that's repetitive, and most of those sites are downweighted by default, but HN would not get better if they were excluded. It's all just more complicated than it seems like it might be.

What ultimately matters is how interesting a story is, not what site it comes from. I'm suspicious of encoding proxies for that, because it would be easy to end up optimizing for the wrong things. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...


Yeah that's totally understandable. I'm not advocating for removing/demoting major media stuff or bumping up obscure sites, not even saying anything about the scoring algorithm should change.

Rather I think obscure sites should get more opportunities to be organically upvoted on (and if they don't get voted up, then fine) and not just fall off /new after a few hours only to be seen by a few people. The BigCo stuff naturally gets posted often several (different) links from different people, whereas obscure stuff is only posted by a single person once. So this is about evening the odds.


One idea here could be to have some set of guidelines for a domain like: is not commercial, is not promoting something, has had past HN front page discussions. Then those domains could just have a slightly different color in the new stream.


Maybe better would be to weight the first 50 votes or so, so if the site has rarely been submitted to HN, every 2 votes count as 3 or whatever variable weight works. The problem is that you can't give blanket +1 votes to submissions from less mainstream sites either, so initial traction might be harder to achieve anyway. I don't know if mods manually upvote some of the new content with this in mind, but yeah, in the end this second chance pool is pretty equivalent.


I'd love this too. I have a blog and have had a few HN front pagers (https://news.ycombinator.com/from?site=somehowmanage.com), but it's kind of a roll of the dice whether a particular submission makes it or not (sometimes a post will only make it on a resubmit, otherwise it gets lost in the stream).

Would be great if non-commercial blog domains that have produced good discussions on HN in the past have that somehow reflected in future submissions. Sure, not every piece we write will be worth a front page discussion, and obviously we don't want to recreate digg where some people start getting disproportionate power. Writers can put hours and hours of thought into a piece, and we'd probably be fine if jo one thought it was interesting, but it's discouraging when it feels like a coin flip.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: