Hacker Newsnew | past | comments | ask | show | jobs | submit | ddelnano's commentslogin

Wouldn't the Nsight Systems suite provide coverage here? Are the tricky cases difficult to debug with the standard CUDA tooling stack?


Yes, nsys is very helpful, especially when looking at perf issues. It’s often the case that bugs present like in this blog though - you just notice that training curves have regressed somehow - so even with good tooling it can be hard to figure out where to start looking in these very complex systems. Only gets worse if the symptoms only show up when running for a long time and at scale in a cluster.


Does anyone know how their KV cache sync mechanism compares to newer P2P communication layers like nixl, uccl p2p, etc.?

The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.

> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.


What kind of NCCL testing are you thinking about? Always curious what’s hardest to validate in people’s setups.


For kernel dev and eBPF, what kinds of resources or tutorials have you tried in the past? Have you ever tried building something small or contributing to an existing project?

Curious to hear what hurdles you ran into.


For kernel, I have tried to write simple modules and it's okay. But when I am going deep in the internals, things become really complicated and my memories from OSes exam are rusty. For eBPF, I wrote a quite simple DNS visibility tools and while I am okay with the logic, I struggled on writing low-level C code that parse the network packet. Moreover, I found documentation incomplete and really confusing, for example: I would like to understand how the different queues are working and the only solution was to read the code. Right now, anyway, it seems that this aspect is improved; probably it's the time to play again with that!


For Linux kernel dev, I found Linux Kernel Programming: A Comprehensive Guide to Kernel Internals to be a really helpful resource. For eBPF, the early chapters of Brendan Gregg’s BPF Performance Tools gave me the context I needed to get started.

From there, what’s helped me most is a cycle of reading new material, building prototypes and exploring how an open source system solves similar problems. I've definitely hit that wall as systems programming can get confusing fast.

I’ve also noticed that I sometimes get stuck trying to make something perfect before I’ve even started experimenting. Forcing myself to build the lowest-effort version of an idea has been surprisingly productive. Debugging things that don’t work is frustrating, but that failure often reveals insights I wouldn’t have discovered if I were overanalyzing.

You’ve probably seen some of these resources already, but just sharing in case any of it’s useful. I work with eBPF full-time and had many similar challenges along the way, but recommend jumping back in when you have the time.


Thank you! I already know Brendan Gregg, but never read his book.

> I’ve also noticed that I sometimes get stuck trying to make something perfect before I’ve even started experimenting.

Exactly, this is something that I am struggling with too.


It covers version 4, but it explains differences with v5 as they come up.


Okay, thanks!


Even if you have experience with DWARF, I think you will learn something new from the book.

I work on CNCF Pixie, which uses DWARF for eBPF uprobes and dynamic instrumentation. While I understood how our project uses DWARF, the book made many details much clearer for me.


Also second that the book is a fantastic read. I work in the eBPF profiling space (on CNCF Pixie) and this has helped me understand DWARF concepts that I previously took at face value in our project's implementation.


The approach you describe above is common for similar projects:

- Pixie (https://px.dev) -- which I contribute to

- Beyla (https://github.com/grafana/beyla)

- Coroot (https://github.com/coroot/coroot)

If you are interested in the details and how the strategy for this tracing has evolved, you can learn more in this blog (https://blog.px.dev/ebpf-tls-tracing-past-present-future/).


Disclaimer: I'm a maintainer of the project

Pixie [1] is a similar project and offers the self hosted model you are looking for.

We also support 11 application protocols [2] with TLS handshake tracing and MQTT support coming soon (encrypted traffic tracing has been supported for a long time).

[1] https://px.dev

[2] https://docs.px.dev/reference/datatables/


From a dictionary: The meaning of DISCLAIMER is a denial or disavowal of legal claim : relinquishment of or formal refusal to accept an interest or estate.

Perhaps you meant DISCLOSURE


Disclaimer: I'm a maintainer of the project

Pixie (https://px.dev) can be installed in under 5 mins and gives this level of visibility across all applications. No need to change your application (wrap in `subtrace run`) to get instant visibility.

We also support 11 application protocols (https://docs.px.dev/reference/datatables/) with TLS handshake tracing and MQTT support coming soon (encrypted traffic tracing has been supported for a long time).


Definitely agree. As I've looked into instrumentation/tracing, it has helped me more fearlessly look at the kernel. Ftrace is another tool that's helped me level up as well (https://blog.px.dev/ebpf-probes-and-you/)


Yeah, ftrace is great and the trace-cmd frontend too. Another swiss-army-knife type of a tool, like perf... and is also available in older kernels.

One thing that I had missed was function call argument tracing, but looks like ftrace will soon have it too (update: it's actually available on modern kernels already, look for CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS kernel config option).

https://lwn.net/Articles/1003386/

https://lpc.events/event/11/contributions/1106/attachments/7...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: