I know these problems are easily avoidable... but I'm finally starting to see th...

chubot · 2025-06-21T12:58:33 1750510713

It would nice if it were true :-)

But it’s not, for reasons that have more to do with the languages themselves, than parsing

e.g. C++ numbers are different than Java numbers are different than Python numbers are different than JavaScript numbers

ditto for strings

CactusRocket · 2025-06-21T16:51:43 1750524703

I imagine that one of the points of a solid protocol buffers library would be to align the types even across programming languages. E.g. explicitly force a 64-bit integer rather than "int" relying on the platform. And to have some custom "string" type which is always UTF-8 encoded in memory rather than depending on the platform-specific encoding.

(I have no idea if that is the case with protobuf, I don't have enough experience with it.)

chubot · 2025-06-21T17:48:46 1750528126

Why would you "imagine" that?

Again, the problem has more to do with the programming languages themselves, rather than with protobufs or parsing.

Protobuf has both signed and unsigned integers - the initial use case was C++ <-> C++ communication

Java doesn't have unsigned integers

Python has arbitrary precision integers

JavaScript traditionally only had doubles, which means it can represent integers up to 53 bit exactly. It has since added arbitrary size integers -- but that doesn't mean that the protobuf libraries actually use them

---

These aren't the only possibilities -- every language is fundamentally different

OCaml has 31- or 63-bit integers IIRC

https://protobuf.dev/programming-guides/encoding/#int-types

And again, strings also differ between all these languages -- there are three main choices, which are basically 8-bit, 16-bit, or 32-bit code units

Go and Rust favor 8-bit units; Java and JavaScript favor 16-bit units; and Python/C/C++ favors 32-bit units (which are code points)

CactusRocket · 2025-06-21T18:04:44 1750529084

I think I didn't explain myself well.

As long as a language has bytes and arrays, you can implement anything on top of them, like unsigned integers, 8-bit strings, UTF-8 strings, UCS-2, whatever you want. Sure it won't be native types, so it will probably be slower and could have an awkward memory layout, but it's possible

Granted, if a language is so gimped that it doesn't even have integers (as you mentioned JavaScript), then that language will not be able to fully support it indeed.

chubot · 2025-06-21T20:44:07 1750538647

Unfortunately that doesn't solve the problem -- it only pushes it around

I recommend writing a protobuf generator for your favorite language. The less it looks like C++, the more hard decisions you'll have to make

If you try your approach, you'll feel the "tax" when interacting with idiomatic code, and then likely make the opposite decision

---

Re: "so gimped" --> this tends to be what protobuf API design discussion are like. Users of certain languages can't imagine the viewpoints of users of other languages

e.g. is unsigned vs. signed the way the world is? Or an implementation detail.

And it's a problem to be MORE expressive than C/C++ -- i.e. from idiomatic Python code, the protobuf data model also causes a problem

Even within C/C++, there is more than one dialect -- C++ 03 versus C++ 11 with smart pointers (and probably more in the future). These styles correspond to the protobuf v1 and protobuf v2 APIs

(I used both protobuf v1 and protobuf v2 for many years, and did a design review for the protobuf v3 Python API)

In other words, protobufs aren't magic; they're another form of parsing, combined with code generation, which solve some technical problems, and not others. They also don't resolve arguments about parsing and serialization!

m3047 · 2025-06-22T16:22:45 1750609365

> you're guaranteed a consistent ser/de experience

Are there that many implementations of protobuf? How many just wrap the C lib and proto compiler? Consistency can be caused by an underlying monoculture, although that's turtles all the way down because protobuf is not YAML is not JSON, etc.

Off in the weeds already, and all because I implemented a pure Python deserializer / dissector simply because there wasn't one.

liampulles · 2025-06-21T13:41:25 1750513285

I think you can get similar benefits here from writing an RPC style JSON API into an OpenAPI spec and generating structs and route handlers from that. That's what I do for most of my Go projects anyway.

nottorp · 2025-06-22T07:55:23 1750578923

But since the article isn't really about parser bugs, I don't think using a different data format will save you from most of the problems described there.