Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Teleforking a Process onto a Different Computer (thume.ca)
218 points by kaladin-jasnah on May 31, 2022 | hide | past | favorite | 76 comments


Congratulations! You have just reinvented the core idea of UCLA's LOCUS distributed computing project from 1979. https://en.wikipedia.org/wiki/LOCUS

Reinventing LOCUS also has a strong heritage. Bell Lab's Plan 9, for example, did so in part in the late 1980s.

While never a breakout commercial success, tele-forking and its slightly more advanced cousins machine-to-machine process migration and cluster-wide process pools intrigued some of the best minds in distributed computing for 20+ years.

Unfortunately "it's complicated" to implement well, especially when you try to tele-spawn and manage resources beyond compute cycles (network connections, files, file handles, ...) that are important to scale up the idea.


Berkeley Sprite, ~1991: https://www2.eecs.berkeley.edu/Research/Projects/CS/sprite/s...

Everything that wasn't immediately successful is tried again in 15 years. This is why the old farts on your project are grumpy about 'new' things that... aren't.


Yes, this is exactly what is happening again and again. I remember when mSQL (and later MySQL)+PHP was all the rage and everybody was writing CMSes with it (could be late 90s) and people routinely complained about efficiency, it occurred to me: since these are mostly reads, and writes are quite rare, why don't we regenerate the necessary elements only on write, and not whenever a user visits a page? I created a simple proof of concept with multiple themes and it worked perfectly. Fast forward 15 years and static site generators were all the rage. I believe they will resurge again around 2030.


My job satisfaction dropped like a rock the day I realized we're still basically implementing CMSes over, and over, and over again.

I think the 'all apps will evolve until they can send email' axiom is the wrong one. Everything turns into content management.

And I don't think you'll have to wait until 2030. I suspect 2027 ±2 years.


I took that approach with a high traffic site about 10 years ago. The property was mostly write once content from official staff writers with social engagement, comments, follows and light user generated content, etc. It was built originally with Ruby on Rails and over time had lots of layers of caching. Eventually I moved to a system of partial caching which was stored on the file system and nginx would use SSI to reassemble the partials into complete pages. All dynamic content would be loaded secondarily via javascript with mustache templates and lots of in memory json caches. It would even rsync the file system across a handful of hosts which turned out to be a pain in the butt - then distributed file systems were probably less mature than they are now so I never ended up implementing but did much research.

Funny, I think probably all of the above would be trivial now on AWS or Cloudflare.


Also , Haskell Control.Distributed.Fork

> This module provides a common interface for offloading an IO action to remote executors.

> It uses StaticPointers language extension and distributed-closure library for serializing closures to run remotely. This blog post[1] is a good introduction for those.

> In short, if you you need a Closure a:

> One important constraint when using this library is that it assumes the remote environment is capable of executing the exact same binary. On most cases, this requires your host environment to be Linux.

https://hackage.haskell.org/package/distributed-fork-0.0.1.3...

[1] https://blog.ocharles.org.uk/blog/guest-posts/2014-12-23-sta...


Unless your closure closed over the entire operating system.


With a pure language like Haskell, and the now-standard "functional core, imperative shell" approach, it's rather easy to limit the extent of external dependencies of a meaningful computation.


...and many, many more distributed systems. It's basically what PVM was all about.

These days most of the work for this is handled by the CRIU project (https://criu.org/), so the "hard" work has been done.


>Bell Lab's Plan 9, for example, did so in part in the late 1980s.

Plan 9 did not support forking to remote machines and was not a single system image system. Its support for running a program on a remote machine was a completely userspace program.

If you say Plan 9 was an example of teleforking, then so is ssh. That might be a reasonable categorization, but both Plan 9 and ssh are very different from LOCUS and other SSI systems.


> Unfortunately "it's complicated" to implement well, especially when you try to tele-spawn and manage resources beyond compute cycles (network connections, files, file handles, ...)

Aren't all of these resources namespaced/containerized in modern Linux? This should make it feasible to checkpoint and restore them on the same machine (via, e.g. the CRIU patchset) and true location-independence is not that much harder. One of the hardest parts (not even implemented in plan9, AFAICT) is distributed shared memory (allowing for sharing a single virtual address space across cluster nodes), but even that AIUI has some research-level implementations.


you still have to migrate the kernel state of underlying resources, so containers don't really buy you very much. as you say, in checkpoint restart you start with the big stuff like tcp sockets, but after a while you get tired of tracking down little flecks of state.

distributed shared memory though does deserve a new look now that baseline network speeds are 100x what the max was when it was first investigated. unfortunately temporal consistency is still going to be a major factor. some workloads will run great with some heuristics, and some won't. you'll almost certainly need to migrate threads along with pages to try to keep them running co-local without exhausting per-node resources.


> you still have to migrate the kernel state of underlying resources

DragonflyBSD has been experimenting with this concept for several years now. At the moment it's mostly only useful for snapshotting the kernel for debugging purposes. But there's no reason why it couldn't be extended to transport the state of the computer to another computer and resume execution!

https://www.dragonflybsd.org/docs/handbook/vkernel/


https://criu.org/ indicates it supports migration already; at least at the container level.


There's elements of this also represented in GNU Parallel, a really nifty utility that lets you create a fork-join pool out of one or many systems. IMO it is severely underused these days.


Hey if they just came up with the question and struggled at the answer, that's half the battle. It's not all just solving open problems, it's also opening new problems. There was a sociology paper that found 100% of great scientists it reviewed said questions were more important than answers.

(Then the paper went on to make some other point with a more impressive higher margin of error, p-hacking upward, 0% error was too low.)


I would image the process of pausing virtual machines and then loading them on other machines is something similar.


So if I don't want to reinvent the wheel, what would I use nowadays to approach this problem?


In the late 90s I attended a tour at Holland Signaal, a dutch defense company producing radar and anti-missile systems.

I remember vividly how they demonstrated an unbreakable process. They had a computer running a process and no matter what happened to that computer, the next one would flawlessly continue the process down to the cycle, with no change or corruption or skipping a beat.

It may very well be that this is actually not very difficult, but it seemed difficult and impressive.

Perhaps more shocking were ultra high resolution radar screens, some 3 generations ahead of anything I had seen in the consumer space, showing an incredible visualization of the air space, live. Showing exactly which plane is where, the model/type, age, fuel on board, hostile/friendly, all of it.

They even had a "situation room" with a holodeck chair in the middle, full of controls. The entire room was covered in wall-size screens basically showing the air space of the entire country, being live analyzed.

Sounds very 2022, not 1998.


Of all of the amazing things I've learned/seen in IT over the years the one that had the greatest "jaw to floor" effect was when I first learned of VMware's Fault Tolerance feature (shortly after I learned about VMs and live VM migration, which was the previous "jaw to floor" winner). Fault Tolerance allows for a very similar cycle accurate continuation concept but trades off specialized synchronous design and near instant failover for a ~1 second hitch to gain the ability to do this with a non-specialized VM and software stack. For even more fun if you have more than 2 hosts when 1 is lost it'll rebuild a fault tolerant state automatically with another host.


QEMU can do that too nowadays with COLO (COarsegrained LOckstepping) [1]. It's quite fun to randomly yank a Node, wait for resync and repeat and see how the VM stays alive all the time. :)

[1] https://wiki.qemu.org/Features/COLO


Is there any cloud provider that offers such VMs? Seems incredibly simpler to fail over like that instead of implementing a userland procedure.


You can run VMware Cloud in AWS to get this, but it's not as useful as it sounds: you can't have the pair of fault tolerant instances across regions or even AZs. You still need another plan for recovering from region or AZ failures, and if you need that, why not just also use that plan for intra-AZ failures instead of shelling out for VMware?


The COLO effort is driven mainly by Intel China and they are currently deploying it at some Chinese cloud provider AFAIK.

Unfortunately, as cool as VMWare FT and QEMU COLO are, you need high-bandwith and low-latency links between the nodes and still they increase latency and reduce performance a lot. And in 99.99% of cases just replicating the disk and restarting the VMs in case of a node failure is good enough.


That's so cool, I'm only a casual VMWare user and had no idea.

With things like this possible, and possible for a long time now, it honestly makes me cringe at how absolutely terrible (slow, inefficient, unreliable, complex, not robust) the typical modern software stack is.


It's not true ! The typical software stack is optimized with need as it arises

You wouldnt want to write ultrafast software for your idea to deliver cats back from the vets with a network of teenagers om bicycle. Imagine, years to write your website in C++/ASM to end up failing the first week.

What s embarrassing is how much software cost to build in the 80s.


I came here to say the same thing: “telefork” is just a much more limited version of VMWare’s VMotion. FT is, IIRC, just an extension of the same technique. You’re basically just continuously monitoring and copying state over to the other host instead of shutting down the original VM after you’re done copying. Essentially ESX is an OS that can already a full on telefork with its “process” being a VM. You can practically get to that state if your VM is made super lightweight.

(Source: I used to work at VMWare and one of the implementers of FT was a cycling buddy.)



Both MOSIX and openMOSIX supported fork()ing to another node on the network. https://en.m.wikipedia.org/wiki/MOSIX


The old 'dsh' (distributed shell) command in DEC Ultrix gave you the ability to do something similar from the shell - kind of rsh on steroids. (FWIW, Ultrix remains my overall favorite Unix - it was very well thought-out, and just thought-about period. It's a shame they abandoned it for OSF/1 on Alpha, which never really worked (well, the DEC Alpha hardware worked, but OSF/1 and the compilers never did...)

Way back in the early 90s, inspired by this and a few other things, a coworker of mine (one of only three true geniuses I've ever met) came up with an even more powerful way (called vArray) to spread processes across any arbitrary number of POSIX machines. One of the things vArray could do was make remote memory look local (if painfully slow). He was the company's Cray expert, and developed it for the major oil company we worked for to run jobs too large to fit into the Cray's memory. Since some seismic deconvolution vectorizes very well, with fairly infrequent memory access, it worked better than you might expect, even over the slow networks of the day. Obviously, it hammered the network, but it was quite a sight to see every Sun, DEC, IBM, and HP Unix machine on the company's network light up at once, all coordinating to process jobs that were probably bigger than anything run outside the Black world of the Spooks... Fun days.


OpenMOSIX is something I miss so much. I watched people build entire functional clusters using off-the-shelf hardware in a very tight budget and in just a few hours (most of the time was just unpacking things) using network boot and (IIRC) ClusterKnoppix. It was just "fork and forget" and boom, your job was now running in multiple processors.

I'd like to do that with my spare ARM SBC's.


I think multicore CPUs did a lot to kill the momentum of OpenMOSIX. I remember playing around with clusterKNOPPIX back in the day and being quite impressed by it, but it seems that multiple cores and threads per core became the norm and the cores themselves got a lot faster.

On the other hand, combining multiple 64-core Threadrippers into a single-system-image cluster has a certain appeal to it.


Open mosix was the coolest project.


Doesn't Erlang support these ideas of distributed computing .... And if I recall correctly Clipper supported remote execution of objects, or sharing object code in a distributed fashion.


Yep. spawn is the basic primitive: https://www.erlang.org/doc/man/erlang.html#spawn-4



<About a year later I had to write a paper. One of the disadvantages of being a researcher is that in order to get money you have to write a paper about something or other, the paper is never about what currently interests you at the moment, but must be about what the project that financed your research expects to read about.

Well I had my gossip network setup on planet lab and I could tell it to become anything, so I told it to become a content distribution networks and used a gossip algorithm to make copies of the same file on all machine on the network and wrote a paper about it and everybody was happy.>

I miss Joe, not that I ever met him, but his attitude and good humour are inspiring.


He was just as good of a man in real life.


Wow, thanks for posting this! I've never used Erlang before, but the "become" functionality and the message-passing style is similar enough to another actor-model system I'm familiar with (Spritely Goblins) that I was inspired to write a version of this program for it:

https://git.sr.ht/~jfred/goblins-test/tree/master/item/unive...


For what it's worth, at least for HPC-ish distributed computing, this sort of thing turns out not to be terribly worthwhile. We have a standard for distribution of computation, shared memory, i/o, and process starting in MPI (and, for instance, DMTCP to migrate the distributed application if necessary, though I think DMTCP needs a release).

I don't know what its current status is, but the HPC-ish Bproc system has/had an rfork [1]. Probably the most HPC-oriented SSI system, Kerrighed died, as did the Plan 9-ish xcpu, though that was a bit different.

1. https://www.penguinsolutions.com/computing/documentation/scy...


The biggest benefit is arguably that codes that are designed for "telefork" and perhaps remote threads can also be scaled down to a single shared-memory machine, and run way more efficiently than if they had been coded using the MPI approach. Whilst you don't really add much of any overhead when running in a cluster, assuming that the codes are designed properly.


Just doing a fork may be sufficient for something embarrassingly parallel, but interesting things are tightly coupled. Obviously MPI scales down to a single node (a distributed system anyway, these days), typically as real forked processes, but possibly with all the ranks in a single processwith an appropriate implementation.

Citation needed, as they say, for “run way more efficiently”, particularly as the conventional wisdom says shared memory in a single process (e.g. OpenMP).

“Acknowledgements: ... NUMA and Amdahl’s Law, for holding OpenMP back and keeping MPI-only competitive in spite of the ridiculous cost of Send-Recv within a shared-memory domain.” — Jeff Hammond, ‘MPI+MPI’


I would kill for "vmotion" on our HPC cluster. Tired of draining a node then pestering the remaining users to log their jobs out.


Off-topic, but google-fu is failing me: wasn't there a similar project posted here a while back, that instead shipped a binary + a virtual file system that mirrored all local resources to a remote host so that the entire process could run there? IIRC the author demonstrated it running an ffmepg transcode?

edit: Got it, after a while: https://github.com/Overv/outrun


This is cool, when it talks about packaging, I was reminded of statifier. It would take a dynamically linked ELF executable and package everything up into a single binary. I last used in the early 2000s, but it looks like it had an update as recently as 2016.

https://sourceforge.net/projects/statifier/


(2020) and discussed at the time here: https://news.ycombinator.com/item?id=22987747


I've been playing with adjacent ideas in WebAssembly. The idea is that you can write a function that sends and receives messages of arbitrary Rust types, compile it into WebAssembly, and then snapshot and restore it (including across space and time) as you like.

It's very experimental, but I got as far as being able to freeze a JavaScript interpreter (compiled to WebAssembly) and restore it later. The code is here: https://github.com/drifting-in-space/wasmbox


Me too! The entire heap of the Wasm env is up for inspection and mutation. If your data structures are stable, then they can be moved across envs, or use accessors from both sides. Fun stuff.


Yeah! I think it’s an under-appreciated aspect of WebAssembly.


The Cloud Haskell project and language is likely the only one to get this right, thanks to strictly enforced purity. It's much simpler to understand, absence global mutable state, whether it's safe and possible to serialize a closure and run it somewhere else. (fork(2) being a closure by another name.)

In almost all other languages there's just no way to know if a closure is holding on to a file descriptor.

Critics may say the Haskell closures could contain `unsafePerformIO`, but as the saying goes: now you have two problems.


Is Cloud [sigh] Haskell still alive?

For comparison, two old distributed lexical scope systems were Cardelli's Obliq and Kelsey's(?) Kali Scheme. From what I remember, not like remote forking, though.


I don't think so, sadly.


Isn't fork more of a continuation?

/pedantic


I can't state with absolute certainty but I seem to recall OpenVMS clusters back in the 90's transparently supporting "portable" processes across cluster nodes.


That used to be in some UNIX variants, such as UCLA Locus and the IBM derivatives of that. But it never got to be a Linux thing.


Was VMS capable of achieving this as well?


If I am understanding the article correctly, BEAM does this for Elixir and Erlang (and other BEAM languages) out of the box.


No, it does not. Not _this_.

Erlang has, however, excellent support for distributed computing using its own kind of processes.


Termite Scheme did something like this for Scheme programs. It combines Erlang-style concurrency with process migration.

http://www.iro.umontreal.ca/~feeley/papers/GermainFeeleyMonn...


You could do what CRIU does, but leave the original process running, and arrange to have a composite {instance ID, PID} as the return value (different on both sides).

(Trivia: in Unix VI, fork() returns both PIDs, and the C library stub arranges to return 0 to the child and the child's PID to the parent.)


This is fun. Reminded me of mincemeat (https://github.com/ziyuang/mincemeatpy), which actually serializes the mapping functions and sends them to workers.


I don't think anyone has mentioned Sprite https://en.m.wikipedia.org/wiki/Sprite_(operating_system)

Sprite called this "process migration".


Lambda/Cloud Functions are starting to converge on this idea. It will eventually get streamlined and ergonomic enough that it appears you're executing an expensive or async function locally, except you aren't.


"From laptop to lambda"[0] from a couple of years ago is an interesting step in this direction.

[0]https://dl.acm.org/doi/10.5555/3358807.3358848


this is a fun hack. it would be interesting to look at some real world workloads and compare whether this sort of init once ship initialized memory image everywhere style is faster than just initializing everywhere.


This is really cool, Distributed Unix could do this IIRC.


I also had this idea but I called it networkfork.


isn’t this exactly what the vm migration in cloud is?


Yes.

All live migration systems basically follow the same pattern https://cloud.google.com/compute/docs/instances/live-migrati...

And as others have mentioned upthread, Mosix was an extension to Linux that also implemented fork() in such a way that the child could be local or remote with file handles retained across the cluster. We had a Linux lan party when I was in college and managed to scrape together a Mosix cluster across a bunch of machines.

Ignore the uncreative pedants.


No. VM migration moves entire virtual computers. Forking makes a copy of a process with the current state; this moves that single duplicated process to a different machine.


Is Star Trek's Transporter Actually a Murder Machine? https://www.youtube.com/watch?v=f-8zEkIaB0c


Virtual computer is a bunch of processes.


And a kernel. And drivers. And devices. And busses. And interrupts.


The word "entire" is in there.


This sounds like Kafka, but more low level.


Not even close




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: