Not meaning to bring up a Rust vs Go debate, but since there are a few comments claiming that this is a waste of time, I figured it's worth mentioning the Rust based re-implementation of coreutils:
nice. This is the first time I actually read Rust code, and aside from "something macro" that scares me, it's actually mostly readable. Will definitely check out.
That re-write lists its license as MIT, not GPLv2. Is that legitimate? e.g. in general, can you re-implement a library with one license and change it to a non-compatible license?
From everything I've read I'm pretty sure that if it's a complete rewrite in another language then it's legally "divorced" from the original project. You can own (and ergo license) an implementation of an idea, but you can't own or license the idea itself. Same with patents - you own the how, not the why.
I don't know much about go, but taking a quick look at implementations - they seem to be written by a programming novice, and the are quite primitive. Don't mean to be negative, just my opinion.
Tail also doesn't always need to read its entire input. When given a file path, it can read backwards from the end of the file until it has found its n lines. That makes tail fast on huge files (as long as they have normal line lengths), but it also complicates the code.
And you may want to mmap the file, rather than open it. Whether that is a speeds up things depends on OS, OS version, file size, file system, available memory, phase of the moon, etc.
You mean bufio.ReadLine? As the documentation says, it's low-level, since you have to do the buffer allocation yourself. ReadBytes/ReadString are nicer interface-wise, but it allocates new buffers on every call.
I like Scanner because it provides a nice high-level interface, but still maintains an internal buffer, reducing GC pressure. Of course, it's not one size fits all.
I tested it by on a several megabyte text file (not that large) and I can see a huge improvement in speed when I use a reader vs. loading the whole thing into memory as I did at first.
I can see now how much of a difference it can make on a really large file, like in the gigabyte range.
BTW, I deleted my earlier comment because the problem I had wasn't anything to do with bufio. I had just made an obvious mistake elsewhere in my code, which I've fixed now.
Great going. This will also help other novice Go programmers learning the language, at the same time getting a sense of how to implement their own Unix commands.
GNU started as GNU's Not Unix, reimplementing Unix userland for free. The people who are adamant about calling it GNU/Linux are, on some level, remembering that the userspace is historically a reimplementation.
Give me a busybox that I can 'go build' and that becomes really quite interesting.
Except for the fact that it'll be years before all of the subtle bugs are worked out and you can rely on those apps to be as stable as the ones we've got:
We have years. They're passing whether we like it or not. In the meantime you can choose whichever implementation of the Unix tools that you want. In the future there may be more options that fit the bill, developed in a safe language without the kinds of buffer overflows and things that exist in C tools today.
Yeah, just because Joel said so doesn't mean it's true.
I've been responsible for a bunch of rewrites that were resounding successes and typically reduced the bug load (subtle or not) by orders of magnitude.
Busybox is GPLv2. That's the same license as the linux kernel; it's not a problem. You just have to release your modified version of the busybox source if you distribute it in a product. The only difference is that busybox developers enforce the license against the many companies who can't be bothered and respond to requests with lawyers.
That said, a bsd-licensed re-implementation of busybox (still in c) that's pretty far along and has gotten support from some of those companies (including Sony) is "toybox" http://www.landley.net/toybox/
1. It has a fairly good test suite, that rewrites should leverage. That can be easily done by setting $PATH to prepend the dir of the new tools, and running `make check`
2. To give an indication of the size of coreutils:
$ for r in gnulib coreutils; do (cd $r && git ls-files | tr '\n' '\0' | wc -l --files0-from=- | tail -n1); done
It's also worth pointing out, since busybox was mentioned a few times, that the latest release of coreutils has the ./configure --enable-single-binary option to build as a multi-call binary linke busybox etc.
Heh, this brings me back. I was a young guy at Sun, perl 4 was a thing, I actually argued that we should redo /usr/bin in perl. In the days of a 20mhz SPARC.
There's "Perl Power Tools: Unix Reconstruction Project" [0], which doesn't seem to have activity since 2004. I remember something older than 2004, I think, back from when perl was first available on Windows, to bring UNIX command line utilities to Windows through perl.
It's great that they all have inline POD documentation too.
I'm curious: what was your main motivation? I can understand it as a worthy challenge, but it would probably have led to a worse performance than the C-based utilities, no?
It would be pretty cool if all applications on your system were written as scripts, especially if they are very simple scripts, as it means you can open any of them in a text editor and see what they do.
It means you can modify any of them just as easily.
I remember one time when I needed information about how much CPU usage a program was using, I could see it in the output from `top`. So to figure out how `top` was getting that info, I had to download the source code, and grep through several .c files. If I could just `vim top`, and read the code, it would be very cool.
As an educational resource, being able to tell all your students "all binaries in /bin and /usr/bin are editable! Read them! Find out how they do what they do!" would be incredible.
Thanks for your reply. Yes, the ready availability of the code would be a plus. In my mind utilities like grep or sort are very sensitive in terms of performance, but that's just an impression.
One could hope that with many of these utils they're mostly IO bound, and so a scripting language wouldn't change that much. I know at one point I actually had a python based md5sum that was faster than the regular gnu one for big files - as it loaded the data off the disk in huge chunks.
I believe ack-grep was written in perl? And that's pretty fast.
Also - how fast does it Actually need to be? It's not high-frequency-trading or a 60fps game... A computer these days can probably load perl, run a script, and display the output faster than a computer from 20 years ago could run the original c util... And if everything is written in it, then a lot will be already in memory. If your shell is written in it too, then you could just call the utils from within there, rather than having to fork/exec it anyway.
For most higher level scripting languages (such as perl, python, ruby, etc). Things like regexps and hash tables are written in c underneath anyway.
With a lot of shell scripting, you call many many utils many times - which would slow everything down hugely if you needed the scripting language startup costs each time. However, if you use the same higher-level scripting language to write all your scripts in - rather than writing SH scripts that call your utilities, then you might even get faster than SH calling external utils (possibly). (If that makes sense...)
At the time I had written a source management system (nselite, morphed into teamware) almost entirely in perl except for one C program that was performance sensitive.
Thanks for answering. I guess then what you had in mind was a change in the syntax of the shell itself, right? I'm a casual user of POSIX utilities and the typical shells, and have yet to try Perl! In any event, your last sentence pretty much validates what I had in mind. Thanks again.
While I'm not sure if he posts here, I used to work in the same group with Jim Meyering, who maintained/maintains coreutils. Great guy.
Anyway, he told some great stories of the complexities of POSIX, and what happens in Solaris when you have directories 20,000 lines deep (and how to do it efficently, and the fun of teaching various coreutils commands about SELinux). Lots of it gets surprisingly low level quick.
Coreutils is complex for many good reasons, so while these tools look all nice and clever, they are not dealing with alot of the same kinds of issues.
(I also recall some of the heck Ansible had to go through to deal with atomic moves, groking when something was on a NFS partition, and so on. Not the same thing, but small seemingly easy things get tricky quickly.)
Bottom line is appreciate those little tiny Unix utilities, a lot went into them even when you think they aren't doing a whole lot :)
bloat always does have some reason. but it's still bloat.
Good engineering makes good tradeoffs, rather than throwing in everything anyone wants.
(Not that I think this effort in go is "awesome". For one, it's probably bigger executables than even statically linked coreutils. The suckless sbase/ubase have some real potential though.)
People are lazy and we took legacy complexity as granted. Time evolves and rarely any problem looks the same. If for no other values, having a second (and third and beyond) look at legacy complexity will worth all the attention.
It would be cool to have a "busybox" alternative targeted at Plan9 commands. I personally find Plan9 utils much more logical (and easier to implement!). Something like 9base from suckless, but as a single binary and hopefully in a more modern language (most of the code like sam or rc is not so easy to understand).
Plan9 utils usually have less options and smaller functionality, but you can easily compose them.
"rc": I like it much more than sh for scripting. It's easier to learn and has less pitfalls.
"cat" has no options at all, it just concatenates.
"du": I like plan's "du" utility, which is easier than "find" ("du" simply lists file names recursively, while "find"... Can anyone list all the options
of the find utility from memory?).
"rm" is a combination of rm and rmdir (directory is removed only if empty, unless you do "rm -r" to delete it recursively. Only two options: "-r" and
"-f"
"tar" is a simple way to do recursive copying instead of cp, scp etc
"sleep" takes seconds only as an argument, but it can be a floating point (no suffixes like in coreutils, e.g. "sleep 1h" or "sleep 3d")
"who" takes no options at all
I'm not really good at plan9 utils, and I still use coreutils much more, but for embedded systems I would prefer to have a tiny number of simple building bricks like "du", "cp", "rm", "tar" etc, and a few smarter commands like awk/sed/grep/rc. I like toybox a lot, but if only that had an rc shell..
I'm working on that as well. I just fixed cat to crash and log on error, and I'll be fixing the other commands soon.
By the way, I'm wondering how I should go through the file line by line with a reader.
I think the most efficient way may be to scan byte by byte from the start (head) or the end (tail), and count until reaching n amount of newlines (or stop if at the end of the file), then print the bytes between the start/end and the nth newline.
Good question. I'm not sure. You might want to seek to the end and move back. You probably shouldn't do it byte-by-byte directly from the file since that's very inefficient. As you can tell, this is already starting to get complicated! Maybe you could try mmaping the file so you could treat it as a []byte.
Last year I was trying to write a Go routine that read a file backwards. I was amazed how unexpectedly difficult that proved to be.
In the end I settled for reading it from the start which worked 99.999% of the time and enabled me to finish the project to the tight deadline I had. But I've always meant to go back and "fix" that code at some point.
The strategy that is used by the original GNU coreutils written in C, and the one I used to implement tail with Rust, is to jump to the end of the file, than rewind AVERAGE_CHARS_PER_LINE * NUMBER_OF_LINES_TO_BE_READ, check if enough lines have been read, and repeat until enough lines have been found.
I found the optimal value of AVERAGE_CHARS_PER_LINE to be around 40 characters, but of course it hugely depends on the file being read.
Very neat. I started rewriting GNU's coreutils in Rust and find it to be a nice way to learn the language.
Also, it is interesting how many obscure and less-known features some of the tools provide. In this case I clearly see the 80/20 rule, you can implement 80% of the main functionality in 20% of the time, but if you want to make exact clones you're going to need invest a lot more time.
The line mentioning using gccgo to make the binaries small intrigued me... it worked! I've only written a couple of small tools in go but the size of the binary always bugged me.
It's just a shame that setting up cross-compilation with gccgo looks a lot more involved vs. gc.
I rewrote "pause" from Windows into my Ubuntu box out of habit of using it in some cases. First it was written in perl, then in Python, I also symlink clear as cls out of habit from Windows. There's small little "hacks" that you can do that are kind of fun.
Windows' cls actually erases the buffer from previous commands (so you can't scroll up past it), clear does not. I alias "cls=printf '\033c'" to get the Windows behavior.
I'm just doing this for fun and to learn about Go and Unix at the same time.
I am a beginner, and a lot of my code is inefficient and/or incomplete, but by putting it on the Internet I can get criticism and find out where I went wrong.
For instance, some of you have told me that the way I've been reading files is very inefficient, so now I'll try and do it the correct way.
That is the best reason. I am glad HN has people like you who are not afraid to put yourself out there and reminds us all that this is what being a hacker is: doing stuff for fun and learning.
Why do Go programmers always want to redo everything? It's rare to see something actually new written in Go, proving that Go can do everything C can (except make shared libraries and produce small binary sizes) but not that it's actually better.
https://github.com/uutils/coreutils/
And another by suckless in plain C:
http://git.suckless.org/sbase/tree/README