Why wouldn't it to be easy to implement? You could probably make a global LD_PRELOAD library that hooks into all open() CREAT calls and tags the inode's xattr with the program that created the file. You could use an attribute like, user.created_by.
Why such a hack? This is a solved problem already. Try Linux audit API. You can set it up for tracking various actions, and you'll get a log with all the information including a process name.
I believe the question was asking for a method of storing the metadata, I was just suggesting xattrs. This will work at the expense of running a daemon.
Can audit be configured to only log file creation, not modification or access? Last time I didn't find a way to do that and I don't want to write a log file entry for every file access.
Programs can use programs like 'touch' to create the files so you'd have to get meta into a chain of process ownership or something while exempting the user's shell.
I think touch is a generic example, but not a practical one. The idea is that program A can launch program B, and you'd have to recursively search for the parent. But it's not easy to find which parent is the true owner, maybe an exception can be made for init but shells can't have that exception as they are sometimes the owner.
MacOS does something sorta like this. It uses extended attributes to mark where a file came from in some cases, such as when an executable is downloaded from a browser, it will mark it as "unsafe" and when you attempt to run it, this causes it to fail to run and you have to whitelist it via security preferences.
Adding an interface to access the tag store won't be very hard.
It would be harder to make it so that all the tag-oblivious programs, like mv or vi, to say nothing of rsync, would preserve the attributes when moving or modifying a file.
f = open("the_file.new")
write(f, new_contents);
close(f);
rename("the_file", "the_file~");
rename("the_file.new", "the_file");
So the old file is never modified, it's renamed to the backup copy, and an entirely new file is created to take its place.
This has a number of advantages, but does not play well with any extended info the old file used to have, unless they are copied explicitly. And in a tagging-oblivious program, they won't be.
f = open( "the_file.new.$$" );
write( f, new_contents );
close( f );
rename( "the_file.new.$$", "the_file.new" )
If not implemented like this, then something that attempts to read "the_file.new" may get partial contents, normally truncated along the file system block size.
Backups often take place, too, like you mentioned, depending on editor.
If the value matches, you renamed the right file. If it does not match, you renamed a wrong file. Without holding the fd, it is impossible to know if it is the right file.
Thanks for teaching me this! I wasn't aware. So, the original xattr (and the program that created it) would be deleted and replaced by vi for example. Still technically right, but I'm unsure how you'd keep a history.
Please no. This article is literally "Dotfile madness", don't make it worse!
I have to deal already with these on file shares, specifically for Apple: .DS_Store, .Trashes and .AppleDouble, or for Windows: Thumbs.db, $RECYCLE.BIN (for some reason Windows sometimes ignores the fact I've disabled the recycle bin on a share and creates this instead) and desktop.ini. Please don't drop crap around directories where there exist a multitude of tidier alternatives.
Because when you have multiple OSs accessing the same shared folder, they all create their own crap, which is then visible to the other OSs, and fills up directories with stuff that confuses normal users.
I don't run multiple OSes, but I do use USB sticks and SD cards on other people's computers and vice versa. I'm pretty sure that is "normal user" behaviour.
It gets annoying real quickly because you see the "crap" files of all the OSes you're not using. And you delete them, you only have to insert the stick and they're back. This happens in different ways for all of Mac, Windows, Android and certain Linuxes.
Sidecar files for storing file tags make querying the tag system a huge chore; it totally kills performance. I understand the conceptual appeal but it's just not the way to go.
There's an entire research field dedicated to the information flow control which could solve just that (if it were actually used).
In my research team, we used a tainting tracing mechanism to understand the behavior of malware. Basically, we installed a malware on a clean phone and we then traced all information flow originating from the APK to processes, to files, to sockets, etc. It helped reverse-engineering the malware.
I'd love something like this on program level. Given contents of a variable, I'd like to know where they came from - which pieces of code contributed to the result. I'd also love to be able to mark a piece of data, and see what code touched it or derived new data from it. A programming equivalent of injecting radioisotopes into the body.
I’d suggest looking into taint analysis tools, though those are usually aimed more at finding things like unsanitized input ending up in a call to system.
BeFileSystem (Haiku)
HFS+ and APFS
Even NTFS has attributes.
The problem is file manager. Tracker/Finder are several steps ahead of Explorer. Under Linux, its a greater mess even though there are no technical limitations.
What's not quite so straight forward is how you query the system. The worst implementations of file tagging only implement retrieving a list of all files that have a given tag. Slightly better than this are tagging systems that will return the intersection between two or more tags. Most tag systems never go beyond this.
Going slightly further, tag exclusions are powerful and are sometimes implemented (given a set of files from some subquery, exclude all files that have a given tag.) What you rarely see are systems that allow you to exclude one subquery from another.
However what you almost never see implemented is a system for preferential tags; given a subquery, reorder the results such that files with preferred tags are raised to the top, ordered by how many of the preferred tags they have. Once you implement this, the system's UX changes dramatically because the user no longer has to make strong assumptions about how well their files have been tagged. Many files might be missing relevant tags and the user may not be sure if the file they're after is one of these incompletely tagged files. When using a system with preferential tag querying, the user will receive files in their result list that don't match all the tags listed, but most of them. This is a bigger deal than it may sound, since the main drawback of using tags for file management is incomplete tagging. By addressing this drawback, you stand to bring file tagging to the next conceptual level, which is rendering hierarchical file management obsolete.
Consider that hierarchical file management is a strict subset of file tagging. You can model file hierarchies inside a file tagging system. To demonstrate this by example, consider /home/joesixpack/Documents/seventh-novel.pdf For each level of the hierarchy, we can create a new tag, such that this document has the tags: '/home/', '/home/joesixpack/', '/home/joesixpack/Documents/'. But because we're using file tagging, we can also automatically tag that file with things like 'pdf' or maybe even 'Documents'.
Now before I go on, there is something to be said about the number of tags in the system exploding when you model a tree as a tagset. In practice this probably isn't the approach any real system should take, if only because most of those tags will be useless and because there is a great deal of redundancy in trees that wouldn't exist in a native filetagging system. Consider /home/joesixpack/Documents/ and /home/johnsmith/Documents/. We have two different Documents directories because they exist in different parts of the hierarchy. However in a file tagging system we'd ideally only have a single Documents tag and one tag per user, such that querying the intersection between 'joesixpack' and 'Documents' returns the files that would otherwise be in /home/joesixpack/Documents. Some tags, such as the user tag, could be implicit; such that if joesixpack simply queries 'Documents' the tag 'joesixpack' is automatically intersected with the results presented to him.
With that out of the way, consider how we can go further if we apply straight forward statistics to the tag system. Suppose 99% of joesixpack's 'Documents' that are 'pdf' are also 'novel'. When joesixpack creates a new file that's in the tags 'Documents' and 'pdf', what are the chances that file should also be tagged with 'novel'? We could analyze the contents of the file if the system has a semantic understanding of what 'novel' means, but let's not go there. Tag systems created by the likes of Facebook and Google do this sort of analysis, but it's heavy, tricky, goes wrong in ways that create PR disasters, etc. We can get pretty good results by ignoring the contents of the file and looking instead at merely what that file is already tagged with. If a file is tagged with 'joesixpack', 'Documents', and 'pdf', it _probably_ should be tagged with 'novel' as well. Probably, but not necessarily. So the system can expose to joesixpack the suggestion that he tag that particular file with 'novel'. By presenting suggestions like that to the user, you greatly reduce the UX friction needed to tag files well and therefore increase the usability of the tagging system, while at the same time improving the quality of the tag suggestions in the future. What we create here is a 'virtuous cycle' of sorts that gives back to the user more as it's used more. Such a tag suggester can be implemented as multi-label classification using naive bayes; it's fairly straightforward.
Going further; if we have such a tag suggester, and an appropriate caching scheme, tag suggestions can be used by queries. If joesixpack queries for 'pdf','novel', the system could return to him the intersection of 'pdf','novel', but also return to him files that are tagged with 'pdf' but not tagged with 'novel', in cases where the tag suggester indicates there is a high likelihood that the file should be tagged with 'novel'.
That last paragraph may or may not play out well, I haven't experimented with it extensively yet. But getting back to my original point: if you're going to implement file tagging in a filesystem, you should think carefully about how the users will query that system, and whether the query system will be extensible in userspace to facilitate new powerful methods of querying it. It would be a total tragedy if the system only supported basic tag intersections and the only way to extend it was to implement a new kernel module.
This would require a ton of difference from how modern desktop and mobile OSes handle files / documents. Maybe some time later, some research OS would implement it.
I've implemented the above in userspace for my own experimentation (not yet ready for public release, but maybe soon.) Nearly everything I've described above can be implemented with relatively straight forward SQL queries and SQLites performance has been everything I could have asked from it (in the neighborhood of ten thousand tags and hundreds of thousands of files.)
Getting file tagging into the kernel level to replace directory hierarchies would be a huge paradigm shift, a very dramatic departure from Unix. To be honest I'm not sure whether or not getting such a system into a kernel would be appropriate or not. Traditional hierarchical file management seems more than sufficient for "system files". But I'm really interested in replacing hierarchical file management from the users' perspective. More or less, put ~/Documents, ~/Downloads, ~/Desktop, etc under control of the tagging system but leave the rest of the system as-is. At least for the proof of concept.
I think a demonstration system could be implemented as a custom desktop environment running on regular old linux, where the open and save dialogs of GUI applications have been replaced with tagging/querying windows. Instead of the user clicking through a few directories to find a file to open or find a directory to save something, they would instead click or type in tags to add or query. The GUI file manager would likewise be replaced with the GUI frontend to the tagging system.
The story I heard was that performance sucked. Granted it could have sucked for other reasons, as other failed efforts like writing more OS components in C# were happening at the same time.
It's not easy to implement nicely, though.