A general tagging mechanism, reasonably supported *across filesystems*, would be...

netsec_burn · on Feb 2, 2019

Why wouldn't it to be easy to implement? You could probably make a global LD_PRELOAD library that hooks into all open() CREAT calls and tags the inode's xattr with the program that created the file. You could use an attribute like, user.created_by.

megous · on Feb 2, 2019

Why such a hack? This is a solved problem already. Try Linux audit API. You can set it up for tracking various actions, and you'll get a log with all the information including a process name.

https://linux-audit.com/configuring-and-auditing-linux-syste...

netsec_burn · on Feb 2, 2019

I believe the question was asking for a method of storing the metadata, I was just suggesting xattrs. This will work at the expense of running a daemon.

bennofs · on Feb 2, 2019

Can audit be configured to only log file creation, not modification or access? Last time I didn't find a way to do that and I don't want to write a log file entry for every file access.

fwip · on Feb 2, 2019

You can at least ignore read events, not sure about separating write from creation.

cma · on Feb 2, 2019

Programs can use programs like 'touch' to create the files so you'd have to get meta into a chain of process ownership or something while exempting the user's shell.

ksherlock · on Feb 2, 2019

Have you seen actual programs (not shell scripts) that do that? I've heard of code smells, but that's a programmer smell with there.

netsec_burn · on Feb 2, 2019

I think touch is a generic example, but not a practical one. The idea is that program A can launch program B, and you'd have to recursively search for the parent. But it's not easy to find which parent is the true owner, maybe an exception can be made for init but shells can't have that exception as they are sometimes the owner.

larkeith · on Feb 2, 2019

Couldn't you just record the creating process and all ancestors?

netsec_burn · on Feb 2, 2019

You're right. I can see anything like scripts getting difficult. It wouldn't work for all use cases.

ecnahc515 · on Feb 2, 2019

MacOS does something sorta like this. It uses extended attributes to mark where a file came from in some cases, such as when an executable is downloaded from a browser, it will mark it as "unsafe" and when you attempt to run it, this causes it to fail to run and you have to whitelist it via security preferences.

nine_k · on Feb 2, 2019

Adding an interface to access the tag store won't be very hard.

It would be harder to make it so that all the tag-oblivious programs, like mv or vi, to say nothing of rsync, would preserve the attributes when moving or modifying a file.

netsec_burn · on Feb 2, 2019

Sounds like you might need to hook rename() if you'd like to add that feature too, but I would still say it's straightforward.

nine_k · on Feb 2, 2019

The way most editors save a file is like this:

  f = open("the_file.new")
  write(f, new_contents);
  close(f);
  rename("the_file", "the_file~");
  rename("the_file.new", "the_file");

So the old file is never modified, it's renamed to the backup copy, and an entirely new file is created to take its place.

This has a number of advantages, but does not play well with any extended info the old file used to have, unless they are copied explicitly. And in a tagging-oblivious program, they won't be.

6c696e7578 · on Feb 2, 2019

I see a lot of saves work like this:

    f = open( "the_file.new.$$" );
    write( f, new_contents );
    close( f );
    rename( "the_file.new.$$", "the_file.new" )

If not implemented like this, then something that attempts to read "the_file.new" may get partial contents, normally truncated along the file system block size.

Backups often take place, too, like you mentioned, depending on editor.

notyourday · on Feb 2, 2019

If you do not rename files while holding their open file descriptors there's no guarantee that you are renaming the file you just wrote to.

JdeBP · on Feb 2, 2019

There is no guarantee of that if one does retain the open file descriptor.

notyourday · on Feb 4, 2019

before rename:

fstat the fd, get st_dev and st_ino.

rename

stat the new name. Compare st_dev and st_ino.

If the value matches, you renamed the right file. If it does not match, you renamed a wrong file. Without holding the fd, it is impossible to know if it is the right file.

JdeBP · on Feb 4, 2019

"you renamed a wrong file" shows that "there's no guarantee that you are renaming the file you just wrote to".

notyourday · on Feb 4, 2019

In this case you know that you renamed a wrong file. In the close before rename you do not know that you have renamed a wrong file.

JdeBP · on Feb 5, 2019

Knowing that something bad happened after the fact is not the same as your idea that there's a guarantee that something bad will not happen.

netsec_burn · on Feb 2, 2019

Thanks for teaching me this! I wasn't aware. So, the original xattr (and the program that created it) would be deleted and replaced by vi for example. Still technically right, but I'm unsure how you'd keep a history.

sneak · on Feb 2, 2019

Sidecar files might work, and would also be filesystem-portable.

amaccuish · on Feb 2, 2019

Please no. This article is literally "Dotfile madness", don't make it worse!

I have to deal already with these on file shares, specifically for Apple: .DS_Store, .Trashes and .AppleDouble, or for Windows: Thumbs.db, $RECYCLE.BIN (for some reason Windows sometimes ignores the fact I've disabled the recycle bin on a share and creates this instead) and desktop.ini. Please don't drop crap around directories where there exist a multitude of tidier alternatives.

sneak · on Feb 2, 2019

How do you have to “deal with” them? What are the portable tidier alternatives?

amaccuish · on Feb 2, 2019

Because when you have multiple OSs accessing the same shared folder, they all create their own crap, which is then visible to the other OSs, and fills up directories with stuff that confuses normal users.

sneak · on Feb 3, 2019

Normal users don’t run multiple OSes.

tripzilch · on Feb 3, 2019

I don't run multiple OSes, but I do use USB sticks and SD cards on other people's computers and vice versa. I'm pretty sure that is "normal user" behaviour.

It gets annoying real quickly because you see the "crap" files of all the OSes you're not using. And you delete them, you only have to insert the stick and they're back. This happens in different ways for all of Mac, Windows, Android and certain Linuxes.

amaccuish · on Feb 4, 2019

Some users however access our file shares from their Macs, or their Windows PCs, or even some from their Linux machines. That's fairly normal.

featherrust · on Feb 2, 2019

A daemon with a CLI and a programmatic interface, backed by a SQLite store, hooked into Linux audit and perhaps an LD_PRELOAD?

recursive · on Feb 3, 2019

For me, mostly exclude them from select all, or perhaps scroll past them to see meaningful files.

darkpuma · on Feb 2, 2019

Sidecar files for storing file tags make querying the tag system a huge chore; it totally kills performance. I understand the conceptual appeal but it's just not the way to go.

netsec_burn · on Feb 2, 2019

This is what Mac does to track download locations, if I remember correctly (._file).

runxel · on Feb 3, 2019

How do you sleep at night?

lgeorget · on Feb 2, 2019

There's an entire research field dedicated to the information flow control which could solve just that (if it were actually used).

In my research team, we used a tainting tracing mechanism to understand the behavior of malware. Basically, we installed a malware on a clean phone and we then traced all information flow originating from the APK to processes, to files, to sockets, etc. It helped reverse-engineering the malware.

TeMPOraL · on Feb 2, 2019

I'd love something like this on program level. Given contents of a variable, I'd like to know where they came from - which pieces of code contributed to the result. I'd also love to be able to mark a piece of data, and see what code touched it or derived new data from it. A programming equivalent of injecting radioisotopes into the body.

saagarjha · on Feb 2, 2019

I’d suggest looking into taint analysis tools, though those are usually aimed more at finding things like unsanitized input ending up in a call to system.

0db532a0 · on Feb 2, 2019

Any interesting links to papers?

lgeorget · on Feb 3, 2019

There's the work by Myers and Liskov which is quite central to the whole field: https://dl.acm.org/citation.cfm?id=266669, there's Flume: https://dl.acm.org/citation.cfm?id=1294293, DStar https://www.usenix.org/event/nsdi08/tech/full_papers/zeldovi.... More recently, on Android: https://www.usenix.org/system/files/conference/usenixsecurit.... And this (disclaimer: I'm the main author): https://link.springer.com/chapter/10.1007/978-3-319-66197-1_....

0db532a0 · on Feb 4, 2019

Thanks.

Pxtl · on Feb 2, 2019

That's a killer idea. I hope somebody will reise to the challenge of a filesystem with such a feature.

smallstepforman · on Feb 3, 2019

BeFileSystem (Haiku) HFS+ and APFS Even NTFS has attributes.

The problem is file manager. Tracker/Finder are several steps ahead of Explorer. Under Linux, its a greater mess even though there are no technical limitations.

Pxtl · on Feb 3, 2019

I was mostly going for a morbid pun, but it's nice to hear that the FS does support this.

Surprising Explorer is so bad at it since they're good at getting EXIF data and ID3 tags into the Properties tab in Explorer.

agumonkey · on Feb 2, 2019

for non portable, I found it nice that wget or curl adds the original url to the saved file as xattrs

darkpuma · on Feb 2, 2019

Associating tags with files is straight forward.

What's not quite so straight forward is how you query the system. The worst implementations of file tagging only implement retrieving a list of all files that have a given tag. Slightly better than this are tagging systems that will return the intersection between two or more tags. Most tag systems never go beyond this.

Going slightly further, tag exclusions are powerful and are sometimes implemented (given a set of files from some subquery, exclude all files that have a given tag.) What you rarely see are systems that allow you to exclude one subquery from another.

However what you almost never see implemented is a system for preferential tags; given a subquery, reorder the results such that files with preferred tags are raised to the top, ordered by how many of the preferred tags they have. Once you implement this, the system's UX changes dramatically because the user no longer has to make strong assumptions about how well their files have been tagged. Many files might be missing relevant tags and the user may not be sure if the file they're after is one of these incompletely tagged files. When using a system with preferential tag querying, the user will receive files in their result list that don't match all the tags listed, but most of them. This is a bigger deal than it may sound, since the main drawback of using tags for file management is incomplete tagging. By addressing this drawback, you stand to bring file tagging to the next conceptual level, which is rendering hierarchical file management obsolete.

Consider that hierarchical file management is a strict subset of file tagging. You can model file hierarchies inside a file tagging system. To demonstrate this by example, consider /home/joesixpack/Documents/seventh-novel.pdf For each level of the hierarchy, we can create a new tag, such that this document has the tags: '/home/', '/home/joesixpack/', '/home/joesixpack/Documents/'. But because we're using file tagging, we can also automatically tag that file with things like 'pdf' or maybe even 'Documents'.

Now before I go on, there is something to be said about the number of tags in the system exploding when you model a tree as a tagset. In practice this probably isn't the approach any real system should take, if only because most of those tags will be useless and because there is a great deal of redundancy in trees that wouldn't exist in a native filetagging system. Consider /home/joesixpack/Documents/ and /home/johnsmith/Documents/. We have two different Documents directories because they exist in different parts of the hierarchy. However in a file tagging system we'd ideally only have a single Documents tag and one tag per user, such that querying the intersection between 'joesixpack' and 'Documents' returns the files that would otherwise be in /home/joesixpack/Documents. Some tags, such as the user tag, could be implicit; such that if joesixpack simply queries 'Documents' the tag 'joesixpack' is automatically intersected with the results presented to him.

With that out of the way, consider how we can go further if we apply straight forward statistics to the tag system. Suppose 99% of joesixpack's 'Documents' that are 'pdf' are also 'novel'. When joesixpack creates a new file that's in the tags 'Documents' and 'pdf', what are the chances that file should also be tagged with 'novel'? We could analyze the contents of the file if the system has a semantic understanding of what 'novel' means, but let's not go there. Tag systems created by the likes of Facebook and Google do this sort of analysis, but it's heavy, tricky, goes wrong in ways that create PR disasters, etc. We can get pretty good results by ignoring the contents of the file and looking instead at merely what that file is already tagged with. If a file is tagged with 'joesixpack', 'Documents', and 'pdf', it _probably_ should be tagged with 'novel' as well. Probably, but not necessarily. So the system can expose to joesixpack the suggestion that he tag that particular file with 'novel'. By presenting suggestions like that to the user, you greatly reduce the UX friction needed to tag files well and therefore increase the usability of the tagging system, while at the same time improving the quality of the tag suggestions in the future. What we create here is a 'virtuous cycle' of sorts that gives back to the user more as it's used more. Such a tag suggester can be implemented as multi-label classification using naive bayes; it's fairly straightforward.

Going further; if we have such a tag suggester, and an appropriate caching scheme, tag suggestions can be used by queries. If joesixpack queries for 'pdf','novel', the system could return to him the intersection of 'pdf','novel', but also return to him files that are tagged with 'pdf' but not tagged with 'novel', in cases where the tag suggester indicates there is a high likelihood that the file should be tagged with 'novel'.

That last paragraph may or may not play out well, I haven't experimented with it extensively yet. But getting back to my original point: if you're going to implement file tagging in a filesystem, you should think carefully about how the users will query that system, and whether the query system will be extensible in userspace to facilitate new powerful methods of querying it. It would be a total tragedy if the system only supported basic tag intersections and the only way to extend it was to implement a new kernel module.

nine_k · on Feb 2, 2019

I very much agree.

This would require a ton of difference from how modern desktop and mobile OSes handle files / documents. Maybe some time later, some research OS would implement it.

darkpuma · on Feb 2, 2019

I've implemented the above in userspace for my own experimentation (not yet ready for public release, but maybe soon.) Nearly everything I've described above can be implemented with relatively straight forward SQL queries and SQLites performance has been everything I could have asked from it (in the neighborhood of ten thousand tags and hundreds of thousands of files.)

Getting file tagging into the kernel level to replace directory hierarchies would be a huge paradigm shift, a very dramatic departure from Unix. To be honest I'm not sure whether or not getting such a system into a kernel would be appropriate or not. Traditional hierarchical file management seems more than sufficient for "system files". But I'm really interested in replacing hierarchical file management from the users' perspective. More or less, put ~/Documents, ~/Downloads, ~/Desktop, etc under control of the tagging system but leave the rest of the system as-is. At least for the proof of concept.

I think a demonstration system could be implemented as a custom desktop environment running on regular old linux, where the open and save dialogs of GUI applications have been replaced with tagging/querying windows. Instead of the user clicking through a few directories to find a file to open or find a directory to save something, they would instead click or type in tags to add or query. The GUI file manager would likewise be replaced with the GUI frontend to the tagging system.

spockz · on Feb 2, 2019

Wasn’t this the whole idea and purpose of the successor to NTFS which was supposed to ship with Longhorn/Vista? What happened there?

asveikau · on Feb 3, 2019

The story I heard was that performance sucked. Granted it could have sucked for other reasons, as other failed efforts like writing more OS components in C# were happening at the same time.

darkpuma · on Feb 3, 2019

WinFS. I don't know much about it, but I understand it was to have some properties similar to what I've described.