Bangin' on a Rok Amarok, KDE, and all that good stuff

9Aug/0834

Amarok File Tracking

I don't blog often, but when I do it tends to be meaty.  I won't disappoint.  I'll be covering Amarok, Amarok history, and a possible future part of kdelibs.

"We can rebuild him. We have the technology. Better than before. Better, stronger, faster."

A little-known feature in Amarok 1, starting at about 1.4.3, was what was known as Amarok File Tracking, or AFT.  For every single file in your collection, on scan, a unique identifier (UID) was generated from some of the file's attributes.  If you moved your tracks around your folders, as the incremental scan kicked in, the UID would allow for the file to be identified, and integration throughout Amarok would mean that your statistics, your cached lyrics, and the current playlist would all be updated with the new path.  No longer did you have to worry that moving around your files would mean losing years of statistics.  Or losing your files.

But I'm getting ahead of myself.

See, AFT wasn't born AFT.  AFT could not track both a file metadata change and a file location change at once, because the UID was being based on file properties such as file length, plus a portion of the file itself hashed together. So you could still lose track of your files.  This was a limitation that was known in advance.

It was also a limitation that didn't originally exist.  As I said, AFT wasn't born AFT.  It was born as Advanced Tag Features, or ATF.  ATF was the same idea, but a little different -- it would store the generated UID directly in the file's metadata.  This allowed for superb file tracking capabilities, because unlike generating a UID from a part of a file, if that part of the file changed, you'd still have your UID.  In fact, the only way you *couldn't* track your file was if you either removed the file's tag entirely (or some other program removed the UID when it shouldn't), or if you removed the corresponding information from Amarok's database. (There are some downsides to this scheme: only certain file types are supported, for instance, determined by the kind of tag they use and the tag's ability to store this kind of information.)

So why the change?  Well, ATF had a problem, which was related to the structure of Amarok itself, and Amarok's historical penchant for crashing (which got much better as the 1.4 series progressed).  The outcome is possibly worthy of an entry in The Daily WTF.  In gory detail, here's the problem.

   1. Amarok would start a collection scan.  The collection scanner was the entity responsible for adding the UIDs to the file metadata.  Important note: the collection scan was a separate process.
   2. Amarok would crash, leaving the collection scan running, although not communicating with anything.  This scanner could be very slow if it was adding the UIDs, depending on whether padding had to be added to the file's tag.  If this was the case, the entire file would have to be rewritten.
   3. Amarok would be restarted by the user.  Another collection scan process would start.  Becuase UIDs would already exist for the early files, it would very quickly catch up to the first collection scan process.
   4. You now had two collection scan processes generating and writing UIDs at the same time to the same file.  If you were lucky, this would mess up your tag.  If you were unlucky, this toasted your entire file.
   5. Repeat step 4 for the rest of the scan.

ATF was never released in this state, but it did get turned on in SVN.  And a few unlucky users had far too many files end up corrupted, depending on how crashy things became for them.  After we finally realized what the issue was, a user came forward on the mailing list (still trying to find the exact mail or user) proposing a solution that I believe they'd seen in a class.  Essentially, the solution relies on modifications to temporary, uniquely named files instead of the original file, using MD5 checksums to find out of the original file has changed while writing the new file, then using filesystem atomicity guarantees to move the new file back over the old one.  This became the MetaBundleSaver, and it worked quite well, but it was also extremely slow compared to a normal scan.  And most importantly, no one was quite trusting of the whole ATF scheme any more.

So, ATF was renamed to AFT and with it came a new algorithm that wouldn't touch anyone's files, but couldn't track as well.

A couple weeks ago, I added AFT to Amarok 2's SqlCollection.  Enjoy, everyone -- statistics, lyrics, and the playlist are already supported, with support for stored playlists coming eventually.  But there's more.

Fast forward to today (okay, two days ago).  I'm taking a shower -- Wade does insist that there's something about showers and KDE coders -- and I had a thought, which was essentially: there's absolutely no reason why Amarok 2 can't use a UID inside a file, if one exists, for superior tracking, and if not, generate a read-only type for normal tracking.

So I created a utility that is built and installed with Amarok 2.  It's called amarok_afttagger, and it will write UIDs into your files, using a class ported from MetaBundleSaver and called SafeFileSaver to ensure that files are not overwritten/interleaved, even if you run the process twice or three times at once.  It optionally supports recursion if you want to pass in directories, and it can also remove UIDs from your files if you like.  Right now it supports MP3s only, but Vorbis and FLAC support will be coming soon.

I've tested it extensively.  I've added UIDs to files, removed them from files, regenerated the ones in files, over and over, and still everything is cherry.  And Amarok 2, when it finds these files, can do some awesomely robust file tracking.

I encourage people to give it a run on their MP3s and check it out -- if you're worried by all the Dark Ages info up above and don't have faith in the implemented solution, back up your files first, or operate on a copy of them, until you're satisfied it won't harm your files.  And if you still don't want to do it, you can enjoy the less awesome but still awesome power of the non-embedded UID file tracking.

Now, I promised this would talk about a possible KDE library.  I'll eventually be submitting the SafeFileSaver class for hopeful inclusion into kdelibs, so that any application that is worried about data integrity and needs to write to a user's files can take advantage of it.  It's very simple to use -- you simply give it a file path, and then operate on the file path that's returned to you when you call prepareToSave(), instead of the original one.  When you're all done, you call doSave() and it will perform the necessary functions.  That's it.

Hope this has been enjoyable, and enjoy AFT in Amarok 2.  Play with it and be amazed.  Use amarok_afttagger on your files and be even more amazed.  More information is available here: http://amarok.kde.org/wiki/AFT

Filed under: Amarok, KDE Leave a comment
Comments (34) Trackbacks (0)
  1. Hmm this seems like a very bad idea. I don’t think you should silently alter the user’s files.

    What’s wrong with using the MD5 of the first x bytes of audio? That way if the tags or file length changes the ID won’t change.

    Actually you shouldn’t use the first x bytes because they might be zero quite often. Somewhere from the middle.

  2. No one is silently altering anyone’s files. You choose to add the UIDs to your files by running them through the program (and agreeing to the terms of use that are displayed).

    What’s wrong with using MD5 of the first x bytes of audio is that it’s very difficult to do. Our tag library does not have a way to access the audio data only, and you can’t simply find the offset — some formats interleave audio and metadata together. So we’d have to write a new library to find this audio data — it’s not worth it.

    Besides, many tag specifications already have this built in. ID3v2 has a UFID frame, for instance, which is what is used.

    Using the first x bytes are *not* zero quite often. And it’s not the only value being used in non-embedded hashes.

  3. And Nepomuk? I heard it has something like this to track files, it’s a possible solution?

  4. There are a few issues using Nepomuk for tracking.

    The first is that it can only track what Strigi scans. Many users may never use Strigi/Nepomuk, and of those that do, many won’t have it index their music files.

    The second is that Strigi generates SHA hashes of every file, but this includes (AFAIK) all of the data of each file. As soon as a file has changed (i.e. a metadata update), the SHA is invalidated, and we have to wait until Strigi gets it again, and then somehow sync with it. And if a user moves a file at the same time, we may never find the file again.

    So at best, it’s no more useful than non-embedded AFT. At worst, it’s almost useless.

  5. what if I move an audio file from my laptop to my desktop where there is another file with the same uid ?

  6. Two files with identical IDs will see the path from one or the other being used to update the playlist, statistics, and cached lyrics with.

    The way statistics are currently implemented, you would only receive statistics updates for plays of one of the files (I could possibly make this smarter at some point, so it checks to see if a file has a UID and if it matches somewhere else in the table, and preferentially uses that).

    What won’t happen, however, is that your statistics will be totally lost.

  7. Hope the 2.x series also solves problems with music on a smb device (in my case a cheep and very simple NAS with smb). The problem I currently have with 1.4.x is that the collection scan always restarts as soon as it finishes. With a few GB music connected over a slow WLAN this means scan times of hours for a complete scan.

  8. jefferai, AFT sounds very nice, but what exactly does SafeFileSaver do that KSaveFile (already in kdelibs) doesn’t (or couldn’t be made to do with a little help ;) ?

    http://api.kde.org/4.x-api/kdelibs-apidocs/kdecore/html/classKSaveFile.html

  9. Huh. Who knew. Looks like they pretty much do the same thing.

    Actually, that’s one of my biggest problems with kdelibs. It’s really hard to find out what’s already there, in no small part becuase the search at api.kde.org sucks. It never finds classes I already do know exist, much less classes I don’t know exist…

  10. i probably would need some time to accept the fact amarok2 will change all my mp3’s checksum. don’t think i could cope with this :p so i sincerely hope there will be a choice whether to allow embedded AFT or not

  11. Nice, I always thought AFT was the superior solution even though it toasted some of my files back then :)

    Two things:

    - All my files also have MBIDs (see http://wiki.musicbrainz.org/MusicBrainzTag), a combination of those should also a nice way to identify a track.

    - It might be a good idea to document the exact ID3 header you use on the AFT wiki page. maybe some other player will jump the band wagon and use that id as well…

  12. Please, actually read the blog before getting all worried. You have to use the separate utility if you want embedded tracking. You can still take advantage of non-embedded tracking if you don’t want to use it. No one is requiring you to modify all your files, or doing it automatically.

  13. Yeah, sorry about that :-)

    Re: MBIDs, those are ways to identify a track, not a file. There are reasons to do one or the other, but file tracking is more reliable.

    Re: ID3 header, it’s using a UFID frame. It’s the frame designed specifically for Uniquie File IDentifiers, so I think it’ll be obvious to other players’ coders :-)

  14. That is not quite true. Daniel is working on the Nepomuk support for Amarok and the goal is to decouple the files from the actual track database. That means that files are “only” treated as incarnations of the tracks. You then get play count, ratings and all that based on the actual track/title rather than the file. Thus, playing two files that represent the same track will change the statistics for this one track.

    Example: you have foor.mp3 and bar.ogg, both containing song “foo” by artist “bar”. Play them both and you get a playcount of 2 for the one song.

    The same should then apply for playlists: they contain of tracks, not of files. When you move a file, the relation to the actual track is not lost but maintained automatically by Nepomuk.

    Of course this is not done yet. So at the moment, AFT is the preferred solution. :)

  15. This would be similar to using MusicBrainz identifiers, and as I said in that comment, this would be tracking tracks, not files. Both methods have their merits.

    Remember that the Nepomuk collection is separate from the SQL collection. I see no reason why Daniel’s work couldn’t be used in the Nepomuk collection. But my understanding is that Strigi is required for the kind of situation you’re describing to actually work in Nepomuk, and if this is the case, it’s not something that can be relied upon. We can’t assume that users will have Strigi scanning all their music (especially if their music is on remote shares); we can only assume that users are scanning the music that is configured within Amarok. So it seems to make sense that for the SQL-based local collection, AFT would be the file-tracking method used, whereas for tracks sourced from the Nepomuk collection, a method like you described would be used.

  16. Not to jump on the criticism bandwagon, but an extra utility that needs to be run sounds like a usability nightmare. Hell, I’m already confused as to what the options are (will it still use Amarok 1.4-style tracking if the tagger isn’t run? Can this option be turned off?) Eventually there should probably be an option inside Amarok to enable and run the tagger. I envision some radio buttons:

    * Do not track my files
    * Track my files based on their current tags
    * Embed an ID in my files to track them

    The second option should probably be the default.

    Text below the radio button group would read the following if the given radio button is selected:

    1. If you move a file while this option is enabled, any statistics will be lost in the database
    2. This option will re-discover files if the location OR the tags are changed, but not if both are changed between database updates
    3. This option is the most robust, but will embed an identifier in your files’ tags. This process may take a long time for large collections.

    …or something similar. Choosing the third option and applying the changes would run an Amarok-managed process (with progress bar?) to tag all the files. This would also provide UI hooks (though not necessarily software hooks) in case someone wants to add Nepomuk-based tracking or MusicBrainz-based tracking.

    Now, I’m not a usability expert by any means, so I’m open to criticism as to why this sucks.

  17. “Not to jump on the criticism bandwagon”

    Then don’t.

    “But an extra utility that needs to be run sounds like a usability nightmare.”

    It’s not. It has nothing to do with usability.

    “Hell, I’m already confused as to what the options are.”

    Then actually read the entry.

    “Will it still use Amarok 1.4-style tracking if the tagger isn’t run?”

    It’s very clearly stated that it will.

    “Eventually there should probably be an option inside Amarok to enable and run the tagger.”

    There can be. When everyone is satisfied that it works and is safe. Or it could easily be done through a script.

  18. Could someone clarify some of this for me. The above solution seems like a very nice one.

    However I’m don’t entirely understand the difficulties that Amarok and NEPOMUK seem to be having with associating metadata with files. Why is it necessary to use file watching rather than hard links?

    p.s. Lest you misinterpret my “tone of voice” I know that I’m ignorant about these things, this is a genuine question not a suggestion that you’ve missed something.

  19. Don’t confuse Amarok and NEPOMUK in terms of associating metadata with files, they’ve very different. Totally different scanning mechanisms and backends.

    Amarok has tables in its database with file paths to associate files with scanned (with amarokcollectionscanner) metadata, statistics, etc. When the file path changes, those tables have to be synced somehow, or you lose the association in the database with the actual files. AFT solves this.

    As far as hard links, it’s not an option for Amarok to start placing links everywhere on someone’s filesystem. That’s also assuming that the filesystems people are using support hard links, which in many cases is not true.

    P.S. Why would I misinterpret your “tone of voice”?

  20. ….so much for trying to be constructive.

    If you’re not open to criticism then why blog?

  21. I’m perfectly open to constructive criticism. Yours wasn’t very constructive, because you didn’t bother to pay attention to the content of the post before dashing off a reply.

  22. As I’ve understood, KDE4 brings many shiny frameworks for using them instead of inventing wheels, but now you’re saying that they are not widely used and thus cannot be relied upon. Umm..

    Why can’t Strigi be used for scanning Amarok’s collection (even though user ain’t using it for other stuff)? If there’s no right backend/engine or smth, maybe it’s worth creating/porting one?
    The collection would then be accessible via Nepomuk and AFP-stuff with everything else could come from there? The information would be accessible from the rest of the KDE and some other app using KDE libs etc. Ain’t that the true goodness of all the great libs?

  23. Full ACK. An external application seems not transparent to me, too. Who knows, how many won’t even notice its existance.

    BTW, for me personally the ATF in Amarok 1.4.x is useless because it won’t refresh static playlists after renaming files (which I do quite often).

    However, this feature is needed and I hope we won’t discourage you. I’m sure you will find an elegant solution.

    Thumbs up!
    plangin

  24. Amarok makes use of many of KDE’s core frameworks — Solid, Phonon, and Plasma, plus Oxygen icons.

    So it’s not exactly reinventing the wheel that Amarok is using the same collection scanner it was already using in 1.4. It hasn’t really changed. It still uses TagLib, which is a KDE-style framework if not actually a KDE framework, and which Strigi doesn’t use — so perhaps it’s Strigi that reinvented the wheel when it comes to MPEG/FLAC/Vorbis file parsing.

    So, why isn’t Strigi used? Well, one of the main reasons you said yourself: “even though the user ain’t using it for other stuff?” There is no requirement by Amarok for Strigi, and making it a requirement, along with necessary setup, is needlessely complex and intrusive.

    Not only that, Amarok’s collection scanner is *much* faster. Strigi needs to read every byte of every file. Our scanner often reads only a tiny part of the file. Considering that many people store their files on NFS or SMB/CIFS shares, this is a very big concern.

    It’s also a reason why Amarok’s file tracking capabilities doesn’t take hashes of the whole file, or of all of the music data.

    Re: “Nepomuk and AFP-stuff with everything else”…what does Apple Filing Protocol have anything to do with this? That’s Mac-only.

  25. “Full ACK. An external application seems not transparent to me, too. Who knows, how many won’t even notice its existance.”

    It can be integrated into Amarok eventually. The point of making it an external application is two-fold.

    One, it will be easy to integrate into Amarok in some way with a script or the GUI itself.

    Two, people can run it when Amarok isn’t open. This is a big plus, because doing the tagging can be a very time-consuming process.

    “BTW, for me personally…”

    From the article…

    “…AFT is enabled…with support for stored playlists coming eventually”

    With A2 playlists will be stored in the database, and exported to whatever format you desire when you want. As a result, it will be easy to add AFT support to stored playlists, and it will be coming. With A1, the playlists were stored as M3U files on disk, and as a result there was no way for there to be UIDs stored in them.

  26. OK, yeah, instead of “reinventing wheel” I meant more like having one fuction in multiple places.

    But do you agree that using Strigi would be more beneficial as the information would be widely accessible and more usable? If so, maybe Strigi could be used in some other way than default (either within Amarok or just with music-files) so that it would be quicker and more suitable for Amarok’s purposes?

    And I obviously just mixed T with P, so I meant AFT instead of AFP :P

  27. “But do you agree that using Strigi would be more beneficial as the information would be widely accessible and more usable?”

    No, I don’t. Strigi’s scanning is very general and very slow. Amarok’s scanner is very fast and highly specialized. There’s nothing wrong with having both.

    “If so, maybe Strigi could be used in some other way than default (either within Amarok or just with music-files) so that it would be quicker and more suitable for Amarok’s purposes?”

    But that would defeat the point of Strigi…one of the guarantees it makes is SHA hashes, and that is the real killer as far as speed. I don’t really understand the problem you have with both being used. Amarok has support/is getting support for a Nepomuk-based collection. This would query Nepomuk for files scanned by Strigi and present them as a separate collection from the one that Amarok’s own collection scanner will build. People that have Strigi indexing all their music could take advantage of this collection.

    I don’t understand the issue here. What’s wrong with having both methods available and letting the user choose what they want?

  28. The problem is that (as I understood), Nepomuk-based collections will lack all the AFT-goodness. Did I misunderstand it?

  29. Nepomuk-based collections will need a different method for tracking, but having some tracking put in is not impossible. Especially since as I understand it the metadata like statistics will be stored in the Nepomuk database, not Amarok’s internal one.

  30. Exciting stuff! It sounds like you’ve come up with a very elegant and effective solution to the problem of tracking audio files in Amarok. Yet one more reason I’m looking forward to Amarok 2.

    BTW, I’ve read all the comments up to this point, and I definitely agree with you on all your arguments (why not use Strigi/Nepomuk, why make afttagger and standalone utility, etc.)

  31. In the case of a shared music library that different clients can access. What would happen if each tries to write its own UID into the file?
    Can there be multiple UIDs per file? If not, will that mess up Amarok’s statistics?

  32. Unless you specify that the UID should be recreated, the program will simply not write a new ID to the file. The program is designed such that there should always be one (and only one) Amarok ID (although if the format changes for some reason, a file could I suppose end up with two different versions, although different Amarok versions would only use one or the other).

    Now, if you *did* specify that it should be recreated for some reason, that will not mess up statistics so long as a scan is performed on the file in the same location with the new ID. This will create a new (location, ID) mapping and all will be well.

  33. Not to be snide, but haven’t you ever heard of exclusive file locking? Why copy the entire file to a new location on disk and then move it over the old one, when you can just lock the file and make your change? File systems FTW.

  34. Not to be snide, but haven’t you ever heard of the non-portability of file locking, especially on network-based file systems? Portability FTW.


Leave a comment


No trackbacks yet.