Generic PDF File Names Considered Harmful

As we continue to expand our bibliographic datastore, we have noticed an appalling lack of thought going into the names of PDF files offered for download over the internet.

While individual authors may only be working on one book or paper at a time, in aggregate there are countless such projects being authored globally.

So please, for the love of your fellow academics who collect digital artifacts, do not name your book, book.pdf, your dissertation, dissertation.pdf, your thesis, thesis.pdf, your paper, paper.pdf, or your program’s manual, manual.pdf, if you plan to post it on the Internet.

Choose a file name that incorporates several semantic elements like your last name, the date, the file’s version number, topic, key title excerpt, or (for a more opaque solution) a cryptographic hash of the file contents.

Modern operating systems have no difficulty with longer file names and a sensible name will be deeply appreciated by your readers.

Likewise, if you have written a lot of papers, names like paper-17.pdf are just as problematic since web browsers and operating systems often automatically transform conflicting file names using just such a numbering scheme — or even worse using a scheme that employs the word “copy” to signify “file name copy” as opposed to “file contents copy”.

Thus, your reader won’t know if something like paper-17.pdf in his or her download folder is your 17th paper, or a paper that was written in 2017, or their 17th copy of paper.pdf, or a renamed copy of the 17th unique file originally named paper.pdf, or their 15th copy of a file originally named paper.pdf that had been previously automatically renamed to paper-2.pdf because someone else’s paper.pdf had been previously downloaded in the same location. (This is probably rather confusing, because, well, frankly it is! Which is our point.)

Even worse, browser level renaming can be combined with OS level renaming to produce horrors like paper-3 copy 2.pdf in the same directory as paper.3 copy.pdf with two dimensions of ambiguity. Likewise book.1.pdf and book-1.pdf might represent identical files downloaded in different browsers to a directory with a different book.pdf already present!

Similarly, programmers are often guilty of naming their manuals manual.pdf and then using a numerical extension to designate a version leading to ambiguous names like manual.2.pdf which might be a version 2.0 manual or a second copy of a version 1.0 manual generated by a web browser after a redundant download. Is a manual.2.1.pdf a copy of a version 2 manual or an original version 2.1 manual?

Why should we be forced to open a file to read its internal title when an unambiguous program_name-manual(version_number).pdf naming convention would eliminate any doubt.

Furthermore, when devising a naming scheme, note that lots of books and papers are written in any given year, for any given conference, or on any given high level topic — so names like 2016-book.pdf, ai-book.pdf, and chi-2018-paper.pdf are almost guaranteed to come into conflict with other downloads.

When a generic file name invites its renaming to something like paper-3.pdf, it is far more serious than just an annoyance to the reader trying to remember what the paper is about.

Generic file names create a clear and present danger that your book or paper will look like a copy of something else — leading to its being accidentally deleted and lost forever!

The Invisible Library

The University Library Catalog is perhaps the most underutilized and underdeveloped resource at our disposal. While we can readily search for catalog entries based on their constituent fields and even browse some collections in “shelf order” with images of dust jackets, the accessible catalog is but the tip of a potentially invaluable sea of metadata and associations.

Moreover, the set of titles present in the formal catalog of the library proper does not always include non-circulating and often uncatalogued departmental holdings, nor the private collections of inividual students and faculty along with transient titles accessed online or through interlibrary loan that make up the true “working collection”. To begin to automatically assess the scope of this Invisible Library one could scan the bibliographies of student and faculty publications and compare them with the traditional catalog proper to find cited work not in the permanent collection.

If we could further enrich our analysis to capture frequency, nature, and importance of use, we could begin to isolate key titles for future acquisition; as well as identify low value unused and underused portions of the collection, whose retention serves no active function other than contributing aggregate collection size statistics.

Working in the other direction, one could begin mapping out the subject matter expertise of borrowers with an eye to soliciting collection development guidance and facilitating expertise matching to proactively suggest co-authorship opportunities.

Likewise, there is no reason not to regard each title and associated subject entry as its own chat room and discussion forum, further enriching the catalog with links to locations, people, organizations, artifacts, experiments, questions, concerns, and all manner of related entities.

In short we call for making the library catalog a true Knowledge Graph in the richest possible sense.