Thinking about URLs and Overthinking about URLs

,

Rethinking URLs and Trailing Slashes

There is a level a geekiness beyond which few will tread. This likely crosses that frontier. I’m prompted to write this as a result of a discussion on WebmasterWorld about which URLs should and should not have a trailing slash. One person threw out the idea that URLs for files should not have slashes and URLs for directories should. While that seems sensible, when talking about the architecture of a website, that terminology makes no sense anymore in the context of dynamic websites. In that case, we would be better off to speak of vertices or nodes and edges or, if the site is strongly hierarchical, we could think in terms of branch nodes and leaf nodes. In the latter case, we might (or might not) consider branch nodes to be “directories” or “folders” and leaf nodes to be “files”, even though the URLs no longer bear any relation to some underlying structure on a hard disk.

So all of this got me thinking about the characteristics of a modern URL and what it means for thinking about site structure and the implications that has for building listing pages that are good for the user first and foremost and good for search engine optimization as a consequence. Of course, in my usual elliptical way, it will take me 2,000 words to cycle back around to that point.

Before I go into all that, to stave off pedantic comments about URLs versus URIs, a URL is simply “a type of URI that identifies a resource via a representation of its primary access mechanism (e.g., its network ‘location’), rather than by some other attributes it may have” according to the W3C report on the subject (see also Dan Meissler’s summary). Translation: what you see in your browser address bar is always a URL and all URLs are URIs (actually Chrome breaks this by omitting the protocol identifier, but close enough). A URL was once considered a specific subset of URIs, but can now be considered a colloquial but useful (as per the W3C report) rather than a technical term.

URLs Are Abstract

In the old days, it was usual for URLs to reveal something about server architecture, and you still see this on some sites (especially ASP sites on Windows servers). Examples are URLs like:

  1. http://example.com/page.html
  2. http://example.com/page.php
  3. http://example.com/category/
  4. http://example.com/index.php?p=123

In the first two examples, the file extension suggests that the URL is pointing at a file, and that #1 is a simple HTML file that will be fed directly to the client (i.e. browser) as it exists on the server, while #2 is a PHP file, which needs to be evaluated by the PHP processor on the server, which will then feed the actual data to the client. In #3, especially since it is on the same domain as the others, we assume because of the trailing slash that it points to a directory, that is a collection of files on the server hard disk. In the final example, the URL suggests that it points to a file and that file is getting a parameter passed to it and the variable “p” will have the value “123”.

Now of course, this could all be wrong. We’re making an assumption that the URL indicates something about the server. However, early on Tim Berners-Lee proposed the Axiom of Opacity of URIs which states simply:

The only thing you can use an identifier for is to refer to an object. When you are not dereferencing, you should not look at the contents of the URI string to gain other information.

source: Tim Berners-Lee, Universal [sic] Resource Identifiers — Axioms of Web Architecture, 1996.

Dereferencing is just a tech geek way of saying retrieving the resource which is a tech geek way of saying displaying the page or image.

So in other words, the URL should only tell you and only does tell you how to locate the thing you’re looking for. It should not and ultimately does not expose the underlying system that is finding that thing. This axiom was usually ignored in the early days of the web because it was so easy to make a URL point to a file or a directory on the hard drive of the server. So out of laziness, URLs betrayed a lot about the server architecture, but this was a consequence of laziness, not an inherent and certainly not a desirable characteristic of a URL.

In my own case, like all beginners at the time, I started with simple URLs that mapped directly to specific files using the actual file names, including file extensions. One of the reason I started building my own content management system (CMS), though, was that the lack of abstraction bothered me. I hadn’t read any of this stuff by Tim Berners-Lee, but I just felt that storage location and addressing should not be so tightly tied to each other.

As a former programmer, I believed in having an abstraction layer between the user interface and the underlying technology. However, at the time I built my first pages, I was writing my dissertation in history. As a scholar, having URLs point to the host, directory, subdirectory and file location was like having call numbers point to the building, floor, shelf and shelf position of a book, rather than some abstract naming scheme like the Library of Congress classification. We do not expect the call number of a book to reveal anything about the architecture of the storage facility and I could see no reason why the URL of the “resource” would betray anything about the architecture of the server. Rather, the LC call number is based on the information architecture of the LC system without reference to the physical architecture of the building holding the books. It seemed only natural to me that my site URLs should be based on the information architecture of the site, without reference to the server setup.

This observation set me on the path of creating my own content management systems where the page content was split among as many database tables as necessary and the URL was simply an entry in yet another table. The URL table had a column with the URL and that was keyed to some number or whatever that told the program how to put together the page. At this point, the URL was a pure abstraction.

Eventually, I began looking for an open-source alternative to my custom CMS, which as a spare-time project of one guy had substantial limitations. One of the things that initially attracted me to Drupal was that it offered complete URL abstraction, using a lookup table as I was doing in my custom CMS. Now, ten years later, Drupal remains my CMS of choice, even though WordPress and many others now have similarly convenient URL abstraction (yes, this blog is on WordPress — I like it for blogging or very simple brochure sites; I like Drupal for almost everything else).

In 1997, to get good URL abstraction, I needed to build a system that had URL abstraction as a basic component and for that I had to know a programming language and be able to interface with a database. Most people were still hand-coding HTML at the time. Only a small number of people, either through reading or just thinking about it, knew about the Axiom of URI Opacity or, as I conceived it, URL Abstraction. Those that did, often did not have the technical means to achieve it. So we came to think of URLs as being somehow related to server technology. And we came to think of branch and leaf nodes in terms of directories and files, and we differentiated directories from files by the presence or absence of trailing slashes and file extensions.

Flash forward to 2012. Beginners are now more likely to install WordPress than to learn to hand code HTML. And WordPress and most other CMS now have very convenient URL Abstraction built in. So I now see abstract URLs, that is URLs that are simply a lookup column in a database, as the norm. The old-school URLs that tell you something about the machine architecture are a dying breed. And good riddance.

By implication, the idea that a URL should differentiate a directory or listing from a file or page becomes problematic, and that’s where things get controversial.

URLs As API

If URLs are abstract and are just a lookup column in a table, what does that mean? Among other things, it means that they no longer are a server hardware interface, they are a sort of Application Programming Interface. I can choose to have them map to the server architecture, but I can choose not too as well. They are a means not simply of looking up a resource, but of interacting with the underlying program through an abstracted interface. I say the interface is abstract because I don’t know what a given URL does once captured by the program (because URI Opacity is a fundamental principle), only that it does stuff. I can grab any part of the URL and create very different results, routing data through one template or another based on the second or third or fourth term of the URL. There is ultimately nothing except the hassle to prevent me from parsing a url as ./filename/subdirectory/directory for example.

In effect, when I build a site in a system like Drupal with abstract URLs, I can make any part of the URL fire any sort of action, in effect exposing an API to users and designers. Yes, the URL still tells the server how to “dereference a resource” but the underlying program can treat any and all parts of the URL as a parameter and in arbitrary ways (i.e. any part can have any meaning/implication). This is a natural consequence of URL Opacity and Abstraction.

URLs are Content

This may stretch the definition of content a bit, but think about what an abstract URL becomes. It is a column in a lookup table, which means that from an information storage point of view, it is identical to the meta title, the H1 tag content, the navigation menu items and the body of the page, the latter of which may be assembled from many different tables.

We will commonly use the URL as the key to figure out which item to look up in the other tables, but I may decide to list all pages of a certain taxonomic category, in which case the category is the lookup criteria and the URL is simply content like any other content.

Is it really content though? Of course that’s stretching things, but to some degree it is in the sense that it can contain (though doesn’t have to) actual information that tells the user, search engines and site editors what the page is about. If it is stored like any other data and it conveys information about the page like any other data, is it not page-specific content?

It might appear that passing information about the page through the URL violates the axiom of URI opacity. However, URI opacity refers to how the user agent is to treat the URI, not whether or not it can convey information to a user. In the case of a web page served via HTTP to a browser, it means that the browser is not to make assumptions about how to handle the page based on any component of the URL (file extension, presence/absence of a query string). It is, rather, the job of the HTTP headers to pass this information and the browser to do what it’s told. So URI opacity means opaque to the user agent, not necessarily to the user.

Directories Are Abstract

Since URLs are abstract, opaque and serve as API and content, what then is the distinction between “directory” and “file”. In the discussion that prompted this, someone said that directories are lists of pages. That definition works on a classic server setup where you have a URL that points to a directory and you allow the user to have the “index view” of that directory. In that case you get a list of files in that directory or the default index file (like index.html). We would then typically want (and depending on server setup even need) a trailing slash to indicate what we’re looking at. But such a dependency is a gross violation of the principle of opacity.

If we switch to a dynamic site built on Drupal (and to a much lesser extent WordPress), we quickly lose all sense of listing pages as directories. To take a very simple example, my Yosemite flower identifier page is what would be a classic “listing” page. And yet, what do we find there?

  • Rich Structured Data. What’s here is being pulled from several different database tables to put together a list, yes, but not really a mere list of pages and certainly not a list of files. It has a photo, the common name (which is the only part linked to the page for that specific flower), the Latin name (genus and species) and the Family name. This is not merely a listing or link to the page, but a subset of data on the page. This particular page does not have any introductory content, but it could easily prepend a 30,000 word discourse on the native plants of Yosemite to the listing. In which case, is it primarily a list or primarly a page? And what possible meaning do those terms then have? In any case, the terms “directory” and “file” have no meaning at all here.
  • Arbitrary Data Collections: Infinite Directories for a Given Set of Pages. The structured data presented here is arbitrary. For one, depending on how the filters are applied, the content here can have innumerable variations. It would be hard, though not impossible to map that content to a static set of files. But imagine this were a search page where the user can enter any combination of letters. In that case it becomes almost impossible to represent it as a collection of static pages. I could list these same pages and have the strucutred data show as paragraphs rather than columns, I could have columns with color and number of petals rather than taxonomy. I can have a virtually infinite number of presentations for this same listing of pages. Because we have abstracted URLs, collections of pages are arbitrary, abstract themselves and blur the line between listing and page. The distinction between branch nodes and leaf nodes still makes sense, but both become “pages” (though in Drupal parlance only leaf nodes are “nodes”).
  • The URL is an API. The user has access to various select boxes to create a custom listing page. This is simply offering the user a convenient means to hack the URL, which is to say, the API for this list. Again, thinking in terms of a URL that points to a directory loses all meaning.

Who Cares?

So who cares whether or not URLs are abstract and whether or not a trailing slash is added to a URL? More to the point, what does this have to do with anything practical? I would say there are substantial implications here for the design of pages, specifically branch node pages. The most important of them is this: if the distinction between directories and files is meaningless, between listings and pages has become inobvious, and if we see all page-specific parts of the page as “content”, this suggest something to us about site architecture.

The main thing it suggests is that we need to look for the “value add” that a listing has. If it is a mere listing, does it help the user or merely add one click layer to her quest to find relevant information? The value add might be a nice introduction and guide to the category, with most popular and best pages highlighted. In the case of my flower finder page, the value add over a straight list is that people can filter by characteristics they know in order to get a small set of photos from which they can, perhaps, visually identify the flower.

All of this has me thinking in terms of paying more attention to the content of my branch nodes, as it were. It has me thinking that laziness and sloth lead me to create simple listing pages, but these have little value to the user and they also make it harder for the search engines to differentiate one collection of pages from another. So ideally, every branch node becomes a significant content page, a guide to both user and search engine and, I would say, to site editor to sharpen her sense of the information architecture of the site. As you can see, sadly, the organizational principle of Raised By Turtles is basically to put things in the blender and then pour… but that’s work for another day.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>