URLs as GUIDs
Dave Winer has an essay entitled
Guids are not just for geeks anymore which
describes a technique used in his blogging software
Radio Userland which caused me some consternation
while developing
RSS Bandit.
The primary feature that made me go ahead and
decide to write RSS Bandit was the ability to track
read/unread items in an RSS feed in a persistent
manner. To track such items I had to find a unique
way of identifying an item once downloaded so I
could compare its key with my application's list of
read messages. From perusing various RSS feeds
available online it seemed there were only three
elements that showed up with any regularity in an
RSS feed as
children of the item element:
description
, title
and
link
. The title of the item was
definitely not guaranteed to be unique although
either the description or link probably would be. I
was then torn between using an MD5 hash of the
description of an item and using the story link as
the unique identifier.
There are pros and cons to both approaches.
However in many cases hashing the description
seemed to be a better idea especially since links
are used inconsistently by various RSS feed
providers in a manner that may fail to guarantee
uniqueness. In general the link in an RSS item is a
link to the story or blog entry about some topic
but in some cases, such as the feed provided by Eclectic,
it is a link to the item being talked about in the
story or blog entry. In the latter case the link
may not be unique.
A further problem was how to detect when the news
item or blog entry had been updated. In this case,
I wanted RSS Bandit to be able to reflag such
messages as unread. This is where I abandoned
hashing the description since it turned out that
some popular weblogging software especially Radio
Userland usually change the URL when a news
item or blog entry is updated since the URL is
partially constructed from the time of the post
while failing to update the description in any way.
The only question I have about this is wondering
what happens to people who link to old versions of
the entry before an update?
#
//
Considered Dangerous
I and Andy
were talking recently and he complained about the
XPath
abbreviated query//
which is
short for
/descendant-or-self::node()/
. From the
XPath recommendationFor example, //para is
short for /descendant-or-self::node()/child::para
and so will select any para element in the
document (even a para element that is a document
element will be selected by //para since the
document element node is a child of the root
node)
To Andy this such queries are bad and he describes
them as being fragile in that changes in a
document's structure can cause significant issues
with queries such as the above. One example we
constructed of a potential negative consequence of
using
//
queries is using the query
//title
to get all the titles of books
ordered by customers from an XML document
containing customer order information instead of
something like
/customers/customer/order/book/title
.
The problem Andy had with this query is that if the
XML document is extended in a future version to
include movie info then the former query will
accidentally grab all those titles while the latter
would not.
Andy has mentioned considering writing an article
about the evils of
//
. However I
disagree that the uses of
//
are all
bad and like refering to examples that involve
multiple XML documents that contain
islands of
structure that the user is interested in. For
example I use the query
//rss:item
to
get all the RSS items from an RSS feed regardless
of whether it is RSS 0.91, 1.0 or 2.0. This is
because I know the structure of an RSS item is the
same for all three versions although the structure
of the XML document containing the item varies from
version to version. The code that does this is
shown below
string rssNamespaceUri = "";
if(feed.DocumentElement.LocalName.Equals("RDF")
&&
feed.DocumentElement.NamespaceURI.Equals("http://www.w3.org/1999/02/22-rdf-synta
x-ns#")){ //RSS 1.0
rssNamespaceUri =
"http://purl.org/rss/1.0/";
}else
if(feed.DocumentElement.LocalName.Equals("rss")){
//RSS 0.91 & RSS 2.0
rssNamespaceUri =
feed.DocumentElement.NamespaceURI;
}
//convert RSS items in feed to RssItem objects and
add to list
XmlNamespaceManager nsMgr = new
XmlNamespaceManager(feed.NameTable);
nsMgr.AddNamespace("rss", rssNamespaceUri);
foreach(XmlNode node in
feed.SelectNodes("//rss:item", nsMgr)){
RssItem item =
MakeRssItem((XmlElement)node);
In examples like the above I think the usage of
//
is acceptable.
#RSS Bandit: A
Bad Netizen
I recently discovered that RSS Bandit attempted to
download feeds every five minutes regardless of the
user specified delay for how often to attempt such
downloads. Although it was correctly sending the
"If-None-Match" and "If-Modified-Since" HTTP
headers, it was still clogging up logfiles with an
average of twelve requests an hour. It turned out
that the problem was in differences in assumptions
between myself and the designers of the
System.Net.HttpWebRequest class. Check out the
code below
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(current.link);
request.Timeout = 1 * 60 * 1000;
//one minute timeout
request.UserAgent =
this.UserAgent;
request.Proxy = this.Proxy;
HttpWebResponse response =
(HttpWebResponse) request.GetResponse();
if(response.StatusCode ==
HttpStatusCode.OK){
On the surface I thought this seemed like
fairly braindead code and couldn't figure out where
I went wrong. It turned out that the problem was
with the line in bold [and the fact that I wasn't
logging net exceptions which would have helped me
catch this easier]. The designers of the class felt
that any response that wasn't
successful according to HTTP 1.1 or a certain
class of
redirection was an exception. Considering that
an exception is a fatal error I didn't believe a
message from the server indicating that I already
have the cached message counts as an error let
alone a fatal one. They disagreed. :)
I've since fixed my code although from my referrer
logs it does look like there may be one or two
people among my "early adopters" who are still
using the abusive bits. My apologies go out to all
those who are getting too many hits from the
bandit.
#Future Jerry Springer
Guests
Last night I was dancing with this girl (Girl A)
whose friend (Girl B) was dancing right beside us
and groping the heck out of the guy she was dancing
with. Shortly afterwards, she came over and tongue
kissed Girl A. This was confusing so I asked the
Girl A.
So it turns out that Girl B is Girl A's best friend
and the guy is Girl A's ex. They all live together
and have some sort of three way sexual
relationship. That isn't the kicker.
The kicker is that Girl A is engaged and is moving
to the East coast to get married in a month or so
but would like Girl B to move with her so they
don't have to end their "friendship".
That's deep.
#Your
Daily Show Moment of Zen
Excerpted list of
Winners of Open Source Product Excellence Awards
Announced At LinuxworldBest System Integration Software
Microsoft - Services for Unix 3.0
ftp://ftp.microsoft.com/developr/Interix/interix22/GPL.TXT#
Get yourself a
News Aggregator and subscribe to my
RSSfeedDisclaimer:
The above comments do not
represent the thoughts, intentions, plans or
strategies of my employer. They are solely my
opinion.