Christmas Tries (part 1)
Filtering of robot activity from webserver log files using curated lists of known robot User Agent (UA) strings is well-established. Lists such as that available from the Browser Capabilities Project consist of a series of regular expressions which, when compared to the UA string of an incoming call, can be used to establish the nature of the system making the web request.
The difficulties of this situation lie in the sheer size of the computational task involved. At the time of writing the Browser Capabilities Project list of UA-matching regular expressions runs to almost 180,000 patterns, while the RSC's Publishing platform receives somewhere in the region of 8 million separate web requests in any given 24 hour period. A quick bit of maths suggests that in order to check each line's provenance, we would need to perform 180,000 x 8,000,000 = 1.44 trillion regex comparisons per day.
This would of course come down a bit if we stopped the process for each line once it had been matched - on average a robot will be matched after completing half of the total number of comparisons, assuming all patterns are equally likely. However all requests from actual humans would end up being compared to all the patterns, finally coming out the other side unmatched and thus confirmed as a "real user". So if our traffic is made up of 50% robots, we would only cut 25% off the computational "bill" using this approach.
Tries are a time- and memory-efficient approach to pattern matching that can help reduce the burden. A trie (pronounced as "tree") is a form of search tree created from a list of terms.
The nodes (forks) of the tree are each given a "key" or unique ID. For example, you could build a trie of all the words in a dictionary up to the third letter of the word, and then assign keys to each word in the dictionary. If you then wanted to look up a new unknown word to the dictionary, you would analyse its letter sequence against the trie, determine the relevant key value, and then compare the unknown word to only dictionary entries that shared that key value. Checking the word "development" would thus involve comparisons to only the words starting with "dev", rather than all of the words in the dictionary.
In practice, the costly process of building the trie of the search patterns takes time but only has to be done once - the resultant data structure can be saved once created, and then applied whenever it is needed.
The only fly in the ointment is that tries can't handle wildcards, as they make it impossible to assign a unique key, so we are forced to truncate the search pattern at the first wildcard character when creating the trie. Then we notice the large number of patterns that begin with a wildcard, and thus cannot be built into the trie because once truncated they are actually empty strings.
Thankfully the systematic nature of UA strings provides us with a strategy to mitigate this problem. Certain motifs appear repeatedly in UA strings and thus in the regex patterns used to match them. For example, UA strings often contain details of the operating system, browser name etc. Before constructing the trie, we can build a "prefix" for the pattern based on these common elements - essentially we move some of the unique text to the front of the pattern. For example, this pattern starts with an asterisk and is thus initially impossible to add to the trie:
*Mozilla/2.0 (compatible; Ask Jeeves*
However, we can see that it includes details of the web browser further into the pattern, so we can add a prefix:
MOZ020_*Mozilla/2.0 (compatible; Ask Jeeves*
Now when we truncate at the first wildcard, we still have something to add:
MOZ020_
We use this string to assign a key value to the original pattern when creating the trie. When we later come to do the lookup of an unknown UA string, we apply the same process of prefix generation to create a modified UA string, which is then used for the trie lookup to find the key and thus the relevant regex patterns (It is of course necessary to handle cases where, for example, patterns have no mention of "Mozilla" but do contain a wildcard - these could still match a UA string containing reference to "Mozilla" thanks to the wildcard).
In practical terms this results in much bigger tries, but this is OK as trie generation is a one-off expense, as explained previously. It does take fractionally longer to load the pregenerated trie as a result, but this is more than offset by the much larger reduction in the time taken to do the lookups. Actual processing times depend on many factors (hardware, software, etc.), but for a typical day's worth of logs from the RSC journals website, processing times went from over ten hours (without tries) to 10 minutes (full implemented).