Computer Science vs Reality

Posted by Michał ‘mina86’ Nazarewicz on 23rd of May 2021

Robin: ‘Let’s use a linked li—’; Batman: *slaps Robin* ‘Vector is faster’

Some years ago, during a friendly discussion about C++, a colleague challenged me with a question: what’s the best way to represent a sequence of numbers if delete operation is one that needs to be supported. I argued in favour of a linked list suggesting that with sufficiently large number of elements, it would be much preferred.

In a twist of fate, I’ve been recently discussing an algorithm which reminded my of that conversation I had all those years ago. But this time I was arguing against a node-based data structure. Rather than ending things at a conversation, I’ve decided to benchmark a few solutions to make sure which approach is the best.

The problem

The task at hand is simple. Design a data structure which stores a set of words all of the same length and is able to return all words matching globs in the form ‘prefix*suffix’. That is, words which start with a given prefix and end with a given suffix. Either part of the pattern may be empty and their concatenation is never longer than length of the words in the collection. Initialisation time and memory footprint are not a concern. Complexity of returning a result can be assumed to be constant.

For example, supposing we were given words ‘foo’, ‘bar’ and ‘baz’, the following table lists results for look-ups of various patterns:

PatternResultNotes
‘*’‘foo’, ‘bar’, ‘baz’Prefix and suffix can be empty. Sole asterisk matches everything or just a fragment of the word.
‘b*’‘bar’, ‘baz’
‘b*r’‘bar’
‘f*r’
‘bar*’‘bar’An asterisk may also match an empty substring if concatenation of prefix and suffix has length equal that of the words in the set.
‘ba*r’‘bar’

Vector solution

The simplest approach is to store all the words in a list and compare each with the pattern. With \(n\) words each \(k\)-character long this has \(\mathcal O(kn)\) complexity. More specifically, the run-time can be expressed in terms of properties of the pattern. This makes the complexity of a lookup \(\mathcal O(pn+sn)\) where \(p\) is the length of the prefix in the glob and \(s\) length of the suffix.

An obvious improvement is sorting the list. With the words ordered, a binary search makes it possible to narrow down in logarithmic time the search space to a subset of words which match pattern’s prefix. This changes the complexity to \(\mathcal O(p\log n + ns)\).

Another approach is to have two sorted lists, one called forward containing the given words and another called backward containing the words in their reversed forms, and a fwd_to_bck array which would specify position in backward list corresponding to word in forward. In other words, for any valid index i, forward[i] = reverse_word(backward[fwd_to_bck[i]]). Following up with the example of a set with foo, bar and baz words, the three lists would have the following contents:

Indexforwardfwd_to_bckbackward
0bar1oof
1baz2rab
2foo0zab

With such data structure prepared, lookup algorithm becomes two binary searches followed by a scan through fwd_to_bck to check if word in forward list falls within a range of indexes in backward list identified through the binary search. Complexity of this solution is \(\mathcal O((p+s)\log n + n)\).

Trie

f b o a o r z foo bar baz

There is another way though. One involving a rarely used data structure, the humble prefix tree. In a trie, which is another name for the data type, edges are labelled by individual characters. In order to check presence of a key, the tree is traversed following connections corresponding to characters of the word. Since there’s a constant upper-limit to number of children a node can have, lookup operation takes time linear in the length of the key, \(\mathcal O(k)\), which is better than binary search.

If all the words are put into a trie, finding all matching a prefix*suffix pattern becomes a three step process. Firstly, follow edges in the trie matching characters in the prefix. Secondly, fan out the tree traversing \(k-p-s\) edges deep. And finally, follow edges matching characters in the suffix. If a leaf node is reached than path to that leaf describes a word in the set matching the pattern.

Complexity of the first and last step are simple to derive; they are \(\mathcal O(p)\) and \(\mathcal O(sn)\) respectively. Suffix length needs to be multiplied by number of words because in the worst case all words need to be tested. As for the fan-out step, we need to go \(k-p-s\) steps deep and in the worst case do it for all keys which results in \(\mathcal O((k-p-s)n)\) complexity.

Overall, the lookup ends up having \(\mathcal O(p+(k-p)n)\) run-time. Interestingly, it does not depend on the suffix. On the other hand, the longer the prefix the faster the algorithm will run. On one extreme, if the glob is just a literal string (and thus prefix length equal word length), we end up with a linear lookup. In those cases, trie should be faster than vector-based data structure.

Benchmarks

To check the theoretical analysis I’ve prepared a benchmark to compare aforementioned approaches. Three different vector-based implementations and four trie-based ones were measured.

Putting all of that data on a graph isn’t feasible. The run-time depends on four parameters — number of words, length of each word, length of the prefix in the pattern and length of the suffix in the pattern — which doesn’t map well to a two-dimensional canvas. To wrangle the measurements, I’ve first reduced the data set. For each quadruple of parameters I’ve taken the best result of a trie-based implementation and the worst of a vector-based one and divided them. This gives a ratio \(r\) indicating how much faster (if value is less than one) or slower (if value is greater than one) trie-based solution is.

That still leaves quite a few dimensions so I’ve further grouped results by word length and prefix length (discarding number of words and suffix length). Figure below shows all the results collected into separate bands, one for each word length the benchmark was run. Within each band separate columns indicate runs with a different prefix length starting from and empty prefix and ending at a pattern which consists of the full word. Vertical axis is logarithmic and is the aforementioned execution time ratio \(r\).

Trie ÷ Vector time (r)Words length (k)0.1×10×100×1 k×10 k×100 k×1 M×10 M×101001 k10 k100 k1 M

A few things immediately jump out. First of all, as predicted, the longer the prefix the better tries fare. This is shown by the ratio falling down between columns within a single vertical band. For example, for word length of ten characters, if pattern has one-character prefix trie’s execution time (shown in the middle column of the first band) varies from the same as vector-based algorithm to over ten thousand times slower; but a simple word lookup (situation where prefix length is equal word length which is shown in the last column), prefix tree is over ten times faster at times.

The second thing however is that overall, tries are very bad compared to much simpler vector-based structures. Starting at word length of a thousand, tries are ten times slower than using a sorted array of the keys. (I haven’t made any measurements for length between those two sizes). Furthermore, the longer the words in the set, the worst prefix tree becomes. But even with short words, depending on type of queries vector-based algorithm can be thousands times faster. There were only a handful of tests in which prefix tree bode better than an array:

nkpsWorst Vector [µs]Best Trie [µs]r
1101000.4520.1450.32
101000.5480.1350.25
100111.7791.7470.98
1001000.8960.1340.15
10001001.1940.1340.11
100001001.5850.1350.09
1000001001.8740.1340.07
10000001002.3370.1590.07
100001001018.5374.6870.55
100000101132.684.3730.03
10000010104.1163.9190.95

And let’s not forget that the cards were stacked in favour of the trees. Not only the best trie was pitted against the worst vector solution, but the alphabet consisted of only 26 symbols (specifically lower case letters) which also benefited trie.

Conclusion

The big O notation ‘hides constants’. Any programmer recognises that distinct algorithm with the same complexity can perform vastly differently. It’s also not news that due to asymptotic nature of the big O, algorithms with seemingly better complexity may run slower than ones with worse ones. This is why you’d use insertion sort to sort ten elements rather than using a heap sort or why Coppersmith–Winograd algorithm is never used in practice.

It’s good to sometimes remind ourselves in practice of those facts. Sometimes a naïve, almost brute-force, approach may be faster just because it uses tighter loops or has better cache-locality. And cache-locality is something node-based data structures (such as linked lists or trees) don’t have.

In other words, it’s nearly never a linked-list and rarely a tree.

If we further assume uniform distribution of words the linear search matching suffixes doesn’t need to go through all \(n\) keys. Rather it needs to test only \(O\left({n \over \exp(p)}\right)\) strings which is noticeably smaller. This makes the complexity equal \(O\left(p\log n + {ns \over \exp(p)}\right)\). The same adjustment can be made in other complexity functions. Big O expresses pessimistic time so it’s arguably more correct to derive the formula for the worst-case scenario in which all words need to be checked.

Of course this is only in worst case scenario. In practice, on average case, presence of a suffix will speed the algorithm up.