Skip to content Skip to sidebar Skip to footer

Strcmp For Python Or How To Sort Substrings Efficiently (without Copy) When Building A Suffix Array

Here's a very simple way to build an suffix array from a string in python: def sort_offsets(a, b): return cmp(content[a:], content[b:]) content = 'foobar baz foo' suffix_array

Solution 1:

The buffer function does not copy the whole string, but creates an object that only references the source string. Using interjay's suggestion, that would be:

suffix_array.sort(key=lambda a: buffer(content, a))

Solution 2:

I don't know if there's a fast way to compare substrings, but you can make your code much faster (and simpler) by using key instead of cmp:

suffix_array.sort(key=lambda a: content[a:])

This will create the substring just once for each value of a.

Edit: A possible downside is that it will require O(n^2) memory for the substrings.


Solution 3:

+1 for a very interesting problem! I can't see any obvious way to do this directly, but I was able to get a significant speedup (an order of magnitude for 100000 character strings) by using the following comparison function in place of yours:

def compare_offsets2(a, b):
    return (cmp(content[a:a+10], content[b:b+10]) or
            cmp(content[a:], content[b:]))

In other words, start by comparing the first 10 characters of each suffix; only if the result of that comparison is 0, indicating that you've got a match for the first 10 characters, do you go on to compare the entire suffices.

Obviously 10 could be anything: experiment to find the best value.

This comparison function is also a nice example of something that isn't easily replaced with a key function.


Solution 4:

You could use the blist extension type that I wrote. A blist works like the built-in list, but (among other things) uses copy-on-write so that taking a slice takes O(log n) time and memory.

from blist import blist

content = "foobar baz foo"
content = blist(content)
suffix_array = range(len(content))
suffix_array.sort(key = lambda a: content[a:])
print suffix_array
[6, 10, 4, 8, 3, 7, 11, 0, 13, 2, 12, 1, 5, 9]

I was able to create a suffix_array from a randomly generated 100,000-character string in under 5 seconds, and that includes generating the string.


Post a Comment for "Strcmp For Python Or How To Sort Substrings Efficiently (without Copy) When Building A Suffix Array"