Friday, February 3, 2017

Hashing and Probability of Collision

Problem D of Codeforces #395 required hashes of rooted trees. In this post I give some examples of good hashes and bad hashes, and then construct a good hash for rooted trees.

The most well-known hash in competitive programming is probably the rolling-hash of strings. The following problem looks stupid, but let's solve it using rolling hash anyway.

Problem 1. You are given two strings s, t of lengths N (<= 10^5) that consist of lowercase letters. Determine if they are the same.

Let MOD=10^9+7. Choose an integer r uniformly at random from the interval [0, MOD). Compute the sum of r^i * s[i] over all i. This is called the rolling hash of s. We can consider it as a polynomial of r, let's denote it as S(r). Similarly, define T(r). The hash collision happens when s and t are different strings (thus S and T are different polynomials), but unluckily S(r) = T(r). Here we use Schwartz-Zippel lemma:

Let P(x_1, ..., x_k) be a non-zero (multi-variable) polynomial on a finite field F_MOD. If the variables are chosen independently and uniformly at random, the probability that P(x_1, ..., x_k) = 0 is at most D/MOD, where D is the total degree of the polynomial.

In the case of rolling hash, we are only interested in single-variable polynomials. From the lemma above, we can prove that the collision probability is at most N/MOD (<= 1/10^4). Thus I'd say this is a "good hash". My intuition tells that in practice the probability is as small as 1/MOD, but that looks at least very hard to prove. Let me give an interesting example that shows the intuition is not always correct. In the lemma F_MOD must be a field, and MOD must be a prime. What happens if the modulo is not prime, for example 2^64 (this would simplify the implementation). It turns out that the rolling hash doesn't work well for a certain case if the modulo is 2^64. Detailed description on Codeforces.

Problem 2. You are given two graphs G, H with N (<= 10^3) vertices and M (<= 10^4) edges. Determine if they are isomorphic.

For each vertex v in the graph and an integer d, define the level-d hash for v, h(d, v). When d = 0, define h(0, v) = 1. Otherwise, let x_1, ..., x_k be the sorted list of level-(d-1) hashes of vertices adjacent to v, and define h(d, v) as the sum of x_i * r^i modulo MOD. Compute the level-N hashes of the two graphs for each vertex, sort them, and compare them.

It looks very hard to find a testcase against this hash. (Actually no - see Petr's comment below). However there's no theoretical guarantee under this hash and I think most people agree that this doesn't look valid as an intended solution of a problem in a contest (graph isomorphism is a well-known difficult problem). I'd say this is a "bad hash".

OK, now we want to construct a good hash for rooted trees. Before that, let's try an easier version. Hashes for (multi)sets.

Problem 3. You are given two (multi)sets A = {a_1, ...., a_n} and B = {b_1, ..., b_n}. n <= 10^5 and each element is in the interval [0, MOD). Determine if they are the same.

One possible solution is to sort the input and then compute the rolling hash of it. Can you find a linear-time hash? The solution is indeed very simple. Let's take a random integer r from the interval [0, MOD), and compute the hash (r+a_1)(r+a_2)...(r+a_n). This is a polynomial of r of degree n, so again from Schwartz-Zippel lemma, the collision probability is at most n/MOD.

Problem 4. You are given two rooted trees with n vertices. Determine if they are isomorphic.

Like previous examples, we assign a polynomial for a rooted tree, and evaluate the polynomial for random variables. However, unlike previous examples, this time we use multi-variable polynomials. We define the polynomial P(G) for a rooted tree G as follows. If G is a single leaf, P(G) = 1. Otherwise, let d be the depth of the tree, k be the number of children of the root of G, and G_1, G_2, ..., G_d be the subtrees corresponding to those children. Define P(G) = (x_d + P(G_1))(x_d + P(G_2))...(x_d + P(G_k)). See the picture for an example. (x_1, x_2, x_3 are replaced with x, y, z).

This way we get a multi-variable polynomial over d variables of degree l, where l is the number of leaves in the tree. We can prove that two polynomials that correspond to non-isomorphic trees are different (by using the uniqueness of factorization of the ring of polynomials). Thus, if we evaluate this polynomial for random variables, we get a good hash with collision probability at most l/MOD!


6 comments:

  1. Awesome explanation!

    For graph isomorphism, it's easy to break such hashing: take two non-isomorphic regular (all degrees the same) graphs.

    ReplyDelete
    Replies
    1. Thank you for pointing out, I'll fix it.
      Maybe computing hashes of graphs of this size is hard even in practice.

      Delete
  2. "It turns out that the rolling hash doesn't work well for a certain case if the modulo is 2^64. Comments in Russian." Comment link is broken Can you please fix it.
    Thanks

    ReplyDelete
    Replies
    1. I found an English description: http://codeforces.com/blog/entry/4898

      Delete
    2. It's not exactly broken, just change the language in codeforces to russian and click the link again :)

      Delete
  3. "and G_1, G_2, ..., G_d be the subtrees corresponding to those children"

    I think it should be G_1, G_2, ..., G_k. Am I right?

    ReplyDelete