Monday, February 13, 2017

Best Performance and Worst Performance

After the last blog, I participated in three contests: GP of Gomel (tourist's contest), SRM 708, and GP of China. The problems were very nice in all contests and I really enjoyed them. However, my performance was quite different: in SRM everything went very well and it was my best performance after the last TCO, but today I performed terribly (well, 17th in Open Cup is not bad, but given that the problems really suited me, it was terrible).

In SRM I was in a very strange situation. A few minutes before the end of the coding phase, I found that my solution may access to out-of-memory and fail. However, it worked on random tests and I decided not to resubmit - and luckily, it passed the systest!

In GP of Gomel, I solved seven tasks (B, D, F, G, H, J, K) quickly and I still had two hours to solve something else. In such situations, I usually read all problems and sort them by expected difficulties for me. The difficulty is estimated by the number of ACs (objective difficulty) and the topic (for example, I'm good at combinatorial/mathematical tasks, but bad at data structures). For me the topic is more important and after spending several minutes on each task I decided to concentrate on A and I because they looked the most interesting for me and no task was solved by many people - however I couldn't finish coding I in time and ended up with nothing.

In GP of China, after solving five tasks (B, E, F, G, J), I had more than 3 hours. Among the remaining tasks, I liked C/D the most and decided to spend 3 hours on them, but again ended up with nothing.

It seems I can quickly solve easy problems, but when the problems become harder I can't do anything. I seriously need to fix this tendency. Maybe I should try Div2 in next Open Cup? Anyway, I will upsolve at least four tasks (Gomel's AI, China's CD) in the near future and post here again. I believe I understood the solutions for these tasks now and they are indeed very nice.

Friday, February 3, 2017

Hashing and Probability of Collision

Problem D of Codeforces #395 required hashes of rooted trees. In this post I give some examples of good hashes and bad hashes, and then construct a good hash for rooted trees.

The most well-known hash in competitive programming is probably the rolling-hash of strings. The following problem looks stupid, but let's solve it using rolling hash anyway.

Problem 1. You are given two strings s, t of lengths N (<= 10^5) that consist of lowercase letters. Determine if they are the same.

Let MOD=10^9+7. Choose an integer r uniformly at random from the interval [0, MOD). Compute the sum of r^i * s[i] over all i. This is called the rolling hash of s. We can consider it as a polynomial of r, let's denote it as S(r). Similarly, define T(r). The hash collision happens when s and t are different strings (thus S and T are different polynomials), but unluckily S(r) = T(r). Here we use Schwartz-Zippel lemma:

Let P(x_1, ..., x_k) be a non-zero (multi-variable) polynomial on a finite field F_MOD. If the variables are chosen independently and uniformly at random, the probability that P(x_1, ..., x_k) = 0 is at most D/MOD, where D is the total degree of the polynomial.

In the case of rolling hash, we are only interested in single-variable polynomials. From the lemma above, we can prove that the collision probability is at most N/MOD (<= 1/10^4). Thus I'd say this is a "good hash". My intuition tells that in practice the probability is as small as 1/MOD, but that looks at least very hard to prove. Let me give an interesting example that shows the intuition is not always correct. In the lemma F_MOD must be a field, and MOD must be a prime. What happens if the modulo is not prime, for example 2^64 (this would simplify the implementation). It turns out that the rolling hash doesn't work well for a certain case if the modulo is 2^64. Detailed description on Codeforces.

Problem 2. You are given two graphs G, H with N (<= 10^3) vertices and M (<= 10^4) edges. Determine if they are isomorphic.

For each vertex v in the graph and an integer d, define the level-d hash for v, h(d, v). When d = 0, define h(0, v) = 1. Otherwise, let x_1, ..., x_k be the sorted list of level-(d-1) hashes of vertices adjacent to v, and define h(d, v) as the sum of x_i * r^i modulo MOD. Compute the level-N hashes of the two graphs for each vertex, sort them, and compare them.

It looks very hard to find a testcase against this hash. (Actually no - see Petr's comment below). However there's no theoretical guarantee under this hash and I think most people agree that this doesn't look valid as an intended solution of a problem in a contest (graph isomorphism is a well-known difficult problem). I'd say this is a "bad hash".

OK, now we want to construct a good hash for rooted trees. Before that, let's try an easier version. Hashes for (multi)sets.

Problem 3. You are given two (multi)sets A = {a_1, ...., a_n} and B = {b_1, ..., b_n}. n <= 10^5 and each element is in the interval [0, MOD). Determine if they are the same.

One possible solution is to sort the input and then compute the rolling hash of it. Can you find a linear-time hash? The solution is indeed very simple. Let's take a random integer r from the interval [0, MOD), and compute the hash (r+a_1)(r+a_2)...(r+a_n). This is a polynomial of r of degree n, so again from Schwartz-Zippel lemma, the collision probability is at most n/MOD.

Problem 4. You are given two rooted trees with n vertices. Determine if they are isomorphic.

Like previous examples, we assign a polynomial for a rooted tree, and evaluate the polynomial for random variables. However, unlike previous examples, this time we use multi-variable polynomials. We define the polynomial P(G) for a rooted tree G as follows. If G is a single leaf, P(G) = 1. Otherwise, let d be the depth of the tree, k be the number of children of the root of G, and G_1, G_2, ..., G_d be the subtrees corresponding to those children. Define P(G) = (x_d + P(G_1))(x_d + P(G_2))...(x_d + P(G_k)). See the picture for an example. (x_1, x_2, x_3 are replaced with x, y, z).

This way we get a multi-variable polynomial over d variables of degree l, where l is the number of leaves in the tree. We can prove that two polynomials that correspond to non-isomorphic trees are different (by using the uniqueness of factorization of the ring of polynomials). Thus, if we evaluate this polynomial for random variables, we get a good hash with collision probability at most l/MOD!


Wednesday, February 1, 2017

January Contests

ピカピカの本のキャラクターNo, I didn't forget this blog. I was just too lazy to write new posts. Anyway, after the last post I was not very active - I participated only in one SRM and Facebook Hacker Cup's rounds (and SnarkNews Series, though I gave up after having trouble with understanding the Russian statements).

In SRM 706, the first two tasks were relatively easy. My idea for the Hard was as follows. In this task you want to find a family of subsets of {1, ..., N}, where no subset contains two consecutive numbers and for each non-consecutive pair (p, q), at least one subset contains both. First add a subset that contains all even numbers, and add another subset that contains all odd numbers. Now we only need to care about pairs with different parities. In my construction, for example when N = 15, add three sets: {2, 4, 6, 9, 11, 13, 15}, {1, 3, 5, 7, 10, 12, 14}, and {1, 3, 5, 8, 11, 13, 15}. Then do the same thing recursively for {1, ..., 7} and {9, ..., 15} (like divide-and-conquer). This way we only need about 3 * lg N sets but it turned out that the numbers can be as large as 10^25. Maybe we can do it with 2 * lg N sets? I still don't know the answer, let's add it to "Upsolving" section.

My rating slightly increased after the contest and it became 3615. It seems this is the 4th in TopCoder's history. However it's far away from the three legends - Petr (3924), ACRush (3902), and tourist (3836).

Do you understand what the picture above means? Facebook. There were four elimination rounds in this year's FHC. The first three rounds were relatively easy. The second task in Round 1 was notable - at first it looked like a very tedious task where you have to find all subsets of points that can be enclosed by a circle, but there's a significantly simpler solution: you only need to consider the union of two squares.

Last Sunday I participated in Round 3, the last elimination round. I knew problem A so I solved it quickly. After that I read everything and decided to solve B first - it's tricky and requires careful coding, but I'm good at counting tasks and it looked like the easiest one. After that the choice was between C and E. D looked overwhelming at that time. I chose C mainly because it looked very easy to do stress-testing and believed that correct A+B+C can qualify (which turned out to be wrong and I was quite close to the cutoff due to too big penalty). Maybe careful A+B+E was a better strategy. (However, I have to admit that I made a small mistake and my E failed during upsolving).

My solution for C was as follows. First for each non-center vertex i, compute two values x[i] and y[i]. x[i] is the position along the cycle and y[i] is the shortest path from the center. Now the problem asks to compute the sum of min{x[j] - x[i], X - (x[j] - x[i]), y[i] + y[j]}, where X is the total cycle length. In my solution I assumed that the arrays x and y are arbitrary. This leads to a solution with BIT. But x and y are not arbitrary arrays and these satisfy triangle inequalities of a certain graph. If we use this fact, we can prove that for a fixed i we can split the cycle into three consecutive parts and in each part we can determine which of x[j] - x[i], X - (x[j] - x[i]), y[i] + y[j] is the smallest. This leads to a simpler linear solution.

During the contest, my first idea for D was as follows. Consider two numbers x and y, and two masks (the set of non-broken positions) mask1 and mask2. We can't distinguish two states (x, mask1) and (y, mask2) after t seconds if ((x + t) & mask1) and ((y + t) & mask2) are the same. For each bit there are several cases: it is broken in both states, broken in one state and off in another, ... and so on. And looked completely unapproachable. Indeed I missed a very simple thing: when the position i is 0 in the input, we can always assume that the display is broken at this position. This observation greatly simplifies the problem and now we know mask1 and mask2 - now we can concentrate on x and y. After that it's still tricky, but quite solvable task.

How should we solve it during the contest? One possible approach is to consider various cases, but I don't prefer it with such weak examples. Digit DP is also possible but it's fairly complicated. Here's a very simple solution: we can prove that in the optimal solution x and y differ in only two bits. Fix two positions p and q - at position p, x is 0 and y is 1, and at position q, vice versa. And in other positions x and y are the same. In this case, we can prove that there's some r such that the bits at positions [r, N) doesn't matter at all, and we want to minimize the numbers in the least-significant r bits. If we fix these three numbers (p, q, r), we can solve the task without considering various cases. This will lead to a very simple O(N^4) solution. Source code in ideone.

Anyway, I managed to qualify to the Finals. If you also qualified, see you in Seattle!