Suffix array explanation

Warning:
All updates in original text Published in original text Better eating experience!

Some conventions

1. The string subscript starts with \ (1 \)
2. The length of the string \ (t \) is \ (len(t) \). In particular, the length of the string \ (s \) is \ (n \)
3. In this article, it is assumed that the string contains only lowercase letters
4. \(s[l..r] \) means \ (s_ls_{l+1}\ldots s_r \)
5. The suffix \ (i \) indicates \ (s[i..n] \)
6. The rank \ (i \) suffix indicates the starting position of the one whose dictionary order is \ (i \) in all suffixes

Suffix Array

What is a suffix array

Suffix array, as the name suggests, is an array with suffixes.

In fact, suffix array is to put all suffixes together and sort them in dictionary order.

Steal a picture of the OI Wiki.

Suffix array method

There are many ways to find suffix arrays, such as \ (O(n^2 \log n) \) violence, or \ (O(n \log^2 n) \) hash. There are also two \ (O(n) \) methods( SA-ISDC3 , link from OI Wiki recommendation), and multiplication algorithm to be discussed in this article.

Algorithmic thought

The idea of violence is that first we compare the \ (1 \) character of each suffix \ (i \), then the \ (2 \), and then the \ (3 \) until the \ (n \). But this is \ (O(n^2) \).

How to optimize it? Don't forget our algorithm is called multiplication! When we finish sorting the first \ (w \) bits of a suffix \ (i \), we also know the order of the last \ (w \) bits of the suffix \ (i \)! At this time, we only need to sort the ranking of the top \ (w \) bit in suffix \ (i \) and the ranking of \ ([w + 1.. 2w] \) bit in suffix \ (i \) with double keywords to get the ranking of the top \ (2w \) bit in suffix \ (i \)!

Steal another picture of OI Wiki

Concrete implementation

If you don't want to see TLE code, you can move directly to Optimized implementation , the previous content is to pave the way for optimal implementation.

Variable Convention

• sa[i] - starting position of rank \ (I \) suffix
• rk[i] - rank of suffix \ (I \)
• tp[i] - that is, temp, auxiliary array. The specific meaning will be described below
• p - auxiliary variable, the specific meaning will be described below

sort implementation

Now we can write code! Every time you update the first and second keywords, sort them with sort!

The code comes from OI Wiki, and the variables are roughly the same (actually I'm lazy)

```#include <algorithm>
#include <cstdio>
#include <cstring>
#include <iostream>

using namespace std;

const int N = 1000010;

char s[N];
int n, w, sa[N], rk[N << 1], oldrk[N << 1];
// To prevent the access rk[i+w] from causing the array to cross the boundary, double the array.
// Of course, you can also judge whether it is out of bounds before accessing, but it is more convenient to open the double array directly.

int main() {
int i, p;

scanf("%s", s + 1);
n = strlen(s + 1);
for (i = 1; i <= n; ++i) sa[i] = i, rk[i] = s[i];

for (w = 1; w < n; w <<= 1) {
sort(sa + 1, sa + n + 1, [](int x, int y) {
return rk[x] == rk[y] ? rk[x + w] < rk[y + w] : rk[x] < rk[y];
});  // lambda is used here
memcpy(oldrk, rk, sizeof(rk));
// Since the original rk will be overwritten when calculating rk, a copy should be copied first
for (p = 0, i = 1; i <= n; ++i) {
if (oldrk[sa[i]] == oldrk[sa[i - 1]] &&
oldrk[sa[i] + w] == oldrk[sa[i - 1] + w]) {
rk[sa[i]] = p;
} else {
rk[sa[i]] = ++p;
}  // If two substrings are the same, their corresponding rk also needs to be the same, so they need to be de duplicated
}
}

for (i = 1; i <= n; ++i) printf("%d ", sa[i]);

return 0;
}
```

So we can AC!

Good face!

Forget it, the constant is not our business. We can pass it with O2.
But! So the algorithm becomes \ (O(n \log^2 n) \)? How can our excellent algorithm allow such a thing to happen?

Cardinality sorting implementation

To make the complexity of the algorithm \ (O(n \log n) \), we need to work hard on sorting.
Double keyword sorting, we can use cardinality sorting (where stable sorting uses count sorting)!
Thus, the sorting complexity is reduced to \ (O(n \log n) \).

Steal the code of OI Wiki again

```#include <algorithm>
#include <cstdio>
#include <cstring>
#include <iostream>

using namespace std;

const int N = 1000010;

char s[N];
int n, sa[N], rk[N << 1], oldrk[N << 1], id[N], cnt[N];

int main() {
int i, m, p, w;

scanf("%s", s + 1);
n = strlen(s + 1);
m = max(n, 300);
for (i = 1; i <= n; ++i) ++cnt[rk[i] = s[i]];
for (i = 1; i <= m; ++i) cnt[i] += cnt[i - 1];
for (i = n; i >= 1; --i) sa[cnt[rk[i]]--] = i;

for (w = 1; w < n; w <<= 1) {
memset(cnt, 0, sizeof(cnt));
for (i = 1; i <= n; ++i) id[i] = sa[i];
for (i = 1; i <= n; ++i) ++cnt[rk[id[i] + w]];
for (i = 1; i <= m; ++i) cnt[i] += cnt[i - 1];
for (i = n; i >= 1; --i) sa[cnt[rk[id[i] + w]]--] = id[i];
memset(cnt, 0, sizeof(cnt));
for (i = 1; i <= n; ++i) id[i] = sa[i];
for (i = 1; i <= n; ++i) ++cnt[rk[id[i]]];
for (i = 1; i <= m; ++i) cnt[i] += cnt[i - 1];
for (i = n; i >= 1; --i) sa[cnt[rk[id[i]]]--] = id[i];
memcpy(oldrk, rk, sizeof(rk));
for (p = 0, i = 1; i <= n; ++i) {
if (oldrk[sa[i]] == oldrk[sa[i - 1]] &&
oldrk[sa[i] + w] == oldrk[sa[i - 1] + w]) {
rk[sa[i]] = p;
} else {
rk[sa[i]] = ++p;
}
}
}

for (i = 1; i <= n; ++i) printf("%d ", sa[i]);

return 0;
}
```

Now we can finally AC.

Hit the face for the second time

Why? Obviously, the complexity is correct!
However, the constant of this code is too large.

Optimized implementation

In fact, the second keyword in cardinality sorting can be sorted directly at the time of deposit without counting.

Here with my code to explain in detail. (not OI Wiki at last)

```// main code
const int N = 1e6 + 5;

char s[N];
int n, m;

// The reason why sa and rk are twice open is that it is possible to cross the border. Of course, it can also be judged to cross the border
// tot is used for counting and sorting
int sa[N << 1], rk[N << 1], tot[N], tp[N];
void Sort() { // Count, sort and process sa
for(int j = 0; j <= m; j++) tot[j] = 0; // Bucket clear
for(int j = 1; j <= n; j++) tot[rk[j]]++;
for(int j = 0; j <= m; j++) tot[j] += tot[j - 1]; // Counting sorting routine
for(int j = n; j >= 1; j--) sa[tot[rk[tp[j]]]--] = tp[j];
}
void SA() {
m = 'z';
for(int j = 1; j <= n; j++) rk[j] = s[j], tp[j] = j;
Sort(); // Since rk will not be used below, there is no need to deal with rk. Of course, it is right to deal with rk
for(int i = 1, p = 0; p < n; i <<= 1, m = p) { // i is w above
p = 0;
for(int j = 1; j <= i; j++) tp[++p] = n - i + j;
for(int j = 1; j <= n; j++) if(sa[j] > i) tp[++p] = sa[j] - i;
Sort();
for(int j = 1; j <= n; j++) tp[j] = rk[j]; // Since tp will not be used again, we use tp to represent rk in the previous round
rk[sa[1]] = p = 1;
for(int j = 2; j <= n; j++)
rk[sa[j]] = ((tp[sa[j - 1]] == tp[sa[j]] && tp[sa[j - 1] + i] == tp[sa[j] + i]) ? p : ++p);
}
}
```

This part is more complicated. The following is a detailed explanation.

Part I
```// Code 1.1 (main code, Line 10~25)
void Sort() { // Count, sort and process sa
for(int j = 0; j <= m; j++) tot[j] = 0; // Bucket clear
for(int j = 1; j <= n; j++) tot[rk[j]]++;
for(int j = 0; j <= m; j++) tot[j] += tot[j - 1]; // Counting sorting routine
for(int j = n; j >= 1; j--) sa[tot[rk[tp[j]]]--] = tp[j];
}
```

First, let's look at count sorting. The function of count sort here is to sort the first keyword stably. Look at the following sentence.

```for(int j = 1; j <= n; j++) tot[rk[j]]++; // Code 1.1, Line 4
```

Here, the ranking of all suffixes \ (i \) is recorded in the bucket. Note that rk may be the same at this time, but it must be different at the end (because the lengths are different).

```for(int j = n; j >= 1; j--) sa[tot[rk[tp[j]]]--] = tp[j]; // Code 1.1, Line 6
```

Here we update sa. Let's talk about the meaning of each array at this time.

• tp[i] - the starting position of the suffix with the second keyword ranking of \ (I \) (as for why not directly record the second keyword with suffix \ (I \), I'll understand later)
• rk[i] - the ranking of suffix \ (I \) after the last round of ranking. Here is the first keyword of suffix \ (I \).

Then this sentence is the ranking \ (j \) of the second keyword of the enumeration, and tp[j] is the beginning of the string of the current enumeration (the ranking of the second keyword is \ (j \). Because sa[i] represents the beginning of the suffix of ranking \ (I \), this sentence means:
Take the tot of the current enumeration string as the ranking of the string, and subtract the tot (TOT subtraction is the routine of counting and sorting).

Note that stable sorting is required, so the enumeration is in reverse order.

Part II
```// Code 1.2 (main code, Line 20)
for(int i = 1, p = 0; p < n; i <<= 1, m = p) // i is w above
```

The \ (p \) here refers to the number of different suffixes that have been discharged.

Part III
```// Code 1.3 (main code, Line 21~23)
p = 0;
for(int j = 1; j <= i; j++) tp[++p] = n - i + j;
for(int j = 1; j <= n; j++) if(sa[j] > i) tp[++p] = sa[j] - i;
```

This part is to sort the second keyword, that is, update tp.
Here sa is the original meaning, and tp is also the starting position of the suffix with the second keyword ranking of \ (i \).

```for(int j = 1; j <= i; j++) tp[++p] = n - i + j; // Code 1.3, Line 3
```

Here, the suffixes \ (n - I + 1.. n \) are sorted. For these suffixes, they have no \ (I + 1.. 2I \) bit, so they need to be processed separately.
Of course, for this part, the smaller the length, the higher the ranking.

```for(int j = 1; j <= n; j++) if(sa[j] > i) tp[++p] = sa[j] - i; // Code 1.3, Line 4
```

This part sorts the remaining suffixes (i.e. suffix \ (1.. n - I \). Here we can understand that we are enumerating sa[j], that is, enumerating suffixes according to the first ranking. In fact, we enumerate the \ (i+1 \) bit of the remaining suffix, that is, the bit at the beginning of the second keyword. So here is to say:
Enumerate the suffix \ (j \) according to the last ranking, take the suffix \ (j - i \) as the string smaller than \ (p \), and increase \ (p \) automatically.

Part IV
```// Code 1.4 (main code, Line 26~28)
rk[sa[1]] = p = 1;
for(int j = 2; j <= n; j++)
rk[sa[j]] = ((tp[sa[j - 1]] == tp[sa[j]] && tp[sa[j - 1] + i] == tp[sa[j] + i]) ? p : ++p);
```

Here is an update to \ (rk \). The meaning of each array:

• rk[i] - ranking of suffix \ (I \) in this round
• sa[i] - the starting position of the suffix \ (I \) in this round of ranking
• tp[i] - ranking of suffix \ (I \) in the previous round
• \(p \) - different rankings that have been discharged

Let's explain this part together.

In line 2, the rank and \ (p \) of the rank \ (1 \) suffix are set to \ (1 \). (in fact, at this time \ (p \) is still a counter, but after counting, it becomes the number of different rankings)
Then compare two adjacent strings (compare with the previous round of \ (rk \), and add one to the number of different rankings \ (p \).

Press line realization

No.

```const int N = 1e6 + 5;

char s[N];
int n, m;

int sa[N << 1], rk[N << 1], tot[N], tp[N];
void calcSA() {
m = 'z';
for(int i = 0, p = 0; p < n; i = (i ? i << 1 : 1), m = p) {
if(i) {
p = 0;
for(int j = 1; j <= i; j++) tp[++p] = n - i + j;
for(int j = 1; j <= n; j++) if(sa[j] > i) tp[++p] = sa[j] - i;
} else for(int j = 1; j <= n; j++) rk[j] = s[j], tp[j] = j;
for(int j = 0; j <= m; j++) tot[j] = 0;
for(int j = 1; j <= n; j++) tot[rk[j]]++;
for(int j = 0; j <= m; j++) tot[j] += tot[j - 1];
for(int j = n; j >= 1; j--) sa[tot[rk[tp[j]]]--] = tp[j];
if(!i) continue;
for(int j = 1; j <= n; j++) tp[j] = rk[j];
rk[sa[1]] = p = 1;
for(int j = 2; j <= n; j++)
rk[sa[j]] = ((tp[sa[j - 1]] == tp[sa[j]] && tp[sa[j - 1] + i] == tp[sa[j] + i]) ? p : ++p);
}
}
```

height array

definition

LCP definition

\(lcp(s, t) \) represents the longest common prefix of strings \ (s \) and \ (t \), that is, the largest \ (i \) satisfies \ (s[1..i]=t[1..i] \).

height array definition

\(height[i] = lcp(sa[i], sa[i - 1]) \), i.e. \ (height[i] \) represents the LCP with the suffix of ranking \ (I \) and \ (i - 1 \).

lemma

Lemma: \ (height[rk[i]] \ge height[rk[i - 1]] - 1 \), that is, the LCP of the suffix \ (I \) and the suffix in the top position is not less than the LCP of the suffix \ (I \) in the top position and the LCP of the top two positions (- 1 \).

Proof: (I can't prove that I'm lazy, so I found the proof of OI Wiki. Why is it OI Wiki again)

code implementation

```void get_height() { // ht [] is height
for(int i = 1, k = 0; i <= n; i++) {
if(rk[i] == 1) { ht[rk[i]] = 0; continue; }
if(k) k--;
int j = sa[rk[i] - 1];
while(i + k <= n && j + k <= n && s[i + k] == s[j + k]) k++;
ht[rk[i]] = k;
}
}
```

reference material

OI Wiki

Tags: string

Posted on Sun, 31 Oct 2021 01:41:41 -0400 by optik