Troubleshooting caused by ES wildcard

When many developers with RDBMS/SQL background first entered the ElasticSearch world, they easily thought of using Wildcard query to realize fuzzy query, because this is the query method most similar to the like operation in SQL, and it feels very comfortable to use. However, misuse of Wildcard query can have disastrous consequences.

Problem recurrence

Create an index with only one document

POST test_index/type1/?refresh=true
{
  "foo": "bar"
} 

Use wildcard query to execute a long string query with wildcard * at the beginning and end

POST /test_index/_search
{
  "query": {
    "wildcard": {
      "foo": {
        "value": "Gently I left, just as I came gently; I gently waved goodbye to the clouds in the West.
                  The golden willow by the river is the bride in the sunset; The bright shadow in the wave light rippled in my heart.
                  The green fungus on the soft mud swaggers under the water; In the gentle waves of Kang River, I am willing to be a water plant!
                  The pool under the shade of elms is not a clear spring, but a rainbow in the sky; Crumpled in the floating algae, precipitating a rainbow like dream.
                  The Dream Pursued? Hold a long pole and go back to the greener part of the grass; A boat full of stars, singing in the beauty of stars.
                  But I can't play songs. Silence is a farewell Sheng and Xiao; Xia Chong is also silent for me. Silence is Cambridge tonight!
                  I left quietly, just as I came quietly; I waved my sleeves and didn't take away a cloud."
          }
       }
   }
}

View results

{
    "took": 3445,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
        },
     "hits": {
        "total": 0,
        "max_score": null,
        "hits":
        } 
}

Even if no hits, it takes an amazing 3.4 seconds (the tester is macbook pro, i7 CPU), and the CPU has a high peak during execution.

The online query is much more complex than this example. Several fields will be queried at the same time. In the actual test, a query may be executed for more than ten seconds. When you compare the length of string queries, the cluster may be DOS.

Explore deep roots

Why is it so expensive to do this query for an index with only one piece of data? Intuitively, it should return the result in an instant! Before answering this question, you can do another test. If you continue to increase the length of the query string, ES will throw an exception directly after reaching a certain length. The reasons for the exception in the service ES are as follows:
Caused by: org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
  Determinizing automaton with 22082 states and 34182 transitions would result in more than 10000 states.
  at org.apache.lucene.util.automaton.Operations.determinize(Operations.java:741) ~[lucene-core-6.4.1.jar:6.4.1
The exception comes from the package org.apache.lucene.util.automaton. The literal meaning of the exception is that "the automata is too complex to determine the state: due to too many states and transitions, it is determined that the state to be generated by an automata exceeds the upper limit of 10000".
Originally, in order to speed up the matching of wildcards and regular expressions, Lucene 4.0 will start to build the input string pattern into a DFA (deterministic final Automation). The DFA constructed by the pattern with wildcards may be very complex and expensive. For example, the DFA constructed by a*bc is like the following figure.

Check the relevant codes in Lucene. The construction process is as follows:

1. The toautomation method in org.apache.lucene.search.WildcardQuery traverses the input wildcard pattern, turns each character into an automaton, and then links the automata of each character to generate a new automaton.

public static Automaton toAutomaton(Term wildcardquery) {
    List<Automaton> automata = new ArrayList<>();
    
    String wildcardText = wildcardquery.text();
    
    for (int i = 0; i < wildcardText.length();) {
      final int c = wildcardText.codePointAt(i);
      int length = Character.charCount(c);
      switch(c) {
        case WILDCARD_STRING: 
          automata.add(Automata.makeAnyString());
          break;
        case WILDCARD_CHAR:
          automata.add(Automata.makeAnyChar());
          break;
        case WILDCARD_ESCAPE:
          // add the next codepoint instead, if it exists
          if (i + length < wildcardText.length()) {
            final int nextChar = wildcardText.codePointAt(i + length);
            length += Character.charCount(nextChar);
            automata.add(Automata.makeChar(nextChar));
            break;
          } // else fallthru, lenient parsing with a trailing \
        default:
          automata.add(Automata.makeChar(c));
      }
      i += length;
    }
    
    return Operations.concatenate(automata);
  }

At this time, the generated state machine is an uncertain state machine, that is, non deterministic finite automation (NFA).

2. The determine method in the org.apache.lucene.util.automation.operations class will convert NFA to DFA

/**
   * Determinizes the given automaton.
   * <p>
   * Worst case complexity: exponential in number of states.
   * @param maxDeterminizedStates Maximum number of states created when
   *   determinizing.  Higher numbers allow this operation to consume more
   *   memory but allow more complex automatons.  Use
   *   DEFAULT_MAX_DETERMINIZED_STATES as a decent default if you don't know
   *   how many to allow.
   * @throws TooComplexToDeterminizeException if determinizing a creates an
   *   automaton with more than maxDeterminizedStates
   */
  public static Automaton determinize(Automaton a, int maxDeterminizedStates){ 
The code comments say that the worst time complexity of this process is the exponential level of the number of States! In order to prevent too many states and consume too much memory and CPU, the maximum number of States is limited in the class
/**
* Default maximum number of states that {@link Operations#determinize} should create.
*/
public static final int DEFAULT_MAX_DETERMINIZED_STATES = 10000;
When there are wildcards at the beginning and end and the string is very long, this determine process will produce a large number of States and even exceed the upper limit.
What is the difference between NFA and DFA? How to convert each other? A superficial understanding is that NFA can transfer from one state to multiple states when a condition is entered, while DFA can only have a certain state to transfer, so DFA is faster in string matching. Although DFA is fast in search, it may have high time complexity in construction, Especially when there is a header wildcard + long string. There are special instructions for wild query in the official Elasticsearch document. Avoid using term s starting with wildcards.
" Note that this query can be slow, as it needs to iterate over many terms. 
In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?."

Summary:

Wildcard query should not start with wildcards. If you have to do so, you must limit the length of the string entered by the user. It is best to change the implementation method. Make an article at index time, select an appropriate word breaker, such as nGram tokenizer, preprocess the data, and then use a cheaper term query to achieve the same fuzzy search function. For some input In other words, for the prompted application scenarios, we can give priority to using the query methods with better performance and slightly less fuzziness such as completion sugger and phrase / term sugger. When the sugger has no matching results, we can fall back to the more fuzzy but poor performance queries such as wildcard, regex and fuzzy.
So, do regex and fuzzy query have the same problem? The answer is yes. The reason is that their bottom layer, like wildcard, accelerates string matching by constructing pattern into DFA.

Solution

It is very simple to solve this problem by limiting keywords. Baidu and Taobao do deal with it in this way.
If you must give users something, you can find some hot words to analyze, or give some hot search products.

Posted on Sun, 05 Dec 2021 00:23:59 -0500 by mapostel