Full parsing of regular expressions

Regular expressions define patterns for strings that can be used to search, edit, or process text.

1, Regular basic knowledge points

1.1 metacharacter

Metacharacters are one of the basic elements in constructing regular expressions.

  • Several commonly used metacharacters:
Metacharacter explain
. Match any character except newline
\w Match letters or numbers or underscores or Chinese characters
\s Match any whitespace
\d Match numbers
\b Match the beginning or end of a word
^ Start of matching string
$ End of matching string

1.2 repeat qualifier

Some repetition qualifiers in regular expressions are replaced by appropriate qualifiers.

grammar explain
* Repeat zero or more times
+ Repeat one or more times
? Repeat zero or once
{n} Repeat n times
{n,} Repeat n times or more
{n,m} Repeat n to m times

1.3 grouping

The qualifier is the closest character to his left. What if I want multiple characters to be qualified at the same time?

Regular expressions are grouped with parentheses (), which means that the contents in parentheses are taken as a whole.

1.4 escape

We can see that regular expressions are grouped with parentheses, so the problem is:

If the string to be matched contains parentheses, is that a conflict? What should I do?

In view of this situation, regular provides a way to escape, that is, to escape these metacharacters, qualifiers or keywords into ordinary characters, which is very simple, that is to add a slash before the characters to be escaped, that is, \.

1.5 conditions or

A regular is represented by a sign|, also known as a branching condition. When any of the branching conditions in a regular are satisfied, it will be regarded as a successful match.

1.6 interval

Regular provides a metacharacter with brackets [] to represent interval conditions.

  1. Limit 0 to 9 can be written as [0-9];
  2. Limit A-Z to [A-Z];
  3. Limit some numbers [165].

1.7 antonym

What we talked about before is what to match. Of course, if you want to do the opposite, you don't want to match some characters.

  • Common Antonyms:
Metacharacter explain
\W Match any character that is not a letter, number, underscore or Chinese character
\S Match any character that is not a space character
\D Match any non numeric character
\B Match is not at the beginning or end of a word
[^x] Match any character except x
[^aeiou] Match any character except the letters aeiou

2, Java regular Basics

2.1 Pattern class

  • definition

Pattern object is a compiled representation of regular expression. Its construction method is private and cannot be created directly, but it can be constructed by Pattern.complie(String regex) the simple factory method creates a regular expression.

  • common method
  1. complie(String regex): create a regular expression object;
  2. pattern(): returns the string form of a regular expression;
  3. split(CharSequence input): used to separate strings and return a string [] (in JDK String.split(String regex) yes Pattern.split(CharSequence input);
  4. matcher(String regex,CharSequence input): used to quickly match strings. This method is suitable for matching only once and all strings;
  5. matcher(CharSequence input): returns a matcher object.

The constructor of the Matcher class is also private. It can't be created at will, but only through Pattern.Matcher The (charsequence input) method gets an instance of the class

  • Summary

The Pattern class can only do some simple matching operations. In order to get a stronger and more convenient regular matching operation, you need to use Pattern with Matcher.

2.2 Matcher class

  • definition

The Matcher object is the engine that interprets and matches the input string. Like the Pattern class, the Matcher does not have a common constructor. You need to call the Pattern object's matcher() method to get a Matcher object.

  • common method

The Matcher class provides three matching operation methods that return the boolean type.

  1. matches(): matches the entire string. Returns true only if the entire string matches;
  2. lookingAt(): match the previous string. Only the matched string is at the top can return true;
  3. find(): match the string. The matched string can be anywhere.

2.3 Matcher class expansion

When you use matches(), lookingAt(), find() to perform matching operations, you can use the following three methods to get more detailed information

  1. start(): returns the index position of the matched substring in the string;

  2. end(): returns the index position of the last character of the matched substring in the string;

  3. group(): returns the matching substring.

  • The Matcher class also provides four methods to replace a matching substring with a specified string:
  1. replaceAll() ;
  2. replaceFirst() ;
  3. appendReplacement() ;
  4. appendTail().
  • Summary

The Matcher class provides grouping support for regular expressions and multiple matching support for regular expressions.

3, [recommended] regular tool class

3.1 regular expression constants

In order to manage regular expressions, a constant class is used to store regular expressions.

public class RegexConstant {

    /**
     * Regular: mobile number (simple)
     */
    public static final String REGEX_MOBILE_SIMPLE = "^[1]\\d{10}$";
    /**
     * Regular: cell phone number (accurate)
     * <p>Movement: 134 (0-8), 135, 136, 137, 138, 139, 147, 150, 151, 152, 157, 158, 159, 178, 182, 183, 184, 187, 188</p>
     * <p>China Unicom: 130, 131, 132, 145, 155, 156, 175, 176, 185, 186</p>
     * <p>Telecommunications: 133, 153, 173, 177, 180, 181, 189</p>
     * <p>Global Star: 1349</p>
     * <p>Virtual operators: 170</p>
     */
    public static final String REGEX_MOBILE_EXACT = "^((13[0-9])|(14[5,7])|(15[0-3,5-9])|(17[0,3,5-8])|(18[0-9])|(147))\\d{8}$";
    /**
     * Regular: phone number
     */
    public static final String REGEX_TEL = "^0\\d{2,3}[- ]?\\d{7,8}";
    /**
     * Regular: 15 digits of ID card number
     */
    public static final String REGEX_ID_CARD15 = "^[1-9]\\d{7}((0\\d)|(1[0-2]))(([0|1|2]\\d)|3[0-1])\\d{3}$";
    /**
     * Regular: 18 digits of ID card number
     */
    public static final String REGEX_ID_CARD18 = "^[1-9]\\d{5}[1-9]\\d{3}((0\\d)|(1[0-2]))(([0|1|2]\\d)|3[0-1])\\d{3}([0-9Xx])$";
    /**
     * Regular: mailbox
     */
    public static final String REGEX_EMAIL = "^\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*$";
    /**
     * Regular: URL
     */
    public static final String REGEX_URL = "[a-zA-z]+://[^\\s]*";
    /**
     * Regular: Chinese characters
     */
    public static final String REGEX_ZH = "^[\\u4e00-\\u9fa5]+$";
    /**
     * Regular: user name, value range is a-z,A-Z,0-9, "_ ", Chinese character, cannot use"_ "End, user name must be 6-20 digits
     */
    public static final String REGEX_USERNAME = "^[\\w\\u4e00-\\u9fa5]{6,20}(?<!_)$";
    /**
     * Regular: Date verification in yyyy MM DD format, even leap year has been considered
     */
    public static final String REGEX_DATE = "^(?:(?!0000)[0-9]{4}-(?:(?:0[1-9]|1[0-2])-(?:0[1-9]|1[0-9]|2[0-8])|(?:0[13-9]|1[0-2])-(?:29|30)|(?:0[13578]|1[02])-31)|(?:[0-9]{2}(?:0[48]|[2468][048]|[13579][26])|(?:0[48]|[2468][048]|[13579][26])00)-02-29)$";
    /**
     * Regular: IP address
     */
    public static final String REGEX_IP = "((2[0-4]\\d|25[0-5]|[01]?\\d\\d?)\\.){3}(2[0-4]\\d|25[0-5]|[01]?\\d\\d?)";
    /**
     * Regular: double byte characters (including Chinese characters)
     */
    public static final String REGEX_DOUBLE_BYTE_CHAR = "[^\\x00-\\xff]";
    /**
     * Regular: blank line
     */
    public static final String REGEX_BLANK_LINE = "\\n\\s*\\r";
    /**
     * Regular: QQ No
     */
    public static final String REGEX_QQ = "[1-9][0-9]{4,}";
    /**
     * Regular: China Post Code
     */
    public static final String REGEX_ZIP_CODE = "[1-9]\\d{5}(?!\\d)";
    /**
     * Regular: positive integer
     */
    public static final String REGEX_POSITIVE_INTEGER = "^[1-9]\\d*$";
    /**
     * Regular: negative integer
     */
    public static final String REGEX_NEGATIVE_INTEGER = "^-[1-9]\\d*$";
    /**
     * Regular: integer
     */
    public static final String REGEX_INTEGER = "^-?[1-9]\\d*$";
    /**
     * Regular: non negative integer (positive integer + 0)
     */
    public static final String REGEX_NOT_NEGATIVE_INTEGER = "^[1-9]\\d*|0$";
    /**
     * Regular: non positive integer (negative integer + 0)
     */
    public static final String REGEX_NOT_POSITIVE_INTEGER = "^-[1-9]\\d*|0$";
    /**
     * Regular: positive floating point
     */
    public static final String REGEX_POSITIVE_FLOAT = "^[1-9]\\d*\\.\\d*|0\\.\\d*[1-9]\\d*$";
    /**
     * Regular: negative floating point
     */
    public static final String REGEX_NEGATIVE_FLOAT = "^-[1-9]\\d*\\.\\d*|-0\\.\\d*[1-9]\\d*$";
    /**
     * Regular: only numbers
     */
    public static final String REGEX_NUMBER = "[0-9]*";
    /**
     * Regular: letters, numbers and underscores
     */
    public static final String REGEX_NUMBER_LETTER = "^[0-9a-zA-Z-][\\w-_]{1,}$";
    /**
     * Regular: only letters
     */
    public static final String REGEX_LETTER = "^[A-Za-z]+$";
    /**
     * Regular: include parentheses or not
     */
    public static final String REGEX_BRACKETS = ".*[()\\[\\]{}()]+.*";
    /**
     * Keywords to be escaped in regular
     */
    private final static Character[] KEYS_ARRAY = {'$', '(', ')', '*', '+', '.', '[', ']', '?', '\\', '^', '{', '}', '|'};

    public final static Set<Character> RE_KEYS = new HashSet<>(Arrays.asList(KEYS_ARRAY));

}

3.2 judge whether to match regular

Matching with regular expression

public static boolean isMatch(String regex, CharSequence input) {
    return input != null && input.length() > 0 && Pattern.matches(regex, input);
}
  • Example
@Test
public void isMatch() {
    String input = "0571-69123456";
    boolean isTel =  RegexUtil.isMatch(RegexConstant.REGEX_TEL, input);
    logger.info("result:{}", isTel);
    
    boolean isPhone =  RegexUtil.isMatch(RegexConstant.REGEX_MOBILE_SIMPLE, input);
    logger.info("result:{}", isPhone);

}

3.3 get the regular matching part

  • The result is a string
public static String getMatches(String regex, CharSequence input) {
    if (input == null || input.length() == 0) {
        return "";
    }
    StringBuffer matches = new StringBuffer();
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        matches.append(matcher.group());
    }
    return matches.toString();
}
  • The return result is a collection
public static List<Object> getMatchesList(String regex, CharSequence input) {
    Pattern pattern = Pattern.compile(regex);
    if (input == null || input.length() == 0) {
        return null;
    }
    List<Object> matches = new ArrayList<>();
    Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        if (!matcher.group().isEmpty()) {
            matches.add(matcher.group());
        }
    }
    return matches;
}
  • Example
@Test
public void getMatches() {
    String input = "0571-69123456";
    String result = RegexUtil.getMatches(RegexConstant.REGEX_NUMBER, input);
    logger.info("result:{}", result);
    List<Object> list = RegexUtil.getMatchesList(RegexConstant.REGEX_NUMBER, input);
    logger.info("result:{}", list);
}

3.4 count the number of matching pattern s in the specified string

public static int count(String regex, CharSequence input) {
    if (null == regex || input == null || input.length() == 0) {
        return 0;
    }
    Pattern pattern = Pattern.compile(regex);
    int count = 0;
    final Matcher matcher = pattern.matcher(input);
    while (matcher.find()) {
        if (!matcher.group().isEmpty()) {
            count++;
        }
    }
    return count;
}
  • Example
@Test
public void count() {
    String input = "0571-69123456";
    int result = RegexUtil.count(RegexConstant.REGEX_NUMBER, input);
    logger.info("result:{}", result);
}

3.5 replace regular matching

  • Replace all regular matches
public static String replaceAll(String regex, String input, String replacement) {
    if (input == null || input.length() == 0) {
        return "";
    }
    return Pattern.compile(regex).matcher(input).replaceAll(replacement);
}
  • Replace the first part of a regular match
public static String replaceFirst(String regex, String input, String replacement) {
    if (input == null || input.length() == 0) {
        return "";
    }
    return Pattern.compile(regex).matcher(input).replaceFirst(replacement);
}
  • Delete the specified prefix, if not found, return to the original
public static String delPre(String regex, CharSequence input) {
    if (regex == null || input == null ||input.length() == 0) {
        return "";
    }
    Pattern pattern = Pattern.compile(regex, Pattern.DOTALL);
    Matcher matcher = pattern.matcher(input);
    if (matcher.find()) {
        return input.toString().substring(matcher.end(), input.length());
    }
    return input.toString();
}
  • Example
@Test
public void replace() {
    String input = "0571-69123456-hz";
    String result = RegexUtil.replaceAll("-",input,"_");
    logger.info("result:{}", result);
    result = RegexUtil.replaceFirst("-",input,"_");
    logger.info("result:{}", result);
    result = RegexUtil.delPre("-",input);
    logger.info("result:{}", result);
}

3.6 escape string

Escape regular keywords in Java.

public static String escape(CharSequence input) {
    if (input == null ||input.length() == 0) {
        return "";
    }
    final StringBuilder builder = new StringBuilder();
    int len = input.length();
    char current;
    for (int i = 0; i < len; i++) {
        current = input.charAt(i);
        if (RegexConstant.RE_KEYS.contains(current)) {
            builder.append('\\');
        }
        builder.append(current);
    }
    return builder.toString();
}
  • Example
@Test
public void escape() {
    String input = "$123.45";
    String result = RegexUtil.escape(input);
    logger.info("result:{}", result);
}

For complete tools, please refer to [RegexUtil.java]

4, Summary

Regular expression is an effective way to manipulate strings, but at the same time, it is convenient for us to develop, we must realize that too much use of regular expression will reduce the readability of code.

More Java notes, see [Java knowledge notebook] , welcome to provide ideas and suggestions.

  1. [Github sample code]
  2. Java online expression tool

Daily praise

  1. Ancestral script Spring Boot sunflower classic Welcome to Tucao, and make complaints about open source.
  2. nine men's power [Java knowledge notebook] Welcome to Tucao, and make complaints about open source.

The latest article, welcome to the public: the official account - the dust blog; the exchange of views, welcome to add: personal WeChat.

Tags: Java JDK Mobile github

Posted on Tue, 23 Jun 2020 04:02:40 -0400 by jeanne