一种正则表达式匹配模型

一、模型图示

图1

二、模型概述

在匹配过程中，目标字符串按照图1模型进行标注，即增加索引位置，索引位置标号值从0开始，最后一个标号值等于目标字符串的长度值。
关于索引位置有如下几点论述：

正则表达式在目标字符串中的匹配字符串对应一个索引位置区间，比如图1中的“hello”字符串如果作为匹配字符串，那么对应的索引位置区间为“[0,5)”
有两种索引位置区间：点区间和非点区间。点区间，左闭右闭，区间长度为零，对应零长度匹配，比如“[5-5]”；非点区间，左闭右开，区间长度大于零，对应非零长度匹配，比如“[5-7)”
索引位置区间的重叠。重叠例子：“[5-7)”和“[6-8)”，“[5-5]”和“[5-7)”；非重叠例子：“[5-7)”和“[7-9)”，“[5-7)”和“[10-12)”
一般情况下，正则表达式匹配过程的最终等价效果是：从左至右依次找到一系列互不重叠的索引位置区间。假如正则表达式为“[a-z]{2}”，图1中的字符串作为目标字符串，那么依次得到的“匹配字符串（对应的索引位置区间）”系列为：he[0,2)，ll[2,4)，wo[6,8)，rl[8,10)
正则表达式中的边界匹配符描述的是索引位置的性质。假如正则表达式为“\bhel\B”，图1中的字符串作为目标字符串，那么能够得到一个匹配字符串“hel”（目标字符串中索引位置0的确是“单词边界”，索引位置3的确是“非单词边界”）

三、实践

正则表达式	目标字符串	结果	对应的索引位置区间序列	描述
`a?`	`abcafdas`	I found the text “a” starting at index 0 and ending at index 1. I found the text “” starting at index 1 and ending at index 1. I found the text “” starting at index 2 and ending at index 2. I found the text “a” starting at index 3 and ending at index 4. I found the text “” starting at index 4 and ending at index 4. I found the text “” starting at index 5 and ending at index 5. I found the text “a” starting at index 6 and ending at index 7. I found the text “” starting at index 7 and ending at index 7. I found the text “” starting at index 8 and ending at index 8.	[0,1)，[1,1]，[2,2]，[3,4)，[4,4]，[5,5]，[6,7)，[7,7]，[8,8]	从左至右，从“索引位置0”开始，到“索引位置8”结束。尝试以“索引位置0”为边界左值，找到匹配字符串“a”，对应的边界右值为“索引位置1”，即“[0,1)”；尝试以“索引位置1”为边界左值，找到零长度匹配，对应的边界右值为“索引位置1”，即“[1,1]”；尝试以“索引位置2”为边界左值，找到零长度匹配，对应的边界右值为“索引位置2”，即“[2,2]”；尝试以“索引位置3”为边界左值，找到匹配字符串“a”，对应的边界右值为“索引位置4”，即“[3,4)”；尝试以“索引位置4”为边界左值，找到零长度匹配，对应的边界右值为“索引位置4”，即“[4,4]”；尝试以“索引位置5”为边界左值，找到零长度匹配，对应的边界右值为“索引位置5”，即“[5,5]”；尝试以“索引位置6”为边界左值，找到匹配字符串“a”，对应的边界右值为“索引位置7”，即“[6,7)”；尝试以“索引位置7”为边界左值，找到零长度匹配，对应的边界右值为“索引位置7”，即“[7,7]”；尝试以“索引位置8”为边界左值，找到零长度匹配，对应的边界右值为“索引位置8”，即“[8,8]”
`a*`	`aabaaaca`	I found the text “aa” starting at index 0 and ending at index 2. I found the text “” starting at index 2 and ending at index 2. I found the text “aaa” starting at index 3 and ending at index 6. I found the text “” starting at index 6 and ending at index 6. I found the text “a” starting at index 7 and ending at index 8. I found the text “” starting at index 8 and ending at index 8.	[0,2)，[2,2]，[3,6)，[6,6]，[7,8)，[8,8]	从左至右，从“索引位置0”开始，到“索引位置8”结束。尝试以“索引位置0”为边界左值，找到匹配字符串“aa”，对应的边界右值为“索引位置2”，即“[0,2)”；尝试以“索引位置2”为边界左值，找到零长度匹配，对应的边界右值为“索引位置2”，即“[2,2]”；尝试以“索引位置3”为边界左值，找到匹配字符串“aaa”，对应的边界右值为“索引位置6”，即“[3,6)”；尝试以“索引位置6”为边界左值，找到零长度匹配，对应的边界右值为“索引位置6”，即“[6,6]”；尝试以“索引位置7”为边界左值，找到匹配字符串“a”，对应的边界右值为“索引位置8”，即“[7,8)”；尝试以“索引位置8”为边界左值，找到零长度匹配，对应的边界右值为“索引位置8”，即“[8,8]”
`^he\B`	`hello world`	I found the text “he” starting at index 0 and ending at index 2.	[0,2)	从左至右，从“索引位置0”开始，到“索引位置11”结束。尝试以“索引位置0”为边界左值，找到匹配字符串“he”，对应的边界右值为“索引位置2”，即“[0,2)”；依次尝试以“索引位置2”，“索引位置3”，“索引位置4”，“索引位置5”，“索引位置6”，“索引位置7”，“索引位置8”，“索引位置9”，“索引位置10”，“索引位置11”为边界左值，找不到匹配
`aa`	`cdfaacdf`	I found the text “aa” starting at index 3 and ending at index 5.	[3,5)	从左至右，从“索引位置0”开始，到“索引位置8”结束。依次尝试以“索引位置0”，“索引位置1”，“索引位置2”为边界左值，找不到匹配，尝试以“索引位置3”为边界左值，找到匹配字符串“aa”，对应的边界右值为“索引位置5”，即“[3,5)”；依次尝试以“索引位置5”，“索引位置6”，“索引位置7”，“索引位置8”为边界左值，找不到匹配

四、模型适用

采用本模型的正则表达式匹配引擎在匹配时会使用本模型进行匹配，但是当我们自己进行“人工匹配”时，一般使用“粗放匹配（即直接根据字符串进行匹配）”方式，在这种方式下，极易漏掉“零长度匹配”，因此，当有可能出现“零长度匹配”时，我们最好还是遵循本模型的方式进行“人工匹配”。
比如有“正则表达式：a?，目标字符串：abcafdas”，使用“粗放匹配”方式，很容易找到“a（第1个字符）”，“a（第4个字符）”和“a（第7个字符）”这3个匹配字符串，但是很容易遗漏零长度匹配；使用“本模型”方式，不容易遗漏任何可能的匹配（可参见“三、实践”小节）。

参考文献： [1]https://docs.oracle.com/javase/tutorial/essential/regex/literals.html