Introduction to PCRE and Hyperscan
Perl Compatible Regular Expressions (PCRE), is a widely used regular expression matching library written in the C language, inspired by the regular expression capabilities of the Perl programming language. Its syntax is much more powerful and flexible than many other regular expression libraries, such as the Portable Operating System Interface for UNIX* (POSIX).
Hyperscan is a high performance, multi-pattern regular expression matching library developed by Intel, which supports the same syntax as PCRE. In this article, we will describe the API differences and provide a performance contrast between PCRE and Hyperscan, then show how to replace PCRE with Hyperscan in a typical scenario.
Functionality Comparison
PCRE supports only block mode compilation and matching, while Hyperscan supports both block and streaming mode. Streaming mode is more practical and flexible in real network scenarios.
PCRE supports only single pattern compilation and matching, while Hyperscan can support multiple patterns. In real scenarios, it is common that multiple patterns are applied, and Hyperscan can efficiently complete all work in one compilation and scan.
API Comparison
Both PCRE and Hyperscan interfaces have compile time and runtime phases. When replacing PCRE with Hyperscan, the basic idea is to replace the PCRE API with the Hyperscan API at both compile time and runtime.
Compile time API changes
With PCRE, we often use the following API to compile each pattern:
#include <pcre.h> pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr);
It is very easy to replace this API with the Hyperscan compile time API:
#include <hs_compile.h> hs_error_t hs_compile(const char *expression, unsigned int flags, unsigned int mode, const hs_platform_info_t *platform, hs_database_t **db, hs_compile_error_t **error);
Parameters:
expression – single pattern. Corresponding to “pattern” of pcre. flags – flag of single pattern. Corresponding to “options” of pcre. mode – mode selection. What corresponds to pcre is HS_MODE_BLOCK. platform – platform information. db – Generated Hyperscan database. Corresponding to the return value of pcre_compile2. error – return the compile error.
Return value is HS_SUCCESS(0)
or error code.
Hyperscan also provides an API for multi-pattern compilation:
hs_error_t hs_compile_multi(const char *const *expressions, const unsigned int *flags, const unsigned int *ids, unsigned int elements, unsigned int mode, const hs_platform_info_t *platform, hs_database_t **db, hs_compile_error_t **error);
This API supports the compiling of several patterns together to generate one Hyperscan database. There are some differences in the parameters:
expressions – the array of patterns. flags – the flag array of patterns. ids – the id array of patterns. elements – the number of patterns.
The others remain the same.
Run time API replacement
To build the compiled PCRE database, we often use the following API to scan the input data:
#include <pcre.h> int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize);
It is easy to replace it with the Hyperscan runtime API:
#include <hs_runtime.h> hs_error_t hs_scan(const hs_database_t *db, const char *data, unsigned int length, unsigned int flags, hs_scratch_t *scratch, match_event_handler onEvent, void *context);
Parameters:
db – Hypersan data base. Corresponding to “code” of pcre. data – input data. Corresponding to “subject” of pcre. length – length of the input data. Corresponding to “length” of pcre. flags – reserved options. scratch – the space storing temporary state during runtime, allocated by hs_alloc_scratch(). onEvent – callback function at match, user-defined. context – callback function parameter, user-defined.
Return value is HS_SUCCESS(0)
or error code.
Using prefilter mode
Hyperscan does not completely match PCRE syntax; for example, it doesn’t support Back Reference and Zero-Width Assertion. However, Hyperscan’s performance advantage over PCRE makes it worthwhile to convert the unsupported pattern to its superset, which can be supported by Hyperscan’s prefilter mode. For example:
Convert /foo(\d)+bar\1+baz/ to /foo(\d)+bar(\d)+baz/
At compile time, each pattern goes through the classifier first, which decides whether to use Hyperscan, prefilter mode, or PCRE. The patterns supported by Hyperscan and prefilter mode are multi-compiled together to generate one Hyperscan database; in addition, every pattern that uses prefilter mode or PCRE is compiled separately to generate the PCRE database:
Figure 1. Compile time.
At runtime, the input data is scanned against the Hyperscan database and each non-prefilter mode PCRE database. If Hyperscan finds a match, it should be reconfirmed with PCRE:
Figure 2. Runtime.
Please refer to API Reference: Constants for more API information.
Performance Comparison
To contrast Hyperscan and PCRE performance, we used the Hyperscan performance testing tool hsbench. We selected the following 15 regular expressions, which include both plain text strings and regular expression rules:
ID | Signature |
1 | Twain |
2 | (?i)Twain |
3 | [a-z]shing |
4 | Huck[a-zA-Z]+|Saw[a-zA-Z]+ |
5 | \b\w+nn\b |
6 | [a-q][^u-z]{13}x |
7 | Tom|Sawyer|Huckleberry|Finn |
8 | (?i)Tom|Sawyer|Huckleberry|Finn |
9 | .{0,2}(Tom|Sawyer|Huckleberry|Finn) |
10 | .{2,4}(Tom|Sawyer|Huckleberry|Finn) |
11 | Tom.{10,25}river|river.{10,25}Tom |
12 | [a-zA-Z]+ing |
13 | \s[a-zA-Z]{0,12}ing\s |
14 | ([A-Za-z]awyer|[A-Za-z]inn)\s |
15 | ["'][^"']{0,30}[?!\.]["'] |
We ran the test on a single core of an Intel® Core™ i7-8700K processor at 3.70 GHz. We chose an e-book from The Entire Project Gutenberg Works of Mark Twain by Mark Twain, containing about 20M words (18,905,427 bytes) as input, and then looped for 200 times using hsbench. Time spent and throughput of PCRE (v8.41, just-in-time mode) and Hyperscan (v4.7.0) are as follows:
Corpus: mtent.txt | Total Data: 18905427 Bytes x 200 | |||
---|---|---|---|---|
Time(s) | Throughput (Mbit/s) | |||
ID | pcre_jit | hs | pcre_jit | hs |
1 | 3.518 | 1.452 | 8,598.26 | 20,832.43 |
2 | 3.631 | 1.498 | 8,330.68 | 20,192.71 |
3 | 3.327 | 2.355 | 9,091.88 | 12,844.45 |
4 | 1.582 | 2.246 | 19,120.53 | 13,467.80 |
5 | 12.901 | 2.067 | 2,344.68 | 14,634.10 |
6 | 1.462 | 1.585 | 20,689.93 | 19,084.34 |
7 | 4.755 | 3.037 | 6,361.45 | 9,960.05 |
8 | 11.075 | 3.08 | 2,731.26 | 9,821.00 |
9 | 31.684 | 3.037 | 954.70 | 9,960.05 |
10 | 31.014 | 3.042 | 975.32 | 9,943.68 |
11 | 3.143 | 2.097 | 9,624.14 | 14,424.74 |
12 | 7.42 | 3.279 | 4,076.64 | 9,224.97 |
13 | 8.463 | 4.45 | 3,574.23 | 6,797.46 |
14 | 6.423 | 2.89 | 4,709.43 | 10,466.67 |
15 | 2.395 | 4.267 | 12,629.93 | 7,088.98 |
Total | 132.793 | 40.382 | 227.79 | 749.0635234 |
Multi | 132.793 | 13.4 | 227.79 | 2,257.36 |
The results show that Hyperscan has a performance advantage over PCRE for most of the rules tested. The highest throughput (see test 9) is 10.4 times greater using Hyperscan.
Multiple pattern matching test results also show the advantage of using Hyperscan. Multi-pattern matching is very common in practical use. Hyperscan can compile all the patterns simultaneously into one database which is scanned against the input corpus only once. The results above (see the rows labeled Total and Multi) show that it takes Hyperscan only 13.4 seconds to perform multiple pattern matching, and a total of 40.382 seconds when the 15 rules are scanned separately in single pattern scans. Because PCRE supports only single pattern compilation and scanning, each of the 15 rules must be compiled separately and scanned against the input corpus. Altogether, it takes PCRE 132.793 seconds to complete all scans. The throughput histogram is as follows:
Replacement Pseudo-code Samples
Assume that we have a pattern set and input corpus:
// patterns const char *pats[]; unsigned flags[]; // input data const char *buf = "blah...................blah"; size_t buf_len = strlen(buf);
When using PCRE, we may have this kind of implementation:
// pcre compile for each pattern i pcres[i] = pcre_compile2(pats[i], flags[i], …); // pcre runtime for each pattern i ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …); if ret >= 0 report pattern i match at ovector[1]
Now we’ll describe the details of replacing PCRE with Hyperscan.
In addition to a possible requirement for prefiltering mode, we also have to be careful about a pattern having variable width, which means that a pattern may consume a different amount of data to get matches. This is because Hyperscan reports all matches but PCRE only reports one match under greedy or ungreedy mode. We also need to reconfirm the match from a variable width pattern with PCRE. We may use the following function to check whether a pattern has variable width or not:
bool is_variable_width(re, flags) { hs_expr_info_t *info = NULL; ret = hs_expression_info(re, flags, &info, ...); if (ret == HS_SUCCESS) and info and (info->min_width == info->max_width) return false else return true }
Here we show two different scenarios - single compile or multi-compile.
Single pattern compile
// try compile hs, prefilter mode and pcre compile for each pattern i dbs[i].pcre = NULL; ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &dbs[i].hs, …); if ret == HS_SUCCESS hs_alloc_scratch(dbs[i].hs, dbs[i].scratch); if pats[i] has variable width dbs[i].pcre = pcre_compile2(pats[i], flags[i], …); else dbs[i].pcre = pcre_compile2(pats[i], flags[i], …); ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &dbs[i].hs, …); if ret == HS_SUCCESS hs_alloc_scratch(dbs[i].hs, dbs[i].scratch); else dbs[i].hs = NULL; // runtime on_match(…, to, …, ctx) { // hs callback if !ctx->pcre // not prefilter mode pattern report pattern ctx->i match at to return // got a match from a prefilter mode or variable width pattern, need pcre confirm ret = pcre_exec(ctx->pcre, …, buf, buf_len, …, &ovector[0], …); if ret >= 0 report pattern ctx->i match at ovector[1] return } for each pattern i if dbs[i].hs ctx = pack dbs[i].pcre and i hs_scan(dbs[i].hs, buf, buf_len, 0, dbs[i].scratch, on_match, &ctx); else ret = pcre_exec(dbs[i].pcre, …, buf, buf_len, …, &ovector[0], …); if ret >= 0 report pattern i match at ovector[1] // house clean for each pattern i if dbs[i].hs hs_free_scratch(dbs[i].scratch); hs_free_database(dbs[i].hs);
Multi-pattern compile
// try hs, use prefilter mode and pcre compile if failed for each pattern i ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &hs, …); if ret == HS_SUCCESS store pats[i] to hs_pats[] store flags[i] | HS_FLAG_PREFILTER to hs_flags[] store ids[i] to hs_ids[] if pats[i] has variable width id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …); else ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &hs, …); if ret == HS_SUCCESS store pats[i] to hs_pats[] store flags[i] | HS_FLAG_PREFILTER to hs_flags[] store ids[i] to hs_ids[] id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …); else pcres[n_pcre++] = pcre_compile2(pats[i], flags[i], …); // hs multi compile hs_compile_multi(hs_pats, hs_flags, hs_ids, n_hs, HS_MODE_BLOCK, …, &hs, …); hs_alloc_scratch(hs, &scratch); // hs runtime for multi compiled part on_match(id, …, to, …, ctx) { if ctx[id] // got match from a prefilter mode or variable width pattern, need pcre confirm ret = pcre_exec(ctx[id], …, buf, buf_len, …, &ovector[0], …); if ret >= 0 report pattern id match at ovector[1] else report pattern id match at to return } ctx = id2pcre; // user defined on_match context hs_scan(hs, buf, buf_len, 0, scratch, on_match, ctx); // pcre runtime for the rest for each db i in pcres[] ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …); if ret >= 0 report pattern ids[i] match at ovector[1] // house clean hs_free_scratch(scratch); hs_free_database(hs_db);
Summary
Hyperscan is a high performance regular expression matching library that is very suitable for multi-pattern matching and is faster than PCRE. In this article, we showed how to replace PCRE with Hyperscan in a typical scenario. In addition to its performance advantage, Hyperscan has another superior feature, the streaming mode. It enables us to deal with the case that input data is separated into pieces of blocks. For these reasons, Hyperscan is expected to replace many of the existing regular expression matching engines in a greater number of scenarios.