Most of the interfaces are covered by the documentation for GNU regex. This file will accumulate factoids about other interfaces until somebody writes a manual. * Don't Pass Registers Gratuitously Search and match functions take an optional parameter which is a pointer to "registers" or "match positions". This parameter points to a structure which during the match is filled in with the offset locations of parenthesized subexpressions. Unless you specificly need the values that would be stored in that structure, you should pass NULL for this parameter. Usually Rx will do less backtracking (and so run much faster) if subexpression positions are not being measured. * Use syntax_parens Sometimes you need to know the positions of *some* parenthesized subexpressions, but not others. You can still help Rx to avoid backtracking by telling it specificly which subexpressions you are interested in. You do this by filling in the rxb.syntax_parens field of a pattern buffer. /* If this is a valid pointer, it tells rx not to store the extents of * certain subexpressions (those corresponding to non-zero entries). * Passing 0x1 is the same as passing an array of all ones. Passing 0x0 * is the same as passing an array of all zeros. * The array should contain as many entries as their are subexps in the * regexp. */ char * syntax_parens; * RX_SEARCH For an example of how to use rx_search, you can look at how re_search_2 is defined (in rx.c). Basicly you need to define three functions. These are GET_BURST, FETCH_CHAR, and BACK_REF. They each operate on `struct rx_string_position' and a closure of your design passed as void *. struct rx_string_position { const unsigned char * pos; /* The current pos. */ const unsigned char * string; /* The current string burst. */ const unsigned char * end; /* First invalid position >= POS. */ int offset; /* Integer address of current burst */ int size; /* Current string's size. */ int search_direction; /* 1 or -1 */ int search_end; /* First position to not try. */ }; On entry to GET_BURST, all these fields are set, but POS may be >= END. In fact, STRING and END might both be 0. The function of GET_BURST is to make all the fields valid without changing the logical position in the string. SEARCH_DIRECTION is a hint about which way the matcher will move next. It is usually 1, and is -1 only when fastmapping during a reverse search. SEARCH_END terminates the burst. typedef enum rx_get_burst_return (*rx_get_burst_fn) (struct rx_string_position * pos, void * app_closure, int stop); The closure is whatever you pass to rx_search. STOP is an argument to rx_search that bounds the search. You should never return a string position from with SEARCH_END set beyond the position indicated by STOP. enum rx_get_burst_return { rx_get_burst_continuation, rx_get_burst_error, rx_get_burst_ok, rx_get_burst_no_more }; Those are the possible return values of get_burst. Normally, you only ever care about the last two. An error return indicates something like trouble reading a file. A continuation return means suspend the search and resume by retrying GET_BURST if the search is restarted. GET_BURST is not quite as trivial as you might hope. If you have a fragmented string, you really have to keep two adjacent fragments at all times, even though the GET_BURST interface looks like you only need one. This is because of operators like `word-boundary' that try to look at two adjacent characters. Such operators are implemented with FETCH_CHAR. typedef int (*rx_fetch_char_fn) (struct rx_string_position * pos, int offset, void * app_closure, int stop); That takes the same closure passed to GET_BURST. It returns the character at POS or at one past POS according to whether OFFSET is 0 or 1. It is guaranteed that POS + OFFSET is within the string being searched. The last function compares characters at one position with characters previously matched by a parenthesized subexpression. enum rx_back_check_return { rx_back_check_continuation, rx_back_check_error, rx_back_check_pass, rx_back_check_fail }; typedef enum rx_back_check_return (*rx_back_check_fn) (struct rx_string_position * pos, int lparen, int rparen, unsigned char * translate, void * app_closure, int stop); LPAREN and RPAREN are the integer indexes of the previously matched characters. The comparison should translate both characters being compared by mapping them through TRANSLATE. POS is the point at which to begin comparing. It should be advanced to the last character matched during backreferencing. * Compilation Stages In rx_compile, a string is compiled into a pattern buffer. Compilation proceeds in these stages: 1. Make a syntax tree for the regexp. 2. Duplicate the syntax tree and make both trees nodes in a single unifying tree. 3. In one of the two trees, remove all side effects that aren't needed to test for the possibility of a match. Such side effects include the filling in of output registers for subexpressions that are not backreferenced. 4. Optimize the unifying tree. 5. Translate the tree to an NFA. 6. Analyze and optimize the NFA. 7. Copy the NFA into a contiguous region of memory. * Cache Size During a search or match, the NFA is translated into a "super NFA". A super NFA can match the patterns of the corresponding NFA in no more and often fewer steps. The catch is that the super NFA may be costly to construct in its entirety; it may not even fit in memory. So, states of the NFA are constructed on demand and discarded after a period of non-use. They are kept in a cache so that time is not wasted constructing existing nodes twice. The size of the super state NFA cache is a contributing factor the performance of Rx. The larger the cache (to a point) the faster Rx can run. The variable rx_cache_bound is an upper limit on the number of superstates that can exist in the cache. The defaulting setting is 128. GNU sed uses 4096. Neither setting has much justification although sed's is after a small number of quick and dirty experiments. The memory consumed by one superstate is between 4k and 8k. The cache only grows to its bounded size if there is actual demand for that many states. Sed's setting, for example, may appear quite high but in practice that much memory is hardly ever used. The default setting was chosen based on the heuristic that a megabyte is the upper limit on what a good citizen library can allocate without special arrangement.