pcre.txt 148 KB

Edit Raw Blame History

This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. There are
separate text files for the pcregrep and pcretest commands.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

DESCRIPTION

       The  PCRE  library is a set of functions that implement regular expres-
       sion pattern matching using the same syntax and semantics as Perl, with
       just  a  few  differences.  The current implementation of PCRE (release
       4.x) corresponds approximately with Perl  5.8,  including  support  for
       UTF-8  encoded  strings.   However,  this  support has to be explicitly
       enabled; it is not the default.

       PCRE is written in C and released as a C library. However, a number  of
       people  have  written  wrappers  and interfaces of various kinds. A C++
       class is included in these contributions, which can  be  found  in  the
       Contrib directory at the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details  of  exactly which Perl regular expression features are and are
       not supported by PCRE are given in separate documents. See the pcrepat-
       tern and pcrecompat pages.

       Some  features  of  PCRE can be included, excluded, or changed when the
       library is built. The pcre_config() function makes it  possible  for  a
       client  to  discover  which features are available. Documentation about
       building PCRE for various operating systems can be found in the  README
       file in the source distribution.


USER DOCUMENTATION

       The user documentation for PCRE has been split up into a number of dif-
       ferent sections. In the "man" format, each of these is a separate  "man
       page".  In  the  HTML  format, each is a separate page, linked from the
       index page. In the plain text format, all  the  sections  are  concate-
       nated, for ease of searching. The sections are as follows:

         pcre              this document
         pcreapi           details of PCRE's native API
         pcrebuild         options for building PCRE
         pcrecallout       details of the callout feature
         pcrecompat        discussion of Perl compatibility
         pcregrep          description of the pcregrep command
         pcrepattern       syntax and semantics of supported
                             regular expressions
         pcreperform       discussion of performance issues
         pcreposix         the POSIX-compatible API
         pcresample        discussion of the sample program
         pcretest          the pcretest testing command

       In  addition,  in the "man" and HTML formats, there is a short page for
       each library function, listing its arguments and results.


LIMITATIONS

       There are some size limitations in PCRE but it is hoped that they  will
       never in practice be relevant.

       The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
       is compiled with the default internal linkage size of 2. If you want to
       process  regular  expressions  that are truly enormous, you can compile
       PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
       the  source  distribution and the pcrebuild documentation for details).
       If these cases the limit is substantially larger.  However,  the  speed
       of execution will be slower.

       All values in repeating quantifiers must be less than 65536.  The maxi-
       mum number of capturing subpatterns is 65535.

       There is no limit to the number of non-capturing subpatterns,  but  the
       maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,
       including capturing subpatterns, assertions, and other types of subpat-
       tern, is 200.

       The  maximum  length of a subject string is the largest positive number
       that an integer variable can hold. However, PCRE uses recursion to han-
       dle  subpatterns  and indefinite repetition. This means that the avail-
       able stack space may limit the size of a subject  string  that  can  be
       processed by certain patterns.


UTF-8 SUPPORT

       Starting  at  release  3.3,  PCRE  has  had  some support for character
       strings encoded in the UTF-8 format. For  release  4.0  this  has  been
       greatly extended to cover most common requirements.

       In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
       support in the code, and, in addition,  you  must  call  pcre_compile()
       with  the PCRE_UTF8 option flag. When you do this, both the pattern and
       any subject strings that are matched against it are  treated  as  UTF-8
       strings instead of just strings of bytes.

       If  you compile PCRE with UTF-8 support, but do not use it at run time,
       the library will be a bit bigger, but the additional run time  overhead
       is  limited  to testing the PCRE_UTF8 flag in several places, so should
       not be very large.

       The following comments apply when PCRE is running in UTF-8 mode:

       1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
       subjects  are  checked for validity on entry to the relevant functions.
       If an invalid UTF-8 string is passed, an error return is given. In some
       situations,  you  may  already  know  that  your strings are valid, and
       therefore want to skip these checks in order to improve performance. If
       you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
       PCRE assumes that the pattern or subject  it  is  given  (respectively)
       contains  only valid UTF-8 codes. In this case, it does not diagnose an
       invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
       PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
       crash.

       2. In a pattern, the escape sequence \x{...}, where the contents of the
       braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8
       character whose code number is the given hexadecimal number, for  exam-
       ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,
       the item is not recognized.  This escape sequence can be used either as
       a literal, or within a character class.

       3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte
       UTF-8 character if the value is greater than 127.

       4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
       vidual bytes, for example: \x{100}{3}.

       5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a
       single byte.

       6. The escape sequence \C can be used to match a single byte  in  UTF-8
       mode, but its use can lead to some strange effects.

       7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
       test characters of any code value, but the characters that PCRE  recog-
       nizes  as  digits,  spaces,  or  word characters remain the same set as
       before, all with values less than 256.

       8. Case-insensitive matching applies only to  characters  whose  values
       are  less  than  256.  PCRE  does  not support the notion of "case" for
       higher-valued characters.

       9. PCRE does not support the use of Unicode tables  and  properties  or
       the Perl escapes \p, \P, and \X.


AUTHOR

       Philip Hazel <ph10@cam.ac.uk>
       University Computing Service,
       Cambridge CB2 3QG, England.
       Phone: +44 1223 334714

Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

PCRE BUILD-TIME OPTIONS

       This  document  describes  the  optional  features  of PCRE that can be
       selected when the library is compiled. They are all selected, or  dese-
       lected,  by  providing  options  to  the  configure script which is run
       before the make command. The complete list  of  options  for  configure
       (which  includes the standard ones such as the selection of the instal-
       lation directory) can be obtained by running

         ./configure --help

       The following sections describe certain options whose names begin  with
       --enable  or  --disable. These settings specify changes to the defaults
       for the configure command. Because of the  way  that  configure  works,
       --enable  and  --disable  always  come  in  pairs, so the complementary
       option always exists as well, but as it specifies the  default,  it  is
       not described.


UTF-8 SUPPORT

       To build PCRE with support for UTF-8 character strings, add

         --enable-utf8

       to  the  configure  command.  Of  itself, this does not make PCRE treat
       strings as UTF-8. As well as compiling PCRE with this option, you  also
       have  have to set the PCRE_UTF8 option when you call the pcre_compile()
       function.


CODE VALUE OF NEWLINE

       By default, PCRE treats character 10 (linefeed) as the newline  charac-
       ter. This is the normal newline character on Unix-like systems. You can
       compile PCRE to use character 13 (carriage return) instead by adding

         --enable-newline-is-cr

       to the configure command. For completeness there is  also  a  --enable-
       newline-is-lf  option,  which explicitly specifies linefeed as the new-
       line character.


BUILDING SHARED AND STATIC LIBRARIES

       The PCRE building process uses libtool to build both shared and  static
       Unix  libraries by default. You can suppress one of these by adding one
       of

         --disable-shared
         --disable-static

       to the configure command, as required.


POSIX MALLOC USAGE

       When PCRE is called through the  POSIX  interface  (see  the  pcreposix
       documentation),  additional working storage is required for holding the
       pointers to capturing substrings because PCRE requires  three  integers
       per  substring,  whereas  the POSIX interface provides only two. If the
       number of expected substrings is small, the wrapper function uses space
       on the stack, because this is faster than using malloc() for each call.
       The default threshold above which the stack is no longer used is 10; it
       can be changed by adding a setting such as

         --with-posix-malloc-threshold=20

       to the configure command.


LIMITING PCRE RESOURCE USAGE

       Internally,  PCRE  has a function called match() which it calls repeat-
       edly (possibly recursively) when performing a  matching  operation.  By
       limiting  the  number of times this function may be called, a limit can
       be placed on the resources used by a single call  to  pcre_exec().  The
       limit  can be changed at run time, as described in the pcreapi documen-
       tation. The default is 10 million, but this can be changed by adding  a
       setting such as

         --with-match-limit=500000

       to the configure command.


HANDLING VERY LARGE PATTERNS

       Within  a  compiled  pattern,  offset values are used to point from one
       part to another (for example, from an opening parenthesis to an  alter-
       nation  metacharacter).  By  default two-byte values are used for these
       offsets, leading to a maximum size for a  compiled  pattern  of  around
       64K.  This  is sufficient to handle all but the most gigantic patterns.
       Nevertheless, some people do want to process enormous patterns,  so  it
       is  possible  to compile PCRE to use three-byte or four-byte offsets by
       adding a setting such as

         --with-link-size=3

       to the configure command. The value given must be 2,  3,  or  4.  Using
       longer  offsets slows down the operation of PCRE because it has to load
       additional bytes when handling them.

       If you build PCRE with an increased link size, test 2 (and  test  5  if
       you  are using UTF-8) will fail. Part of the output of these tests is a
       representation of the compiled pattern, and this changes with the  link
       size.


AVOIDING EXCESSIVE STACK USAGE

       PCRE  implements  backtracking while matching by making recursive calls
       to an internal function called match(). In environments where the  size
       of the stack is limited, this can severely limit PCRE's operation. (The
       Unix environment does not usually suffer from this problem.) An  alter-
       native  approach  that  uses  memory  from  the  heap to remember data,
       instead of using recursive function calls, has been implemented to work
       round  this  problem. If you want to build a version of PCRE that works
       this way, add

         --disable-stack-for-recursion

       to the configure command. With this configuration, PCRE  will  use  the
       pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory
       management functions. Separate functions are provided because the usage
       is very predictable: the block sizes requested are always the same, and
       the blocks are always freed in reverse order. A calling  program  might
       be  able  to implement optimized functions that perform better than the
       standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
       slowly when built in this way.


USING EBCDIC CODE

       PCRE  assumes  by  default that it will run in an environment where the
       character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE
       can, however, be compiled to run in an EBCDIC environment by adding

         --enable-ebcdic

       to the configure command.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

SYNOPSIS OF PCRE API

       #include <pcre.h>

       pcre *pcre_compile(const char *pattern, int options,
            const char **errptr, int *erroffset,
            const unsigned char *tableptr);

       pcre_extra *pcre_study(const pcre *code, int options,
            const char **errptr);

       int pcre_exec(const pcre *code, const pcre_extra *extra,
            const char *subject, int length, int startoffset,
            int options, int *ovector, int ovecsize);

       int pcre_copy_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            char *buffer, int buffersize);

       int pcre_copy_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber, char *buffer,
            int buffersize);

       int pcre_get_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            const char **stringptr);

       int pcre_get_stringnumber(const pcre *code,
            const char *name);

       int pcre_get_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber,
            const char **stringptr);

       int pcre_get_substring_list(const char *subject,
            int *ovector, int stringcount, const char ***listptr);

       void pcre_free_substring(const char *stringptr);

       void pcre_free_substring_list(const char **stringptr);

       const unsigned char *pcre_maketables(void);

       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
            int what, void *where);

       int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

       int pcre_config(int what, void *where);

       char *pcre_version(void);

       void *(*pcre_malloc)(size_t);

       void (*pcre_free)(void *);

       void *(*pcre_stack_malloc)(size_t);

       void (*pcre_stack_free)(void *);

       int (*pcre_callout)(pcre_callout_block *);


PCRE API

       PCRE has its own native API, which is described in this document. There
       is also a set of wrapper functions that correspond to the POSIX regular
       expression API.  These are described in the pcreposix documentation.

       The  native  API  function  prototypes  are  defined in the header file
       pcre.h, and on Unix systems the library itself is called libpcre.a,  so
       can be accessed by adding -lpcre to the command for linking an applica-
       tion which calls it. The header file defines the macros PCRE_MAJOR  and
       PCRE_MINOR  to  contain  the  major  and  minor release numbers for the
       library. Applications can use these to include  support  for  different
       releases.

       The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used
       for compiling and matching regular expressions. A sample  program  that
       demonstrates  the simplest way of using them is given in the file pcre-
       demo.c. The pcresample documentation describes how to run it.

       There are convenience functions for extracting captured substrings from
       a matched subject string. They are:

         pcre_copy_substring()
         pcre_copy_named_substring()
         pcre_get_substring()
         pcre_get_named_substring()
         pcre_get_substring_list()

       pcre_free_substring() and pcre_free_substring_list() are also provided,
       to free the memory used for extracted strings.

       The function pcre_maketables() is used (optionally) to build a  set  of
       character tables in the current locale for passing to pcre_compile().

       The  function  pcre_fullinfo()  is used to find out information about a
       compiled pattern; pcre_info() is an obsolete version which returns only
       some  of  the available information, but is retained for backwards com-
       patibility.  The function pcre_version() returns a pointer to a  string
       containing the version of PCRE and its date of release.

       The  global  variables  pcre_malloc and pcre_free initially contain the
       entry points of the standard  malloc()  and  free()  functions  respec-
       tively. PCRE calls the memory management functions via these variables,
       so a calling program can replace them if it  wishes  to  intercept  the
       calls. This should be done before calling any PCRE functions.

       The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
       indirections to memory management functions.  These  special  functions
       are  used  only  when  PCRE is compiled to use the heap for remembering
       data, instead of recursive function calls. This is a  non-standard  way
       of  building  PCRE,  for  use in environments that have limited stacks.
       Because of the greater use of memory management, it runs  more  slowly.
       Separate  functions  are provided so that special-purpose external code
       can be used for this case. When used, these functions are always called
       in  a  stack-like  manner  (last obtained, first freed), and always for
       memory blocks of the same size.

       The global variable pcre_callout initially contains NULL. It can be set
       by  the  caller  to  a "callout" function, which PCRE will then call at
       specified points during a matching operation. Details are given in  the
       pcrecallout documentation.


MULTITHREADING

       The  PCRE  functions  can be used in multi-threading applications, with
       the  proviso  that  the  memory  management  functions  pointed  to  by
       pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
       callout function pointed to by pcre_callout, are shared by all threads.

       The  compiled form of a regular expression is not altered during match-
       ing, so the same compiled pattern can safely be used by several threads
       at once.


CHECKING BUILD-TIME OPTIONS

       int pcre_config(int what, void *where);

       The  function pcre_config() makes it possible for a PCRE client to dis-
       cover which optional features have been compiled into the PCRE library.
       The  pcrebuild documentation has more details about these optional fea-
       tures.

       The first argument for pcre_config() is an  integer,  specifying  which
       information is required; the second argument is a pointer to a variable
       into which the information is  placed.  The  following  information  is
       available:

         PCRE_CONFIG_UTF8

       The  output is an integer that is set to one if UTF-8 support is avail-
       able; otherwise it is set to zero.

         PCRE_CONFIG_NEWLINE

       The output is an integer that is set to the value of the code  that  is
       used  for the newline character. It is either linefeed (10) or carriage
       return (13), and should normally be the  standard  character  for  your
       operating system.

         PCRE_CONFIG_LINK_SIZE

       The  output  is  an  integer that contains the number of bytes used for
       internal linkage in compiled regular expressions. The value is 2, 3, or
       4.  Larger  values  allow larger regular expressions to be compiled, at
       the expense of slower matching. The default value of  2  is  sufficient
       for  all  but  the  most massive patterns, since it allows the compiled
       pattern to be up to 64K in size.

         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD

       The output is an integer that contains the threshold  above  which  the
       POSIX  interface  uses malloc() for output vectors. Further details are
       given in the pcreposix documentation.

         PCRE_CONFIG_MATCH_LIMIT

       The output is an integer that gives the default limit for the number of
       internal  matching  function  calls in a pcre_exec() execution. Further
       details are given with pcre_exec() below.

         PCRE_CONFIG_STACKRECURSE

       The output is an integer that is set to one if  internal  recursion  is
       implemented  by recursive function calls that use the stack to remember
       their state. This is the usual way that PCRE is compiled. The output is
       zero  if PCRE was compiled to use blocks of data on the heap instead of
       recursive  function  calls.  In  this   case,   pcre_stack_malloc   and
       pcre_stack_free  are  called  to manage memory blocks on the heap, thus
       avoiding the use of the stack.


COMPILING A PATTERN

       pcre *pcre_compile(const char *pattern, int options,
            const char **errptr, int *erroffset,
            const unsigned char *tableptr);


       The function pcre_compile() is called to  compile  a  pattern  into  an
       internal  form.  The pattern is a C string terminated by a binary zero,
       and is passed in the argument pattern. A pointer to a single  block  of
       memory  that is obtained via pcre_malloc is returned. This contains the
       compiled code and related data.  The  pcre  type  is  defined  for  the
       returned  block;  this  is a typedef for a structure whose contents are
       not externally defined. It is up to the caller to free the memory  when
       it is no longer required.

       Although  the compiled code of a PCRE regex is relocatable, that is, it
       does not depend on memory location, the complete pcre data block is not
       fully relocatable, because it contains a copy of the tableptr argument,
       which is an address (see below).

       The options argument contains independent bits that affect the compila-
       tion.  It  should  be  zero  if  no  options  are required. Some of the
       options, in particular, those that are compatible with Perl,  can  also
       be  set and unset from within the pattern (see the detailed description
       of regular expressions in the  pcrepattern  documentation).  For  these
       options,  the  contents of the options argument specifies their initial
       settings at the start of compilation and execution.  The  PCRE_ANCHORED
       option can be set at the time of matching as well as at compile time.

       If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
       if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
       sets the variable pointed to by errptr to point to a textual error mes-
       sage. The offset from the start of the pattern to the  character  where
       the  error  was  discovered  is  placed  in  the variable pointed to by
       erroffset, which must not be NULL. If it  is,  an  immediate  error  is
       given.

       If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
       character tables which are built when it is compiled, using the default
       C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to
       pcre_maketables(). See the section on locale support below.

       This code fragment shows a typical straightforward  call  to  pcre_com-
       pile():

         pcre *re;
         const char *error;
         int erroffset;
         re = pcre_compile(
           "^A.*Z",          /* the pattern */
           0,                /* default options */
           &error,           /* for error message */
           &erroffset,       /* for error offset */
           NULL);            /* use default character tables */

       The following option bits are defined:

         PCRE_ANCHORED

       If this bit is set, the pattern is forced to be "anchored", that is, it
       is constrained to match only at the first matching point in the  string
       which is being searched (the "subject string"). This effect can also be
       achieved by appropriate constructs in the pattern itself, which is  the
       only way to do it in Perl.

         PCRE_CASELESS

       If  this  bit is set, letters in the pattern match both upper and lower
       case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
       changed within a pattern by a (?i) option setting.

         PCRE_DOLLAR_ENDONLY

       If  this bit is set, a dollar metacharacter in the pattern matches only
       at the end of the subject string. Without this option,  a  dollar  also
       matches  immediately before the final character if it is a newline (but
       not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is
       ignored if PCRE_MULTILINE is set. There is no equivalent to this option
       in Perl, and no way to set it within a pattern.

         PCRE_DOTALL

       If this bit is set, a dot metacharater in the pattern matches all char-
       acters,  including  newlines.  Without  it, newlines are excluded. This
       option is equivalent to Perl's /s option, and it can be changed  within
       a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]
       always matches a newline character, independent of the setting of  this
       option.

         PCRE_EXTENDED

       If  this  bit  is  set,  whitespace  data characters in the pattern are
       totally ignored except  when  escaped  or  inside  a  character  class.
       Whitespace  does  not  include the VT character (code 11). In addition,
       characters between an unescaped # outside a  character  class  and  the
       next newline character, inclusive, are also ignored. This is equivalent
       to Perl's /x option, and it can be changed within a pattern by  a  (?x)
       option setting.

       This  option  makes  it possible to include comments inside complicated
       patterns.  Note, however, that this applies only  to  data  characters.
       Whitespace   characters  may  never  appear  within  special  character
       sequences in a pattern, for  example  within  the  sequence  (?(  which
       introduces a conditional subpattern.

         PCRE_EXTRA

       This  option  was invented in order to turn on additional functionality
       of PCRE that is incompatible with Perl, but it  is  currently  of  very
       little  use. When set, any backslash in a pattern that is followed by a
       letter that has no special meaning  causes  an  error,  thus  reserving
       these  combinations  for  future  expansion.  By default, as in Perl, a
       backslash followed by a letter with no special meaning is treated as  a
       literal.  There  are  at  present  no other features controlled by this
       option. It can also be set by a (?X) option setting within a pattern.

         PCRE_MULTILINE

       By default, PCRE treats the subject string as consisting  of  a  single
       "line"  of  characters (even if it actually contains several newlines).
       The "start of line" metacharacter (^) matches only at the start of  the
       string,  while  the "end of line" metacharacter ($) matches only at the
       end of the string, or before a terminating  newline  (unless  PCRE_DOL-
       LAR_ENDONLY is set). This is the same as Perl.

       When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
       constructs match immediately following or immediately before  any  new-
       line  in the subject string, respectively, as well as at the very start
       and end. This is equivalent to Perl's /m option, and it can be  changed
       within a pattern by a (?m) option setting. If there are no "\n" charac-
       ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,
       setting PCRE_MULTILINE has no effect.

         PCRE_NO_AUTO_CAPTURE

       If this option is set, it disables the use of numbered capturing paren-
       theses in the pattern. Any opening parenthesis that is not followed  by
       ?  behaves as if it were followed by ?: but named parentheses can still
       be used for capturing (and they acquire  numbers  in  the  usual  way).
       There is no equivalent of this option in Perl.

         PCRE_UNGREEDY

       This  option  inverts  the "greediness" of the quantifiers so that they
       are not greedy by default, but become greedy if followed by "?". It  is
       not  compatible  with Perl. It can also be set by a (?U) option setting
       within the pattern.

         PCRE_UTF8

       This option causes PCRE to regard both the pattern and the  subject  as
       strings  of  UTF-8 characters instead of single-byte character strings.
       However, it is available only if PCRE has been built to  include  UTF-8
       support.  If  not, the use of this option provokes an error. Details of
       how this option changes the behaviour of PCRE are given in the  section
       on UTF-8 support in the main pcre page.

         PCRE_NO_UTF8_CHECK

       When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
       automatically checked. If an invalid UTF-8 sequence of bytes is  found,
       pcre_compile()  returns an error. If you already know that your pattern
       is valid, and you want to skip this check for performance reasons,  you
       can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of
       passing an invalid UTF-8 string as a pattern is undefined. It may cause
       your  program  to  crash.  Note that there is a similar option for sup-
       pressing the checking of subject strings passed to pcre_exec().


STUDYING A PATTERN

       pcre_extra *pcre_study(const pcre *code, int options,
            const char **errptr);

       When a pattern is going to be used several times, it is worth  spending
       more  time  analyzing it in order to speed up the time taken for match-
       ing. The function pcre_study() takes a pointer to a compiled pattern as
       its first argument. If studing the pattern produces additional informa-
       tion that will help speed up matching, pcre_study() returns  a  pointer
       to  a  pcre_extra  block,  in  which the study_data field points to the
       results of the study.

       The returned value from  a  pcre_study()  can  be  passed  directly  to
       pcre_exec().  However,  the pcre_extra block also contains other fields
       that can be set by the caller before the block  is  passed;  these  are
       described  below.  If  studying  the pattern does not produce any addi-
       tional information, pcre_study() returns NULL. In that circumstance, if
       the  calling  program  wants  to  pass  some  of  the  other  fields to
       pcre_exec(), it must set up its own pcre_extra block.

       The second argument contains option bits. At present,  no  options  are
       defined for pcre_study(), and this argument should always be zero.

       The  third argument for pcre_study() is a pointer for an error message.
       If studying succeeds (even if no data is  returned),  the  variable  it
       points  to  is set to NULL. Otherwise it points to a textual error mes-
       sage. You should therefore test the error pointer for NULL after  call-
       ing pcre_study(), to be sure that it has run successfully.

       This is a typical call to pcre_study():

         pcre_extra *pe;
         pe = pcre_study(
           re,             /* result of pcre_compile() */
           0,              /* no options exist */
           &error);        /* set to NULL or points to a message */

       At present, studying a pattern is useful only for non-anchored patterns
       that do not have a single fixed starting character. A bitmap of  possi-
       ble starting characters is created.


LOCALE SUPPORT

       PCRE  handles  caseless matching, and determines whether characters are
       letters, digits, or whatever, by reference to a  set  of  tables.  When
       running  in UTF-8 mode, this applies only to characters with codes less
       than 256. The library contains a default set of tables that is  created
       in  the  default  C locale when PCRE is compiled. This is used when the
       final argument of pcre_compile() is NULL, and is  sufficient  for  many
       applications.

       An alternative set of tables can, however, be supplied. Such tables are
       built by calling the pcre_maketables() function,  which  has  no  argu-
       ments,  in  the  relevant  locale.  The  result  can  then be passed to
       pcre_compile() as often as necessary. For example,  to  build  and  use
       tables that are appropriate for the French locale (where accented char-
       acters with codes greater than 128 are treated as letters), the follow-
       ing code could be used:

         setlocale(LC_CTYPE, "fr");
         tables = pcre_maketables();
         re = pcre_compile(..., tables);

       The  tables  are  built in memory that is obtained via pcre_malloc. The
       pointer that is passed to pcre_compile is saved with the compiled  pat-
       tern, and the same tables are used via this pointer by pcre_study() and
       pcre_exec(). Thus, for any single pattern,  compilation,  studying  and
       matching  all  happen in the same locale, but different patterns can be
       compiled in different locales. It is  the  caller's  responsibility  to
       ensure  that  the memory containing the tables remains available for as
       long as it is needed.


INFORMATION ABOUT A PATTERN

       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
            int what, void *where);

       The pcre_fullinfo() function returns information about a compiled  pat-
       tern. It replaces the obsolete pcre_info() function, which is neverthe-
       less retained for backwards compability (and is documented below).

       The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
       pattern.  The second argument is the result of pcre_study(), or NULL if
       the pattern was not studied. The third argument specifies  which  piece
       of  information  is required, and the fourth argument is a pointer to a
       variable to receive the data. The yield of the  function  is  zero  for
       success, or one of the following negative numbers:

         PCRE_ERROR_NULL       the argument code was NULL
                               the argument where was NULL
         PCRE_ERROR_BADMAGIC   the "magic number" was not found
         PCRE_ERROR_BADOPTION  the value of what was invalid

       Here  is a typical call of pcre_fullinfo(), to obtain the length of the
       compiled pattern:

         int rc;
         unsigned long int length;
         rc = pcre_fullinfo(
           re,               /* result of pcre_compile() */
           pe,               /* result of pcre_study(), or NULL */
           PCRE_INFO_SIZE,   /* what is required */
           &length);         /* where to put the data */

       The possible values for the third argument are defined in  pcre.h,  and
       are as follows:

         PCRE_INFO_BACKREFMAX

       Return  the  number  of  the highest back reference in the pattern. The
       fourth argument should point to an int variable. Zero  is  returned  if
       there are no back references.

         PCRE_INFO_CAPTURECOUNT

       Return  the  number of capturing subpatterns in the pattern. The fourth
       argument should point to an int variable.

         PCRE_INFO_FIRSTBYTE

       Return information about the first byte of any matched  string,  for  a
       non-anchored    pattern.    (This    option    used    to   be   called
       PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards
       compatibility.)

       If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as
       (cat|cow|coyote), it is returned in the integer pointed  to  by  where.
       Otherwise, if either

       (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
       branch starts with "^", or

       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
       set (if it were set, the pattern would be anchored),

       -1  is  returned, indicating that the pattern matches only at the start
       of a subject string or after any newline within the  string.  Otherwise
       -2 is returned. For anchored patterns, -2 is returned.

         PCRE_INFO_FIRSTTABLE

       If  the pattern was studied, and this resulted in the construction of a
       256-bit table indicating a fixed set of bytes for the first byte in any
       matching  string, a pointer to the table is returned. Otherwise NULL is
       returned. The fourth argument should point to an unsigned char *  vari-
       able.

         PCRE_INFO_LASTLITERAL

       Return  the  value of the rightmost literal byte that must exist in any
       matched string, other than at its  start,  if  such  a  byte  has  been
       recorded. The fourth argument should point to an int variable. If there
       is no such byte, -1 is returned. For anchored patterns, a last  literal
       byte  is  recorded only if it follows something of variable length. For
       example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
       /^a\dz\d/ the returned value is -1.

         PCRE_INFO_NAMECOUNT
         PCRE_INFO_NAMEENTRYSIZE
         PCRE_INFO_NAMETABLE

       PCRE  supports the use of named as well as numbered capturing parenthe-
       ses. The names are just an additional way of identifying the  parenthe-
       ses,  which still acquire a number. A caller that wants to extract data
       from a named subpattern must convert the name to a number in  order  to
       access  the  correct  pointers  in  the  output  vector (described with
       pcre_exec() below). In order to do this, it must first use these  three
       values to obtain the name-to-number mapping table for the pattern.

       The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
       gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
       of  each  entry;  both  of  these  return  an int value. The entry size
       depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
       a  pointer  to  the  first  entry of the table (a pointer to char). The
       first two bytes of each entry are the number of the capturing parenthe-
       sis,  most  significant byte first. The rest of the entry is the corre-
       sponding name, zero terminated. The names are  in  alphabetical  order.
       For  example,  consider  the following pattern (assume PCRE_EXTENDED is
       set, so white space - including newlines - is ignored):

         (?P<date> (?P<year>(\d\d)?\d\d) -
         (?P<month>\d\d) - (?P<day>\d\d) )

       There are four named subpatterns, so the table has  four  entries,  and
       each  entry  in the table is eight bytes long. The table is as follows,
       with non-printing bytes shows in hex, and undefined bytes shown as ??:

         00 01 d  a  t  e  00 ??
         00 05 d  a  y  00 ?? ??
         00 04 m  o  n  t  h  00
         00 02 y  e  a  r  00 ??

       When writing code to extract data from named subpatterns, remember that
       the length of each entry may be different for each compiled pattern.

         PCRE_INFO_OPTIONS

       Return  a  copy of the options with which the pattern was compiled. The
       fourth argument should point to an unsigned long  int  variable.  These
       option bits are those specified in the call to pcre_compile(), modified
       by any top-level option settings within the pattern itself.

       A pattern is automatically anchored by PCRE if  all  of  its  top-level
       alternatives begin with one of the following:

         ^     unless PCRE_MULTILINE is set
         \A    always
         \G    always
         .*    if PCRE_DOTALL is set and there are no back
                 references to the subpattern in which .* appears

       For such patterns, the PCRE_ANCHORED bit is set in the options returned
       by pcre_fullinfo().

         PCRE_INFO_SIZE

       Return the size of the compiled pattern, that is, the  value  that  was
       passed as the argument to pcre_malloc() when PCRE was getting memory in
       which to place the compiled data. The fourth argument should point to a
       size_t variable.

         PCRE_INFO_STUDYSIZE

       Returns  the  size of the data block pointed to by the study_data field
       in a pcre_extra block. That is, it is the  value  that  was  passed  to
       pcre_malloc() when PCRE was getting memory into which to place the data
       created by pcre_study(). The fourth argument should point to  a  size_t
       variable.


OBSOLETE INFO FUNCTION

       int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

       The  pcre_info()  function is now obsolete because its interface is too
       restrictive to return all the available data about a compiled  pattern.
       New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of
       pcre_info() is the number of capturing subpatterns, or one of the  fol-
       lowing negative numbers:

         PCRE_ERROR_NULL       the argument code was NULL
         PCRE_ERROR_BADMAGIC   the "magic number" was not found

       If  the  optptr  argument is not NULL, a copy of the options with which
       the pattern was compiled is placed in the integer  it  points  to  (see
       PCRE_INFO_OPTIONS above).

       If  the  pattern  is  not anchored and the firstcharptr argument is not
       NULL, it is used to pass back information about the first character  of
       any matched string (see PCRE_INFO_FIRSTBYTE above).


MATCHING A PATTERN

       int pcre_exec(const pcre *code, const pcre_extra *extra,
            const char *subject, int length, int startoffset,
            int options, int *ovector, int ovecsize);

       The  function pcre_exec() is called to match a subject string against a
       pre-compiled pattern, which is passed in the code argument. If the pat-
       tern  has been studied, the result of the study should be passed in the
       extra argument.

       Here is an example of a simple call to pcre_exec():

         int rc;
         int ovector[30];
         rc = pcre_exec(
           re,             /* result of pcre_compile() */
           NULL,           /* we didn't study the pattern */
           "some string",  /* the subject string */
           11,             /* the length of the subject string */
           0,              /* start at offset 0 in the subject */
           0,              /* default options */
           ovector,        /* vector for substring information */
           30);            /* number of elements in the vector */

       If the extra argument is not NULL, it must point to a  pcre_extra  data
       block.  The pcre_study() function returns such a block (when it doesn't
       return NULL), but you can also create one for yourself, and pass  addi-
       tional information in it. The fields in the block are as follows:

         unsigned long int flags;
         void *study_data;
         unsigned long int match_limit;
         void *callout_data;

       The  flags  field  is a bitmap that specifies which of the other fields
       are set. The flag bits are:

         PCRE_EXTRA_STUDY_DATA
         PCRE_EXTRA_MATCH_LIMIT
         PCRE_EXTRA_CALLOUT_DATA

       Other flag bits should be set to zero. The study_data field is  set  in
       the  pcre_extra  block  that is returned by pcre_study(), together with
       the appropriate flag bit. You should not set this yourself, but you can
       add to the block by setting the other fields.

       The match_limit field provides a means of preventing PCRE from using up
       a vast amount of resources when running patterns that are not going  to
       match,  but  which  have  a very large number of possibilities in their
       search trees. The classic  example  is  the  use  of  nested  unlimited
       repeats. Internally, PCRE uses a function called match() which it calls
       repeatedly (sometimes recursively). The limit is imposed on the  number
       of  times  this function is called during a match, which has the effect
       of limiting the amount of recursion  and  backtracking  that  can  take
       place.  For  patterns that are not anchored, the count starts from zero
       for each position in the subject string.

       The default limit for the library can be set when PCRE  is  built;  the
       default  default  is 10 million, which handles all but the most extreme
       cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a
       pcre_extra  block  in  which match_limit is set to a smaller value, and
       PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
       exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.

       The  pcre_callout  field is used in conjunction with the "callout" fea-
       ture, which is described in the pcrecallout documentation.

       The PCRE_ANCHORED option can be passed in the options  argument,  whose
       unused  bits  must  be zero. This limits pcre_exec() to matching at the
       first matching position.  However,  if  a  pattern  was  compiled  with
       PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,
       it cannot be made unachored at matching time.

       When PCRE_UTF8 was set at compile time, the validity of the subject  as
       a  UTF-8  string is automatically checked, and the value of startoffset
       is also checked to ensure that it points to the start of a UTF-8  char-
       acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()
       returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an
       invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.

       If  you  already  know that your subject is valid, and you want to skip
       these   checks   for   performance   reasons,   you   can    set    the
       PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
       do this for the second and subsequent calls to pcre_exec() if  you  are
       making  repeated  calls  to  find  all  the matches in a single subject
       string. However, you should be  sure  that  the  value  of  startoffset
       points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
       set, the effect of passing an invalid UTF-8 string as a subject,  or  a
       value  of startoffset that does not point to the start of a UTF-8 char-
       acter, is undefined. Your program may crash.

       There are also three further options that can be set only  at  matching
       time:

         PCRE_NOTBOL

       The  first  character  of the string is not the beginning of a line, so
       the circumflex metacharacter should not match before it.  Setting  this
       without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to
       match.

         PCRE_NOTEOL

       The end of the string is not the end of a line, so the dollar metachar-
       acter  should  not  match  it  nor (except in multiline mode) a newline
       immediately before it. Setting this without PCRE_MULTILINE (at  compile
       time) causes dollar never to match.

         PCRE_NOTEMPTY

       An empty string is not considered to be a valid match if this option is
       set. If there are alternatives in the pattern, they are tried.  If  all
       the  alternatives  match  the empty string, the entire match fails. For
       example, if the pattern

         a?b?

       is applied to a string not beginning with "a" or "b",  it  matches  the
       empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
       match is not valid, so PCRE searches further into the string for occur-
       rences of "a" or "b".

       Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
       cial case of a pattern match of the empty  string  within  its  split()
       function,  and  when  using  the /g modifier. It is possible to emulate
       Perl's behaviour after matching a null string by first trying the match
       again at the same offset with PCRE_NOTEMPTY set, and then if that fails
       by advancing the starting offset (see below)  and  trying  an  ordinary
       match again.

       The  subject string is passed to pcre_exec() as a pointer in subject, a
       length in length, and a starting byte offset in startoffset. Unlike the
       pattern  string,  the  subject  may contain binary zero bytes. When the
       starting offset is zero, the search for a match starts at the beginning
       of the subject, and this is by far the most common case.

       If the pattern was compiled with the PCRE_UTF8 option, the subject must
       be a sequence of bytes that is a valid UTF-8 string, and  the  starting
       offset  must point to the beginning of a UTF-8 character. If an invalid
       UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8
       or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option
       PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not
       defined.

       A  non-zero  starting offset is useful when searching for another match
       in the same subject by calling pcre_exec() again after a previous  suc-
       cess.   Setting  startoffset differs from just passing over a shortened
       string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
       with any kind of lookbehind. For example, consider the pattern

         \Biss\B

       which  finds  occurrences  of "iss" in the middle of words. (\B matches
       only if the current position in the subject is not  a  word  boundary.)
       When  applied  to the string "Mississipi" the first call to pcre_exec()
       finds the first occurrence. If pcre_exec() is called  again  with  just
       the  remainder  of  the  subject,  namely  "issipi", it does not match,
       because \B is always false at the start of the subject, which is deemed
       to  be  a  word  boundary. However, if pcre_exec() is passed the entire
       string again, but with startoffset  set  to  4,  it  finds  the  second
       occurrence  of  "iss"  because  it  is able to look behind the starting
       point to discover that it is preceded by a letter.

       If a non-zero starting offset is passed when the pattern  is  anchored,
       one  attempt  to match at the given offset is tried. This can only suc-
       ceed if the pattern does not require the match to be at  the  start  of
       the subject.

       In  general, a pattern matches a certain portion of the subject, and in
       addition, further substrings from the subject  may  be  picked  out  by
       parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
       this is called "capturing" in what follows, and the  phrase  "capturing
       subpattern"  is  used for a fragment of a pattern that picks out a sub-
       string. PCRE supports several other kinds of  parenthesized  subpattern
       that do not cause substrings to be captured.

       Captured  substrings are returned to the caller via a vector of integer
       offsets whose address is passed in ovector. The number of  elements  in
       the vector is passed in ovecsize. The first two-thirds of the vector is
       used to pass back captured substrings, each substring using a  pair  of
       integers.  The  remaining  third  of the vector is used as workspace by
       pcre_exec() while matching capturing subpatterns, and is not  available
       for  passing  back  information.  The  length passed in ovecsize should
       always be a multiple of three. If it is not, it is rounded down.

       When a match has been successful, information about captured substrings
       is returned in pairs of integers, starting at the beginning of ovector,
       and continuing up to two-thirds of its length at the  most.  The  first
       element of a pair is set to the offset of the first character in a sub-
       string, and the second is set to the  offset  of  the  first  character
       after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
       tor[1], identify the portion of  the  subject  string  matched  by  the
       entire  pattern.  The next pair is used for the first capturing subpat-
       tern, and so on. The value returned by pcre_exec()  is  the  number  of
       pairs  that  have  been set. If there are no capturing subpatterns, the
       return value from a successful match is 1,  indicating  that  just  the
       first pair of offsets has been set.

       Some  convenience  functions  are  provided for extracting the captured
       substrings as separate strings. These are described  in  the  following
       section.

       It  is  possible  for  an capturing subpattern number n+1 to match some
       part of the subject when subpattern n has not been  used  at  all.  For
       example, if the string "abc" is matched against the pattern (a|(z))(bc)
       subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both
       offset values corresponding to the unused subpattern are set to -1.

       If a capturing subpattern is matched repeatedly, it is the last portion
       of the string that it matched that gets returned.

       If the vector is too small to hold all the captured substrings,  it  is
       used as far as possible (up to two-thirds of its length), and the func-
       tion returns a value of zero. In particular, if the  substring  offsets
       are  not  of interest, pcre_exec() may be called with ovector passed as
       NULL and ovecsize as zero. However, if the pattern contains back refer-
       ences  and  the  ovector  isn't big enough to remember the related sub-
       strings, PCRE has to get additional memory  for  use  during  matching.
       Thus it is usually advisable to supply an ovector.

       Note  that  pcre_info() can be used to find out how many capturing sub-
       patterns there are in a compiled pattern. The smallest size for ovector
       that  will  allow for n captured substrings, in addition to the offsets
       of the substring matched by the whole pattern, is (n+1)*3.

       If pcre_exec() fails, it returns a negative number. The  following  are
       defined in the header file:

         PCRE_ERROR_NOMATCH        (-1)

       The subject string did not match the pattern.

         PCRE_ERROR_NULL           (-2)

       Either  code  or  subject  was  passed as NULL, or ovector was NULL and
       ovecsize was not zero.

         PCRE_ERROR_BADOPTION      (-3)

       An unrecognized bit was set in the options argument.

         PCRE_ERROR_BADMAGIC       (-4)

       PCRE stores a 4-byte "magic number" at the start of the compiled  code,
       to  catch  the case when it is passed a junk pointer. This is the error
       it gives when the magic number isn't present.

         PCRE_ERROR_UNKNOWN_NODE   (-5)

       While running the pattern match, an unknown item was encountered in the
       compiled  pattern.  This  error  could be caused by a bug in PCRE or by
       overwriting of the compiled pattern.

         PCRE_ERROR_NOMEMORY       (-6)

       If a pattern contains back references, but the ovector that  is  passed
       to pcre_exec() is not big enough to remember the referenced substrings,
       PCRE gets a block of memory at the start of matching to  use  for  this
       purpose.  If the call via pcre_malloc() fails, this error is given. The
       memory is freed at the end of matching.

         PCRE_ERROR_NOSUBSTRING    (-7)

       This error is used by the pcre_copy_substring(),  pcre_get_substring(),
       and  pcre_get_substring_list()  functions  (see  below).  It  is  never
       returned by pcre_exec().

         PCRE_ERROR_MATCHLIMIT     (-8)

       The recursion and backtracking limit, as specified by  the  match_limit
       field  in  a  pcre_extra  structure (or defaulted) was reached. See the
       description above.

         PCRE_ERROR_CALLOUT        (-9)

       This error is never generated by pcre_exec() itself. It is provided for
       use  by  callout functions that want to yield a distinctive error code.
       See the pcrecallout documentation for details.

         PCRE_ERROR_BADUTF8        (-10)

       A string that contains an invalid UTF-8 byte sequence was passed  as  a
       subject.

         PCRE_ERROR_BADUTF8_OFFSET (-11)

       The UTF-8 byte sequence that was passed as a subject was valid, but the
       value of startoffset did not point to the beginning of a UTF-8  charac-
       ter.


EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

       int pcre_copy_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber, char *buffer,
            int buffersize);

       int pcre_get_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber,
            const char **stringptr);

       int pcre_get_substring_list(const char *subject,
            int *ovector, int stringcount, const char ***listptr);

       Captured  substrings  can  be  accessed  directly  by using the offsets
       returned by pcre_exec() in  ovector.  For  convenience,  the  functions
       pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
       string_list() are provided for extracting captured substrings  as  new,
       separate,  zero-terminated strings. These functions identify substrings
       by number. The next section describes functions  for  extracting  named
       substrings.  A  substring  that  contains  a  binary  zero is correctly
       extracted and has a further zero added on the end, but  the  result  is
       not, of course, a C string.

       The  first  three  arguments  are the same for all three of these func-
       tions: subject is the subject string which has just  been  successfully
       matched, ovector is a pointer to the vector of integer offsets that was
       passed to pcre_exec(), and stringcount is the number of substrings that
       were  captured  by  the match, including the substring that matched the
       entire regular expression. This is the value returned by  pcre_exec  if
       it  is greater than zero. If pcre_exec() returned zero, indicating that
       it ran out of space in ovector, the value passed as stringcount  should
       be the size of the vector divided by three.

       The  functions pcre_copy_substring() and pcre_get_substring() extract a
       single substring, whose number is given as  stringnumber.  A  value  of
       zero  extracts  the  substring  that  matched the entire pattern, while
       higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
       string(),  the  string  is  placed  in buffer, whose length is given by
       buffersize, while for pcre_get_substring() a new  block  of  memory  is
       obtained  via  pcre_malloc,  and its address is returned via stringptr.
       The yield of the function is the length of the  string,  not  including
       the terminating zero, or one of

         PCRE_ERROR_NOMEMORY       (-6)

       The  buffer  was too small for pcre_copy_substring(), or the attempt to
       get memory failed for pcre_get_substring().

         PCRE_ERROR_NOSUBSTRING    (-7)

       There is no substring whose number is stringnumber.

       The pcre_get_substring_list()  function  extracts  all  available  sub-
       strings  and  builds  a list of pointers to them. All this is done in a
       single block of memory which is obtained via pcre_malloc.  The  address
       of the memory block is returned via listptr, which is also the start of
       the list of string pointers. The end of the list is marked  by  a  NULL
       pointer. The yield of the function is zero if all went well, or

         PCRE_ERROR_NOMEMORY       (-6)

       if the attempt to get the memory block failed.

       When  any of these functions encounter a substring that is unset, which
       can happen when capturing subpattern number n+1 matches  some  part  of
       the  subject, but subpattern n has not been used at all, they return an
       empty string. This can be distinguished from a genuine zero-length sub-
       string  by inspecting the appropriate offset in ovector, which is nega-
       tive for unset substrings.

       The    two    convenience    functions    pcre_free_substring()     and
       pcre_free_substring_list() can be used to free the memory returned by a
       previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),
       respectively. They do nothing more than call the function pointed to by
       pcre_free, which of course could be called directly from a  C  program.
       However,  PCRE is used in some situations where it is linked via a spe-
       cial  interface  to  another  programming  language  which  cannot  use
       pcre_free  directly;  it is for these cases that the functions are pro-
       vided.


EXTRACTING CAPTURED SUBSTRINGS BY NAME

       int pcre_copy_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            char *buffer, int buffersize);

       int pcre_get_stringnumber(const pcre *code,
            const char *name);

       int pcre_get_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            const char **stringptr);

       To extract a substring by name, you first have to find associated  num-
       ber.  This  can  be  done by calling pcre_get_stringnumber(). The first
       argument is the compiled pattern, and the second is the name. For exam-
       ple, for this pattern

         ab(?<xxx>\d+)...

       the  number  of the subpattern called "xxx" is 1. Given the number, you
       can then extract the substring directly, or use one  of  the  functions
       described  in the previous section. For convenience, there are also two
       functions that do the whole job.

       Most   of   the   arguments    of    pcre_copy_named_substring()    and
       pcre_get_named_substring() are the same as those for the functions that
       extract by number, and so are not re-described here. There are just two
       differences.

       First,  instead  of a substring number, a substring name is given. Sec-
       ond, there is an extra argument, given at the start, which is a pointer
       to  the compiled pattern. This is needed in order to gain access to the
       name-to-number translation table.

       These functions call pcre_get_stringnumber(), and if it succeeds,  they
       then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
       ate.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

PCRE CALLOUTS

       int (*pcre_callout)(pcre_callout_block *);

       PCRE provides a feature called "callout", which is a means of temporar-
       ily passing control to the caller of PCRE  in  the  middle  of  pattern
       matching.  The  caller of PCRE provides an external function by putting
       its entry point in the global variable pcre_callout. By  default,  this
       variable contains NULL, which disables all calling out.

       Within  a  regular  expression,  (?C) indicates the points at which the
       external function is to be called.  Different  callout  points  can  be
       identified  by  putting  a number less than 256 after the letter C. The
       default value is zero.  For  example,  this  pattern  has  two  callout
       points:

         (?C1)abc(?C2)def

       During matching, when PCRE reaches a callout point (and pcre_callout is
       set), the external function is called. Its only argument is  a  pointer
       to a pcre_callout block. This contains the following variables:

         int          version;
         int          callout_number;
         int         *offset_vector;
         const char  *subject;
         int          subject_length;
         int          start_match;
         int          current_position;
         int          capture_top;
         int          capture_last;
         void        *callout_data;

       The  version  field  is an integer containing the version number of the
       block format. The current version  is  zero.  The  version  number  may
       change  in  future if additional fields are added, but the intention is
       never to remove any of the existing fields.

       The callout_number field contains the number of the  callout,  as  com-
       piled into the pattern (that is, the number after ?C).

       The  offset_vector field is a pointer to the vector of offsets that was
       passed by the caller to pcre_exec(). The contents can be  inspected  in
       order  to extract substrings that have been matched so far, in the same
       way as for extracting substrings after a match has completed.

       The subject and subject_length fields contain copies  the  values  that
       were passed to pcre_exec().

       The  start_match  field contains the offset within the subject at which
       the current match attempt started. If the pattern is not anchored,  the
       callout  function  may  be  called several times for different starting
       points.

       The current_position field contains the offset within  the  subject  of
       the current match pointer.

       The  capture_top field contains one more than the number of the highest
       numbered  captured  substring  so  far.  If  no  substrings  have  been
       captured, the value of capture_top is one.

       The  capture_last  field  contains the number of the most recently cap-
       tured substring.

       The callout_data field contains a value that is passed  to  pcre_exec()
       by  the  caller specifically so that it can be passed back in callouts.
       It is passed in the pcre_callout field of the  pcre_extra  data  struc-
       ture.  If  no  such  data  was  passed,  the value of callout_data in a
       pcre_callout block is NULL. There is a description  of  the  pcre_extra
       structure in the pcreapi documentation.


RETURN VALUES

       The callout function returns an integer. If the value is zero, matching
       proceeds as normal. If the value is greater than zero,  matching  fails
       at the current point, but backtracking to test other possibilities goes
       ahead, just as if a lookahead assertion had failed.  If  the  value  is
       less  than  zero,  the  match is abandoned, and pcre_exec() returns the
       value.

       Negative  values  should  normally  be   chosen   from   the   set   of
       PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
       dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
       reserved  for  use  by callout functions; it will never be used by PCRE
       itself.

Last updated: 21 January 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

DIFFERENCES FROM PERL

       This  document describes the differences in the ways that PCRE and Perl
       handle regular expressions. The differences  described  here  are  with
       respect to Perl 5.8.

       1.  PCRE does not have full UTF-8 support. Details of what it does have
       are given in the section on UTF-8 support in the main pcre page.

       2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
       permits  them,  but they do not mean what you might think. For example,
       (?!a){3} does not assert that the next three characters are not "a". It
       just asserts that the next character is not "a" three times.

       3.  Capturing  subpatterns  that occur inside negative lookahead asser-
       tions are counted, but their entries in the offsets  vector  are  never
       set.  Perl sets its numerical variables from any such patterns that are
       matched before the assertion fails to match something (thereby succeed-
       ing),  but  only  if the negative lookahead assertion contains just one
       branch.

       4. Though binary zero characters are supported in the  subject  string,
       they are not allowed in a pattern string because it is passed as a nor-
       mal C string, terminated by zero. The escape sequence "\0" can be  used
       in the pattern to represent a binary zero.

       5.  The  following Perl escape sequences are not supported: \l, \u, \L,
       \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general
       string-handling and are not part of its pattern matching engine. If any
       of these are encountered by PCRE, an error is generated.

       6. PCRE does support the \Q...\E escape for quoting substrings. Charac-
       ters  in  between  are  treated as literals. This is slightly different
       from Perl in that $ and @ are  also  handled  as  literals  inside  the
       quotes.  In Perl, they cause variable interpolation (but of course PCRE
       does not have variables). Note the following examples:

           Pattern            PCRE matches      Perl matches

           \Qabc$xyz\E        abc$xyz           abc followed by the
                                                  contents of $xyz
           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz

       The \Q...\E sequence is recognized both inside  and  outside  character
       classes.

       7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
       constructions. However, there is some experimental support  for  recur-
       sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).
       Also, the PCRE "callout" feature allows  an  external  function  to  be
       called during pattern matching.

       8.  There  are some differences that are concerned with the settings of
       captured strings when part of  a  pattern  is  repeated.  For  example,
       matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
       unset, but in PCRE it is set to "b".

       9. PCRE  provides  some  extensions  to  the  Perl  regular  expression
       facilities:

       (a)  Although  lookbehind  assertions  must match fixed length strings,
       each alternative branch of a lookbehind assertion can match a different
       length of string. Perl requires them all to have the same length.

       (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
       meta-character matches only at the very end of the string.

       (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
       cial meaning is faulted.

       (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
       fiers is inverted, that is, by default they are not greedy, but if fol-
       lowed by a question mark they are.

       (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at
       the first matching position in the subject string.

       (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-
       TURE options for pcre_exec() have no Perl equivalents.

       (g)  The (?R), (?number), and (?P>name) constructs allows for recursive
       pattern matching (Perl can do  this  using  the  (?p{code})  construct,
       which PCRE cannot support.)

       (h)  PCRE supports named capturing substrings, using the Python syntax.

       (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from
       Sun's Java package.

       (j) The (R) condition, for testing recursion, is a PCRE extension.

       (k) The callout facility is PCRE-specific.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

PCRE REGULAR EXPRESSION DETAILS

       The  syntax  and semantics of the regular expressions supported by PCRE
       are described below. Regular expressions are also described in the Perl
       documentation  and in a number of other books, some of which have copi-
       ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-
       lished  by  O'Reilly, covers them in great detail. The description here
       is intended as reference documentation.

       The basic operation of PCRE is on strings of bytes. However,  there  is
       also  support for UTF-8 character strings. To use this support you must
       build PCRE to include UTF-8 support, and then call pcre_compile()  with
       the  PCRE_UTF8  option.  How  this affects the pattern matching is men-
       tioned in several places below. There is also a summary of  UTF-8  fea-
       tures in the section on UTF-8 support in the main pcre page.

       A  regular  expression  is  a pattern that is matched against a subject
       string from left to right. Most characters stand for  themselves  in  a
       pattern,  and  match  the corresponding characters in the subject. As a
       trivial example, the pattern

         The quick brown fox

       matches a portion of a subject string that is identical to itself.  The
       power of regular expressions comes from the ability to include alterna-
       tives and repetitions in the pattern. These are encoded in the  pattern
       by  the  use  of meta-characters, which do not stand for themselves but
       instead are interpreted in some special way.

       There are two different sets of meta-characters: those that are  recog-
       nized  anywhere in the pattern except within square brackets, and those
       that are recognized in square brackets. Outside  square  brackets,  the
       meta-characters are as follows:

         \      general escape character with several uses
         ^      assert start of string (or line, in multiline mode)
         $      assert end of string (or line, in multiline mode)
         .      match any character except newline (by default)
         [      start character class definition
         |      start of alternative branch
         (      start subpattern
         )      end subpattern
         ?      extends the meaning of (
                also 0 or 1 quantifier
                also quantifier minimizer
         *      0 or more quantifier
         +      1 or more quantifier
                also "possessive quantifier"
         {      start min/max quantifier

       Part  of  a  pattern  that is in square brackets is called a "character
       class". In a character class the only meta-characters are:

         \      general escape character
         ^      negate the class, but only if the first character
         -      indicates character range
         [      POSIX character class (only if followed by POSIX
                  syntax)
         ]      terminates the character class

       The following sections describe the use of each of the meta-characters.


BACKSLASH

       The backslash character has several uses. Firstly, if it is followed by
       a non-alphameric character, it takes  away  any  special  meaning  that
       character  may  have.  This  use  of  backslash  as an escape character
       applies both inside and outside character classes.

       For example, if you want to match a * character, you write  \*  in  the
       pattern.   This  escaping  action  applies whether or not the following
       character would otherwise be interpreted as a meta-character, so it  is
       always  safe to precede a non-alphameric with backslash to specify that
       it stands for itself. In particular, if you want to match a  backslash,
       you write \\.

       If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
       the pattern (other than in a character class) and characters between  a
       # outside a character class and the next newline character are ignored.
       An escaping backslash can be used to include a whitespace or #  charac-
       ter as part of the pattern.

       If  you  want  to remove the special meaning from a sequence of charac-
       ters, you can do so by putting them between \Q and \E. This is  differ-
       ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
       tion. Note the following examples:

         Pattern            PCRE matches   Perl matches

         \Qabc$xyz\E        abc$xyz        abc followed by the
                                             contents of $xyz
         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.

       A second use of backslash provides a way of encoding non-printing char-
       acters  in patterns in a visible manner. There is no restriction on the
       appearance of non-printing characters, apart from the binary zero  that
       terminates  a  pattern,  but  when  a pattern is being prepared by text
       editing, it is usually easier  to  use  one  of  the  following  escape
       sequences than the binary character it represents:

         \a        alarm, that is, the BEL character (hex 07)
         \cx       "control-x", where x is any character
         \e        escape (hex 1B)
         \f        formfeed (hex 0C)
         \n        newline (hex 0A)
         \r        carriage return (hex 0D)
         \t        tab (hex 09)
         \ddd      character with octal code ddd, or backreference
         \xhh      character with hex code hh
         \x{hhh..} character with hex code hhh... (UTF-8 mode only)

       The  precise  effect of \cx is as follows: if x is a lower case letter,
       it is converted to upper case. Then bit 6 of the character (hex 40)  is
       inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
       becomes hex 7B.

       After \x, from zero to two hexadecimal digits are read (letters can  be
       in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
       its may appear between \x{ and }, but the value of the  character  code
       must  be  less  than  2**31  (that is, the maximum hexadecimal value is
       7FFFFFFF). If characters other than hexadecimal digits  appear  between
       \x{  and }, or if there is no terminating }, this form of escape is not
       recognized. Instead, the initial \x will be interpreted as a basic hex-
       adecimal escape, with no following digits, giving a byte whose value is
       zero.

       Characters whose value is less than 256 can be defined by either of the
       two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
       in the way they are handled. For example, \xdc is exactly the  same  as
       \x{dc}.

       After  \0  up  to  two further octal digits are read. In both cases, if
       there are fewer than two digits, just those that are present are  used.
       Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
       character (code value 7). Make sure you supply  two  digits  after  the
       initial zero if the character that follows is itself an octal digit.

       The handling of a backslash followed by a digit other than 0 is compli-
       cated.  Outside a character class, PCRE reads it and any following dig-
       its  as  a  decimal  number. If the number is less than 10, or if there
       have been at least that many previous capturing left parentheses in the
       expression,  the  entire  sequence  is  taken  as  a  back reference. A
       description of how this works is given later, following the  discussion
       of parenthesized subpatterns.

       Inside  a  character  class, or if the decimal number is greater than 9
       and there have not been that many capturing subpatterns, PCRE  re-reads
       up  to three octal digits following the backslash, and generates a sin-
       gle byte from the least significant 8 bits of the value. Any subsequent
       digits stand for themselves.  For example:

         \040   is another way of writing a space
         \40    is the same, provided there are fewer than 40
                   previous capturing subpatterns
         \7     is always a back reference
         \11    might be a back reference, or another way of
                   writing a tab
         \011   is always a tab
         \0113  is a tab followed by the character "3"
         \113   might be a back reference, otherwise the
                   character with octal code 113
         \377   might be a back reference, otherwise
                   the byte consisting entirely of 1 bits
         \81    is either a back reference, or a binary zero
                   followed by the two characters "8" and "1"

       Note  that  octal  values of 100 or greater must not be introduced by a
       leading zero, because no more than three octal digits are ever read.

       All the sequences that define a single byte value  or  a  single  UTF-8
       character (in UTF-8 mode) can be used both inside and outside character
       classes. In addition, inside a character  class,  the  sequence  \b  is
       interpreted  as  the  backspace character (hex 08). Outside a character
       class it has a different meaning (see below).

       The third use of backslash is for specifying generic character types:

         \d     any decimal digit
         \D     any character that is not a decimal digit
         \s     any whitespace character
         \S     any character that is not a whitespace character
         \w     any "word" character
         \W     any "non-word" character

       Each pair of escape sequences partitions the complete set of characters
       into  two disjoint sets. Any given character matches one, and only one,
       of each pair.

       In UTF-8 mode, characters with values greater than 255 never match  \d,
       \s, or \w, and always match \D, \S, and \W.

       For  compatibility  with Perl, \s does not match the VT character (code
       11).  This makes it different from the the POSIX "space" class. The  \s
       characters are HT (9), LF (10), FF (12), CR (13), and space (32).

       A  "word" character is any letter or digit or the underscore character,
       that is, any character which can be part of a Perl "word". The  defini-
       tion  of  letters  and digits is controlled by PCRE's character tables,
       and may vary if locale- specific matching is taking place (see  "Locale
       support"  in  the  pcreapi  page).  For  example,  in the "fr" (French)
       locale, some character codes greater than 128  are  used  for  accented
       letters, and these are matched by \w.

       These character type sequences can appear both inside and outside char-
       acter classes. They each match one character of the  appropriate  type.
       If  the current matching point is at the end of the subject string, all
       of them fail, since there is no character to match.

       The fourth use of backslash is for certain simple assertions. An asser-
       tion  specifies a condition that has to be met at a particular point in
       a match, without consuming any characters from the subject string.  The
       use  of subpatterns for more complicated assertions is described below.
       The backslashed assertions are

         \b     matches at a word boundary
         \B     matches when not at a word boundary
         \A     matches at start of subject
         \Z     matches at end of subject or before newline at end
         \z     matches at end of subject
         \G     matches at first matching position in subject

       These assertions may not appear in character classes (but note that  \b
       has a different meaning, namely the backspace character, inside a char-
       acter class).

       A word boundary is a position in the subject string where  the  current
       character  and  the previous character do not both match \w or \W (i.e.
       one matches \w and the other matches \W), or the start or  end  of  the
       string if the first or last character matches \w, respectively.

       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
       and dollar (described below) in that they only ever match at  the  very
       start  and  end  of the subject string, whatever options are set. Thus,
       they are independent of multiline mode.

       They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
       startoffset argument of pcre_exec() is non-zero, indicating that match-
       ing is to start at a point other than the beginning of the subject,  \A
       can  never  match.  The difference between \Z and \z is that \Z matches
       before a newline that is the last character of the string as well as at
       the end of the string, whereas \z matches only at the end.

       The  \G assertion is true only when the current matching position is at
       the start point of the match, as specified by the startoffset  argument
       of  pcre_exec().  It  differs  from \A when the value of startoffset is
       non-zero. By calling pcre_exec() multiple times with appropriate  argu-
       ments, you can mimic Perl's /g option, and it is in this kind of imple-
       mentation where \G can be useful.

       Note, however, that PCRE's interpretation of \G, as the  start  of  the
       current match, is subtly different from Perl's, which defines it as the
       end of the previous match. In Perl, these can  be  different  when  the
       previously  matched  string was empty. Because PCRE does just one match
       at a time, it cannot reproduce this behaviour.

       If all the alternatives of a pattern begin with \G, the  expression  is
       anchored to the starting match position, and the "anchored" flag is set
       in the compiled regular expression.


CIRCUMFLEX AND DOLLAR

       Outside a character class, in the default matching mode, the circumflex
       character  is  an  assertion which is true only if the current matching
       point is at the start of the subject string. If the  startoffset  argu-
       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
       has an entirely different meaning (see below).

       Circumflex  need  not be the first character of the pattern if a number
       of alternatives are involved, but it should be the first thing in  each
       alternative  in  which  it appears if the pattern is ever to match that
       branch. If all possible alternatives start with a circumflex, that  is,
       if  the  pattern  is constrained to match only at the start of the sub-
       ject, it is said to be an "anchored" pattern.  (There  are  also  other
       constructs that can cause a pattern to be anchored.)

       A  dollar  character  is an assertion which is true only if the current
       matching point is at the end of  the  subject  string,  or  immediately
       before a newline character that is the last character in the string (by
       default). Dollar need not be the last character of  the  pattern  if  a
       number  of alternatives are involved, but it should be the last item in
       any branch in which it appears.  Dollar has no  special  meaning  in  a
       character class.

       The  meaning  of  dollar  can be changed so that it matches only at the
       very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
       compile time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are changed if the
       PCRE_MULTILINE option is set. When this is the case, they match immedi-
       ately  after  and  immediately  before  an  internal newline character,
       respectively, in addition to matching at the start and end of the  sub-
       ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
       string "def\nabc" in multiline mode, but not  otherwise.  Consequently,
       patterns  that  are  anchored  in single line mode because all branches
       start with ^ are not anchored in multiline mode, and a match  for  cir-
       cumflex  is  possible  when  the startoffset argument of pcre_exec() is
       non-zero. The PCRE_DOLLAR_ENDONLY option is ignored  if  PCRE_MULTILINE
       is set.

       Note  that  the sequences \A, \Z, and \z can be used to match the start
       and end of the subject in both modes, and if all branches of a  pattern
       start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
       not.


FULL STOP (PERIOD, DOT)

       Outside a character class, a dot in the pattern matches any one charac-
       ter  in  the  subject,  including a non-printing character, but not (by
       default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
       which  might  be  more than one byte long, except (by default) for new-
       line. If the PCRE_DOTALL option is set, dots match  newlines  as  well.
       The  handling of dot is entirely independent of the handling of circum-
       flex and dollar, the only relationship being  that  they  both  involve
       newline characters. Dot has no special meaning in a character class.


MATCHING A SINGLE BYTE

       Outside a character class, the escape sequence \C matches any one byte,
       both in and out of UTF-8 mode. Unlike a dot, it always matches  a  new-
       line.  The  feature  is  provided  in Perl in order to match individual
       bytes in UTF-8 mode.  Because it breaks up UTF-8 characters into  indi-
       vidual  bytes,  what  remains  in  the  string may be a malformed UTF-8
       string. For this reason it is best avoided.

       PCRE does not allow \C to appear in lookbehind assertions (see  below),
       because in UTF-8 mode it makes it impossible to calculate the length of
       the lookbehind.


SQUARE BRACKETS

       An opening square bracket introduces a character class, terminated by a
       closing square bracket. A closing square bracket on its own is not spe-
       cial. If a closing square bracket is required as a member of the class,
       it  should  be  the first data character in the class (after an initial
       circumflex, if present) or escaped with a backslash.

       A character class matches a single character in the subject.  In  UTF-8
       mode,  the character may occupy more than one byte. A matched character
       must be in the set of characters defined by the class, unless the first
       character  in  the  class definition is a circumflex, in which case the
       subject character must not be in the set defined by  the  class.  If  a
       circumflex  is actually required as a member of the class, ensure it is
       not the first character, or escape it with a backslash.

       For example, the character class [aeiou] matches any lower case  vowel,
       while  [^aeiou]  matches  any character that is not a lower case vowel.
       Note that a circumflex is just a convenient notation for specifying the
       characters which are in the class by enumerating those that are not. It
       is not an assertion: it still consumes a  character  from  the  subject
       string, and fails if the current pointer is at the end of the string.

       In  UTF-8 mode, characters with values greater than 255 can be included
       in a class as a literal string of bytes, or by using the  \x{  escaping
       mechanism.

       When  caseless  matching  is set, any letters in a class represent both
       their upper case and lower case versions, so for  example,  a  caseless
       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
       match "A", whereas a caseful version would. PCRE does not  support  the
       concept of case for characters with values greater than 255.

       The  newline character is never treated in any special way in character
       classes, whatever the setting  of  the  PCRE_DOTALL  or  PCRE_MULTILINE
       options is. A class such as [^a] will always match a newline.

       The  minus (hyphen) character can be used to specify a range of charac-
       ters in a character  class.  For  example,  [d-m]  matches  any  letter
       between  d  and  m,  inclusive.  If  a minus character is required in a
       class, it must be escaped with a backslash  or  appear  in  a  position
       where  it cannot be interpreted as indicating a range, typically as the
       first or last character in the class.

       It is not possible to have the literal character "]" as the end charac-
       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
       two characters ("W" and "-") followed by a literal string "46]", so  it
       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
       preted  as  a  single class containing a range followed by two separate
       characters. The octal or hexadecimal representation of "]" can also  be
       used to end a range.

       Ranges  operate in the collating sequence of character values. They can
       also  be  used  for  characters  specified  numerically,  for   example
       [\000-\037].  In UTF-8 mode, ranges can include characters whose values
       are greater than 255, for example [\x{100}-\x{2ff}].

       If a range that includes letters is used when caseless matching is set,
       it matches the letters in either case. For example, [W-c] is equivalent
       to [][\^_`wxyzabc], matched caselessly, and if character tables for the
       "fr"  locale  are  in use, [\xc8-\xcb] matches accented E characters in
       both cases.

       The character types \d, \D, \s, \S, \w, and \W may  also  appear  in  a
       character  class,  and add the characters that they match to the class.
       For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
       conveniently  be  used with the upper case character types to specify a
       more restricted set of characters than the matching  lower  case  type.
       For  example,  the  class  [^\W_]  matches any letter or digit, but not
       underscore.

       All non-alphameric characters other than \, -, ^ (at the start) and the
       terminating ] are non-special in character classes, but it does no harm
       if they are escaped.


POSIX CHARACTER CLASSES

       Perl supports the POSIX notation  for  character  classes,  which  uses
       names  enclosed by [: and :] within the enclosing square brackets. PCRE
       also supports this notation. For example,

         [01[:alpha:]%]

       matches "0", "1", any alphabetic character, or "%". The supported class
       names are

         alnum    letters and digits
         alpha    letters
         ascii    character codes 0 - 127
         blank    space or tab only
         cntrl    control characters
         digit    decimal digits (same as \d)
         graph    printing characters, excluding space
         lower    lower case letters
         print    printing characters, including space
         punct    printing characters, excluding letters and digits
         space    white space (not quite the same as \s)
         upper    upper case letters
         word     "word" characters (same as \w)
         xdigit   hexadecimal digits

       The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
       and space (32). Notice that this list includes the VT  character  (code
       11). This makes "space" different to \s, which does not include VT (for
       Perl compatibility).

       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
       from  Perl  5.8. Another Perl extension is negation, which is indicated
       by a ^ character after the colon. For example,

         [12[:^digit:]]

       matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
       these are not supported, and an error is given if they are encountered.

       In UTF-8 mode, characters with values greater than 255 do not match any
       of the POSIX character classes.


VERTICAL BAR

       Vertical bar characters are used to separate alternative patterns.  For
       example, the pattern

         gilbert|sullivan

       matches  either "gilbert" or "sullivan". Any number of alternatives may
       appear, and an empty  alternative  is  permitted  (matching  the  empty
       string).   The  matching  process  tries each alternative in turn, from
       left to right, and the first one that succeeds is used. If the alterna-
       tives  are within a subpattern (defined below), "succeeds" means match-
       ing the rest of the main pattern as well as the alternative in the sub-
       pattern.


INTERNAL OPTION SETTING

       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
       PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
       sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
       option letters are

         i  for PCRE_CASELESS
         m  for PCRE_MULTILINE
         s  for PCRE_DOTALL
         x  for PCRE_EXTENDED

       For example, (?im) sets caseless, multiline matching. It is also possi-
       ble to unset these options by preceding the letter with a hyphen, and a
       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
       is also permitted. If a  letter  appears  both  before  and  after  the
       hyphen, the option is unset.

       When  an option change occurs at top level (that is, not inside subpat-
       tern parentheses), the change applies to the remainder of  the  pattern
       that follows.  If the change is placed right at the start of a pattern,
       PCRE extracts it into the global options (and it will therefore show up
       in data extracted by the pcre_fullinfo() function).

       An option change within a subpattern affects only that part of the cur-
       rent pattern that follows it, so

         (a(?i)b)c

       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
       used).   By  this means, options can be made to have different settings
       in different parts of the pattern. Any changes made in one  alternative
       do  carry  on  into subsequent branches within the same subpattern. For
       example,

         (a(?i)b|c)

       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
       first  branch  is  abandoned before the option setting. This is because
       the effects of option settings happen at compile time. There  would  be
       some very weird behaviour otherwise.

       The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
       in the same way as the Perl-compatible options by using the  characters
       U  and X respectively. The (?X) flag setting is special in that it must
       always occur earlier in the pattern than any of the additional features
       it turns on, even when it is at top level. It is best put at the start.


SUBPATTERNS

       Subpatterns are delimited by parentheses (round brackets), which can be
       nested.  Marking part of a pattern as a subpattern does two things:

       1. It localizes a set of alternatives. For example, the pattern

         cat(aract|erpillar|)

       matches  one  of the words "cat", "cataract", or "caterpillar". Without
       the parentheses, it would match "cataract",  "erpillar"  or  the  empty
       string.

       2.  It  sets  up  the  subpattern as a capturing subpattern (as defined
       above).  When the whole pattern matches, that portion  of  the  subject
       string that matched the subpattern is passed back to the caller via the
       ovector argument of pcre_exec(). Opening parentheses are  counted  from
       left  to right (starting from 1) to obtain the numbers of the capturing
       subpatterns.

       For example, if the string "the red king" is matched against  the  pat-
       tern

         the ((red|white) (king|queen))

       the captured substrings are "red king", "red", and "king", and are num-
       bered 1, 2, and 3, respectively.

       The fact that plain parentheses fulfil  two  functions  is  not  always
       helpful.   There are often times when a grouping subpattern is required
       without a capturing requirement. If an opening parenthesis is  followed
       by  a question mark and a colon, the subpattern does not do any captur-
       ing, and is not counted when computing the  number  of  any  subsequent
       capturing  subpatterns. For example, if the string "the white queen" is
       matched against the pattern

         the ((?:red|white) (king|queen))

       the captured substrings are "white queen" and "queen", and are numbered
       1  and 2. The maximum number of capturing subpatterns is 65535, and the
       maximum depth of nesting of all subpatterns, both  capturing  and  non-
       capturing, is 200.

       As  a  convenient shorthand, if any option settings are required at the
       start of a non-capturing subpattern,  the  option  letters  may  appear
       between the "?" and the ":". Thus the two patterns

         (?i:saturday|sunday)
         (?:(?i)saturday|sunday)

       match exactly the same set of strings. Because alternative branches are
       tried from left to right, and options are not reset until  the  end  of
       the  subpattern is reached, an option setting in one branch does affect
       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
       "Saturday".


NAMED SUBPATTERNS

       Identifying  capturing  parentheses  by number is simple, but it can be
       very hard to keep track of the numbers in complicated  regular  expres-
       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
       change. To help with the difficulty, PCRE supports the naming  of  sub-
       patterns,  something  that  Perl  does  not  provide. The Python syntax
       (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
       underscores, and must be unique within a pattern.

       Named  capturing  parentheses  are  still  allocated numbers as well as
       names. The PCRE API provides function calls for extracting the name-to-
       number  translation  table from a compiled pattern. For further details
       see the pcreapi documentation.


REPETITION

       Repetition is specified by quantifiers, which can  follow  any  of  the
       following items:

         a literal data character
         the . metacharacter
         the \C escape sequence
         escapes such as \d that match single characters
         a character class
         a back reference (see next section)
         a parenthesized subpattern (unless it is an assertion)

       The  general repetition quantifier specifies a minimum and maximum num-
       ber of permitted matches, by giving the two numbers in  curly  brackets
       (braces),  separated  by  a comma. The numbers must be less than 65536,
       and the first must be less than or equal to the second. For example:

         z{2,4}

       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
       special  character.  If  the second number is omitted, but the comma is
       present, there is no upper limit; if the second number  and  the  comma
       are  both omitted, the quantifier specifies an exact number of required
       matches. Thus

         [aeiou]{3,}

       matches at least 3 successive vowels, but may match many more, while

         \d{8}

       matches exactly 8 digits. An opening curly bracket that  appears  in  a
       position  where a quantifier is not allowed, or one that does not match
       the syntax of a quantifier, is taken as a literal character. For  exam-
       ple, {,6} is not a quantifier, but a literal string of four characters.

       In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
       individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
       acters, each of which is represented by a two-byte sequence.

       The quantifier {0} is permitted, causing the expression to behave as if
       the previous item and the quantifier were not present.

       For  convenience  (and  historical compatibility) the three most common
       quantifiers have single-character abbreviations:

         *    is equivalent to {0,}
         +    is equivalent to {1,}
         ?    is equivalent to {0,1}

       It is possible to construct infinite loops by  following  a  subpattern
       that can match no characters with a quantifier that has no upper limit,
       for example:

         (a?)*

       Earlier versions of Perl and PCRE used to give an error at compile time
       for  such  patterns. However, because there are cases where this can be
       useful, such patterns are now accepted, but if any  repetition  of  the
       subpattern  does in fact match no characters, the loop is forcibly bro-
       ken.

       By default, the quantifiers are "greedy", that is, they match  as  much
       as  possible  (up  to  the  maximum number of permitted times), without
       causing the rest of the pattern to fail. The classic example  of  where
       this gives problems is in trying to match comments in C programs. These
       appear between the sequences /* and */ and within the  sequence,  indi-
       vidual * and / characters may appear. An attempt to match C comments by
       applying the pattern

         /\*.*\*/

       to the string

         /* first command */  not comment  /* second comment */

       fails, because it matches the entire string owing to the greediness  of
       the .*  item.

       However,  if  a quantifier is followed by a question mark, it ceases to
       be greedy, and instead matches the minimum number of times possible, so
       the pattern

         /\*.*?\*/

       does  the  right  thing with the C comments. The meaning of the various
       quantifiers is not otherwise changed,  just  the  preferred  number  of
       matches.   Do  not  confuse this use of question mark with its use as a
       quantifier in its own right. Because it has two uses, it can  sometimes
       appear doubled, as in

         \d??\d

       which matches one digit by preference, but can match two if that is the
       only way the rest of the pattern matches.

       If the PCRE_UNGREEDY option is set (an option which is not available in
       Perl),  the  quantifiers are not greedy by default, but individual ones
       can be made greedy by following them with a  question  mark.  In  other
       words, it inverts the default behaviour.

       When  a  parenthesized  subpattern  is quantified with a minimum repeat
       count that is greater than 1 or with a limited maximum, more  store  is
       required  for  the  compiled  pattern, in proportion to the size of the
       minimum or maximum.

       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
       alent  to Perl's /s) is set, thus allowing the . to match newlines, the
       pattern is implicitly anchored, because whatever follows will be  tried
       against  every character position in the subject string, so there is no
       point in retrying the overall match at any position  after  the  first.
       PCRE normally treats such a pattern as though it were preceded by \A.

       In  cases  where  it  is known that the subject string contains no new-
       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
       mization, or alternatively using ^ to indicate anchoring explicitly.

       However,  there is one situation where the optimization cannot be used.
       When .*  is inside capturing parentheses that  are  the  subject  of  a
       backreference  elsewhere in the pattern, a match at the start may fail,
       and a later one succeed. Consider, for example:

         (.*)abc\1

       If the subject is "xyz123abc123" the match point is the fourth  charac-
       ter. For this reason, such a pattern is not implicitly anchored.

       When a capturing subpattern is repeated, the value captured is the sub-
       string that matched the final iteration. For example, after

         (tweedle[dume]{3}\s*)+

       has matched "tweedledum tweedledee" the value of the captured substring
       is  "tweedledee".  However,  if there are nested capturing subpatterns,
       the corresponding captured values may have been set in previous  itera-
       tions. For example, after

         /(a|(b))+/

       matches "aba" the value of the second captured substring is "b".


ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

       With both maximizing and minimizing repetition, failure of what follows
       normally causes the repeated item to be re-evaluated to see if  a  dif-
       ferent number of repeats allows the rest of the pattern to match. Some-
       times it is useful to prevent this, either to change the nature of  the
       match,  or  to  cause it fail earlier than it otherwise might, when the
       author of the pattern knows there is no point in carrying on.

       Consider, for example, the pattern \d+foo when applied to  the  subject
       line

         123456bar

       After matching all 6 digits and then failing to match "foo", the normal
       action of the matcher is to try again with only 5 digits  matching  the
       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
       the  means for specifying that once a subpattern has matched, it is not
       to be re-evaluated in this way.

       If we use atomic grouping for the previous example, the  matcher  would
       give up immediately on failing to match "foo" the first time. The nota-
       tion is a kind of special parenthesis, starting with  (?>  as  in  this
       example:

         (?>\d+)foo

       This  kind  of  parenthesis "locks up" the  part of the pattern it con-
       tains once it has matched, and a failure further into  the  pattern  is
       prevented  from  backtracking into it. Backtracking past it to previous
       items, however, works as normal.

       An alternative description is that a subpattern of  this  type  matches
       the  string  of  characters  that an identical standalone pattern would
       match, if anchored at the current point in the subject string.

       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
       such as the above example can be thought of as a maximizing repeat that
       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
       pared  to  adjust  the number of digits they match in order to make the
       rest of the pattern match, (?>\d+) can only match an entire sequence of
       digits.

       Atomic  groups in general can of course contain arbitrarily complicated
       subpatterns, and can be nested. However, when  the  subpattern  for  an
       atomic group is just a single repeated item, as in the example above, a
       simpler notation, called a "possessive quantifier" can  be  used.  This
       consists  of  an  additional  + character following a quantifier. Using
       this notation, the previous example can be rewritten as

         \d++bar

       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
       simpler forms of atomic group. However, there is no difference  in  the
       meaning  or  processing  of  a possessive quantifier and the equivalent
       atomic group.

       The possessive quantifier syntax is an extension to the Perl syntax. It
       originates in Sun's Java package.

       When  a  pattern  contains an unlimited repeat inside a subpattern that
       can itself be repeated an unlimited number of  times,  the  use  of  an
       atomic  group  is  the  only way to avoid some failing matches taking a
       very long time indeed. The pattern

         (\D+|<\d+>)*[!?]

       matches an unlimited number of substrings that either consist  of  non-
       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
       matches, it runs quickly. However, if it is applied to

         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

       it takes a long time before reporting  failure.  This  is  because  the
       string  can  be  divided  between  the two repeats in a large number of
       ways, and all have to be tried. (The example used [!?]  rather  than  a
       single  character  at the end, because both PCRE and Perl have an opti-
       mization that allows for fast failure when a single character is  used.
       They  remember  the last single character that is required for a match,
       and fail early if it is not present in the string.)  If the pattern  is
       changed to

         ((?>\D+)|<\d+>)*[!?]

       sequences  of non-digits cannot be broken, and failure happens quickly.


BACK REFERENCES

       Outside a character class, a backslash followed by a digit greater than
       0 (and possibly further digits) is a back reference to a capturing sub-
       pattern earlier (that is, to its left) in the pattern,  provided  there
       have been that many previous capturing left parentheses.

       However, if the decimal number following the backslash is less than 10,
       it is always taken as a back reference, and causes  an  error  only  if
       there  are  not that many capturing left parentheses in the entire pat-
       tern. In other words, the parentheses that are referenced need  not  be
       to  the left of the reference for numbers less than 10. See the section
       entitled "Backslash" above for further details of the handling of  dig-
       its following a backslash.

       A  back  reference matches whatever actually matched the capturing sub-
       pattern in the current subject string, rather  than  anything  matching
       the subpattern itself (see "Subpatterns as subroutines" below for a way
       of doing that). So the pattern

         (sens|respons)e and \1ibility

       matches "sense and sensibility" and "response and responsibility",  but
       not  "sense and responsibility". If caseful matching is in force at the
       time of the back reference, the case of letters is relevant. For  exam-
       ple,

         ((?i)rah)\s+\1

       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
       original capturing subpattern is matched caselessly.

       Back references to named subpatterns use the Python  syntax  (?P=name).
       We could rewrite the above example as follows:

         (?<p1>(?i)rah)\s+(?P=p1)

       There  may be more than one back reference to the same subpattern. If a
       subpattern has not actually been used in a particular match,  any  back
       references to it always fail. For example, the pattern

         (a|(bc))\2

       always  fails if it starts to match "a" rather than "bc". Because there
       may be many capturing parentheses in a pattern,  all  digits  following
       the  backslash  are taken as part of a potential back reference number.
       If the pattern continues with a digit character, some delimiter must be
       used  to  terminate  the back reference. If the PCRE_EXTENDED option is
       set, this can be whitespace.  Otherwise an empty comment can be used.

       A back reference that occurs inside the parentheses to which it  refers
       fails  when  the subpattern is first used, so, for example, (a\1) never
       matches.  However, such references can be useful inside  repeated  sub-
       patterns. For example, the pattern

         (a|b\1)+

       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
       ation of the subpattern,  the  back  reference  matches  the  character
       string  corresponding  to  the previous iteration. In order for this to
       work, the pattern must be such that the first iteration does  not  need
       to  match the back reference. This can be done using alternation, as in
       the example above, or by a quantifier with a minimum of zero.


ASSERTIONS

       An assertion is a test on the characters  following  or  preceding  the
       current  matching  point that does not actually consume any characters.
       The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
       described above.  More complicated assertions are coded as subpatterns.
       There are two kinds: those that look ahead of the current  position  in
       the subject string, and those that look behind it.

       An  assertion  subpattern  is matched in the normal way, except that it
       does not cause the current matching position to be  changed.  Lookahead
       assertions  start with (?= for positive assertions and (?! for negative
       assertions. For example,

         \w+(?=;)

       matches a word followed by a semicolon, but does not include the  semi-
       colon in the match, and

         foo(?!bar)

       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
       that the apparently similar pattern

         (?!foo)bar

       does not find an occurrence of "bar"  that  is  preceded  by  something
       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
       the assertion (?!foo) is always true when the next three characters are
       "bar". A lookbehind assertion is needed to achieve this effect.

       If you want to force a matching failure at some point in a pattern, the
       most convenient way to do it is  with  (?!)  because  an  empty  string
       always  matches, so an assertion that requires there not to be an empty
       string must always fail.

       Lookbehind assertions start with (?<= for positive assertions and  (?<!
       for negative assertions. For example,

         (?<!foo)bar

       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
       contents of a lookbehind assertion are restricted  such  that  all  the
       strings it matches must have a fixed length. However, if there are sev-
       eral alternatives, they do not all have to have the same fixed  length.
       Thus

         (?<=bullock|donkey)

       is permitted, but

         (?<!dogs?|cats?)

       causes  an  error at compile time. Branches that match different length
       strings are permitted only at the top level of a lookbehind  assertion.
       This  is  an  extension  compared  with  Perl (at least for 5.8), which
       requires all branches to match the same length of string. An  assertion
       such as

         (?<=ab(c|de))

       is  not  permitted,  because  its single top-level branch can match two
       different lengths, but it is acceptable if rewritten to  use  two  top-
       level branches:

         (?<=abc|abde)

       The  implementation  of lookbehind assertions is, for each alternative,
       to temporarily move the current position back by the  fixed  width  and
       then try to match. If there are insufficient characters before the cur-
       rent position, the match is deemed to fail.

       PCRE does not allow the \C escape (which matches a single byte in UTF-8
       mode)  to appear in lookbehind assertions, because it makes it impossi-
       ble to calculate the length of the lookbehind.

       Atomic groups can be used in conjunction with lookbehind assertions  to
       specify efficient matching at the end of the subject string. Consider a
       simple pattern such as

         abcd$

       when applied to a long string that does  not  match.  Because  matching
       proceeds from left to right, PCRE will look for each "a" in the subject
       and then see if what follows matches the rest of the  pattern.  If  the
       pattern is specified as

         ^.*abcd$

       the  initial .* matches the entire string at first, but when this fails
       (because there is no following "a"), it backtracks to match all but the
       last  character,  then all but the last two characters, and so on. Once
       again the search for "a" covers the entire string, from right to  left,
       so we are no better off. However, if the pattern is written as

         ^(?>.*)(?<=abcd)

       or, equivalently,

         ^.*+(?<=abcd)

       there  can  be  no  backtracking for the .* item; it can match only the
       entire string. The subsequent lookbehind assertion does a  single  test
       on  the last four characters. If it fails, the match fails immediately.
       For long strings, this approach makes a significant difference  to  the
       processing time.

       Several assertions (of any sort) may occur in succession. For example,

         (?<=\d{3})(?<!999)foo

       matches  "foo" preceded by three digits that are not "999". Notice that
       each of the assertions is applied independently at the  same  point  in
       the  subject  string.  First  there  is a check that the previous three
       characters are all digits, and then there is  a  check  that  the  same
       three characters are not "999".  This pattern does not match "foo" pre-
       ceded by six characters, the first of which are  digits  and  the  last
       three  of  which  are not "999". For example, it doesn't match "123abc-
       foo". A pattern to do that is

         (?<=\d{3}...)(?<!999)foo

       This time the first assertion looks at the  preceding  six  characters,
       checking that the first three are digits, and then the second assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested in any combination. For example,

         (?<=(?<!foo)bar)baz

       matches an occurrence of "baz" that is preceded by "bar" which in  turn
       is not preceded by "foo", while

         (?<=\d{3}(?!999)...)foo

       is another pattern which matches "foo" preceded by three digits and any
       three characters that are not "999".

       Assertion subpatterns are not capturing subpatterns,  and  may  not  be
       repeated,  because  it  makes no sense to assert the same thing several
       times. If any kind of assertion contains capturing  subpatterns  within
       it,  these are counted for the purposes of numbering the capturing sub-
       patterns in the whole pattern.  However, substring capturing is carried
       out  only  for  positive assertions, because it does not make sense for
       negative assertions.


CONDITIONAL SUBPATTERNS

       It is possible to cause the matching process to obey a subpattern  con-
       ditionally  or to choose between two alternative subpatterns, depending
       on the  result  of  an  assertion,  or  whether  a  previous  capturing
       subpattern  matched  or not. The two possible forms of conditional sub-
       pattern are

         (?(condition)yes-pattern)
         (?(condition)yes-pattern|no-pattern)

       If the condition is satisfied, the yes-pattern is used;  otherwise  the
       no-pattern  (if  present)  is used. If there are more than two alterna-
       tives in the subpattern, a compile-time error occurs.

       There are three kinds of condition. If the text between the parentheses
       consists  of  a  sequence  of digits, the condition is satisfied if the
       capturing subpattern of that number has previously matched. The  number
       must  be  greater than zero. Consider the following pattern, which con-
       tains non-significant white space to make it more readable (assume  the
       PCRE_EXTENDED  option)  and  to  divide it into three parts for ease of
       discussion:

         ( \( )?    [^()]+    (?(1) \) )

       The first part matches an optional opening  parenthesis,  and  if  that
       character is present, sets it as the first captured substring. The sec-
       ond part matches one or more characters that are not  parentheses.  The
       third part is a conditional subpattern that tests whether the first set
       of parentheses matched or not. If they did, that is, if subject started
       with an opening parenthesis, the condition is true, and so the yes-pat-
       tern is executed and a  closing  parenthesis  is  required.  Otherwise,
       since  no-pattern  is  not  present, the subpattern matches nothing. In
       other words,  this  pattern  matches  a  sequence  of  non-parentheses,
       optionally enclosed in parentheses.

       If the condition is the string (R), it is satisfied if a recursive call
       to the pattern or subpattern has been made. At "top level", the  condi-
       tion  is  false.   This  is  a  PCRE  extension. Recursive patterns are
       described in the next section.

       If the condition is not a sequence of digits or  (R),  it  must  be  an
       assertion.   This may be a positive or negative lookahead or lookbehind
       assertion. Consider  this  pattern,  again  containing  non-significant
       white space, and with the two alternatives on the second line:

         (?(?=[^a-z]*[a-z])
         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The  condition  is  a  positive  lookahead  assertion  that  matches an
       optional sequence of non-letters followed by a letter. In other  words,
       it  tests  for the presence of at least one letter in the subject. If a
       letter is found, the subject is matched against the first  alternative;
       otherwise  it  is  matched  against  the  second.  This pattern matches
       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
       letters and dd are digits.


COMMENTS

       The sequence (?# marks the start of a comment which continues up to the
       next closing parenthesis. Nested parentheses  are  not  permitted.  The
       characters  that make up a comment play no part in the pattern matching
       at all.

       If the PCRE_EXTENDED option is set, an unescaped # character outside  a
       character class introduces a comment that continues up to the next new-
       line character in the pattern.


RECURSIVE PATTERNS

       Consider the problem of matching a string in parentheses, allowing  for
       unlimited  nested  parentheses.  Without the use of recursion, the best
       that can be done is to use a pattern that  matches  up  to  some  fixed
       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
       depth. Perl has provided an experimental facility that  allows  regular
       expressions to recurse (amongst other things). It does this by interpo-
       lating Perl code in the expression at run time, and the code can  refer
       to the expression itself. A Perl pattern to solve the parentheses prob-
       lem can be created like this:

         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       The (?p{...}) item interpolates Perl code at run time, and in this case
       refers  recursively to the pattern in which it appears. Obviously, PCRE
       cannot support the interpolation of Perl  code.  Instead,  it  supports
       some  special  syntax for recursion of the entire pattern, and also for
       individual subpattern recursion.

       The special item that consists of (? followed by a number greater  than
       zero and a closing parenthesis is a recursive call of the subpattern of
       the given number, provided that it occurs inside that  subpattern.  (If
       not,  it  is  a  "subroutine" call, which is described in the next sec-
       tion.) The special item (?R) is a recursive call of the entire  regular
       expression.

       For  example,  this  PCRE pattern solves the nested parentheses problem
       (assume the  PCRE_EXTENDED  option  is  set  so  that  white  space  is
       ignored):

         \( ( (?>[^()]+) | (?R) )* \)

       First  it matches an opening parenthesis. Then it matches any number of
       substrings which can either be a  sequence  of  non-parentheses,  or  a
       recursive  match  of  the pattern itself (that is a correctly parenthe-
       sized substring).  Finally there is a closing parenthesis.

       If this were part of a larger pattern, you would not  want  to  recurse
       the entire pattern, so instead you could use this:

         ( \( ( (?>[^()]+) | (?1) )* \) )

       We  have  put the pattern into parentheses, and caused the recursion to
       refer to them instead of the whole pattern. In a larger pattern,  keep-
       ing  track  of parenthesis numbers can be tricky. It may be more conve-
       nient to use named parentheses instead. For this, PCRE uses  (?P>name),
       which  is  an  extension  to the Python syntax that PCRE uses for named
       parentheses (Perl does not provide named parentheses). We could rewrite
       the above example as follows:

         (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )

       This  particular example pattern contains nested unlimited repeats, and
       so the use of atomic grouping for matching strings  of  non-parentheses
       is  important  when  applying the pattern to strings that do not match.
       For example, when this pattern is applied to

         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

       it yields "no match" quickly. However, if atomic grouping is not  used,
       the  match  runs  for a very long time indeed because there are so many
       different ways the + and * repeats can carve up the  subject,  and  all
       have to be tested before failure can be reported.

       At the end of a match, the values set for any capturing subpatterns are
       those from the outermost level of the recursion at which the subpattern
       value  is  set.   If  you want to obtain intermediate values, a callout
       function can be used (see below and the pcrecallout documentation).  If
       the pattern above is matched against

         (ab(cd)ef)

       the  value  for  the  capturing  parentheses is "ef", which is the last
       value taken on at the top level. If additional parentheses  are  added,
       giving

         \( ( ( (?>[^()]+) | (?R) )* ) \)
            ^                        ^
            ^                        ^

       the  string  they  capture is "ab(cd)ef", the contents of the top level
       parentheses. If there are more than 15 capturing parentheses in a  pat-
       tern, PCRE has to obtain extra memory to store data during a recursion,
       which it does by using pcre_malloc, freeing  it  via  pcre_free  after-
       wards.  If  no  memory  can  be  obtained,  the  match  fails  with the
       PCRE_ERROR_NOMEMORY error.

       Do not confuse the (?R) item with the condition (R),  which  tests  for
       recursion.   Consider  this pattern, which matches text in angle brack-
       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
       brackets  (that is, when recursing), whereas any characters are permit-
       ted at the outer level.

         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >

       In this pattern, (?(R) is the start of a conditional  subpattern,  with
       two  different  alternatives for the recursive and non-recursive cases.
       The (?R) item is the actual recursive call.


SUBPATTERNS AS SUBROUTINES

       If the syntax for a recursive subpattern reference (either by number or
       by  name)  is used outside the parentheses to which it refers, it oper-
       ates like a subroutine in a programming language.  An  earlier  example
       pointed out that the pattern

         (sens|respons)e and \1ibility

       matches  "sense and sensibility" and "response and responsibility", but
       not "sense and responsibility". If instead the pattern

         (sens|respons)e and (?1)ibility

       is used, it does match "sense and responsibility" as well as the  other
       two  strings.  Such  references must, however, follow the subpattern to
       which they refer.


CALLOUTS

       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
       Perl  code to be obeyed in the middle of matching a regular expression.
       This makes it possible, amongst other things, to extract different sub-
       strings that match the same pair of parentheses when there is a repeti-
       tion.

       PCRE provides a similar feature, but of course it cannot obey arbitrary
       Perl code. The feature is called "callout". The caller of PCRE provides
       an external function by putting its entry point in the global  variable
       pcre_callout.   By default, this variable contains NULL, which disables
       all calling out.

       Within a regular expression, (?C) indicates the  points  at  which  the
       external  function  is  to be called. If you want to identify different
       callout points, you can put a number less than 256 after the letter  C.
       The  default  value is zero.  For example, this pattern has two callout
       points:

         (?C1)abc(?C2)def

       During matching, when PCRE reaches a callout point (and pcre_callout is
       set),  the  external function is called. It is provided with the number
       of the callout, and, optionally, one item of data  originally  supplied
       by  the  caller of pcre_exec(). The callout function may cause matching
       to backtrack, or to fail altogether.  A  complete  description  of  the
       interface  to the callout function is given in the pcrecallout documen-
       tation.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

       Certain  items  that may appear in regular expression patterns are more
       efficient than others. It is more efficient to use  a  character  class
       like  [aeiou]  than  a set of alternatives such as (a|e|i|o|u). In gen-
       eral, the simplest construction that provides the required behaviour is
       usually  the  most  efficient.  Jeffrey Friedl's book contains a lot of
       discussion about optimizing regular expressions for  efficient  perfor-
       mance.

       When  a  pattern  begins  with .* not in parentheses, or in parentheses
       that are not the subject of a backreference, and the PCRE_DOTALL option
       is  set, the pattern is implicitly anchored by PCRE, since it can match
       only at the start of a subject string. However, if PCRE_DOTALL  is  not
       set,  PCRE  cannot  make this optimization, because the . metacharacter
       does not then match a newline, and if the subject string contains  new-
       lines,  the  pattern may match from the character immediately following
       one of them instead of from the very start. For example, the pattern

         .*second

       matches the subject "first\nand second" (where \n stands for a  newline
       character),  with the match starting at the seventh character. In order
       to do this, PCRE has to retry the match starting after every newline in
       the subject.

       If  you  are using such a pattern with subject strings that do not con-
       tain newlines, the best performance is obtained by setting PCRE_DOTALL,
       or  starting  the pattern with ^.* to indicate explicit anchoring. That
       saves PCRE from having to scan along the subject looking for a  newline
       to restart at.

       Beware  of  patterns  that contain nested indefinite repeats. These can
       take a long time to run when applied to a string that does  not  match.
       Consider the pattern fragment

         (a+)*

       This  can  match "aaaa" in 33 different ways, and this number increases
       very rapidly as the string gets longer. (The * repeat can match  0,  1,
       2,  3,  or  4  times,  and  for each of those cases other than 0, the +
       repeats can match different numbers of times.) When  the  remainder  of
       the pattern is such that the entire match is going to fail, PCRE has in
       principle to try  every  possible  variation,  and  this  can  take  an
       extremely long time.

       An optimization catches some of the more simple cases such as

         (a+)*b

       where  a  literal  character  follows. Before embarking on the standard
       matching procedure, PCRE checks that there is a "b" later in  the  sub-
       ject  string, and if there is not, it fails the match immediately. How-
       ever, when there is no following literal this  optimization  cannot  be
       used. You can see the difference by comparing the behaviour of

         (a+)*\d

       with  the  pattern  above.  The former gives a failure almost instantly
       when applied to a whole line of  "a"  characters,  whereas  the  latter
       takes an appreciable time with strings longer than about 20 characters.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions.

SYNOPSIS OF POSIX API
       #include <pcreposix.h>

       int regcomp(regex_t *preg, const char *pattern,
            int cflags);

       int regexec(regex_t *preg, const char *string,
            size_t nmatch, regmatch_t pmatch[], int eflags);

       size_t regerror(int errcode, const regex_t *preg,
            char *errbuf, size_t errbuf_size);

       void regfree(regex_t *preg);


DESCRIPTION

       This  set  of  functions provides a POSIX-style API to the PCRE regular
       expression package. See the pcreapi documentation for a description  of
       the native API, which contains additional functionality.

       The functions described here are just wrapper functions that ultimately
       call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
       pcreposix.h  header  file,  and  on  Unix systems the library itself is
       called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
       command  for  linking an application which uses them. Because the POSIX
       functions call the native ones, it is also necessary to add -lpcre.

       I have implemented only those option bits that can be reasonably mapped
       to  PCRE  native  options.  In  addition,  the options REG_EXTENDED and
       REG_NOSUB are defined with the value zero. They  have  no  effect,  but
       since  programs that are written to the POSIX interface often use them,
       this makes it easier to slot in PCRE as a  replacement  library.  Other
       POSIX options are not even defined.

       When  PCRE  is  called  via these functions, it is only the API that is
       POSIX-like in style. The syntax and semantics of  the  regular  expres-
       sions  themselves  are  still  those of Perl, subject to the setting of
       various PCRE options, as described below. "POSIX-like in  style"  means
       that  the  API  approximates  to  the POSIX definition; it is not fully
       POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
       even less compatible.

       The  header for these functions is supplied as pcreposix.h to avoid any
       potential clash with other POSIX  libraries.  It  can,  of  course,  be
       renamed or aliased as regex.h, which is the "correct" name. It provides
       two structure types, regex_t for  compiled  internal  forms,  and  reg-
       match_t  for  returning  captured substrings. It also defines some con-
       stants whose names start  with  "REG_";  these  are  used  for  setting
       options and identifying error codes.


COMPILING A PATTERN

       The  function regcomp() is called to compile a pattern into an internal
       form. The pattern is a C string terminated by a  binary  zero,  and  is
       passed  in  the  argument  pattern. The preg argument is a pointer to a
       regex_t structure which is used as a base for storing information about
       the compiled expression.

       The argument cflags is either zero, or contains one or more of the bits
       defined by the following macros:

         REG_ICASE

       The PCRE_CASELESS option is set when the expression is passed for  com-
       pilation to the native function.

         REG_NEWLINE

       The PCRE_MULTILINE option is set when the expression is passed for com-
       pilation to the native function. Note that  this  does  not  mimic  the
       defined POSIX behaviour for REG_NEWLINE (see the following section).

       In  the  absence  of  these  flags, no options are passed to the native
       function.  This means the the  regex  is  compiled  with  PCRE  default
       semantics.  In particular, the way it handles newline characters in the
       subject string is the Perl way, not the POSIX way.  Note  that  setting
       PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
       It does not affect the way newlines are matched by . (they  aren't)  or
       by a negative class such as [^a] (they are).

       The  yield of regcomp() is zero on success, and non-zero otherwise. The
       preg structure is filled in on success, and one member of the structure
       is  public: re_nsub contains the number of capturing subpatterns in the
       regular expression. Various error codes are defined in the header file.


MATCHING NEWLINE CHARACTERS

       This area is not simple, because POSIX and Perl take different views of
       things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
       then  PCRE was never intended to be a POSIX engine. The following table
       lists the different possibilities for matching  newline  characters  in
       PCRE:

                                 Default   Change with

         . matches newline          no     PCRE_DOTALL
         newline matches [^a]       yes    not changeable
         $ matches \n at end        yes    PCRE_DOLLARENDONLY
         $ matches \n in middle     no     PCRE_MULTILINE
         ^ matches \n in middle     no     PCRE_MULTILINE

       This is the equivalent table for POSIX:

                                 Default   Change with

         . matches newline          yes      REG_NEWLINE
         newline matches [^a]       yes      REG_NEWLINE
         $ matches \n at end        no       REG_NEWLINE
         $ matches \n in middle     no       REG_NEWLINE
         ^ matches \n in middle     no       REG_NEWLINE

       PCRE's behaviour is the same as Perl's, except that there is no equiva-
       lent for PCRE_DOLLARENDONLY in Perl. In both PCRE and Perl, there is no
       way to stop newline from matching [^a].

       The   default  POSIX  newline  handling  can  be  obtained  by  setting
       PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way  to  make  PCRE
       behave exactly as for the REG_NEWLINE action.


MATCHING A PATTERN

       The  function  regexec() is called to match a pre-compiled pattern preg
       against a given string, which is terminated by a zero byte, subject  to
       the options in eflags. These can be:

         REG_NOTBOL

       The PCRE_NOTBOL option is set when calling the underlying PCRE matching
       function.

         REG_NOTEOL

       The PCRE_NOTEOL option is set when calling the underlying PCRE matching
       function.

       The  portion of the string that was matched, and also any captured sub-
       strings, are returned via the pmatch argument, which points to an array
       of  nmatch  structures of type regmatch_t, containing the members rm_so
       and rm_eo. These contain the offset to the first character of each sub-
       string and the offset to the first character after the end of each sub-
       string, respectively. The 0th element of  the  vector  relates  to  the
       entire  portion  of string that was matched; subsequent elements relate
       to the capturing subpatterns of the regular expression. Unused  entries
       in the array have both structure members set to -1.

       A  successful  match  yields  a  zero  return;  various error codes are
       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
       failure code.


ERROR MESSAGES

       The regerror() function maps a non-zero errorcode from either regcomp()
       or regexec() to a printable message. If preg is  not  NULL,  the  error
       should have arisen from the use of that structure. A message terminated
       by a binary zero is placed  in  errbuf.  The  length  of  the  message,
       including  the  zero, is limited to errbuf_size. The yield of the func-
       tion is the size of buffer needed to hold the whole message.


STORAGE

       Compiling a regular expression causes memory to be allocated and  asso-
       ciated  with  the preg structure. The function regfree() frees all such
       memory, after which preg may no longer be used as  a  compiled  expres-
       sion.


AUTHOR

       Philip Hazel <ph10@cam.ac.uk>
       University Computing Service,
       Cambridge CB2 3QG, England.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)


NAME
       PCRE - Perl-compatible regular expressions

PCRE SAMPLE PROGRAM

       A simple, complete demonstration program, to get you started with using
       PCRE, is supplied in the file pcredemo.c in the PCRE distribution.

       The program compiles the regular expression that is its first argument,
       and  matches  it  against the subject string in its second argument. No
       PCRE options are set, and default character tables are used. If  match-
       ing  succeeds,  the  program  outputs  the  portion of the subject that
       matched, together with the contents of any captured substrings.

       If the -g option is given on the command line, the program then goes on
       to check for further matches of the same regular expression in the same
       subject string. The logic is a little bit tricky because of the  possi-
       bility  of  matching an empty string. Comments in the code explain what
       is going on.

       On a Unix system that has PCRE installed in /usr/local, you can compile
       the demonstration program using a command like this:

         gcc -o pcredemo pcredemo.c -I/usr/local/include \
             -L/usr/local/lib -lpcre

       Then you can run simple tests like this:

         ./pcredemo 'cat|dog' 'the cat sat on the mat'
         ./pcredemo -g 'cat|dog' 'the dog sat on the cat'

       Note  that  there  is  a  much  more comprehensive test program, called
       pcretest, which supports  many  more  facilities  for  testing  regular
       expressions and the PCRE library. The pcredemo program is provided as a
       simple coding example.

       On some operating systems (e.g. Solaris) you may get an error like this
       when you try to run pcredemo:

         ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
       directory

       This is caused by the way shared library support works  on  those  sys-
       tems. You need to add

         -R/usr/local/lib

       to the compile command to get round this problem.

Last updated: 28 January 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------