pcre.txt 148 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111 3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165 3166 3167 3168 3169
This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. There are
separate text files for the pcregrep and pcretest commands.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

DESCRIPTION

       The  PCRE  library is a set of functions that implement regular expres-
       sion pattern matching using the same syntax and semantics as Perl, with
       just  a  few  differences.  The current implementation of PCRE (release
       4.x) corresponds approximately with Perl  5.8,  including  support  for
       UTF-8  encoded  strings.   However,  this  support has to be explicitly
       enabled; it is not the default.

       PCRE is written in C and released as a C library. However, a number  of
       people  have  written  wrappers  and interfaces of various kinds. A C++
       class is included in these contributions, which can  be  found  in  the
       Contrib directory at the primary FTP site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details  of  exactly which Perl regular expression features are and are
       not supported by PCRE are given in separate documents. See the pcrepat-
       tern and pcrecompat pages.

       Some  features  of  PCRE can be included, excluded, or changed when the
       library is built. The pcre_config() function makes it  possible  for  a
       client  to  discover  which features are available. Documentation about
       building PCRE for various operating systems can be found in the  README
       file in the source distribution.


USER DOCUMENTATION

       The user documentation for PCRE has been split up into a number of dif-
       ferent sections. In the "man" format, each of these is a separate  "man
       page".  In  the  HTML  format, each is a separate page, linked from the
       index page. In the plain text format, all  the  sections  are  concate-
       nated, for ease of searching. The sections are as follows:

         pcre              this document
         pcreapi           details of PCRE's native API
         pcrebuild         options for building PCRE
         pcrecallout       details of the callout feature
         pcrecompat        discussion of Perl compatibility
         pcregrep          description of the pcregrep command
         pcrepattern       syntax and semantics of supported
                             regular expressions
         pcreperform       discussion of performance issues
         pcreposix         the POSIX-compatible API
         pcresample        discussion of the sample program
         pcretest          the pcretest testing command

       In  addition,  in the "man" and HTML formats, there is a short page for
       each library function, listing its arguments and results.


LIMITATIONS

       There are some size limitations in PCRE but it is hoped that they  will
       never in practice be relevant.

       The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
       is compiled with the default internal linkage size of 2. If you want to
       process  regular  expressions  that are truly enormous, you can compile
       PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
       the  source  distribution and the pcrebuild documentation for details).
       If these cases the limit is substantially larger.  However,  the  speed
       of execution will be slower.

       All values in repeating quantifiers must be less than 65536.  The maxi-
       mum number of capturing subpatterns is 65535.

       There is no limit to the number of non-capturing subpatterns,  but  the
       maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,
       including capturing subpatterns, assertions, and other types of subpat-
       tern, is 200.

       The  maximum  length of a subject string is the largest positive number
       that an integer variable can hold. However, PCRE uses recursion to han-
       dle  subpatterns  and indefinite repetition. This means that the avail-
       able stack space may limit the size of a subject  string  that  can  be
       processed by certain patterns.


UTF-8 SUPPORT

       Starting  at  release  3.3,  PCRE  has  had  some support for character
       strings encoded in the UTF-8 format. For  release  4.0  this  has  been
       greatly extended to cover most common requirements.

       In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
       support in the code, and, in addition,  you  must  call  pcre_compile()
       with  the PCRE_UTF8 option flag. When you do this, both the pattern and
       any subject strings that are matched against it are  treated  as  UTF-8
       strings instead of just strings of bytes.

       If  you compile PCRE with UTF-8 support, but do not use it at run time,
       the library will be a bit bigger, but the additional run time  overhead
       is  limited  to testing the PCRE_UTF8 flag in several places, so should
       not be very large.

       The following comments apply when PCRE is running in UTF-8 mode:

       1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
       subjects  are  checked for validity on entry to the relevant functions.
       If an invalid UTF-8 string is passed, an error return is given. In some
       situations,  you  may  already  know  that  your strings are valid, and
       therefore want to skip these checks in order to improve performance. If
       you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
       PCRE assumes that the pattern or subject  it  is  given  (respectively)
       contains  only valid UTF-8 codes. In this case, it does not diagnose an
       invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
       PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
       crash.

       2. In a pattern, the escape sequence \x{...}, where the contents of the
       braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8
       character whose code number is the given hexadecimal number, for  exam-
       ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,
       the item is not recognized.  This escape sequence can be used either as
       a literal, or within a character class.

       3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte
       UTF-8 character if the value is greater than 127.

       4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
       vidual bytes, for example: \x{100}{3}.

       5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a
       single byte.

       6. The escape sequence \C can be used to match a single byte  in  UTF-8
       mode, but its use can lead to some strange effects.

       7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
       test characters of any code value, but the characters that PCRE  recog-
       nizes  as  digits,  spaces,  or  word characters remain the same set as
       before, all with values less than 256.

       8. Case-insensitive matching applies only to  characters  whose  values
       are  less  than  256.  PCRE  does  not support the notion of "case" for
       higher-valued characters.

       9. PCRE does not support the use of Unicode tables  and  properties  or
       the Perl escapes \p, \P, and \X.


AUTHOR

       Philip Hazel <ph10@cam.ac.uk>
       University Computing Service,
       Cambridge CB2 3QG, England.
       Phone: +44 1223 334714

Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE BUILD-TIME OPTIONS

       This  document  describes  the  optional  features  of PCRE that can be
       selected when the library is compiled. They are all selected, or  dese-
       lected,  by  providing  options  to  the  configure script which is run
       before the make command. The complete list  of  options  for  configure
       (which  includes the standard ones such as the selection of the instal-
       lation directory) can be obtained by running

         ./configure --help

       The following sections describe certain options whose names begin  with
       --enable  or  --disable. These settings specify changes to the defaults
       for the configure command. Because of the  way  that  configure  works,
       --enable  and  --disable  always  come  in  pairs, so the complementary
       option always exists as well, but as it specifies the  default,  it  is
       not described.


UTF-8 SUPPORT

       To build PCRE with support for UTF-8 character strings, add

         --enable-utf8

       to  the  configure  command.  Of  itself, this does not make PCRE treat
       strings as UTF-8. As well as compiling PCRE with this option, you  also
       have  have to set the PCRE_UTF8 option when you call the pcre_compile()
       function.


CODE VALUE OF NEWLINE

       By default, PCRE treats character 10 (linefeed) as the newline  charac-
       ter. This is the normal newline character on Unix-like systems. You can
       compile PCRE to use character 13 (carriage return) instead by adding

         --enable-newline-is-cr

       to the configure command. For completeness there is  also  a  --enable-
       newline-is-lf  option,  which explicitly specifies linefeed as the new-
       line character.


BUILDING SHARED AND STATIC LIBRARIES

       The PCRE building process uses libtool to build both shared and  static
       Unix  libraries by default. You can suppress one of these by adding one
       of

         --disable-shared
         --disable-static

       to the configure command, as required.


POSIX MALLOC USAGE

       When PCRE is called through the  POSIX  interface  (see  the  pcreposix
       documentation),  additional working storage is required for holding the
       pointers to capturing substrings because PCRE requires  three  integers
       per  substring,  whereas  the POSIX interface provides only two. If the
       number of expected substrings is small, the wrapper function uses space
       on the stack, because this is faster than using malloc() for each call.
       The default threshold above which the stack is no longer used is 10; it
       can be changed by adding a setting such as

         --with-posix-malloc-threshold=20

       to the configure command.


LIMITING PCRE RESOURCE USAGE

       Internally,  PCRE  has a function called match() which it calls repeat-
       edly (possibly recursively) when performing a  matching  operation.  By
       limiting  the  number of times this function may be called, a limit can
       be placed on the resources used by a single call  to  pcre_exec().  The
       limit  can be changed at run time, as described in the pcreapi documen-
       tation. The default is 10 million, but this can be changed by adding  a
       setting such as

         --with-match-limit=500000

       to the configure command.


HANDLING VERY LARGE PATTERNS

       Within  a  compiled  pattern,  offset values are used to point from one
       part to another (for example, from an opening parenthesis to an  alter-
       nation  metacharacter).  By  default two-byte values are used for these
       offsets, leading to a maximum size for a  compiled  pattern  of  around
       64K.  This  is sufficient to handle all but the most gigantic patterns.
       Nevertheless, some people do want to process enormous patterns,  so  it
       is  possible  to compile PCRE to use three-byte or four-byte offsets by
       adding a setting such as

         --with-link-size=3

       to the configure command. The value given must be 2,  3,  or  4.  Using
       longer  offsets slows down the operation of PCRE because it has to load
       additional bytes when handling them.

       If you build PCRE with an increased link size, test 2 (and  test  5  if
       you  are using UTF-8) will fail. Part of the output of these tests is a
       representation of the compiled pattern, and this changes with the  link
       size.


AVOIDING EXCESSIVE STACK USAGE

       PCRE  implements  backtracking while matching by making recursive calls
       to an internal function called match(). In environments where the  size
       of the stack is limited, this can severely limit PCRE's operation. (The
       Unix environment does not usually suffer from this problem.) An  alter-
       native  approach  that  uses  memory  from  the  heap to remember data,
       instead of using recursive function calls, has been implemented to work
       round  this  problem. If you want to build a version of PCRE that works
       this way, add

         --disable-stack-for-recursion

       to the configure command. With this configuration, PCRE  will  use  the
       pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory
       management functions. Separate functions are provided because the usage
       is very predictable: the block sizes requested are always the same, and
       the blocks are always freed in reverse order. A calling  program  might
       be  able  to implement optimized functions that perform better than the
       standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
       slowly when built in this way.


USING EBCDIC CODE

       PCRE  assumes  by  default that it will run in an environment where the
       character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE
       can, however, be compiled to run in an EBCDIC environment by adding

         --enable-ebcdic

       to the configure command.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

SYNOPSIS OF PCRE API

       #include <pcre.h>

       pcre *pcre_compile(const char *pattern, int options,
            const char **errptr, int *erroffset,
            const unsigned char *tableptr);

       pcre_extra *pcre_study(const pcre *code, int options,
            const char **errptr);

       int pcre_exec(const pcre *code, const pcre_extra *extra,
            const char *subject, int length, int startoffset,
            int options, int *ovector, int ovecsize);

       int pcre_copy_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            char *buffer, int buffersize);

       int pcre_copy_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber, char *buffer,
            int buffersize);

       int pcre_get_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            const char **stringptr);

       int pcre_get_stringnumber(const pcre *code,
            const char *name);

       int pcre_get_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber,
            const char **stringptr);

       int pcre_get_substring_list(const char *subject,
            int *ovector, int stringcount, const char ***listptr);

       void pcre_free_substring(const char *stringptr);

       void pcre_free_substring_list(const char **stringptr);

       const unsigned char *pcre_maketables(void);

       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
            int what, void *where);

       int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

       int pcre_config(int what, void *where);

       char *pcre_version(void);

       void *(*pcre_malloc)(size_t);

       void (*pcre_free)(void *);

       void *(*pcre_stack_malloc)(size_t);

       void (*pcre_stack_free)(void *);

       int (*pcre_callout)(pcre_callout_block *);


PCRE API

       PCRE has its own native API, which is described in this document. There
       is also a set of wrapper functions that correspond to the POSIX regular
       expression API.  These are described in the pcreposix documentation.

       The  native  API  function  prototypes  are  defined in the header file
       pcre.h, and on Unix systems the library itself is called libpcre.a,  so
       can be accessed by adding -lpcre to the command for linking an applica-
       tion which calls it. The header file defines the macros PCRE_MAJOR  and
       PCRE_MINOR  to  contain  the  major  and  minor release numbers for the
       library. Applications can use these to include  support  for  different
       releases.

       The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used
       for compiling and matching regular expressions. A sample  program  that
       demonstrates  the simplest way of using them is given in the file pcre-
       demo.c. The pcresample documentation describes how to run it.

       There are convenience functions for extracting captured substrings from
       a matched subject string. They are:

         pcre_copy_substring()
         pcre_copy_named_substring()
         pcre_get_substring()
         pcre_get_named_substring()
         pcre_get_substring_list()

       pcre_free_substring() and pcre_free_substring_list() are also provided,
       to free the memory used for extracted strings.

       The function pcre_maketables() is used (optionally) to build a  set  of
       character tables in the current locale for passing to pcre_compile().

       The  function  pcre_fullinfo()  is used to find out information about a
       compiled pattern; pcre_info() is an obsolete version which returns only
       some  of  the available information, but is retained for backwards com-
       patibility.  The function pcre_version() returns a pointer to a  string
       containing the version of PCRE and its date of release.

       The  global  variables  pcre_malloc and pcre_free initially contain the
       entry points of the standard  malloc()  and  free()  functions  respec-
       tively. PCRE calls the memory management functions via these variables,
       so a calling program can replace them if it  wishes  to  intercept  the
       calls. This should be done before calling any PCRE functions.

       The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
       indirections to memory management functions.  These  special  functions
       are  used  only  when  PCRE is compiled to use the heap for remembering
       data, instead of recursive function calls. This is a  non-standard  way
       of  building  PCRE,  for  use in environments that have limited stacks.
       Because of the greater use of memory management, it runs  more  slowly.
       Separate  functions  are provided so that special-purpose external code
       can be used for this case. When used, these functions are always called
       in  a  stack-like  manner  (last obtained, first freed), and always for
       memory blocks of the same size.

       The global variable pcre_callout initially contains NULL. It can be set
       by  the  caller  to  a "callout" function, which PCRE will then call at
       specified points during a matching operation. Details are given in  the
       pcrecallout documentation.


MULTITHREADING

       The  PCRE  functions  can be used in multi-threading applications, with
       the  proviso  that  the  memory  management  functions  pointed  to  by
       pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
       callout function pointed to by pcre_callout, are shared by all threads.

       The  compiled form of a regular expression is not altered during match-
       ing, so the same compiled pattern can safely be used by several threads
       at once.


CHECKING BUILD-TIME OPTIONS

       int pcre_config(int what, void *where);

       The  function pcre_config() makes it possible for a PCRE client to dis-
       cover which optional features have been compiled into the PCRE library.
       The  pcrebuild documentation has more details about these optional fea-
       tures.

       The first argument for pcre_config() is an  integer,  specifying  which
       information is required; the second argument is a pointer to a variable
       into which the information is  placed.  The  following  information  is
       available:

         PCRE_CONFIG_UTF8

       The  output is an integer that is set to one if UTF-8 support is avail-
       able; otherwise it is set to zero.

         PCRE_CONFIG_NEWLINE

       The output is an integer that is set to the value of the code  that  is
       used  for the newline character. It is either linefeed (10) or carriage
       return (13), and should normally be the  standard  character  for  your
       operating system.

         PCRE_CONFIG_LINK_SIZE

       The  output  is  an  integer that contains the number of bytes used for
       internal linkage in compiled regular expressions. The value is 2, 3, or
       4.  Larger  values  allow larger regular expressions to be compiled, at
       the expense of slower matching. The default value of  2  is  sufficient
       for  all  but  the  most massive patterns, since it allows the compiled
       pattern to be up to 64K in size.

         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD

       The output is an integer that contains the threshold  above  which  the
       POSIX  interface  uses malloc() for output vectors. Further details are
       given in the pcreposix documentation.

         PCRE_CONFIG_MATCH_LIMIT

       The output is an integer that gives the default limit for the number of
       internal  matching  function  calls in a pcre_exec() execution. Further
       details are given with pcre_exec() below.

         PCRE_CONFIG_STACKRECURSE

       The output is an integer that is set to one if  internal  recursion  is
       implemented  by recursive function calls that use the stack to remember
       their state. This is the usual way that PCRE is compiled. The output is
       zero  if PCRE was compiled to use blocks of data on the heap instead of
       recursive  function  calls.  In  this   case,   pcre_stack_malloc   and
       pcre_stack_free  are  called  to manage memory blocks on the heap, thus
       avoiding the use of the stack.


COMPILING A PATTERN

       pcre *pcre_compile(const char *pattern, int options,
            const char **errptr, int *erroffset,
            const unsigned char *tableptr);


       The function pcre_compile() is called to  compile  a  pattern  into  an
       internal  form.  The pattern is a C string terminated by a binary zero,
       and is passed in the argument pattern. A pointer to a single  block  of
       memory  that is obtained via pcre_malloc is returned. This contains the
       compiled code and related data.  The  pcre  type  is  defined  for  the
       returned  block;  this  is a typedef for a structure whose contents are
       not externally defined. It is up to the caller to free the memory  when
       it is no longer required.

       Although  the compiled code of a PCRE regex is relocatable, that is, it
       does not depend on memory location, the complete pcre data block is not
       fully relocatable, because it contains a copy of the tableptr argument,
       which is an address (see below).

       The options argument contains independent bits that affect the compila-
       tion.  It  should  be  zero  if  no  options  are required. Some of the
       options, in particular, those that are compatible with Perl,  can  also
       be  set and unset from within the pattern (see the detailed description
       of regular expressions in the  pcrepattern  documentation).  For  these
       options,  the  contents of the options argument specifies their initial
       settings at the start of compilation and execution.  The  PCRE_ANCHORED
       option can be set at the time of matching as well as at compile time.

       If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
       if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
       sets the variable pointed to by errptr to point to a textual error mes-
       sage. The offset from the start of the pattern to the  character  where
       the  error  was  discovered  is  placed  in  the variable pointed to by
       erroffset, which must not be NULL. If it  is,  an  immediate  error  is
       given.

       If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
       character tables which are built when it is compiled, using the default
       C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to
       pcre_maketables(). See the section on locale support below.

       This code fragment shows a typical straightforward  call  to  pcre_com-
       pile():

         pcre *re;
         const char *error;
         int erroffset;
         re = pcre_compile(
           "^A.*Z",          /* the pattern */
           0,                /* default options */
           &error,           /* for error message */
           &erroffset,       /* for error offset */
           NULL);            /* use default character tables */

       The following option bits are defined:

         PCRE_ANCHORED

       If this bit is set, the pattern is forced to be "anchored", that is, it
       is constrained to match only at the first matching point in the  string
       which is being searched (the "subject string"). This effect can also be
       achieved by appropriate constructs in the pattern itself, which is  the
       only way to do it in Perl.

         PCRE_CASELESS

       If  this  bit is set, letters in the pattern match both upper and lower
       case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
       changed within a pattern by a (?i) option setting.

         PCRE_DOLLAR_ENDONLY

       If  this bit is set, a dollar metacharacter in the pattern matches only
       at the end of the subject string. Without this option,  a  dollar  also
       matches  immediately before the final character if it is a newline (but
       not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is
       ignored if PCRE_MULTILINE is set. There is no equivalent to this option
       in Perl, and no way to set it within a pattern.

         PCRE_DOTALL

       If this bit is set, a dot metacharater in the pattern matches all char-
       acters,  including  newlines.  Without  it, newlines are excluded. This
       option is equivalent to Perl's /s option, and it can be changed  within
       a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]
       always matches a newline character, independent of the setting of  this
       option.

         PCRE_EXTENDED

       If  this  bit  is  set,  whitespace  data characters in the pattern are
       totally ignored except  when  escaped  or  inside  a  character  class.
       Whitespace  does  not  include the VT character (code 11). In addition,
       characters between an unescaped # outside a  character  class  and  the
       next newline character, inclusive, are also ignored. This is equivalent
       to Perl's /x option, and it can be changed within a pattern by  a  (?x)
       option setting.

       This  option  makes  it possible to include comments inside complicated
       patterns.  Note, however, that this applies only  to  data  characters.
       Whitespace   characters  may  never  appear  within  special  character
       sequences in a pattern, for  example  within  the  sequence  (?(  which
       introduces a conditional subpattern.

         PCRE_EXTRA

       This  option  was invented in order to turn on additional functionality
       of PCRE that is incompatible with Perl, but it  is  currently  of  very
       little  use. When set, any backslash in a pattern that is followed by a
       letter that has no special meaning  causes  an  error,  thus  reserving
       these  combinations  for  future  expansion.  By default, as in Perl, a
       backslash followed by a letter with no special meaning is treated as  a
       literal.  There  are  at  present  no other features controlled by this
       option. It can also be set by a (?X) option setting within a pattern.

         PCRE_MULTILINE

       By default, PCRE treats the subject string as consisting  of  a  single
       "line"  of  characters (even if it actually contains several newlines).
       The "start of line" metacharacter (^) matches only at the start of  the
       string,  while  the "end of line" metacharacter ($) matches only at the
       end of the string, or before a terminating  newline  (unless  PCRE_DOL-
       LAR_ENDONLY is set). This is the same as Perl.

       When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
       constructs match immediately following or immediately before  any  new-
       line  in the subject string, respectively, as well as at the very start
       and end. This is equivalent to Perl's /m option, and it can be  changed
       within a pattern by a (?m) option setting. If there are no "\n" charac-
       ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,
       setting PCRE_MULTILINE has no effect.

         PCRE_NO_AUTO_CAPTURE

       If this option is set, it disables the use of numbered capturing paren-
       theses in the pattern. Any opening parenthesis that is not followed  by
       ?  behaves as if it were followed by ?: but named parentheses can still
       be used for capturing (and they acquire  numbers  in  the  usual  way).
       There is no equivalent of this option in Perl.

         PCRE_UNGREEDY

       This  option  inverts  the "greediness" of the quantifiers so that they
       are not greedy by default, but become greedy if followed by "?". It  is
       not  compatible  with Perl. It can also be set by a (?U) option setting
       within the pattern.

         PCRE_UTF8

       This option causes PCRE to regard both the pattern and the  subject  as
       strings  of  UTF-8 characters instead of single-byte character strings.
       However, it is available only if PCRE has been built to  include  UTF-8
       support.  If  not, the use of this option provokes an error. Details of
       how this option changes the behaviour of PCRE are given in the  section
       on UTF-8 support in the main pcre page.

         PCRE_NO_UTF8_CHECK

       When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
       automatically checked. If an invalid UTF-8 sequence of bytes is  found,
       pcre_compile()  returns an error. If you already know that your pattern
       is valid, and you want to skip this check for performance reasons,  you
       can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of
       passing an invalid UTF-8 string as a pattern is undefined. It may cause
       your  program  to  crash.  Note that there is a similar option for sup-
       pressing the checking of subject strings passed to pcre_exec().



STUDYING A PATTERN

       pcre_extra *pcre_study(const pcre *code, int options,
            const char **errptr);

       When a pattern is going to be used several times, it is worth  spending
       more  time  analyzing it in order to speed up the time taken for match-
       ing. The function pcre_study() takes a pointer to a compiled pattern as
       its first argument. If studing the pattern produces additional informa-
       tion that will help speed up matching, pcre_study() returns  a  pointer
       to  a  pcre_extra  block,  in  which the study_data field points to the
       results of the study.

       The returned value from  a  pcre_study()  can  be  passed  directly  to
       pcre_exec().  However,  the pcre_extra block also contains other fields
       that can be set by the caller before the block  is  passed;  these  are
       described  below.  If  studying  the pattern does not produce any addi-
       tional information, pcre_study() returns NULL. In that circumstance, if
       the  calling  program  wants  to  pass  some  of  the  other  fields to
       pcre_exec(), it must set up its own pcre_extra block.

       The second argument contains option bits. At present,  no  options  are
       defined for pcre_study(), and this argument should always be zero.

       The  third argument for pcre_study() is a pointer for an error message.
       If studying succeeds (even if no data is  returned),  the  variable  it
       points  to  is set to NULL. Otherwise it points to a textual error mes-
       sage. You should therefore test the error pointer for NULL after  call-
       ing pcre_study(), to be sure that it has run successfully.

       This is a typical call to pcre_study():

         pcre_extra *pe;
         pe = pcre_study(
           re,             /* result of pcre_compile() */
           0,              /* no options exist */
           &error);        /* set to NULL or points to a message */

       At present, studying a pattern is useful only for non-anchored patterns
       that do not have a single fixed starting character. A bitmap of  possi-
       ble starting characters is created.


LOCALE SUPPORT

       PCRE  handles  caseless matching, and determines whether characters are
       letters, digits, or whatever, by reference to a  set  of  tables.  When
       running  in UTF-8 mode, this applies only to characters with codes less
       than 256. The library contains a default set of tables that is  created
       in  the  default  C locale when PCRE is compiled. This is used when the
       final argument of pcre_compile() is NULL, and is  sufficient  for  many
       applications.

       An alternative set of tables can, however, be supplied. Such tables are
       built by calling the pcre_maketables() function,  which  has  no  argu-
       ments,  in  the  relevant  locale.  The  result  can  then be passed to
       pcre_compile() as often as necessary. For example,  to  build  and  use
       tables that are appropriate for the French locale (where accented char-
       acters with codes greater than 128 are treated as letters), the follow-
       ing code could be used:

         setlocale(LC_CTYPE, "fr");
         tables = pcre_maketables();
         re = pcre_compile(..., tables);

       The  tables  are  built in memory that is obtained via pcre_malloc. The
       pointer that is passed to pcre_compile is saved with the compiled  pat-
       tern, and the same tables are used via this pointer by pcre_study() and
       pcre_exec(). Thus, for any single pattern,  compilation,  studying  and
       matching  all  happen in the same locale, but different patterns can be
       compiled in different locales. It is  the  caller's  responsibility  to
       ensure  that  the memory containing the tables remains available for as
       long as it is needed.


INFORMATION ABOUT A PATTERN

       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
            int what, void *where);

       The pcre_fullinfo() function returns information about a compiled  pat-
       tern. It replaces the obsolete pcre_info() function, which is neverthe-
       less retained for backwards compability (and is documented below).

       The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
       pattern.  The second argument is the result of pcre_study(), or NULL if
       the pattern was not studied. The third argument specifies  which  piece
       of  information  is required, and the fourth argument is a pointer to a
       variable to receive the data. The yield of the  function  is  zero  for
       success, or one of the following negative numbers:

         PCRE_ERROR_NULL       the argument code was NULL
                               the argument where was NULL
         PCRE_ERROR_BADMAGIC   the "magic number" was not found
         PCRE_ERROR_BADOPTION  the value of what was invalid

       Here  is a typical call of pcre_fullinfo(), to obtain the length of the
       compiled pattern:

         int rc;
         unsigned long int length;
         rc = pcre_fullinfo(
           re,               /* result of pcre_compile() */
           pe,               /* result of pcre_study(), or NULL */
           PCRE_INFO_SIZE,   /* what is required */
           &length);         /* where to put the data */

       The possible values for the third argument are defined in  pcre.h,  and
       are as follows:

         PCRE_INFO_BACKREFMAX

       Return  the  number  of  the highest back reference in the pattern. The
       fourth argument should point to an int variable. Zero  is  returned  if
       there are no back references.

         PCRE_INFO_CAPTURECOUNT

       Return  the  number of capturing subpatterns in the pattern. The fourth
       argument should point to an int variable.

         PCRE_INFO_FIRSTBYTE

       Return information about the first byte of any matched  string,  for  a
       non-anchored    pattern.    (This    option    used    to   be   called
       PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards
       compatibility.)

       If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as
       (cat|cow|coyote), it is returned in the integer pointed  to  by  where.
       Otherwise, if either

       (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
       branch starts with "^", or

       (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
       set (if it were set, the pattern would be anchored),

       -1  is  returned, indicating that the pattern matches only at the start
       of a subject string or after any newline within the  string.  Otherwise
       -2 is returned. For anchored patterns, -2 is returned.

         PCRE_INFO_FIRSTTABLE

       If  the pattern was studied, and this resulted in the construction of a
       256-bit table indicating a fixed set of bytes for the first byte in any
       matching  string, a pointer to the table is returned. Otherwise NULL is
       returned. The fourth argument should point to an unsigned char *  vari-
       able.

         PCRE_INFO_LASTLITERAL

       Return  the  value of the rightmost literal byte that must exist in any
       matched string, other than at its  start,  if  such  a  byte  has  been
       recorded. The fourth argument should point to an int variable. If there
       is no such byte, -1 is returned. For anchored patterns, a last  literal
       byte  is  recorded only if it follows something of variable length. For
       example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
       /^a\dz\d/ the returned value is -1.

         PCRE_INFO_NAMECOUNT
         PCRE_INFO_NAMEENTRYSIZE
         PCRE_INFO_NAMETABLE

       PCRE  supports the use of named as well as numbered capturing parenthe-
       ses. The names are just an additional way of identifying the  parenthe-
       ses,  which still acquire a number. A caller that wants to extract data
       from a named subpattern must convert the name to a number in  order  to
       access  the  correct  pointers  in  the  output  vector (described with
       pcre_exec() below). In order to do this, it must first use these  three
       values to obtain the name-to-number mapping table for the pattern.

       The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
       gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
       of  each  entry;  both  of  these  return  an int value. The entry size
       depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
       a  pointer  to  the  first  entry of the table (a pointer to char). The
       first two bytes of each entry are the number of the capturing parenthe-
       sis,  most  significant byte first. The rest of the entry is the corre-
       sponding name, zero terminated. The names are  in  alphabetical  order.
       For  example,  consider  the following pattern (assume PCRE_EXTENDED is
       set, so white space - including newlines - is ignored):

         (?P<date> (?P<year>(\d\d)?\d\d) -
         (?P<month>\d\d) - (?P<day>\d\d) )

       There are four named subpatterns, so the table has  four  entries,  and
       each  entry  in the table is eight bytes long. The table is as follows,
       with non-printing bytes shows in hex, and undefined bytes shown as ??:

         00 01 d  a  t  e  00 ??
         00 05 d  a  y  00 ?? ??
         00 04 m  o  n  t  h  00
         00 02 y  e  a  r  00 ??

       When writing code to extract data from named subpatterns, remember that
       the length of each entry may be different for each compiled pattern.

         PCRE_INFO_OPTIONS

       Return  a  copy of the options with which the pattern was compiled. The
       fourth argument should point to an unsigned long  int  variable.  These
       option bits are those specified in the call to pcre_compile(), modified
       by any top-level option settings within the pattern itself.

       A pattern is automatically anchored by PCRE if  all  of  its  top-level
       alternatives begin with one of the following:

         ^     unless PCRE_MULTILINE is set
         \A    always
         \G    always
         .*    if PCRE_DOTALL is set and there are no back
                 references to the subpattern in which .* appears

       For such patterns, the PCRE_ANCHORED bit is set in the options returned
       by pcre_fullinfo().

         PCRE_INFO_SIZE

       Return the size of the compiled pattern, that is, the  value  that  was
       passed as the argument to pcre_malloc() when PCRE was getting memory in
       which to place the compiled data. The fourth argument should point to a
       size_t variable.

         PCRE_INFO_STUDYSIZE

       Returns  the  size of the data block pointed to by the study_data field
       in a pcre_extra block. That is, it is the  value  that  was  passed  to
       pcre_malloc() when PCRE was getting memory into which to place the data
       created by pcre_study(). The fourth argument should point to  a  size_t
       variable.


OBSOLETE INFO FUNCTION

       int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

       The  pcre_info()  function is now obsolete because its interface is too
       restrictive to return all the available data about a compiled  pattern.
       New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of
       pcre_info() is the number of capturing subpatterns, or one of the  fol-
       lowing negative numbers:

         PCRE_ERROR_NULL       the argument code was NULL
         PCRE_ERROR_BADMAGIC   the "magic number" was not found

       If  the  optptr  argument is not NULL, a copy of the options with which
       the pattern was compiled is placed in the integer  it  points  to  (see
       PCRE_INFO_OPTIONS above).

       If  the  pattern  is  not anchored and the firstcharptr argument is not
       NULL, it is used to pass back information about the first character  of
       any matched string (see PCRE_INFO_FIRSTBYTE above).


MATCHING A PATTERN

       int pcre_exec(const pcre *code, const pcre_extra *extra,
            const char *subject, int length, int startoffset,
            int options, int *ovector, int ovecsize);

       The  function pcre_exec() is called to match a subject string against a
       pre-compiled pattern, which is passed in the code argument. If the pat-
       tern  has been studied, the result of the study should be passed in the
       extra argument.

       Here is an example of a simple call to pcre_exec():

         int rc;
         int ovector[30];
         rc = pcre_exec(
           re,             /* result of pcre_compile() */
           NULL,           /* we didn't study the pattern */
           "some string",  /* the subject string */
           11,             /* the length of the subject string */
           0,              /* start at offset 0 in the subject */
           0,              /* default options */
           ovector,        /* vector for substring information */
           30);            /* number of elements in the vector */

       If the extra argument is not NULL, it must point to a  pcre_extra  data
       block.  The pcre_study() function returns such a block (when it doesn't
       return NULL), but you can also create one for yourself, and pass  addi-
       tional information in it. The fields in the block are as follows:

         unsigned long int flags;
         void *study_data;
         unsigned long int match_limit;
         void *callout_data;

       The  flags  field  is a bitmap that specifies which of the other fields
       are set. The flag bits are:

         PCRE_EXTRA_STUDY_DATA
         PCRE_EXTRA_MATCH_LIMIT
         PCRE_EXTRA_CALLOUT_DATA

       Other flag bits should be set to zero. The study_data field is  set  in
       the  pcre_extra  block  that is returned by pcre_study(), together with
       the appropriate flag bit. You should not set this yourself, but you can
       add to the block by setting the other fields.

       The match_limit field provides a means of preventing PCRE from using up
       a vast amount of resources when running patterns that are not going  to
       match,  but  which  have  a very large number of possibilities in their
       search trees. The classic  example  is  the  use  of  nested  unlimited
       repeats. Internally, PCRE uses a function called match() which it calls
       repeatedly (sometimes recursively). The limit is imposed on the  number
       of  times  this function is called during a match, which has the effect
       of limiting the amount of recursion  and  backtracking  that  can  take
       place.  For  patterns that are not anchored, the count starts from zero
       for each position in the subject string.

       The default limit for the library can be set when PCRE  is  built;  the
       default  default  is 10 million, which handles all but the most extreme
       cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a
       pcre_extra  block  in  which match_limit is set to a smaller value, and
       PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
       exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.

       The  pcre_callout  field is used in conjunction with the "callout" fea-
       ture, which is described in the pcrecallout documentation.

       The PCRE_ANCHORED option can be passed in the options  argument,  whose
       unused  bits  must  be zero. This limits pcre_exec() to matching at the
       first matching position.  However,  if  a  pattern  was  compiled  with
       PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,
       it cannot be made unachored at matching time.

       When PCRE_UTF8 was set at compile time, the validity of the subject  as
       a  UTF-8  string is automatically checked, and the value of startoffset
       is also checked to ensure that it points to the start of a UTF-8  char-
       acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()
       returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an
       invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.

       If  you  already  know that your subject is valid, and you want to skip
       these   checks   for   performance   reasons,   you   can    set    the
       PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
       do this for the second and subsequent calls to pcre_exec() if  you  are
       making  repeated  calls  to  find  all  the matches in a single subject
       string. However, you should be  sure  that  the  value  of  startoffset
       points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
       set, the effect of passing an invalid UTF-8 string as a subject,  or  a
       value  of startoffset that does not point to the start of a UTF-8 char-
       acter, is undefined. Your program may crash.

       There are also three further options that can be set only  at  matching
       time:

         PCRE_NOTBOL

       The  first  character  of the string is not the beginning of a line, so
       the circumflex metacharacter should not match before it.  Setting  this
       without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to
       match.

         PCRE_NOTEOL

       The end of the string is not the end of a line, so the dollar metachar-
       acter  should  not  match  it  nor (except in multiline mode) a newline
       immediately before it. Setting this without PCRE_MULTILINE (at  compile
       time) causes dollar never to match.

         PCRE_NOTEMPTY

       An empty string is not considered to be a valid match if this option is
       set. If there are alternatives in the pattern, they are tried.  If  all
       the  alternatives  match  the empty string, the entire match fails. For
       example, if the pattern

         a?b?

       is applied to a string not beginning with "a" or "b",  it  matches  the
       empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
       match is not valid, so PCRE searches further into the string for occur-
       rences of "a" or "b".

       Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
       cial case of a pattern match of the empty  string  within  its  split()
       function,  and  when  using  the /g modifier. It is possible to emulate
       Perl's behaviour after matching a null string by first trying the match
       again at the same offset with PCRE_NOTEMPTY set, and then if that fails
       by advancing the starting offset (see below)  and  trying  an  ordinary
       match again.

       The  subject string is passed to pcre_exec() as a pointer in subject, a
       length in length, and a starting byte offset in startoffset. Unlike the
       pattern  string,  the  subject  may contain binary zero bytes. When the
       starting offset is zero, the search for a match starts at the beginning
       of the subject, and this is by far the most common case.

       If the pattern was compiled with the PCRE_UTF8 option, the subject must
       be a sequence of bytes that is a valid UTF-8 string, and  the  starting
       offset  must point to the beginning of a UTF-8 character. If an invalid
       UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8
       or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option
       PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not
       defined.

       A  non-zero  starting offset is useful when searching for another match
       in the same subject by calling pcre_exec() again after a previous  suc-
       cess.   Setting  startoffset differs from just passing over a shortened
       string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
       with any kind of lookbehind. For example, consider the pattern

         \Biss\B

       which  finds  occurrences  of "iss" in the middle of words. (\B matches
       only if the current position in the subject is not  a  word  boundary.)
       When  applied  to the string "Mississipi" the first call to pcre_exec()
       finds the first occurrence. If pcre_exec() is called  again  with  just
       the  remainder  of  the  subject,  namely  "issipi", it does not match,
       because \B is always false at the start of the subject, which is deemed
       to  be  a  word  boundary. However, if pcre_exec() is passed the entire
       string again, but with startoffset  set  to  4,  it  finds  the  second
       occurrence  of  "iss"  because  it  is able to look behind the starting
       point to discover that it is preceded by a letter.

       If a non-zero starting offset is passed when the pattern  is  anchored,
       one  attempt  to match at the given offset is tried. This can only suc-
       ceed if the pattern does not require the match to be at  the  start  of
       the subject.

       In  general, a pattern matches a certain portion of the subject, and in
       addition, further substrings from the subject  may  be  picked  out  by
       parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
       this is called "capturing" in what follows, and the  phrase  "capturing
       subpattern"  is  used for a fragment of a pattern that picks out a sub-
       string. PCRE supports several other kinds of  parenthesized  subpattern
       that do not cause substrings to be captured.

       Captured  substrings are returned to the caller via a vector of integer
       offsets whose address is passed in ovector. The number of  elements  in
       the vector is passed in ovecsize. The first two-thirds of the vector is
       used to pass back captured substrings, each substring using a  pair  of
       integers.  The  remaining  third  of the vector is used as workspace by
       pcre_exec() while matching capturing subpatterns, and is not  available
       for  passing  back  information.  The  length passed in ovecsize should
       always be a multiple of three. If it is not, it is rounded down.

       When a match has been successful, information about captured substrings
       is returned in pairs of integers, starting at the beginning of ovector,
       and continuing up to two-thirds of its length at the  most.  The  first
       element of a pair is set to the offset of the first character in a sub-
       string, and the second is set to the  offset  of  the  first  character
       after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
       tor[1], identify the portion of  the  subject  string  matched  by  the
       entire  pattern.  The next pair is used for the first capturing subpat-
       tern, and so on. The value returned by pcre_exec()  is  the  number  of
       pairs  that  have  been set. If there are no capturing subpatterns, the
       return value from a successful match is 1,  indicating  that  just  the
       first pair of offsets has been set.

       Some  convenience  functions  are  provided for extracting the captured
       substrings as separate strings. These are described  in  the  following
       section.

       It  is  possible  for  an capturing subpattern number n+1 to match some
       part of the subject when subpattern n has not been  used  at  all.  For
       example, if the string "abc" is matched against the pattern (a|(z))(bc)
       subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both
       offset values corresponding to the unused subpattern are set to -1.

       If a capturing subpattern is matched repeatedly, it is the last portion
       of the string that it matched that gets returned.

       If the vector is too small to hold all the captured substrings,  it  is
       used as far as possible (up to two-thirds of its length), and the func-
       tion returns a value of zero. In particular, if the  substring  offsets
       are  not  of interest, pcre_exec() may be called with ovector passed as
       NULL and ovecsize as zero. However, if the pattern contains back refer-
       ences  and  the  ovector  isn't big enough to remember the related sub-
       strings, PCRE has to get additional memory  for  use  during  matching.
       Thus it is usually advisable to supply an ovector.

       Note  that  pcre_info() can be used to find out how many capturing sub-
       patterns there are in a compiled pattern. The smallest size for ovector
       that  will  allow for n captured substrings, in addition to the offsets
       of the substring matched by the whole pattern, is (n+1)*3.

       If pcre_exec() fails, it returns a negative number. The  following  are
       defined in the header file:

         PCRE_ERROR_NOMATCH        (-1)

       The subject string did not match the pattern.

         PCRE_ERROR_NULL           (-2)

       Either  code  or  subject  was  passed as NULL, or ovector was NULL and
       ovecsize was not zero.

         PCRE_ERROR_BADOPTION      (-3)

       An unrecognized bit was set in the options argument.

         PCRE_ERROR_BADMAGIC       (-4)

       PCRE stores a 4-byte "magic number" at the start of the compiled  code,
       to  catch  the case when it is passed a junk pointer. This is the error
       it gives when the magic number isn't present.

         PCRE_ERROR_UNKNOWN_NODE   (-5)

       While running the pattern match, an unknown item was encountered in the
       compiled  pattern.  This  error  could be caused by a bug in PCRE or by
       overwriting of the compiled pattern.

         PCRE_ERROR_NOMEMORY       (-6)

       If a pattern contains back references, but the ovector that  is  passed
       to pcre_exec() is not big enough to remember the referenced substrings,
       PCRE gets a block of memory at the start of matching to  use  for  this
       purpose.  If the call via pcre_malloc() fails, this error is given. The
       memory is freed at the end of matching.

         PCRE_ERROR_NOSUBSTRING    (-7)

       This error is used by the pcre_copy_substring(),  pcre_get_substring(),
       and  pcre_get_substring_list()  functions  (see  below).  It  is  never
       returned by pcre_exec().

         PCRE_ERROR_MATCHLIMIT     (-8)

       The recursion and backtracking limit, as specified by  the  match_limit
       field  in  a  pcre_extra  structure (or defaulted) was reached. See the
       description above.

         PCRE_ERROR_CALLOUT        (-9)

       This error is never generated by pcre_exec() itself. It is provided for
       use  by  callout functions that want to yield a distinctive error code.
       See the pcrecallout documentation for details.

         PCRE_ERROR_BADUTF8        (-10)

       A string that contains an invalid UTF-8 byte sequence was passed  as  a
       subject.

         PCRE_ERROR_BADUTF8_OFFSET (-11)

       The UTF-8 byte sequence that was passed as a subject was valid, but the
       value of startoffset did not point to the beginning of a UTF-8  charac-
       ter.


EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

       int pcre_copy_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber, char *buffer,
            int buffersize);

       int pcre_get_substring(const char *subject, int *ovector,
            int stringcount, int stringnumber,
            const char **stringptr);

       int pcre_get_substring_list(const char *subject,
            int *ovector, int stringcount, const char ***listptr);

       Captured  substrings  can  be  accessed  directly  by using the offsets
       returned by pcre_exec() in  ovector.  For  convenience,  the  functions
       pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
       string_list() are provided for extracting captured substrings  as  new,
       separate,  zero-terminated strings. These functions identify substrings
       by number. The next section describes functions  for  extracting  named
       substrings.  A  substring  that  contains  a  binary  zero is correctly
       extracted and has a further zero added on the end, but  the  result  is
       not, of course, a C string.

       The  first  three  arguments  are the same for all three of these func-
       tions: subject is the subject string which has just  been  successfully
       matched, ovector is a pointer to the vector of integer offsets that was
       passed to pcre_exec(), and stringcount is the number of substrings that
       were  captured  by  the match, including the substring that matched the
       entire regular expression. This is the value returned by  pcre_exec  if
       it  is greater than zero. If pcre_exec() returned zero, indicating that
       it ran out of space in ovector, the value passed as stringcount  should
       be the size of the vector divided by three.

       The  functions pcre_copy_substring() and pcre_get_substring() extract a
       single substring, whose number is given as  stringnumber.  A  value  of
       zero  extracts  the  substring  that  matched the entire pattern, while
       higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
       string(),  the  string  is  placed  in buffer, whose length is given by
       buffersize, while for pcre_get_substring() a new  block  of  memory  is
       obtained  via  pcre_malloc,  and its address is returned via stringptr.
       The yield of the function is the length of the  string,  not  including
       the terminating zero, or one of

         PCRE_ERROR_NOMEMORY       (-6)

       The  buffer  was too small for pcre_copy_substring(), or the attempt to
       get memory failed for pcre_get_substring().

         PCRE_ERROR_NOSUBSTRING    (-7)

       There is no substring whose number is stringnumber.

       The pcre_get_substring_list()  function  extracts  all  available  sub-
       strings  and  builds  a list of pointers to them. All this is done in a
       single block of memory which is obtained via pcre_malloc.  The  address
       of the memory block is returned via listptr, which is also the start of
       the list of string pointers. The end of the list is marked  by  a  NULL
       pointer. The yield of the function is zero if all went well, or

         PCRE_ERROR_NOMEMORY       (-6)

       if the attempt to get the memory block failed.

       When  any of these functions encounter a substring that is unset, which
       can happen when capturing subpattern number n+1 matches  some  part  of
       the  subject, but subpattern n has not been used at all, they return an
       empty string. This can be distinguished from a genuine zero-length sub-
       string  by inspecting the appropriate offset in ovector, which is nega-
       tive for unset substrings.

       The    two    convenience    functions    pcre_free_substring()     and
       pcre_free_substring_list() can be used to free the memory returned by a
       previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),
       respectively. They do nothing more than call the function pointed to by
       pcre_free, which of course could be called directly from a  C  program.
       However,  PCRE is used in some situations where it is linked via a spe-
       cial  interface  to  another  programming  language  which  cannot  use
       pcre_free  directly;  it is for these cases that the functions are pro-
       vided.


EXTRACTING CAPTURED SUBSTRINGS BY NAME

       int pcre_copy_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            char *buffer, int buffersize);

       int pcre_get_stringnumber(const pcre *code,
            const char *name);

       int pcre_get_named_substring(const pcre *code,
            const char *subject, int *ovector,
            int stringcount, const char *stringname,
            const char **stringptr);

       To extract a substring by name, you first have to find associated  num-
       ber.  This  can  be  done by calling pcre_get_stringnumber(). The first
       argument is the compiled pattern, and the second is the name. For exam-
       ple, for this pattern

         ab(?<xxx>\d+)...

       the  number  of the subpattern called "xxx" is 1. Given the number, you
       can then extract the substring directly, or use one  of  the  functions
       described  in the previous section. For convenience, there are also two
       functions that do the whole job.

       Most   of   the   arguments    of    pcre_copy_named_substring()    and
       pcre_get_named_substring() are the same as those for the functions that
       extract by number, and so are not re-described here. There are just two
       differences.

       First,  instead  of a substring number, a substring name is given. Sec-
       ond, there is an extra argument, given at the start, which is a pointer
       to  the compiled pattern. This is needed in order to gain access to the
       name-to-number translation table.

       These functions call pcre_get_stringnumber(), and if it succeeds,  they
       then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
       ate.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE CALLOUTS

       int (*pcre_callout)(pcre_callout_block *);

       PCRE provides a feature called "callout", which is a means of temporar-
       ily passing control to the caller of PCRE  in  the  middle  of  pattern
       matching.  The  caller of PCRE provides an external function by putting
       its entry point in the global variable pcre_callout. By  default,  this
       variable contains NULL, which disables all calling out.

       Within  a  regular  expression,  (?C) indicates the points at which the
       external function is to be called.  Different  callout  points  can  be
       identified  by  putting  a number less than 256 after the letter C. The
       default value is zero.  For  example,  this  pattern  has  two  callout
       points:

         (?C1)abc(?C2)def

       During matching, when PCRE reaches a callout point (and pcre_callout is
       set), the external function is called. Its only argument is  a  pointer
       to a pcre_callout block. This contains the following variables:

         int          version;
         int          callout_number;
         int         *offset_vector;
         const char  *subject;
         int          subject_length;
         int          start_match;
         int          current_position;
         int          capture_top;
         int          capture_last;
         void        *callout_data;

       The  version  field  is an integer containing the version number of the
       block format. The current version  is  zero.  The  version  number  may
       change  in  future if additional fields are added, but the intention is
       never to remove any of the existing fields.

       The callout_number field contains the number of the  callout,  as  com-
       piled into the pattern (that is, the number after ?C).

       The  offset_vector field is a pointer to the vector of offsets that was
       passed by the caller to pcre_exec(). The contents can be  inspected  in
       order  to extract substrings that have been matched so far, in the same
       way as for extracting substrings after a match has completed.

       The subject and subject_length fields contain copies  the  values  that
       were passed to pcre_exec().

       The  start_match  field contains the offset within the subject at which
       the current match attempt started. If the pattern is not anchored,  the
       callout  function  may  be  called several times for different starting
       points.

       The current_position field contains the offset within  the  subject  of
       the current match pointer.

       The  capture_top field contains one more than the number of the highest
       numbered  captured  substring  so  far.  If  no  substrings  have  been
       captured, the value of capture_top is one.

       The  capture_last  field  contains the number of the most recently cap-
       tured substring.

       The callout_data field contains a value that is passed  to  pcre_exec()
       by  the  caller specifically so that it can be passed back in callouts.
       It is passed in the pcre_callout field of the  pcre_extra  data  struc-
       ture.  If  no  such  data  was  passed,  the value of callout_data in a
       pcre_callout block is NULL. There is a description  of  the  pcre_extra
       structure in the pcreapi documentation.



RETURN VALUES

       The callout function returns an integer. If the value is zero, matching
       proceeds as normal. If the value is greater than zero,  matching  fails
       at the current point, but backtracking to test other possibilities goes
       ahead, just as if a lookahead assertion had failed.  If  the  value  is
       less  than  zero,  the  match is abandoned, and pcre_exec() returns the
       value.

       Negative  values  should  normally  be   chosen   from   the   set   of
       PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
       dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
       reserved  for  use  by callout functions; it will never be used by PCRE
       itself.

Last updated: 21 January 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

DIFFERENCES FROM PERL

       This  document describes the differences in the ways that PCRE and Perl
       handle regular expressions. The differences  described  here  are  with
       respect to Perl 5.8.

       1.  PCRE does not have full UTF-8 support. Details of what it does have
       are given in the section on UTF-8 support in the main pcre page.

       2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
       permits  them,  but they do not mean what you might think. For example,
       (?!a){3} does not assert that the next three characters are not "a". It
       just asserts that the next character is not "a" three times.

       3.  Capturing  subpatterns  that occur inside negative lookahead asser-
       tions are counted, but their entries in the offsets  vector  are  never
       set.  Perl sets its numerical variables from any such patterns that are
       matched before the assertion fails to match something (thereby succeed-
       ing),  but  only  if the negative lookahead assertion contains just one
       branch.

       4. Though binary zero characters are supported in the  subject  string,
       they are not allowed in a pattern string because it is passed as a nor-
       mal C string, terminated by zero. The escape sequence "\0" can be  used
       in the pattern to represent a binary zero.

       5.  The  following Perl escape sequences are not supported: \l, \u, \L,
       \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general
       string-handling and are not part of its pattern matching engine. If any
       of these are encountered by PCRE, an error is generated.

       6. PCRE does support the \Q...\E escape for quoting substrings. Charac-
       ters  in  between  are  treated as literals. This is slightly different
       from Perl in that $ and @ are  also  handled  as  literals  inside  the
       quotes.  In Perl, they cause variable interpolation (but of course PCRE
       does not have variables). Note the following examples:

           Pattern            PCRE matches      Perl matches

           \Qabc$xyz\E        abc$xyz           abc followed by the
                                                  contents of $xyz
           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz

       The \Q...\E sequence is recognized both inside  and  outside  character
       classes.

       7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
       constructions. However, there is some experimental support  for  recur-
       sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).
       Also, the PCRE "callout" feature allows  an  external  function  to  be
       called during pattern matching.

       8.  There  are some differences that are concerned with the settings of
       captured strings when part of  a  pattern  is  repeated.  For  example,
       matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
       unset, but in PCRE it is set to "b".

       9. PCRE  provides  some  extensions  to  the  Perl  regular  expression
       facilities:

       (a)  Although  lookbehind  assertions  must match fixed length strings,
       each alternative branch of a lookbehind assertion can match a different
       length of string. Perl requires them all to have the same length.

       (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
       meta-character matches only at the very end of the string.

       (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
       cial meaning is faulted.

       (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
       fiers is inverted, that is, by default they are not greedy, but if fol-
       lowed by a question mark they are.

       (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at
       the first matching position in the subject string.

       (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-
       TURE options for pcre_exec() have no Perl equivalents.

       (g)  The (?R), (?number), and (?P>name) constructs allows for recursive
       pattern matching (Perl can do  this  using  the  (?p{code})  construct,
       which PCRE cannot support.)

       (h)  PCRE supports named capturing substrings, using the Python syntax.

       (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from
       Sun's Java package.

       (j) The (R) condition, for testing recursion, is a PCRE extension.

       (k) The callout facility is PCRE-specific.

Last updated: 09 December 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE REGULAR EXPRESSION DETAILS

       The  syntax  and semantics of the regular expressions supported by PCRE
       are described below. Regular expressions are also described in the Perl
       documentation  and in a number of other books, some of which have copi-
       ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-
       lished  by  O'Reilly, covers them in great detail. The description here
       is intended as reference documentation.

       The basic operation of PCRE is on strings of bytes. However,  there  is
       also  support for UTF-8 character strings. To use this support you must
       build PCRE to include UTF-8 support, and then call pcre_compile()  with
       the  PCRE_UTF8  option.  How  this affects the pattern matching is men-
       tioned in several places below. There is also a summary of  UTF-8  fea-
       tures in the section on UTF-8 support in the main pcre page.

       A  regular  expression  is  a pattern that is matched against a subject
       string from left to right. Most characters stand for  themselves  in  a
       pattern,  and  match  the corresponding characters in the subject. As a
       trivial example, the pattern

         The quick brown fox

       matches a portion of a subject string that is identical to itself.  The
       power of regular expressions comes from the ability to include alterna-
       tives and repetitions in the pattern. These are encoded in the  pattern
       by  the  use  of meta-characters, which do not stand for themselves but
       instead are interpreted in some special way.

       There are two different sets of meta-characters: those that are  recog-
       nized  anywhere in the pattern except within square brackets, and those
       that are recognized in square brackets. Outside  square  brackets,  the
       meta-characters are as follows:

         \      general escape character with several uses
         ^      assert start of string (or line, in multiline mode)
         $      assert end of string (or line, in multiline mode)
         .      match any character except newline (by default)
         [      start character class definition
         |      start of alternative branch
         (      start subpattern
         )      end subpattern
         ?      extends the meaning of (
                also 0 or 1 quantifier
                also quantifier minimizer
         *      0 or more quantifier
         +      1 or more quantifier
                also "possessive quantifier"
         {      start min/max quantifier

       Part  of  a  pattern  that is in square brackets is called a "character
       class". In a character class the only meta-characters are:

         \      general escape character
         ^      negate the class, but only if the first character
         -      indicates character range
         [      POSIX character class (only if followed by POSIX
                  syntax)
         ]      terminates the character class

       The following sections describe the use of each of the meta-characters.


BACKSLASH

       The backslash character has several uses. Firstly, if it is followed by
       a non-alphameric character, it takes  away  any  special  meaning  that
       character  may  have.  This  use  of  backslash  as an escape character
       applies both inside and outside character classes.

       For example, if you want to match a * character, you write  \*  in  the
       pattern.   This  escaping  action  applies whether or not the following
       character would otherwise be interpreted as a meta-character, so it  is
       always  safe to precede a non-alphameric with backslash to specify that
       it stands for itself. In particular, if you want to match a  backslash,
       you write \\.

       If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
       the pattern (other than in a character class) and characters between  a
       # outside a character class and the next newline character are ignored.
       An escaping backslash can be used to include a whitespace or #  charac-
       ter as part of the pattern.

       If  you  want  to remove the special meaning from a sequence of charac-
       ters, you can do so by putting them between \Q and \E. This is  differ-
       ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
       tion. Note the following examples:

         Pattern            PCRE matches   Perl matches

         \Qabc$xyz\E        abc$xyz        abc followed by the
                                             contents of $xyz
         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz

       The  \Q...\E  sequence  is recognized both inside and outside character
       classes.

       A second use of backslash provides a way of encoding non-printing char-
       acters  in patterns in a visible manner. There is no restriction on the
       appearance of non-printing characters, apart from the binary zero  that
       terminates  a  pattern,  but  when  a pattern is being prepared by text
       editing, it is usually easier  to  use  one  of  the  following  escape
       sequences than the binary character it represents:

         \a        alarm, that is, the BEL character (hex 07)
         \cx       "control-x", where x is any character
         \e        escape (hex 1B)
         \f        formfeed (hex 0C)
         \n        newline (hex 0A)
         \r        carriage return (hex 0D)
         \t        tab (hex 09)
         \ddd      character with octal code ddd, or backreference
         \xhh      character with hex code hh
         \x{hhh..} character with hex code hhh... (UTF-8 mode only)

       The  precise  effect of \cx is as follows: if x is a lower case letter,
       it is converted to upper case. Then bit 6 of the character (hex 40)  is
       inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
       becomes hex 7B.

       After \x, from zero to two hexadecimal digits are read (letters can  be
       in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
       its may appear between \x{ and }, but the value of the  character  code
       must  be  less  than  2**31  (that is, the maximum hexadecimal value is
       7FFFFFFF). If characters other than hexadecimal digits  appear  between
       \x{  and }, or if there is no terminating }, this form of escape is not
       recognized. Instead, the initial \x will be interpreted as a basic hex-
       adecimal escape, with no following digits, giving a byte whose value is
       zero.

       Characters whose value is less than 256 can be defined by either of the
       two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
       in the way they are handled. For example, \xdc is exactly the  same  as
       \x{dc}.

       After  \0  up  to  two further octal digits are read. In both cases, if
       there are fewer than two digits, just those that are present are  used.
       Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
       character (code value 7). Make sure you supply  two  digits  after  the
       initial zero if the character that follows is itself an octal digit.

       The handling of a backslash followed by a digit other than 0 is compli-
       cated.  Outside a character class, PCRE reads it and any following dig-
       its  as  a  decimal  number. If the number is less than 10, or if there
       have been at least that many previous capturing left parentheses in the
       expression,  the  entire  sequence  is  taken  as  a  back reference. A
       description of how this works is given later, following the  discussion
       of parenthesized subpatterns.

       Inside  a  character  class, or if the decimal number is greater than 9
       and there have not been that many capturing subpatterns, PCRE  re-reads
       up  to three octal digits following the backslash, and generates a sin-
       gle byte from the least significant 8 bits of the value. Any subsequent
       digits stand for themselves.  For example:

         \040   is another way of writing a space
         \40    is the same, provided there are fewer than 40
                   previous capturing subpatterns
         \7     is always a back reference
         \11    might be a back reference, or another way of
                   writing a tab
         \011   is always a tab
         \0113  is a tab followed by the character "3"
         \113   might be a back reference, otherwise the
                   character with octal code 113
         \377   might be a back reference, otherwise
                   the byte consisting entirely of 1 bits
         \81    is either a back reference, or a binary zero
                   followed by the two characters "8" and "1"

       Note  that  octal  values of 100 or greater must not be introduced by a
       leading zero, because no more than three octal digits are ever read.

       All the sequences that define a single byte value  or  a  single  UTF-8
       character (in UTF-8 mode) can be used both inside and outside character
       classes. In addition, inside a character  class,  the  sequence  \b  is
       interpreted  as  the  backspace character (hex 08). Outside a character
       class it has a different meaning (see below).

       The third use of backslash is for specifying generic character types:

         \d     any decimal digit
         \D     any character that is not a decimal digit
         \s     any whitespace character
         \S     any character that is not a whitespace character
         \w     any "word" character
         \W     any "non-word" character

       Each pair of escape sequences partitions the complete set of characters
       into  two disjoint sets. Any given character matches one, and only one,
       of each pair.

       In UTF-8 mode, characters with values greater than 255 never match  \d,
       \s, or \w, and always match \D, \S, and \W.

       For  compatibility  with Perl, \s does not match the VT character (code
       11).  This makes it different from the the POSIX "space" class. The  \s
       characters are HT (9), LF (10), FF (12), CR (13), and space (32).

       A  "word" character is any letter or digit or the underscore character,
       that is, any character which can be part of a Perl "word". The  defini-
       tion  of  letters  and digits is controlled by PCRE's character tables,
       and may vary if locale- specific matching is taking place (see  "Locale
       support"  in  the  pcreapi  page).  For  example,  in the "fr" (French)
       locale, some character codes greater than 128  are  used  for  accented
       letters, and these are matched by \w.

       These character type sequences can appear both inside and outside char-
       acter classes. They each match one character of the  appropriate  type.
       If  the current matching point is at the end of the subject string, all
       of them fail, since there is no character to match.

       The fourth use of backslash is for certain simple assertions. An asser-
       tion  specifies a condition that has to be met at a particular point in
       a match, without consuming any characters from the subject string.  The
       use  of subpatterns for more complicated assertions is described below.
       The backslashed assertions are

         \b     matches at a word boundary
         \B     matches when not at a word boundary
         \A     matches at start of subject
         \Z     matches at end of subject or before newline at end
         \z     matches at end of subject
         \G     matches at first matching position in subject

       These assertions may not appear in character classes (but note that  \b
       has a different meaning, namely the backspace character, inside a char-
       acter class).

       A word boundary is a position in the subject string where  the  current
       character  and  the previous character do not both match \w or \W (i.e.
       one matches \w and the other matches \W), or the start or  end  of  the
       string if the first or last character matches \w, respectively.

       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
       and dollar (described below) in that they only ever match at  the  very
       start  and  end  of the subject string, whatever options are set. Thus,
       they are independent of multiline mode.

       They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
       startoffset argument of pcre_exec() is non-zero, indicating that match-
       ing is to start at a point other than the beginning of the subject,  \A
       can  never  match.  The difference between \Z and \z is that \Z matches
       before a newline that is the last character of the string as well as at
       the end of the string, whereas \z matches only at the end.

       The  \G assertion is true only when the current matching position is at
       the start point of the match, as specified by the startoffset  argument
       of  pcre_exec().  It  differs  from \A when the value of startoffset is
       non-zero. By calling pcre_exec() multiple times with appropriate  argu-
       ments, you can mimic Perl's /g option, and it is in this kind of imple-
       mentation where \G can be useful.

       Note, however, that PCRE's interpretation of \G, as the  start  of  the
       current match, is subtly different from Perl's, which defines it as the
       end of the previous match. In Perl, these can  be  different  when  the
       previously  matched  string was empty. Because PCRE does just one match
       at a time, it cannot reproduce this behaviour.

       If all the alternatives of a pattern begin with \G, the  expression  is
       anchored to the starting match position, and the "anchored" flag is set
       in the compiled regular expression.


CIRCUMFLEX AND DOLLAR

       Outside a character class, in the default matching mode, the circumflex
       character  is  an  assertion which is true only if the current matching
       point is at the start of the subject string. If the  startoffset  argu-
       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
       has an entirely different meaning (see below).

       Circumflex  need  not be the first character of the pattern if a number
       of alternatives are involved, but it should be the first thing in  each
       alternative  in  which  it appears if the pattern is ever to match that
       branch. If all possible alternatives start with a circumflex, that  is,
       if  the  pattern  is constrained to match only at the start of the sub-
       ject, it is said to be an "anchored" pattern.  (There  are  also  other
       constructs that can cause a pattern to be anchored.)

       A  dollar  character  is an assertion which is true only if the current
       matching point is at the end of  the  subject  string,  or  immediately
       before a newline character that is the last character in the string (by
       default). Dollar need not be the last character of  the  pattern  if  a
       number  of alternatives are involved, but it should be the last item in
       any branch in which it appears.  Dollar has no  special  meaning  in  a
       character class.

       The  meaning  of  dollar  can be changed so that it matches only at the
       very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
       compile time. This does not affect the \Z assertion.

       The meanings of the circumflex and dollar characters are changed if the
       PCRE_MULTILINE option is set. When this is the case, they match immedi-
       ately  after  and  immediately  before  an  internal newline character,
       respectively, in addition to matching at the start and end of the  sub-
       ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
       string "def\nabc" in multiline mode, but not  otherwise.  Consequently,
       patterns  that  are  anchored  in single line mode because all branches
       start with ^ are not anchored in multiline mode, and a match  for  cir-
       cumflex  is  possible  when  the startoffset argument of pcre_exec() is
       non-zero. The PCRE_DOLLAR_ENDONLY option is ignored  if  PCRE_MULTILINE
       is set.

       Note  that  the sequences \A, \Z, and \z can be used to match the start
       and end of the subject in both modes, and if all branches of a  pattern
       start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
       not.


FULL STOP (PERIOD, DOT)

       Outside a character class, a dot in the pattern matches any one charac-
       ter  in  the  subject,  including a non-printing character, but not (by
       default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
       which  might  be  more than one byte long, except (by default) for new-
       line. If the PCRE_DOTALL option is set, dots match  newlines  as  well.
       The  handling of dot is entirely independent of the handling of circum-
       flex and dollar, the only relationship being  that  they  both  involve
       newline characters. Dot has no special meaning in a character class.


MATCHING A SINGLE BYTE

       Outside a character class, the escape sequence \C matches any one byte,
       both in and out of UTF-8 mode. Unlike a dot, it always matches  a  new-
       line.  The  feature  is  provided  in Perl in order to match individual
       bytes in UTF-8 mode.  Because it breaks up UTF-8 characters into  indi-
       vidual  bytes,  what  remains  in  the  string may be a malformed UTF-8
       string. For this reason it is best avoided.

       PCRE does not allow \C to appear in lookbehind assertions (see  below),
       because in UTF-8 mode it makes it impossible to calculate the length of
       the lookbehind.


SQUARE BRACKETS

       An opening square bracket introduces a character class, terminated by a
       closing square bracket. A closing square bracket on its own is not spe-
       cial. If a closing square bracket is required as a member of the class,
       it  should  be  the first data character in the class (after an initial
       circumflex, if present) or escaped with a backslash.

       A character class matches a single character in the subject.  In  UTF-8
       mode,  the character may occupy more than one byte. A matched character
       must be in the set of characters defined by the class, unless the first
       character  in  the  class definition is a circumflex, in which case the
       subject character must not be in the set defined by  the  class.  If  a
       circumflex  is actually required as a member of the class, ensure it is
       not the first character, or escape it with a backslash.

       For example, the character class [aeiou] matches any lower case  vowel,
       while  [^aeiou]  matches  any character that is not a lower case vowel.
       Note that a circumflex is just a convenient notation for specifying the
       characters which are in the class by enumerating those that are not. It
       is not an assertion: it still consumes a  character  from  the  subject
       string, and fails if the current pointer is at the end of the string.

       In  UTF-8 mode, characters with values greater than 255 can be included
       in a class as a literal string of bytes, or by using the  \x{  escaping
       mechanism.

       When  caseless  matching  is set, any letters in a class represent both
       their upper case and lower case versions, so for  example,  a  caseless
       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
       match "A", whereas a caseful version would. PCRE does not  support  the
       concept of case for characters with values greater than 255.

       The  newline character is never treated in any special way in character
       classes, whatever the setting  of  the  PCRE_DOTALL  or  PCRE_MULTILINE
       options is. A class such as [^a] will always match a newline.

       The  minus (hyphen) character can be used to specify a range of charac-
       ters in a character  class.  For  example,  [d-m]  matches  any  letter
       between  d  and  m,  inclusive.  If  a minus character is required in a
       class, it must be escaped with a backslash  or  appear  in  a  position
       where  it cannot be interpreted as indicating a range, typically as the
       first or last character in the class.

       It is not possible to have the literal character "]" as the end charac-
       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
       two characters ("W" and "-") followed by a literal string "46]", so  it
       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
       preted  as  a  single class containing a range followed by two separate
       characters. The octal or hexadecimal representation of "]" can also  be
       used to end a range.

       Ranges  operate in the collating sequence of character values. They can
       also  be  used  for  characters  specified  numerically,  for   example
       [\000-\037].  In UTF-8 mode, ranges can include characters whose values
       are greater than 255, for example [\x{100}-\x{2ff}].

       If a range that includes letters is used when caseless matching is set,
       it matches the letters in either case. For example, [W-c] is equivalent
       to [][\^_`wxyzabc], matched caselessly, and if character tables for the
       "fr"  locale  are  in use, [\xc8-\xcb] matches accented E characters in
       both cases.

       The character types \d, \D, \s, \S, \w, and \W may  also  appear  in  a
       character  class,  and add the characters that they match to the class.
       For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
       conveniently  be  used with the upper case character types to specify a
       more restricted set of characters than the matching  lower  case  type.
       For  example,  the  class  [^\W_]  matches any letter or digit, but not
       underscore.

       All non-alphameric characters other than \, -, ^ (at the start) and the
       terminating ] are non-special in character classes, but it does no harm
       if they are escaped.


POSIX CHARACTER CLASSES

       Perl supports the POSIX notation  for  character  classes,  which  uses
       names  enclosed by [: and :] within the enclosing square brackets. PCRE
       also supports this notation. For example,

         [01[:alpha:]%]

       matches "0", "1", any alphabetic character, or "%". The supported class
       names are

         alnum    letters and digits
         alpha    letters
         ascii    character codes 0 - 127
         blank    space or tab only
         cntrl    control characters
         digit    decimal digits (same as \d)
         graph    printing characters, excluding space
         lower    lower case letters
         print    printing characters, including space
         punct    printing characters, excluding letters and digits
         space    white space (not quite the same as \s)
         upper    upper case letters
         word     "word" characters (same as \w)
         xdigit   hexadecimal digits

       The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
       and space (32). Notice that this list includes the VT  character  (code
       11). This makes "space" different to \s, which does not include VT (for
       Perl compatibility).

       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
       from  Perl  5.8. Another Perl extension is negation, which is indicated
       by a ^ character after the colon. For example,

         [12[:^digit:]]

       matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
       these are not supported, and an error is given if they are encountered.

       In UTF-8 mode, characters with values greater than 255 do not match any
       of the POSIX character classes.


VERTICAL BAR

       Vertical bar characters are used to separate alternative patterns.  For
       example, the pattern

         gilbert|sullivan

       matches  either "gilbert" or "sullivan". Any number of alternatives may
       appear, and an empty  alternative  is  permitted  (matching  the  empty
       string).   The  matching  process  tries each alternative in turn, from
       left to right, and the first one that succeeds is used. If the alterna-
       tives  are within a subpattern (defined below), "succeeds" means match-
       ing the rest of the main pattern as well as the alternative in the sub-
       pattern.


INTERNAL OPTION SETTING

       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
       PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
       sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
       option letters are

         i  for PCRE_CASELESS
         m  for PCRE_MULTILINE
         s  for PCRE_DOTALL
         x  for PCRE_EXTENDED

       For example, (?im) sets caseless, multiline matching. It is also possi-
       ble to unset these options by preceding the letter with a hyphen, and a
       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
       is also permitted. If a  letter  appears  both  before  and  after  the
       hyphen, the option is unset.

       When  an option change occurs at top level (that is, not inside subpat-
       tern parentheses), the change applies to the remainder of  the  pattern
       that follows.  If the change is placed right at the start of a pattern,
       PCRE extracts it into the global options (and it will therefore show up
       in data extracted by the pcre_fullinfo() function).

       An option change within a subpattern affects only that part of the cur-
       rent pattern that follows it, so

         (a(?i)b)c

       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
       used).   By  this means, options can be made to have different settings
       in different parts of the pattern. Any changes made in one  alternative
       do  carry  on  into subsequent branches within the same subpattern. For
       example,

         (a(?i)b|c)

       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
       first  branch  is  abandoned before the option setting. This is because
       the effects of option settings happen at compile time. There  would  be
       some very weird behaviour otherwise.

       The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
       in the same way as the Perl-compatible options by using the  characters
       U  and X respectively. The (?X) flag setting is special in that it must
       always occur earlier in the pattern than any of the additional features
       it turns on, even when it is at top level. It is best put at the start.


SUBPATTERNS

       Subpatterns are delimited by parentheses (round brackets), which can be
       nested.  Marking part of a pattern as a subpattern does two things:

       1. It localizes a set of alternatives. For example, the pattern

         cat(aract|erpillar|)

       matches  one  of the words "cat", "cataract", or "caterpillar". Without
       the parentheses, it would match "cataract",  "erpillar"  or  the  empty
       string.

       2.  It  sets  up  the  subpattern as a capturing subpattern (as defined
       above).  When the whole pattern matches, that portion  of  the  subject
       string that matched the subpattern is passed back to the caller via the
       ovector argument of pcre_exec(). Opening parentheses are  counted  from
       left  to right (starting from 1) to obtain the numbers of the capturing
       subpatterns.

       For example, if the string "the red king" is matched against  the  pat-
       tern

         the ((red|white) (king|queen))

       the captured substrings are "red king", "red", and "king", and are num-
       bered 1, 2, and 3, respectively.

       The fact that plain parentheses fulfil  two  functions  is  not  always
       helpful.   There are often times when a grouping subpattern is required
       without a capturing requirement. If an opening parenthesis is  followed
       by  a question mark and a colon, the subpattern does not do any captur-
       ing, and is not counted when computing the  number  of  any  subsequent
       capturing  subpatterns. For example, if the string "the white queen" is
       matched against the pattern

         the ((?:red|white) (king|queen))

       the captured substrings are "white queen" and "queen", and are numbered
       1  and 2. The maximum number of capturing subpatterns is 65535, and the
       maximum depth of nesting of all subpatterns, both  capturing  and  non-
       capturing, is 200.

       As  a  convenient shorthand, if any option settings are required at the
       start of a non-capturing subpattern,  the  option  letters  may  appear
       between the "?" and the ":". Thus the two patterns

         (?i:saturday|sunday)
         (?:(?i)saturday|sunday)

       match exactly the same set of strings. Because alternative branches are
       tried from left to right, and options are not reset until  the  end  of
       the  subpattern is reached, an option setting in one branch does affect
       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
       "Saturday".


NAMED SUBPATTERNS

       Identifying  capturing  parentheses  by number is simple, but it can be
       very hard to keep track of the numbers in complicated  regular  expres-
       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
       change. To help with the difficulty, PCRE supports the naming  of  sub-
       patterns,  something  that  Perl  does  not  provide. The Python syntax
       (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
       underscores, and must be unique within a pattern.

       Named  capturing  parentheses  are  still  allocated numbers as well as
       names. The PCRE API provides function calls for extracting the name-to-
       number  translation  table from a compiled pattern. For further details
       see the pcreapi documentation.


REPETITION

       Repetition is specified by quantifiers, which can  follow  any  of  the
       following items:

         a literal data character
         the . metacharacter
         the \C escape sequence
         escapes such as \d that match single characters
         a character class
         a back reference (see next section)
         a parenthesized subpattern (unless it is an assertion)

       The  general repetition quantifier specifies a minimum and maximum num-
       ber of permitted matches, by giving the two numbers in  curly  brackets
       (braces),  separated  by  a comma. The numbers must be less than 65536,
       and the first must be less than or equal to the second. For example:

         z{2,4}

       matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
       special  character.  If  the second number is omitted, but the comma is
       present, there is no upper limit; if the second number  and  the  comma
       are  both omitted, the quantifier specifies an exact number of required
       matches. Thus

         [aeiou]{3,}

       matches at least 3 successive vowels, but may match many more, while

         \d{8}

       matches exactly 8 digits. An opening curly bracket that  appears  in  a
       position  where a quantifier is not allowed, or one that does not match
       the syntax of a quantifier, is taken as a literal character. For  exam-
       ple, {,6} is not a quantifier, but a literal string of four characters.

       In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
       individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
       acters, each of which is represented by a two-byte sequence.

       The quantifier {0} is permitted, causing the expression to behave as if
       the previous item and the quantifier were not present.

       For  convenience  (and  historical compatibility) the three most common
       quantifiers have single-character abbreviations:

         *    is equivalent to {0,}
         +    is equivalent to {1,}
         ?    is equivalent to {0,1}

       It is possible to construct infinite loops by  following  a  subpattern
       that can match no characters with a quantifier that has no upper limit,
       for example:

         (a?)*

       Earlier versions of Perl and PCRE used to give an error at compile time
       for  such  patterns. However, because there are cases where this can be
       useful, such patterns are now accepted, but if any  repetition  of  the
       subpattern  does in fact match no characters, the loop is forcibly bro-
       ken.

       By default, the quantifiers are "greedy", that is, they match  as  much
       as  possible  (up  to  the  maximum number of permitted times), without
       causing the rest of the pattern to fail. The classic example  of  where
       this gives problems is in trying to match comments in C programs. These
       appear between the sequences /* and */ and within the  sequence,  indi-
       vidual * and / characters may appear. An attempt to match C comments by
       applying the pattern

         /\*.*\*/

       to the string

         /* first command */  not comment  /* second comment */

       fails, because it matches the entire string owing to the greediness  of
       the .*  item.

       However,  if  a quantifier is followed by a question mark, it ceases to
       be greedy, and instead matches the minimum number of times possible, so
       the pattern

         /\*.*?\*/

       does  the  right  thing with the C comments. The meaning of the various
       quantifiers is not otherwise changed,  just  the  preferred  number  of
       matches.   Do  not  confuse this use of question mark with its use as a
       quantifier in its own right. Because it has two uses, it can  sometimes
       appear doubled, as in

         \d??\d

       which matches one digit by preference, but can match two if that is the
       only way the rest of the pattern matches.

       If the PCRE_UNGREEDY option is set (an option which is not available in
       Perl),  the  quantifiers are not greedy by default, but individual ones
       can be made greedy by following them with a  question  mark.  In  other
       words, it inverts the default behaviour.

       When  a  parenthesized  subpattern  is quantified with a minimum repeat
       count that is greater than 1 or with a limited maximum, more  store  is
       required  for  the  compiled  pattern, in proportion to the size of the
       minimum or maximum.

       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
       alent  to Perl's /s) is set, thus allowing the . to match newlines, the
       pattern is implicitly anchored, because whatever follows will be  tried
       against  every character position in the subject string, so there is no
       point in retrying the overall match at any position  after  the  first.
       PCRE normally treats such a pattern as though it were preceded by \A.

       In  cases  where  it  is known that the subject string contains no new-
       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
       mization, or alternatively using ^ to indicate anchoring explicitly.

       However,  there is one situation where the optimization cannot be used.
       When .*  is inside capturing parentheses that  are  the  subject  of  a
       backreference  elsewhere in the pattern, a match at the start may fail,
       and a later one succeed. Consider, for example:

         (.*)abc\1

       If the subject is "xyz123abc123" the match point is the fourth  charac-
       ter. For this reason, such a pattern is not implicitly anchored.

       When a capturing subpattern is repeated, the value captured is the sub-
       string that matched the final iteration. For example, after

         (tweedle[dume]{3}\s*)+

       has matched "tweedledum tweedledee" the value of the captured substring
       is  "tweedledee".  However,  if there are nested capturing subpatterns,
       the corresponding captured values may have been set in previous  itera-
       tions. For example, after

         /(a|(b))+/

       matches "aba" the value of the second captured substring is "b".


ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

       With both maximizing and minimizing repetition, failure of what follows
       normally causes the repeated item to be re-evaluated to see if  a  dif-
       ferent number of repeats allows the rest of the pattern to match. Some-
       times it is useful to prevent this, either to change the nature of  the
       match,  or  to  cause it fail earlier than it otherwise might, when the
       author of the pattern knows there is no point in carrying on.

       Consider, for example, the pattern \d+foo when applied to  the  subject
       line

         123456bar

       After matching all 6 digits and then failing to match "foo", the normal
       action of the matcher is to try again with only 5 digits  matching  the
       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
       the  means for specifying that once a subpattern has matched, it is not
       to be re-evaluated in this way.

       If we use atomic grouping for the previous example, the  matcher  would
       give up immediately on failing to match "foo" the first time. The nota-
       tion is a kind of special parenthesis, starting with  (?>  as  in  this
       example:

         (?>\d+)foo

       This  kind  of  parenthesis "locks up" the  part of the pattern it con-
       tains once it has matched, and a failure further into  the  pattern  is
       prevented  from  backtracking into it. Backtracking past it to previous
       items, however, works as normal.

       An alternative description is that a subpattern of  this  type  matches
       the  string  of  characters  that an identical standalone pattern would
       match, if anchored at the current point in the subject string.

       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
       such as the above example can be thought of as a maximizing repeat that
       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
       pared  to  adjust  the number of digits they match in order to make the
       rest of the pattern match, (?>\d+) can only match an entire sequence of
       digits.

       Atomic  groups in general can of course contain arbitrarily complicated
       subpatterns, and can be nested. However, when  the  subpattern  for  an
       atomic group is just a single repeated item, as in the example above, a
       simpler notation, called a "possessive quantifier" can  be  used.  This
       consists  of  an  additional  + character following a quantifier. Using
       this notation, the previous example can be rewritten as

         \d++bar

       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
       simpler forms of atomic group. However, there is no difference  in  the
       meaning  or  processing  of  a possessive quantifier and the equivalent
       atomic group.

       The possessive quantifier syntax is an extension to the Perl syntax. It
       originates in Sun's Java package.

       When  a  pattern  contains an unlimited repeat inside a subpattern that
       can itself be repeated an unlimited number of  times,  the  use  of  an
       atomic  group  is  the  only way to avoid some failing matches taking a
       very long time indeed. The pattern

         (\D+|<\d+>)*[!?]

       matches an unlimited number of substrings that either consist  of  non-
       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
       matches, it runs quickly. However, if it is applied to

         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

       it takes a long time before reporting  failure.  This  is  because  the
       string  can  be  divided  between  the two repeats in a large number of
       ways, and all have to be tried. (The example used [!?]  rather  than  a
       single  character  at the end, because both PCRE and Perl have an opti-
       mization that allows for fast failure when a single character is  used.
       They  remember  the last single character that is required for a match,
       and fail early if it is not present in the string.)  If the pattern  is
       changed to

         ((?>\D+)|<\d+>)*[!?]

       sequences  of non-digits cannot be broken, and failure happens quickly.


BACK REFERENCES

       Outside a character class, a backslash followed by a digit greater than
       0 (and possibly further digits) is a back reference to a capturing sub-
       pattern earlier (that is, to its left) in the pattern,  provided  there
       have been that many previous capturing left parentheses.

       However, if the decimal number following the backslash is less than 10,
       it is always taken as a back reference, and causes  an  error  only  if
       there  are  not that many capturing left parentheses in the entire pat-
       tern. In other words, the parentheses that are referenced need  not  be
       to  the left of the reference for numbers less than 10. See the section
       entitled "Backslash" above for further details of the handling of  dig-
       its following a backslash.

       A  back  reference matches whatever actually matched the capturing sub-
       pattern in the current subject string, rather  than  anything  matching
       the subpattern itself (see "Subpatterns as subroutines" below for a way
       of doing that). So the pattern

         (sens|respons)e and \1ibility

       matches "sense and sensibility" and "response and responsibility",  but
       not  "sense and responsibility". If caseful matching is in force at the
       time of the back reference, the case of letters is relevant. For  exam-
       ple,

         ((?i)rah)\s+\1

       matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
       original capturing subpattern is matched caselessly.

       Back references to named subpatterns use the Python  syntax  (?P=name).
       We could rewrite the above example as follows:

         (?<p1>(?i)rah)\s+(?P=p1)

       There  may be more than one back reference to the same subpattern. If a
       subpattern has not actually been used in a particular match,  any  back
       references to it always fail. For example, the pattern

         (a|(bc))\2

       always  fails if it starts to match "a" rather than "bc". Because there
       may be many capturing parentheses in a pattern,  all  digits  following
       the  backslash  are taken as part of a potential back reference number.
       If the pattern continues with a digit character, some delimiter must be
       used  to  terminate  the back reference. If the PCRE_EXTENDED option is
       set, this can be whitespace.  Otherwise an empty comment can be used.

       A back reference that occurs inside the parentheses to which it  refers
       fails  when  the subpattern is first used, so, for example, (a\1) never
       matches.  However, such references can be useful inside  repeated  sub-
       patterns. For example, the pattern

         (a|b\1)+

       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
       ation of the subpattern,  the  back  reference  matches  the  character
       string  corresponding  to  the previous iteration. In order for this to
       work, the pattern must be such that the first iteration does  not  need
       to  match the back reference. This can be done using alternation, as in
       the example above, or by a quantifier with a minimum of zero.


ASSERTIONS

       An assertion is a test on the characters  following  or  preceding  the
       current  matching  point that does not actually consume any characters.
       The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
       described above.  More complicated assertions are coded as subpatterns.
       There are two kinds: those that look ahead of the current  position  in
       the subject string, and those that look behind it.

       An  assertion  subpattern  is matched in the normal way, except that it
       does not cause the current matching position to be  changed.  Lookahead
       assertions  start with (?= for positive assertions and (?! for negative
       assertions. For example,

         \w+(?=;)

       matches a word followed by a semicolon, but does not include the  semi-
       colon in the match, and

         foo(?!bar)

       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
       that the apparently similar pattern

         (?!foo)bar

       does not find an occurrence of "bar"  that  is  preceded  by  something
       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
       the assertion (?!foo) is always true when the next three characters are
       "bar". A lookbehind assertion is needed to achieve this effect.

       If you want to force a matching failure at some point in a pattern, the
       most convenient way to do it is  with  (?!)  because  an  empty  string
       always  matches, so an assertion that requires there not to be an empty
       string must always fail.

       Lookbehind assertions start with (?<= for positive assertions and  (?<!
       for negative assertions. For example,

         (?<!foo)bar

       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
       contents of a lookbehind assertion are restricted  such  that  all  the
       strings it matches must have a fixed length. However, if there are sev-
       eral alternatives, they do not all have to have the same fixed  length.
       Thus

         (?<=bullock|donkey)

       is permitted, but

         (?<!dogs?|cats?)

       causes  an  error at compile time. Branches that match different length
       strings are permitted only at the top level of a lookbehind  assertion.
       This  is  an  extension  compared  with  Perl (at least for 5.8), which
       requires all branches to match the same length of string. An  assertion
       such as

         (?<=ab(c|de))

       is  not  permitted,  because  its single top-level branch can match two
       different lengths, but it is acceptable if rewritten to  use  two  top-
       level branches:

         (?<=abc|abde)

       The  implementation  of lookbehind assertions is, for each alternative,
       to temporarily move the current position back by the  fixed  width  and
       then try to match. If there are insufficient characters before the cur-
       rent position, the match is deemed to fail.

       PCRE does not allow the \C escape (which matches a single byte in UTF-8
       mode)  to appear in lookbehind assertions, because it makes it impossi-
       ble to calculate the length of the lookbehind.

       Atomic groups can be used in conjunction with lookbehind assertions  to
       specify efficient matching at the end of the subject string. Consider a
       simple pattern such as

         abcd$

       when applied to a long string that does  not  match.  Because  matching
       proceeds from left to right, PCRE will look for each "a" in the subject
       and then see if what follows matches the rest of the  pattern.  If  the
       pattern is specified as

         ^.*abcd$

       the  initial .* matches the entire string at first, but when this fails
       (because there is no following "a"), it backtracks to match all but the
       last  character,  then all but the last two characters, and so on. Once
       again the search for "a" covers the entire string, from right to  left,
       so we are no better off. However, if the pattern is written as

         ^(?>.*)(?<=abcd)

       or, equivalently,

         ^.*+(?<=abcd)

       there  can  be  no  backtracking for the .* item; it can match only the
       entire string. The subsequent lookbehind assertion does a  single  test
       on  the last four characters. If it fails, the match fails immediately.
       For long strings, this approach makes a significant difference  to  the
       processing time.

       Several assertions (of any sort) may occur in succession. For example,

         (?<=\d{3})(?<!999)foo

       matches  "foo" preceded by three digits that are not "999". Notice that
       each of the assertions is applied independently at the  same  point  in
       the  subject  string.  First  there  is a check that the previous three
       characters are all digits, and then there is  a  check  that  the  same
       three characters are not "999".  This pattern does not match "foo" pre-
       ceded by six characters, the first of which are  digits  and  the  last
       three  of  which  are not "999". For example, it doesn't match "123abc-
       foo". A pattern to do that is

         (?<=\d{3}...)(?<!999)foo

       This time the first assertion looks at the  preceding  six  characters,
       checking that the first three are digits, and then the second assertion
       checks that the preceding three characters are not "999".

       Assertions can be nested in any combination. For example,

         (?<=(?<!foo)bar)baz

       matches an occurrence of "baz" that is preceded by "bar" which in  turn
       is not preceded by "foo", while

         (?<=\d{3}(?!999)...)foo

       is another pattern which matches "foo" preceded by three digits and any
       three characters that are not "999".

       Assertion subpatterns are not capturing subpatterns,  and  may  not  be
       repeated,  because  it  makes no sense to assert the same thing several
       times. If any kind of assertion contains capturing  subpatterns  within
       it,  these are counted for the purposes of numbering the capturing sub-
       patterns in the whole pattern.  However, substring capturing is carried
       out  only  for  positive assertions, because it does not make sense for
       negative assertions.


CONDITIONAL SUBPATTERNS

       It is possible to cause the matching process to obey a subpattern  con-
       ditionally  or to choose between two alternative subpatterns, depending
       on the  result  of  an  assertion,  or  whether  a  previous  capturing
       subpattern  matched  or not. The two possible forms of conditional sub-
       pattern are

         (?(condition)yes-pattern)
         (?(condition)yes-pattern|no-pattern)

       If the condition is satisfied, the yes-pattern is used;  otherwise  the
       no-pattern  (if  present)  is used. If there are more than two alterna-
       tives in the subpattern, a compile-time error occurs.

       There are three kinds of condition. If the text between the parentheses
       consists  of  a  sequence  of digits, the condition is satisfied if the
       capturing subpattern of that number has previously matched. The  number
       must  be  greater than zero. Consider the following pattern, which con-
       tains non-significant white space to make it more readable (assume  the
       PCRE_EXTENDED  option)  and  to  divide it into three parts for ease of
       discussion:

         ( \( )?    [^()]+    (?(1) \) )

       The first part matches an optional opening  parenthesis,  and  if  that
       character is present, sets it as the first captured substring. The sec-
       ond part matches one or more characters that are not  parentheses.  The
       third part is a conditional subpattern that tests whether the first set
       of parentheses matched or not. If they did, that is, if subject started
       with an opening parenthesis, the condition is true, and so the yes-pat-
       tern is executed and a  closing  parenthesis  is  required.  Otherwise,
       since  no-pattern  is  not  present, the subpattern matches nothing. In
       other words,  this  pattern  matches  a  sequence  of  non-parentheses,
       optionally enclosed in parentheses.

       If the condition is the string (R), it is satisfied if a recursive call
       to the pattern or subpattern has been made. At "top level", the  condi-
       tion  is  false.   This  is  a  PCRE  extension. Recursive patterns are
       described in the next section.

       If the condition is not a sequence of digits or  (R),  it  must  be  an
       assertion.   This may be a positive or negative lookahead or lookbehind
       assertion. Consider  this  pattern,  again  containing  non-significant
       white space, and with the two alternatives on the second line:

         (?(?=[^a-z]*[a-z])
         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

       The  condition  is  a  positive  lookahead  assertion  that  matches an
       optional sequence of non-letters followed by a letter. In other  words,
       it  tests  for the presence of at least one letter in the subject. If a
       letter is found, the subject is matched against the first  alternative;
       otherwise  it  is  matched  against  the  second.  This pattern matches
       strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
       letters and dd are digits.


COMMENTS

       The sequence (?# marks the start of a comment which continues up to the
       next closing parenthesis. Nested parentheses  are  not  permitted.  The
       characters  that make up a comment play no part in the pattern matching
       at all.

       If the PCRE_EXTENDED option is set, an unescaped # character outside  a
       character class introduces a comment that continues up to the next new-
       line character in the pattern.


RECURSIVE PATTERNS

       Consider the problem of matching a string in parentheses, allowing  for
       unlimited  nested  parentheses.  Without the use of recursion, the best
       that can be done is to use a pattern that  matches  up  to  some  fixed
       depth  of  nesting.  It  is not possible to handle an arbitrary nesting
       depth. Perl has provided an experimental facility that  allows  regular
       expressions to recurse (amongst other things). It does this by interpo-
       lating Perl code in the expression at run time, and the code can  refer
       to the expression itself. A Perl pattern to solve the parentheses prob-
       lem can be created like this:

         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;

       The (?p{...}) item interpolates Perl code at run time, and in this case
       refers  recursively to the pattern in which it appears. Obviously, PCRE
       cannot support the interpolation of Perl  code.  Instead,  it  supports
       some  special  syntax for recursion of the entire pattern, and also for
       individual subpattern recursion.

       The special item that consists of (? followed by a number greater  than
       zero and a closing parenthesis is a recursive call of the subpattern of
       the given number, provided that it occurs inside that  subpattern.  (If
       not,  it  is  a  "subroutine" call, which is described in the next sec-
       tion.) The special item (?R) is a recursive call of the entire  regular
       expression.

       For  example,  this  PCRE pattern solves the nested parentheses problem
       (assume the  PCRE_EXTENDED  option  is  set  so  that  white  space  is
       ignored):

         \( ( (?>[^()]+) | (?R) )* \)

       First  it matches an opening parenthesis. Then it matches any number of
       substrings which can either be a  sequence  of  non-parentheses,  or  a
       recursive  match  of  the pattern itself (that is a correctly parenthe-
       sized substring).  Finally there is a closing parenthesis.

       If this were part of a larger pattern, you would not  want  to  recurse
       the entire pattern, so instead you could use this:

         ( \( ( (?>[^()]+) | (?1) )* \) )

       We  have  put the pattern into parentheses, and caused the recursion to
       refer to them instead of the whole pattern. In a larger pattern,  keep-
       ing  track  of parenthesis numbers can be tricky. It may be more conve-
       nient to use named parentheses instead. For this, PCRE uses  (?P>name),
       which  is  an  extension  to the Python syntax that PCRE uses for named
       parentheses (Perl does not provide named parentheses). We could rewrite
       the above example as follows:

         (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )

       This  particular example pattern contains nested unlimited repeats, and
       so the use of atomic grouping for matching strings  of  non-parentheses
       is  important  when  applying the pattern to strings that do not match.
       For example, when this pattern is applied to

         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

       it yields "no match" quickly. However, if atomic grouping is not  used,
       the  match  runs  for a very long time indeed because there are so many
       different ways the + and * repeats can carve up the  subject,  and  all
       have to be tested before failure can be reported.

       At the end of a match, the values set for any capturing subpatterns are
       those from the outermost level of the recursion at which the subpattern
       value  is  set.   If  you want to obtain intermediate values, a callout
       function can be used (see below and the pcrecallout documentation).  If
       the pattern above is matched against

         (ab(cd)ef)

       the  value  for  the  capturing  parentheses is "ef", which is the last
       value taken on at the top level. If additional parentheses  are  added,
       giving

         \( ( ( (?>[^()]+) | (?R) )* ) \)
            ^                        ^
            ^                        ^

       the  string  they  capture is "ab(cd)ef", the contents of the top level
       parentheses. If there are more than 15 capturing parentheses in a  pat-
       tern, PCRE has to obtain extra memory to store data during a recursion,
       which it does by using pcre_malloc, freeing  it  via  pcre_free  after-
       wards.  If  no  memory  can  be  obtained,  the  match  fails  with the
       PCRE_ERROR_NOMEMORY error.

       Do not confuse the (?R) item with the condition (R),  which  tests  for
       recursion.   Consider  this pattern, which matches text in angle brack-
       ets, allowing for arbitrary nesting. Only digits are allowed in  nested
       brackets  (that is, when recursing), whereas any characters are permit-
       ted at the outer level.

         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >

       In this pattern, (?(R) is the start of a conditional  subpattern,  with
       two  different  alternatives for the recursive and non-recursive cases.
       The (?R) item is the actual recursive call.


SUBPATTERNS AS SUBROUTINES

       If the syntax for a recursive subpattern reference (either by number or
       by  name)  is used outside the parentheses to which it refers, it oper-
       ates like a subroutine in a programming language.  An  earlier  example
       pointed out that the pattern

         (sens|respons)e and \1ibility

       matches  "sense and sensibility" and "response and responsibility", but
       not "sense and responsibility". If instead the pattern

         (sens|respons)e and (?1)ibility

       is used, it does match "sense and responsibility" as well as the  other
       two  strings.  Such  references must, however, follow the subpattern to
       which they refer.


CALLOUTS

       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
       Perl  code to be obeyed in the middle of matching a regular expression.
       This makes it possible, amongst other things, to extract different sub-
       strings that match the same pair of parentheses when there is a repeti-
       tion.

       PCRE provides a similar feature, but of course it cannot obey arbitrary
       Perl code. The feature is called "callout". The caller of PCRE provides
       an external function by putting its entry point in the global  variable
       pcre_callout.   By default, this variable contains NULL, which disables
       all calling out.

       Within a regular expression, (?C) indicates the  points  at  which  the
       external  function  is  to be called. If you want to identify different
       callout points, you can put a number less than 256 after the letter  C.
       The  default  value is zero.  For example, this pattern has two callout
       points:

         (?C1)abc(?C2)def

       During matching, when PCRE reaches a callout point (and pcre_callout is
       set),  the  external function is called. It is provided with the number
       of the callout, and, optionally, one item of data  originally  supplied
       by  the  caller of pcre_exec(). The callout function may cause matching
       to backtrack, or to fail altogether.  A  complete  description  of  the
       interface  to the callout function is given in the pcrecallout documen-
       tation.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

       Certain  items  that may appear in regular expression patterns are more
       efficient than others. It is more efficient to use  a  character  class
       like  [aeiou]  than  a set of alternatives such as (a|e|i|o|u). In gen-
       eral, the simplest construction that provides the required behaviour is
       usually  the  most  efficient.  Jeffrey Friedl's book contains a lot of
       discussion about optimizing regular expressions for  efficient  perfor-
       mance.

       When  a  pattern  begins  with .* not in parentheses, or in parentheses
       that are not the subject of a backreference, and the PCRE_DOTALL option
       is  set, the pattern is implicitly anchored by PCRE, since it can match
       only at the start of a subject string. However, if PCRE_DOTALL  is  not
       set,  PCRE  cannot  make this optimization, because the . metacharacter
       does not then match a newline, and if the subject string contains  new-
       lines,  the  pattern may match from the character immediately following
       one of them instead of from the very start. For example, the pattern

         .*second

       matches the subject "first\nand second" (where \n stands for a  newline
       character),  with the match starting at the seventh character. In order
       to do this, PCRE has to retry the match starting after every newline in
       the subject.

       If  you  are using such a pattern with subject strings that do not con-
       tain newlines, the best performance is obtained by setting PCRE_DOTALL,
       or  starting  the pattern with ^.* to indicate explicit anchoring. That
       saves PCRE from having to scan along the subject looking for a  newline
       to restart at.

       Beware  of  patterns  that contain nested indefinite repeats. These can
       take a long time to run when applied to a string that does  not  match.
       Consider the pattern fragment

         (a+)*

       This  can  match "aaaa" in 33 different ways, and this number increases
       very rapidly as the string gets longer. (The * repeat can match  0,  1,
       2,  3,  or  4  times,  and  for each of those cases other than 0, the +
       repeats can match different numbers of times.) When  the  remainder  of
       the pattern is such that the entire match is going to fail, PCRE has in
       principle to try  every  possible  variation,  and  this  can  take  an
       extremely long time.

       An optimization catches some of the more simple cases such as

         (a+)*b

       where  a  literal  character  follows. Before embarking on the standard
       matching procedure, PCRE checks that there is a "b" later in  the  sub-
       ject  string, and if there is not, it fails the match immediately. How-
       ever, when there is no following literal this  optimization  cannot  be
       used. You can see the difference by comparing the behaviour of

         (a+)*\d

       with  the  pattern  above.  The former gives a failure almost instantly
       when applied to a whole line of  "a"  characters,  whereas  the  latter
       takes an appreciable time with strings longer than about 20 characters.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions.

SYNOPSIS OF POSIX API
       #include <pcreposix.h>

       int regcomp(regex_t *preg, const char *pattern,
            int cflags);

       int regexec(regex_t *preg, const char *string,
            size_t nmatch, regmatch_t pmatch[], int eflags);

       size_t regerror(int errcode, const regex_t *preg,
            char *errbuf, size_t errbuf_size);

       void regfree(regex_t *preg);


DESCRIPTION

       This  set  of  functions provides a POSIX-style API to the PCRE regular
       expression package. See the pcreapi documentation for a description  of
       the native API, which contains additional functionality.

       The functions described here are just wrapper functions that ultimately
       call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
       pcreposix.h  header  file,  and  on  Unix systems the library itself is
       called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
       command  for  linking an application which uses them. Because the POSIX
       functions call the native ones, it is also necessary to add -lpcre.

       I have implemented only those option bits that can be reasonably mapped
       to  PCRE  native  options.  In  addition,  the options REG_EXTENDED and
       REG_NOSUB are defined with the value zero. They  have  no  effect,  but
       since  programs that are written to the POSIX interface often use them,
       this makes it easier to slot in PCRE as a  replacement  library.  Other
       POSIX options are not even defined.

       When  PCRE  is  called  via these functions, it is only the API that is
       POSIX-like in style. The syntax and semantics of  the  regular  expres-
       sions  themselves  are  still  those of Perl, subject to the setting of
       various PCRE options, as described below. "POSIX-like in  style"  means
       that  the  API  approximates  to  the POSIX definition; it is not fully
       POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
       even less compatible.

       The  header for these functions is supplied as pcreposix.h to avoid any
       potential clash with other POSIX  libraries.  It  can,  of  course,  be
       renamed or aliased as regex.h, which is the "correct" name. It provides
       two structure types, regex_t for  compiled  internal  forms,  and  reg-
       match_t  for  returning  captured substrings. It also defines some con-
       stants whose names start  with  "REG_";  these  are  used  for  setting
       options and identifying error codes.


COMPILING A PATTERN

       The  function regcomp() is called to compile a pattern into an internal
       form. The pattern is a C string terminated by a  binary  zero,  and  is
       passed  in  the  argument  pattern. The preg argument is a pointer to a
       regex_t structure which is used as a base for storing information about
       the compiled expression.

       The argument cflags is either zero, or contains one or more of the bits
       defined by the following macros:

         REG_ICASE

       The PCRE_CASELESS option is set when the expression is passed for  com-
       pilation to the native function.

         REG_NEWLINE

       The PCRE_MULTILINE option is set when the expression is passed for com-
       pilation to the native function. Note that  this  does  not  mimic  the
       defined POSIX behaviour for REG_NEWLINE (see the following section).

       In  the  absence  of  these  flags, no options are passed to the native
       function.  This means the the  regex  is  compiled  with  PCRE  default
       semantics.  In particular, the way it handles newline characters in the
       subject string is the Perl way, not the POSIX way.  Note  that  setting
       PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
       It does not affect the way newlines are matched by . (they  aren't)  or
       by a negative class such as [^a] (they are).

       The  yield of regcomp() is zero on success, and non-zero otherwise. The
       preg structure is filled in on success, and one member of the structure
       is  public: re_nsub contains the number of capturing subpatterns in the
       regular expression. Various error codes are defined in the header file.


MATCHING NEWLINE CHARACTERS

       This area is not simple, because POSIX and Perl take different views of
       things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
       then  PCRE was never intended to be a POSIX engine. The following table
       lists the different possibilities for matching  newline  characters  in
       PCRE:

                                 Default   Change with

         . matches newline          no     PCRE_DOTALL
         newline matches [^a]       yes    not changeable
         $ matches \n at end        yes    PCRE_DOLLARENDONLY
         $ matches \n in middle     no     PCRE_MULTILINE
         ^ matches \n in middle     no     PCRE_MULTILINE

       This is the equivalent table for POSIX:

                                 Default   Change with

         . matches newline          yes      REG_NEWLINE
         newline matches [^a]       yes      REG_NEWLINE
         $ matches \n at end        no       REG_NEWLINE
         $ matches \n in middle     no       REG_NEWLINE
         ^ matches \n in middle     no       REG_NEWLINE

       PCRE's behaviour is the same as Perl's, except that there is no equiva-
       lent for PCRE_DOLLARENDONLY in Perl. In both PCRE and Perl, there is no
       way to stop newline from matching [^a].

       The   default  POSIX  newline  handling  can  be  obtained  by  setting
       PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way  to  make  PCRE
       behave exactly as for the REG_NEWLINE action.


MATCHING A PATTERN

       The  function  regexec() is called to match a pre-compiled pattern preg
       against a given string, which is terminated by a zero byte, subject  to
       the options in eflags. These can be:

         REG_NOTBOL

       The PCRE_NOTBOL option is set when calling the underlying PCRE matching
       function.

         REG_NOTEOL

       The PCRE_NOTEOL option is set when calling the underlying PCRE matching
       function.

       The  portion of the string that was matched, and also any captured sub-
       strings, are returned via the pmatch argument, which points to an array
       of  nmatch  structures of type regmatch_t, containing the members rm_so
       and rm_eo. These contain the offset to the first character of each sub-
       string and the offset to the first character after the end of each sub-
       string, respectively. The 0th element of  the  vector  relates  to  the
       entire  portion  of string that was matched; subsequent elements relate
       to the capturing subpatterns of the regular expression. Unused  entries
       in the array have both structure members set to -1.

       A  successful  match  yields  a  zero  return;  various error codes are
       defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
       failure code.


ERROR MESSAGES

       The regerror() function maps a non-zero errorcode from either regcomp()
       or regexec() to a printable message. If preg is  not  NULL,  the  error
       should have arisen from the use of that structure. A message terminated
       by a binary zero is placed  in  errbuf.  The  length  of  the  message,
       including  the  zero, is limited to errbuf_size. The yield of the func-
       tion is the size of buffer needed to hold the whole message.


STORAGE

       Compiling a regular expression causes memory to be allocated and  asso-
       ciated  with  the preg structure. The function regfree() frees all such
       memory, after which preg may no longer be used as  a  compiled  expres-
       sion.


AUTHOR

       Philip Hazel <ph10@cam.ac.uk>
       University Computing Service,
       Cambridge CB2 3QG, England.

Last updated: 03 February 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------

PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE SAMPLE PROGRAM

       A simple, complete demonstration program, to get you started with using
       PCRE, is supplied in the file pcredemo.c in the PCRE distribution.

       The program compiles the regular expression that is its first argument,
       and  matches  it  against the subject string in its second argument. No
       PCRE options are set, and default character tables are used. If  match-
       ing  succeeds,  the  program  outputs  the  portion of the subject that
       matched, together with the contents of any captured substrings.

       If the -g option is given on the command line, the program then goes on
       to check for further matches of the same regular expression in the same
       subject string. The logic is a little bit tricky because of the  possi-
       bility  of  matching an empty string. Comments in the code explain what
       is going on.

       On a Unix system that has PCRE installed in /usr/local, you can compile
       the demonstration program using a command like this:

         gcc -o pcredemo pcredemo.c -I/usr/local/include \
             -L/usr/local/lib -lpcre

       Then you can run simple tests like this:

         ./pcredemo 'cat|dog' 'the cat sat on the mat'
         ./pcredemo -g 'cat|dog' 'the dog sat on the cat'

       Note  that  there  is  a  much  more comprehensive test program, called
       pcretest, which supports  many  more  facilities  for  testing  regular
       expressions and the PCRE library. The pcredemo program is provided as a
       simple coding example.

       On some operating systems (e.g. Solaris) you may get an error like this
       when you try to run pcredemo:

         ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
       directory

       This is caused by the way shared library support works  on  those  sys-
       tems. You need to add

         -R/usr/local/lib

       to the compile command to get round this problem.

Last updated: 28 January 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------