분류 전체보기에 해당되는 글 69건

2013.03.09 쉘 명령어 정리 1
2012.08.20 이클립스 팁
2012.08.20 MySQL 간단한 팁 모음
2012.04.19 Primary, Secondary... 이후
2012.04.18 안드로이드 크롬 개발자도구
2012.04.03 phonegap 외부 페이지 로드 시 페이지 이동 오류
2012.03.22 아이폰,팟 첨부파일 이미지 다운로드 시 Content-Disposition 값으로 인한 javascript 오류
2012.02.16 로서, 로써
2011.11.01 CSS 클래스 지정 시 공백에 따른 범위
2011.04.28 유용한 프로그램 및 정보
2010.09.28 MySQL 문자열관련 함수
2010.09.17 VI 파일 인코딩
2010.07.22 SimpleXMLIterator to array function
2009.06.30 구글 크롬, 사파리에서 favicon.ico 관련
2009.05.13 십진수를 이진수로 변환하는 방법
2009.05.12 소수점이 있는 숫자의 곱하기, 나누기 오류
2009.03.02 preg_match 정리
2009.01.13 Regular Expression의 match와 exec의 차이
2008.06.18 비올때 하늘에서 찍은 사진
2007.07.24 SED 명령어 사용법
2007.05.09 RIA
2007.05.09 랜덤이미지 불러오기
2007.05.02 돼, 되
2007.05.02 했대? 했데? 1
2007.04.27 Perl-compatible regular expressions
2007.04.27 pcre 문법, preg
2007.04.26 Azureus
2007.04.18 MySQL [4.0 -> 4.1] 업그레이드시 체크할것
2007.04.16 Line Rider
2007.04.07 난 행복하다.................................

쉘 명령어 정리

Programming/Tip

해당 라인의 처음으로 커서이동 : ctrl + a
해당 라인의 끝으로 커서이동 : ctrl + e

by 뭔일이여 2013. 3. 9. 02:30

이클립스 팁

에디터/Eclipse

마우스로 블럭단위 선택하기(Mac) : alt(option) + command + a

by 뭔일이여 2012. 8. 20. 18:02

Primary, Secondary... 이후

일상/기타

1. Primary

2. Secondary

3. Tertiary

4. Quaternary

5. Quinary

6. senary

7. septenary

8. octonary

9. nonary

10. denary

....

12. duodenary

.......

20. vigenary

by 뭔일이여 2012. 4. 19. 14:44

안드로이드 크롬 개발자도구

Programming/Mobile

준비

- 안드로이드 SD K 와 Chrome브라우저가 PC에 설치되어있어야함.

- 모바일에도 Chrome브라우저가 설치돼있어야함(아이스크림 샌드위치에만 지원됨)

시작

1. 모바일에서 크롬브라우저를 실행하고 메뉴버튼 클릭해서 설정-개발자도구에서 USB웹 디버깅 사용에 체크

2. 시작-실행을 클릭해 cmd창을 열고 안드로이드 SD K가 설치된 경로로 가서 platform-tools 폴더로 들어간다. 그 후 다음 명령어를 실행한다.

adb forward tcp:9222 localabstract:chrome_devtools_remote

3. 데스크탑에서 크롬브라우저를 실행하고 주소창에 localhost:9222 를 입력한다. 그러면 모바일 크롬에서 떠있는 창이 썸네일이미지처럼 나오는데 그걸 클릭하면 크롬 개발자도구와 똑같은 웹페이지가 실행된다.

4. 디버깅 시작~

출처 - https://developers.google.com/chrome/mobile/docs/debugging?hl=ko-KR

by 뭔일이여 2012. 4. 18. 11:56

phonegap 외부 페이지 로드 시 페이지 이동 오류

Programming/Mobile

환경 - phonegap, jquery mobile, code igniter

phonegap에서 외부페이지(http://~)를 초기에 로드하도록 설정하고 해당 페이지의 링크를 클릭했을때

주소 끝에 / 가 들어가는 경우 prompt창이 뜨고 이동이 안되는 현상 발생

간단히 초기 로드할 주소의 끝부분에 / 를 없애면 해당현상 해결

by 뭔일이여 2012. 4. 3. 14:17

아이폰,팟 첨부파일 이미지 다운로드 시 Content-Disposition 값으로 인한 javascript 오류

Programming/Mobile

개요
모바일웹을 개발하면서 첨부파일 다운로드를 php의 header로 처리
모든기기에서 테스트하진 않았지만 안드로이드와 아이폰(팟)에서 테스트해본결과
아이폰계열에서는 이미지를 바로 보여준 후 오류가 발생
확인결과 header설정에서 Content-Disposition: attachment 일경우에 발생

증상
아이폰계얼에서 첨부파일 클릭 시 다운로드되지 않고 바로 보여주는데 이미지의 경우
브라우저에서 바로 보여주는 방식으로 처리됨
이미지를 보고난 후 뒤로가기를 눌러서 웹페이지로 이동하면 javascript에러가 발생
ajax통신 등 크로스도메인에서 주로 발생하는 SECURITY_ERR: DOM Exception 18 에러 발생
이 에러가 발생하면 브라우저를 닫고 다시 열기전까지는 계속 오류 발생

해결
첨부파일 다운로드 시 아이폰계열이고 이미지형식일 경우 header의 Content-Disposition 에
attachment 대신 inline 을 입력해주면 아이폰에서도 첨부파일이 보여진 후 에러발생하지 않음
이 방법으로 해결은 했지만 다른이유로 인해 SECURITY_~ 에러가 발생한다면 처리불가

by 뭔일이여 2012. 3. 22. 15:55

로서, 로써

아름다운 한글

현재로써=X

현재,현재로서=O

'-로서'는 자격, 지위를, '-로써'는 수단, 자격을 나타냅니다. 흔히 시간을 나타내는 부사는 홀로 쓰이거나 '-로서'와 함께 쓰입니다. 예) 현재로써 (X) -> 현재, 현재로서 (O)

출 처 - 네이버 지식인

by 뭔일이여 2012. 2. 16. 16:00

CSS 클래스 지정 시 공백에 따른 범위

Programming/CSS

스타일에서 클래스를 지정해서 사용할때 앞쪽 대상뒤의 클래스를 입력할 때 공백이 있을 경우와 없을경우의 차이가 있다.
공백이 있을경우는 해당 대상의 하위객체에 해당 클래스가 있으면 해당 스타일을 적용하라는 뜻이다.

<style type="text/css">

<!--

div .aa {

color: #ddd;

}

-->

</style>
<div class="aa">

no style
<span class="aa">color : #ddd</span>
</div>

위와같이 div 뒤에 공백이 있고 그 뒤에 .aa 라고 클래스를 지정했을 경우는 div 안에서 다른 태그에 aa라는

클래스를 지정했을 때만 컬러가 적용된다. 그래서 'no style' 텍스트는 aa 스타일이 적용되지 않고 보여진다.

하지만 스타일을 정의할때 div.aa 라고 적용을 했다면 'no style' 텍스트에도 스타일이 적용될것이다.

즉 div태그에 aa클래스가 선언되어있으면 해당 태그 하위 모든 텍스트에 똑같이 스타일을 적용한다는 뜻이다.

by 뭔일이여 2011. 11. 1. 10:28

유용한 프로그램 및 정보

Programming/Tip

MySQL

SQLyog
MySQL Query Browser(MySQL GUI Tools Bundle)

PHP

by 뭔일이여 2011. 4. 28. 10:46

MySQL 문자열관련 함수

Programming/MySQL

출처 :: http://blog.naearu.com/2982717/trackback/

- ASCII(str) : 해당 인저의 아스키 값을 반환한다. 문자열이 한글자 이상일 경우는 첫번째 문자에 해당하는 아스키 값을 반환한다. 빈 문자열에 대해서는 0, NULL 에 대해서는 NULL 을 반환한다.
- 예 : select ASCII('2');

- CONCAT(X,Y,...) : 해당 인자들을 연결한 문자열을 반환한다. 인자중 하나가 NULL 일 경우는 NULL 을 반환한다.
- 예 : select CONCAT('My', 'S', 'QL');

- LENGTH(str) : 문자열의 길이를 반환한다.
- 예 : select LENGTH('text');

- OCTET_LENGTH(str) : LENGTH(str) 와 동일하다.

- CHARACTER_LENGTH(str) : LENGTH(str) 와 동일하다.

- LOCATE(substr,str) : 첫번째 인자에서 두번째 인자가 있는 위치를 반환한다. 없을경우 0 을 반환한다.
- 예 : select LOCATE('bar', 'foobarbar');

- POSITION(substr IN str) : LOCATE(substr,str) 와 동일하다.

- LOCATE(substr,str,pos) : 두번째 인자에서 세번째 인자의 자리수부터 검색을 하여 첫번째 인자가 발견되는 위치를 반환한다.
- 예 : select LOCATE('bar', 'foobarbar',5);

- INSTR(str,substr) : LOCATE(substr,str) 와 동일한 기능을 하며, 차이점은 첫번째 인자와 두번째 인자가 바뀐것 뿐이다.
- 예 : select INSTR('foobarbar', 'bar');

- LPAD(str,len,padstr) : 첫번째 인자를 두번째 인자만큼의 길이로 변환한 문자열을 반환한다. 모자란 공간은 왼쪽에 세번째 인자로 채운다.
- 예 : select LPAD('hi',4,' ');

- RPAD(str,len,padstr) : LPAD 와 반대로 오른쪽에 빈공간을 채운다.
- 예 : select RPAD('hi',5,'?');

- LEFT(str,len) : 첫번째 문자열에서 두번째 길이만큼만을 반환한다.
- 예 : select LEFT('foobarbar', 5);

- RIGHT(str,len) : LEFT(str,len) 와 동일하다. 차이점은 해당 길이만큼 오른쪽에서부터 반환한다.
- 예 : select RIGHT('foobarbar', 4);
select SUBSTRING('foobarbar' FROM 4);

- SUBSTRING(str,pos,len) : 첫번째 인자의 문자열에서 두번째 인자의 위치부터 세번째 인자의 길이만큼 반환한다.
- 예 : select SUBSTRING('Quadratically',5,6);

- SUBSTRING(str FROM pos FOR len) : SUBSTRING(str,pos,len) 과 동일하다.

- MID(str,pos,len) : SUBSTRING(str,pos,len) 과 동일하다.

- SUBSTRING(str,pos) : 첫번째 인자의 문자열에서 두번째 인자로부터의 모든 문자열을 반환한다.
- 예 : select SUBSTRING('Quadratically',5);

- SUBSTRING(str FROM pos) : SUBSTRING(str,pos) 와 동일하다.

- SUBSTRING_INDEX(str,delim,count) : 첫번째 인자인 문자열을 두번째 문자로 구분하여 세번째 인자 수의 위치만큼 반환한다. 예를들어 select SUBSTRING_INDEX('www.mysql.com', '.', 2) 은 'www.mysql' 을 반환한다. 세번째 인자가 음수일경우는 반대로 오른쪽에서부터 검색하여 결과를 반환한다.
- 예 : select SUBSTRING_INDEX('www.mysql.com', '.', -2);

- LTRIM(str) : 왼쪽에 있는 공백문자를 제거한 문자열을 반환한다.
- 예 : select LTRIM(' barbar');

- RTRIM(str) : 오른쪽에 있는 공백문자를 제거한 문자열을 반환한다.
- 예 : select RTRIM('barbar ');

- TRIM([[BOTH | LEADING | TRAILING] [remstr] FROM] str)
- 예 : select TRIM(' bar ');
select TRIM(LEADING 'x' FROM 'xxxbarxxx');
select TRIM(BOTH 'x' FROM 'xxxbarxxx');
select TRIM(TRAILING 'xyz' FROM 'barxxyz');

- REPLACE(str,from_str,to_str) : 문자열은 치환한다.
- 예 : select REPLACE('www.mysql.com', 'www', 'ftp');

- REVERSE(str) : 문자열을 뒤집는다. 예를들어, select REVERSE('abc') 은 'cba' 를 반환한다.

- LCASE(str) : 문자열을 소문자로 변환한다.
- 예 : select LCASE('QUADRATICALLY');

- LOWER(str) : LCASE(str) 와 동일하다.

- UCASE(str) : 문자열을 대문자로 변환한다.
- 예 : select UCASE('Hej');

- UPPER(str) : UCASE(str) 와 동일하다.

by 뭔일이여 2010. 9. 28. 10:14

VI 파일 인코딩

Programming/Tip

VI에서 파일 인코딩을 변환하여 저장하거나 불러오려면 다음의 명령을 이용한다.

인코딩을 변환하여 불러오기.

:e ++enc=euc-kr

그리고 아래 명령으로 인코딩을 변환한 다음 저장한다.

:set fileencoding=utf-8

리눅스에서 파일의 인코딩을 확인하려면 file 명령어를 이용할 수 있다.

file 명령어에서 나타나는 내용은 /etc/magic 에 정의되어 있는 메타 정보를 이용한다.

출처 - http://decoder.tistory.com/426

by 뭔일이여 2010. 9. 17. 10:25

SimpleXMLIterator to array function

Programming/PHP

function xml2array($xml_obj) {

$sxi = new SimpleXmlIterator($xml_obj); // read from xml string

//$sxi = new SimpleXmlIterator($xml_obj, null, true); // read from filename

return sxiToArray($sxi);

}

function sxiToArray($sxi) {

$a = array();

for( $sxi->rewind(); $sxi->valid(); $sxi->next() ) {

if(!array_key_exists($sxi->key(), $a)) {

$a[$sxi->key()] = array();

} else if(!isset($a[$sxi->key()][0])) {

$exists_value = $a[$sxi->key()];

$a[$sxi->key()] = array();

$a[$sxi->key()][] = $exists_value;

}

if($sxi->hasChildren()) {

if(isset($a[$sxi->key()][0])) {

$a[$sxi->key()][] = sxiToArray($sxi->current());

} else {

$a[$sxi->key()] = sxiToArray($sxi->current());

}

} else {

if(isset($a[$sxi->key()][0])) {

$a[$sxi->key()][] = strval($sxi->current());

} else {

$a[$sxi->key()] = strval($sxi->current());

}

return $a;

}

사용상 편의를 위해 약간 수정했음

출처 : http://php.net/manual/en/class.simplexmliterator.php

by 뭔일이여 2010. 7. 22. 10:37

구글 크롬, 사파리에서 favicon.ico 관련

Programming/Tip

사이트 제작 중 쿠키로 세션을 체크하는데 새로고침을 하면 계속 세션이 풀리는 현상이 발생했다.

크롬과 사파리에서만...

참고로 이 사이트는 codeigniter를 사용해서 제작했다.

그리고 http://도메인/그룹아이디 로 접속을 해서 그룹아이디를 쿠키로 굽고 그 쿠키를 가지고 로그인체크를 하는 구조를 가지고 있었다.

원인을 찾아보다 쿠키값을 찍어보니 favicon.ico라는 값이 찍히면서 쿠키값이 변경되는것이었다.

왜 저 값이 찍히는지 찾아보는데 아무리 소스를 들여봐도 해당 단어는 없었다.

그러다 codeigniter로 작성했다는걸 생각하면서 혹시 http://도메인/favicon.ico라는 접속이 이뤄지는것이 아닐까 라는 생각을 하게 되었다.

그래서 소스에서 저런 접속이 있으면 끝내도록 수정을 하니 이 문제가 사라졌다.

확인은 해봐야겠지만 원인은 밝혀졌다.

크롬과 사파리에서는 새로고침 시

http://도메인/favicon.ico 에 접속을 해서 파비콘을 확인하는것이었다.

크롬과 사파리에서 파비콘을 확인하면서 접속했을때 favicon.ico를 그룹아이디로 인식하게 만들어버리기 때문이다.

이는 codeigniter를 사용하면서 서버에서 rewrite모듈을 사용했기때문이다.

즉 특수한 경우에 발생하는 현상인 것이다.

by 뭔일이여 2009. 6. 30. 17:53

십진수를 이진수로 변환하는 방법

Programming/기타

정수 변경하기
10 나누기 2 는 5 나머지 0
5 나누기 2 는 2 나머지 1
2 나누기 2 는 1 나머지 0
1 나누기 2 는 0 나머지 1

제일 마지막에 나온 나머지 부터 위로 순서대로 나열 -> 1010

소수점 아래 수 변경하기
0.35 곱하기 2 는 0.7, 0.7 에서 정수 부분 0 을 뺌
0.7 곱하기 2 는 1.4, 1.4 에서 정수부분 1 을 뺌
0.4 곱하기 2 는 0.8, 0.8 에서 정수부분 0 을 뺌
0.8 곱하기 2 는 1.6, 1.6 에서 정수부분 1 을 뺌
0.6 곱하기 2 는 1.2, 1.2 에서 정수부분 1 을 뺌
0.2 곱하기 2 는 0.4, 0.4 에서 정수부분 0 을 뺌
0.4 곱하기 2 는 0.8, 0.8 에서 정수부분 0 을 뺌
0.8 곱하기 2 는 1.6, 1.6 에서 정수부분 1 을 뺌
0.6 곱하기 2 는 1.2, 1.2 에서 정수부분 1 을 뺌
0.2 곱하기 2 는 0.4, 0.4 에서 정수부분 0 을 뺌
0.4 곱하기 2 는 0.8, 0.8 에서 정수부분 0 을 뺌
0.8 곱하기 2 는 1.6, 1.6 에서 정수부분 1 을 뺌
................................
이후는 무한반복(2를 곱해서 나온수가 정수 1일경우 1을 뱉어내고 거기서 끝)
제일 위쪽부터 뺀 정수를 순서대로 나열 -> 0.010110011001100.....

소수점아래의 수는 이진수로 완벽하게 변경되지 않음

십진수 11.75 를 이진수로 변환
정수 변환
11 ÷ 2 = 5 나머지 1
5 ÷ 2 = 2 나머지 1
2 ÷ 2 = 1 나머지 0
1 ÷ 2 = 0 나머지 1
결과 : 1011(2진) = 11(10진)

소수점 아래 변환
0.75 × 2 = 1.50 => 1
0.50 × 2 = 1.00 => 1
결과 : 0.11(2진) => 0.75(10진)

11.75(10진) = 1011.11(2진)

by 뭔일이여 2009. 5. 13. 11:02

소수점이 있는 숫자의 곱하기, 나누기 오류

Programming/기타

<?php
$string = floatVal(3300);
$float = 3000 * 1.1;
if($string === $float) {
    echo 'String '.$string.' 은 Float '.$float.' 와 같다';
} else {
    echo 'String '.$string.' 은 Float '.$float.' 와 다르다';
}
echo '<br />';
var_dump($string);
echo '<br />';
var_dump($float);
exit();
?>
////////////////결과값////////////////////
String 3300 은 Float 3300 와 다르다
float(3300)
float(3300)
///////////////////////////////////////////

<?php
$string = floatVal(5500);
$float = 5000 * 1.1;
if($string === $float) {
    echo 'String '.$string.' 은 Float '.$float.' 와 같다';
} else {
    echo 'String '.$string.' 은 Float '.$float.' 와 다르다';
}
echo '<br />';
var_dump($string);
echo '<br />';
var_dump($float);
exit();
?>
////////////////결과값////////////////////
String 5500 은 Float 5500 와 같다
float(5500)
float(5500)
///////////////////////////////////////////

위와같이 1.1에 3000을 곱하면 결과값은 같지만 비교연산을 해보면 다르다고 값이 나온다.
여러가지 방법으로 알아보니 이진연산을 하는 컴퓨터 프로그래밍이라서 그런것이라고 한다.
십진수의 소수점 아래 수는 완벽하게 이진수로 변경되지 않는다.
1.1을 이진수로 변환하면 1.000100010001... 이처럼 0001 이 반복된다.
그래서 곱하기를 하면 보이지는 않지만 미묘하게 다른 값이 나오는것이다.
근데 왜 3000을 곱하면 다르고 5000은 같을까? 제길...

꼼수로 해결을 하자면 소수점 자리만큼 10의 제곱을 곱한 후 계산하고 다시 곱한 수로 나누는 방법이 있다.

by 뭔일이여 2009. 5. 12. 17:28

preg_match 정리

My/PHP

preg_match의 pattern인자에서 modifier(구분자?)별 의미

i : 대소문자 구분안함
u : utf-8(자세한 사항은 확인 중)

utf-8에서 모든문자를 각각의 문자별로 자르기

예제)
<?php
$str = '한글 english どをウィ中國＃＆＊§※☆★';
preg_match_all('/./u', $str, $match);
echo implode(',', $match[1]);
?>

결과값)
한,글, ,e,n,g,l,i,s,h, ,ど,を,ウ,ィ, ,中,國, ,＃,＆,＊,§,※,☆,★

Pattern Modifiers

The current possible PCRE modifiers are listed below. The names in parentheses refer to internal PCRE names for these modifiers. Spaces and newlines are ignored in modifiers, other characters cause error.

i (PCRE_CASELESS)

If this modifier is set, letters in the pattern match both upper and lower case letters.

m (PCRE_MULTILINE)

By default, PCRE treats the subject string as consisting of a single "line" of characters (even if it actually contains several newlines). The "start of line" metacharacter (^) matches only at the start of the string, while the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl. When this modifier is set, the "start of line" and "end of line" constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m modifier. If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.

s (PCRE_DOTALL)

If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

x (PCRE_EXTENDED)

If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include comments inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern.

e (PREG_REPLACE_EVAL)

If this modifier is set, preg_replace() does normal substitution of backreferences in the replacement string, evaluates it as PHP code, and uses the result for replacing the search string. Single quotes, double quotes, backslashes and NULL chars will be escaped by backslashes in substituted backreferences.

Only preg_replace() uses this modifier; it is ignored by other PCRE functions.

A (PCRE_ANCHORED)

If this modifier is set, the pattern is forced to be "anchored", that is, it is constrained to match only at the start of the string which is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.

D (PCRE_DOLLAR_ENDONLY)

If this modifier is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set. There is no equivalent to this modifier in Perl.

S

When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

U (PCRE_UNGREEDY)

This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by "?". It is not compatible with Perl. It can also be set by a (?U) modifier setting within the pattern or by a question mark behind a quantifier (e.g. .*?).

X (PCRE_EXTRA)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no special meaning is treated as a literal. There are at present no other features controlled by this modifier.

J (PCRE_INFO_JCHANGED)

The (?J) internal option setting changes the local PCRE_DUPNAMES option. Allow duplicate names for subpatterns.

u (PCRE_UTF8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

preg_match

(PHP 4, PHP 5)

preg_match -- 정규표현식 매치를 수행합니다.

설명

int preg_match ( string $pattern, string $subject [, array $matches [, int $flags [, int $offset]]] )

pattern에 주어진 정규표현식을 subject에서 찾습니다.

matches가 주어지면, 검색 결과를 채워넣습니다. $matches[0]는 전체 패턴 텍스트가 들어가고, $matches[1]부터 괄호로 둘러싸인 서브 패턴을 채워넣습니다.

flags는 다음과 같은 플래그를 사용할 수 있습니다:

PREG_OFFSET_CAPTURE: 이 플래그를 넘기면, 모든 매치에 대한 문자열 시작 위치를 함께 반환합니다. 반환값을 0에 매치한 문자열을 가지고, 1에 문자열 시작 위치를 가지는 배열을 원소로 갖는 배열로 변경하는 점에 주의하십시오. 이 플래그는 PHP 4.3.0부터 사용할 수 있습니다.

flags 인자는 PHP 4.3.0부터 사용할 수 있습니다.

보통, 검색은 목표 문자열의 처음에서 시작합니다. 선택적인 인자 offset으로 검색을 시작할 다른 위치를 지정할 수 있습니다. 이는 preg_match()의 목표 문자열에 substr()($subject, $offset)을 넘기는 것과 동일합니다. offset 인자는 PHP 4.3.3부터 사용할 수 있습니다.

preg_match()는 pattern이 매치된 횟수를 반환합니다. 이는 0(매치 없음)이나 1입니다. preg_match()는 처음 매치 후에 검색을 중지하기 때문입니다. 대조적으로, preg_match_all()는 subject의 끝까지 계속해서 실행합니다. 에러가 발생하면, preg_match()는 FALSE를 반환합니다.

작은 정보

단순히 하나의 문자열이 다른 문자열에 들어있는지를 확인하고 싶을때는 preg_match()를 사용하지 마십시오. 대신, strpos()나 strstr()를 사용하는 편이 더욱 빠릅니다.

예 1620. 문자열 "php" 찾기

<?php

// 패턴 구분자 뒤의 "i"는 대소문자를 구별하지 않게 합니다.

if (preg_match("/php/i", "PHP is the web scripting language of choice.")) {

    echo "발견하였습니다.";

} else {

    echo "발견하지 못했습니다.";

}

?>

예 1621. 단어 "Web" 찾기

<?php

/* 패턴에서 \b는 단어를 지시합니다. 단어 "web"만 매치하고,

 * "webbing"이나 "cobweb" 등의 부분적인 경우에는 매치하지 않습니다. */

if (preg_match("/\bweb\b/i", "PHP is the web scripting language of choice.")) {

    echo "발견하였습니다.";

} else {

    echo "발견하지 못했습니다.";

}



if (preg_match("/\bweb\b/i", "PHP is the website scripting language of choice.")) {

    echo "발견하였습니다.";

} else {

    echo "발견하지 못했습니다.";

}

?>

예 1622. URL에서 도메인 이름 얻기

<?php

// URL에서 호스트 이름 얻기

preg_match("/^(http:\/\/)?([^\/]+)/i",

    "http://www.php.net/index.html", $matches);

$host = $matches[2];



// 호스트 이름에서 마지막 두 세그멘트 얻기

preg_match("/[^\.\/]+\.[^\.\/]+$/", $host, $matches);

echo "도메인 이름은: {$matches[0]}\n";

?>

이 예제의 결과:

도메인 이름은: php.ne

by 뭔일이여 2009. 3. 2. 13:57

Regular Expression의 match와 exec의 차이

My/Javascript

g 옵션이 있을경우와 없을경우의 차이

match : 전체 문자열을 반복적으로 검색해 해당하는 모든 결과를 배열로 뿌려줌

괄호로 묶은 부분은 찾은 문자에서 두번째값으로 지정되어야 하지만 match함수에서 g 옵션이 들어갈 경우

값이 할당되지 않는다. g 옵션이 없을 경우에는 처음 찾은 값만 리턴하기 때문에 두번째값으로 지정됨

ex)



<!--

    var temp = "\"첫번째\" \"두번재\"\n\"세번째\"";

    var reg = /"([^"]+)"/igm;


    document.write(temp.match(reg));

    //결과값 : "첫번째","두번재","세번째"





    var reg = /"([^"]+)"/im;

    document.write(temp.match(reg));

    //결과값 : "첫번째",첫번째    <=== 따옴표가 없는 첫번째 는 괄호안의 검색 값임

//-->

</script>





exec : 문자열에서 해당하는 값을 배열로 뿌려줌

한번 검색 시 처음 해당하는 값만 뿌려줌

<script type="text/javascript">

<!--

    var temp = "\"첫번째\" \"두번재\"\n\"세번째\"";

    var reg = /"([^"]+)"/igm;

    document.write(reg.exec(temp));

    document.write(reg.exec(temp));

    //결과값 : "첫번째",첫번째"두번째",두번째


//-->

</script>






g 옵션이 있을경우 한번 검색하면 다음으로 포인터가 넘어가고 없을경우에는 첫번째 해당하는 값만 리턴해줌

전체 문자열에서 검색을 하기 위해선 다음과 같이하면 된다

<script type="text/javascript">

<!--

    var temp = "\"첫번째\" \"두번재\"\n\"세번째\"";

    var res = null;

    while((res = reg.exec(temp)) != null) {

        document.write(res+' - ');

    }

    //결과값 : "첫번째",첫번째 - "두번째",두번째 - "세번째",세번째 -


//-->

</script>

by 뭔일이여 2009. 1. 13. 13:16

비올때 하늘에서 찍은 사진

일상/기타

by 뭔일이여 2008. 6. 18. 13:22

SED 명령어 사용법

OS/Linux

SUBJECT: SED 명령어 사용법

o sed 스트림 편집기
ed명령어와 grep명령어 기능의 일부를 합친 것이 sed(stream editor)명령어이다.

sed명령어도 grep명령어와 같은 필터이지만 이 명령어는 화일을 수정할 수 있게 하는 반면 ed처럼 대화식
처리는 불가능하다. sed 명령어는 1개 라인씩 입력 라인을 읽어들여 표준 출력에 출력한다.

sed는 각 라인을 읽을 때마다 ed에서 사용하던 형식의 대치작업을 실행한다.

일치하는 문자열이 있으면 그 문자열을 대치한 후 출력하고 일치하는 문자열이 없으면 그 라인은
수정되지 않고 그대로 출력된다.

이 sed 명령어가 ed보다 좋은 점은 라인들을 하나씩 읽고 , 수정하고, 출력하기 때문에 기억장치 안의
버퍼를 사용하지 않는다는 것이다. 버퍼를 사용하지 않으면 화일의 크기에 제한 없이 작업을 할 수 있다.

ed와 같이 버퍼를 사용하는 경우는 버퍼의 크기보다 큰 화일은 처리할 수 없으며 대개 버퍼의 크기는

1MB정도이다. 따라서 sed는 아주 큰 화일을 처리할 때 주로 사용된다.

sed 명령어를 호출하는 형식은 grep명령어와 같지만 완전한 형식의 대치 연산자를 사용한다는 점만이 다르다.

# sed "s/hello/goodbye" in.file

위의 명령어는 in.file이라는 화일에 있는 각 라인에서 첫번째 등장하는 hello라는 문자열을 goodbye로
교체한 후 그 라인을 표준 출력에 출력한다.

# echo "1234hello5678" | sed "s/hello/goodbye/"

대치 명령어를 따옴표로 둘러싸야 올바로 사용할 수 있다. 여기서 문자열은 정규식으로 표현될수 도 있다.
그외에도 sed명령어에는 여러 가지 연산자를 사용할 수 있다. 다음의 명령어를 사용하면 hello라는

문자열을 포함하고 있는 모든 문자열을 삭제할 수 있다.

# sed "/hello/d" in.file

위 명령어의 의미는 "hello라는 문자열을 포함하고 있는 라인을 찾아 그 라인을 삭제하라"는 것이다.

이 sed 명령어는 다음 명령어와 같은 의미이다.

# grep -v hello in.file

라인을 전부 삭제하지 않고 hello라는 문자열만을 삭제하려면 다음 명령어를 사용하다.

# sed "s/hello//" in.file

ed와 같이 sed에서도 화일의 일부만을 대상으로 작업하는 경우는 라인의 범위를 지정할 수 있다.

# sed "3,7s/hello//" in.file

위의 명령어는 in.file이라는 화일의 라인3에서 7까지만을 대상으로 첫번째 hello를 삭제하고 화일의
그 외의 부분은 변경시키지 않는다. 또한 다음과 같이 사용하면 라인 번호 대신 문맥을 범위로
지정할 수 있다.

# sed "/hello/,/goodbye/s/bad/good/g" in.file

위의 명령어는 hello라는 단어를 포함하고 있는 첫번째 라인부터 goodbye라는 단어를 포함하고 있는
라인까지 검색하면서 bad라는 문자열을 모두 good으로 변경한다.

또한 문자열 goodbye를 만난 이후에도 다시 다른 hello가 등장하면 다음 goodbye가 나올 때까지 대치
작업은 반복된다.

sed명령어의 기능은 지금까지 우리가 살펴본 것보다 더 강력하다.

sed명령어의 -f(file)선택자를 사용하면 명령어를 일일이 키보드에서 입력하지 않고 하나의 화일에
기억시켜 놓고 사용할 수도 있다.

# sed -f command.file in.file

여러 개의 명령어를 연속적으로 자주 사용할 때 이 명령어 화일이 유용하게 사용된다.

예를 들어 다음과 같은복수 개의 명령어가 화일에 기억되어 있는 경우는

   # vi command.file
   s/hello/goodbye
   s/good/bad

다음과 같은 명령어를 입력하면

# echo "1234hello5678" | sed -f command.file

다음과 같이 출력된다.

   # echo "1234hello5678" | sed -f command.file
   1234badbye5678

o sed 기본
   # sed '' ljs --> cat ljs 와 동일

o sed 편집 명령어
일상적인 sed 명령
---------------------------------------------------------------------------------------
   a\    다음 라인(들)을 적용될 라인들에 부가한다 (라인뒤)
   c\    적용될 라인들을 다음 라인(들)로 변경한다 (라인 대체)
   d 적용될 라인들을 삭제한다
   g 단지 첫번째의 것만이 아니라 라인의 모든 부합 패턴 대체가 적용 되게 한다
   i\    다음 라인(들)을 적용될 라인들 위에 삽입한다 (라인앞)
   p - n 옵션하에 있을지라도, 라인을 프린트한다
   q 명시된 라인에 도달할 때 중지한다
   r filename filename을 판독한다. 내용을 출력에 부가한다
   s/old/new/ "old"를 "new"로 대체한다
   = 라인 번호를 프린트한다
   !command    라인이 선택되지 않는 경우 command를 적용한다.
--------------------------------------------------------------------------------------

o 라인 명시
sed명령은 두가지 방법을 사용한다. 첫 번째는 번지를 번호로 명시하는것이다.
여러분은 특정한 라인을 가리키기 위해 단일 번호를 사용할 수 있다.

   # sed '3d' ljs --> 세번째 라인을 삭제
또는, 라인들의 범위를 가리키기 위해 콤마(,)로 분리된 두 번호들을 사용할 수 있다.
   # sed '2,4 s/e/#/' ljs --> 대체 명령은 단지 2-4 라인들에만 적용된다. (단순 대체 명령은 라인에서
   첫번째 어커런스에만 적용된다는 점을 기억하라. 따라서 각 적용 라인의
   첫 번째 e만이 #로 대체된다)
   # sed -n '/kingdom/p' ljs --> kingdom이 들어있는 line만 프린트
   # sed '/kingdom/p' ljs --> 모든 line이 나타나고 그와 동시에 kingdom line이 중복해서 나타남
   # sed '[Pp]rincess/d' ljs --> princess 또는 Princess를 포함하고 있는 라인들을 삭제함
   # sed '1,/fragrant/d' ljs --> 라인 1로부터 fragrant를 포함하고 있는 첫번째 라인까지의 모든
라인들을 삭제함

o sed 명령 하이라이트
   # more ljs
   I am a boy
   You are a girk
   He is a doctor
   # sed 'a\\
   Hey la la\! Doo de dah\!' ljs --> 각 라인뒤에다 Hey la la!를 입력
I am a boy
Hey !
You are a girk
Hey !
He is a doctor
Hey !
   # sed 'a\\
   Oh\! good\\ --> \\을 사용함으로써 하나 이상의 라인들을 부가할 수 있다
   yeh' ljs
   # sed '3a\\
   Good Morning' ljs --> 3 line뒤에다 내용 삽입
   # sed 'c\\
   Oh marvelous delight! sing to me! ' ljs --> 기존의 라인들을 이것으로 대체시킴
   Oh marvelous delight! sing to me!
   Oh marvelous delight! sing to me!
   Oh marvelous delight! sing to me!

# sed '2q' ljs = sed 2q ljs --> q명령은 편집기로 하여금 그것이 명시된 라인에 도착한 뒤
중지하게 한다. 즉 2라인만 보여줌
   # sed -n '1s/a/#/gp' ljs --> 전체적으로 바꿔줌

o sed의 패턴-부합
패턴-부합에 대한 sed메타 문자
   -------------------------------------------------------------------
   메타 문자    작    용
   -------------------------------------------------------------------
   \    다음 문자의 특수한 의미를 부정한다
   ^    라인의 시작과 부합한다
   $    라인의 끝과 부합한다
   .    어떠한 단일 문자와도 부합한다
   [ ]    둘러싸인 문자들 중의 어느 하나와 부합한다
   [^...] ...리스트에 없는 어떠한 문자와도 부합한다
   pat* 0 또는 그 이상의 pat 어커런스들과 부합한다
여기에서 pat는 단일문자 또는 [ ]패턴이다
   &    s 명령의 newpattern부분에서 사용되어 oldpattern
부분의 재 산출을 나타낸다
   -------------------------------------------------------------------

o 간략한 예
----------------------------------------------------------------------------------------
명 령 결 과
----------------------------------------------------------------------------------------
   /Second/ Second를 포함하고 있는 어떠한 라인과도 부합한다.
   /^Second/    Second로 시작하는 어떠한 라인과도 부합한다.
   /^$/    공백라인, 즉 라인의 시작과 끝 사이에 아무것도 없는 라인과 부합한다.
이것은 공백 스페이스들로 된 라인과는 부합하지 않는바, 스페이스 자체가
문자이기 때문이다.
   /c.t/    cat, cot, 기타 등을 포함하고 있는 라인들과 부합한다. 이 패턴은 단어의
일부일 수 있음에 유의하라. 예를 들어, apricot와 acute도 부합된다.
   /./    적어도 한 문자를 포함하고 있는 라인들과 부합한다.
   /\./ 피리어드를 포함하고 있는 라인들과 부합한다. \는 .의 특수한 의미를 부정
   /s[oa]p/ sop또는 sap와는 부합하지만 sip 또는 sup와는 부합하지 않는다.
   /s[ ^oa]p/ sip또는 sup와는 부합하지만 sop또는 sap와는 부합하지 않는다.
   s/cow/s&s/ cow를 scows로 대체한다.
   /co*t/ * --> 어떠한 수
----------------------------------------------------------------------------------------

간단한 sed 해법
   # sed '/^$/d' ljs --> 모든 공백 라인 제거
   # sed '/^ *$/d' --> space로 만들어진 공백까지 제거 (조심! ^와 *사이에 공백이 있어야 한다)
   # sed 'a\\
   ' ljs --> 각 line마다 공백라인 추가
   # sed '/^#/d' ljs --> 첫번째 열에 #을 가진 라인 제거
   # sed 's/^/ /' ljs --> 각 line의 시작을 5 space로 대체

o 다중 명령
   # sed 's/Bob/Robert/g\
s/Pat/Patricia/g' ljs --> sh을 사용하는 경우에는 \을 생략하라
   # sed 's/cat/dog/g\
s/dog/pigs/g' ljs --> 먼저 모든 cats를 dogs로 변환한 다음에 모든 dogs를 pigs로 변환한다.
   # sed 's/Bob/Robert/g\
s/Pat[^a-z]/Patricia/g' ljs --> ^a-z은 a에서 z까지의 문자들이 아닌 모든 문자를
   의미한다는 점을 상기하라

o 태그
위에서 Pat!와 같은 것이 발견될때 !를 포함한 전체 문자열이 Patricia로 대체되므로 !가 소실된다.
우리는 !를 유지하면서 Pat를 대체하는 방법을 필요로 한다. 우리는 이것을 태그(tag)를 사용하여
수행할 수 있다. 패턴의 일부를 "태그"하려면, 그것을 좌측에는 $로 우측에는 $로 둘러싸라.
그 다음에, 명령의 newpattern부분에서, 여러분은 그렇게 둘러싸인 패턴의 첫 번째 것은 \1로,
두번째 것은 \2 등으로 인용할 수 있다. 이 방법을 사용하면 다음의 명령이 부여된다.
# sed 's/$Pat$$[^a-z]$/\1ricia\2/g' ljs

o 쉘 스크립트와 sed
# vi twospace
sed 'a\\
' $* --> $*은 모든 인자들을 나타냄
# twospace ljs | pr | lpr
위 예는 sed가 어떻게 하여 UNIX 프로그래밍과 쉘 스크립트에 적합한가를 나타낸다.

출처 : Tong - schick님의 linux통

by 뭔일이여 2007. 7. 24. 16:47

RIA

용어정리

RIA ( Rich Internet Application, 리치 인터넷 에플리케이션 )

RIA ( Rich Internet Application, 이하 RIA )란 기존의 웹 어플리케이션 기술이 가진 평면적인 표현과 순차적인 프로세스를 다이나믹한 사용자 인터페이스와 데이터 베이스의 연동을 통해 저렴한 비용으로 하나의 인터페이스에서 모든 프로세스가 처리 가능 하도록 해주는 기술을 말한다.

이 용어는 현재 Adobe와 통합된 Macromedia의 2003년 백서에서 처음 사용된 용어로 특정 제푸을 뜻하는 것이 아니라 풍부한 GUI ( Graphic User interface ) 를 제공하는 애플리케이션을 정의하는 단어이다. 이후에 Flash 기반 어플리케이션과 솔루션에 RIA라는 개념으로 광범위하게 사용되었으며 현재는 기존 웹 애플리케이션보다 풍부하고 향상된 유저 인터페이스를 제공하는 웹 애플리케이션을 지칭하는데 범용적으로 사용되고 있다.

차세대 애플리케이션의 이동, RIA

과거 메인 프레임이 주를 이루던 시대, 메인 프레임에서 구축된 애플리케이션은 모든 데이터 처리가 중앙 서버에서 이루어지고 사용자는 터미널에서 까만 화면에 커서가 깜빡이는 화면을 볼 뿐이었다. 이후 클라이언트/서버 방식이 보편화 되면서 윈도우와 같은 운영 체계 상에서 GUI 를 사용하게 되었다. 그러나 이 같은 방식은 클라이언트 프로그램 배포에 엄청난 시간과 비용이 소모되기 때문에 웹 기반 애플리케이션으로 다시 이동하게 된다.

웹 애플리케이션은 클라이언트로 브라우저만 있으면 실행 가능하므로 기존 방식에 비해 접근성이 크게 향상되었으나 UI 부분에서 취약하다는 점이 약점을 가지고 있다. 이 때문에 배포의 효율성을 유지하면서 풍부한 GUI를 제공하기 위한 리치 인터넷 애플리케이션 ( RIA ) 혹은 X-인터넷 기술이 각광받고 있다. 이는 기존의 웹 애플리케이션의 표현 한계를 극복하기 위한 개념으로 풍부한 UI를 지향한다는 점에서 근본적으로 RIA와 X-인터넷의 개념은 같다. 그러나 X-인터넷은 차트나 그리드 위주인데 반해, 웹 2.0의 RIA 개념이 접목된 솔루션에서는 다양한 멀티미디어를 활용해 더욱 직관적이고 풍부한 UI를 구현할 수 있다. 웹 2.0 시대를 이끌어갈 핵심 키워드로 RIA가 주목받는 이유이다.

중요성이 증가하는 RIA

RIA는 기존 HTML 보다 역동적인 화면 연출이 가능할 뿐 아니라 다단계 페이지가 아닌 원 페이지에서 모든 정보 제공이 가능해 사용자 편의성 중심의 새로운 웹 기술이다. 웹 애플리케이션에 대한 사용자의 요구와 기대는 기술 발전 속도보다 앞서 나아가 기존 기술 역량을 넘어서고 있기 떄문에 앞으로는 RIA 구현의 중요성이 점차 증가할 것으로 예상된다. 이러한 RIA를 구현하는 데는 초기 개념에서부터 가장 오래 사용되어온 Adobe에서 출시한 두 가지 소프트웨어가 필요하다. 그것은 바로 Flash와 Flex이다. 이미 Flash의 기능과 기술은 개발자들을 통해 많이 알려져 있으므로 Flex의 기능에 대해 알아보겠다.

Adobe Flex

Adobe의 ' Flex 2 ' 는 웹 2.0을 구현하기 위한 솔루션 중 최적의 솔루션으로 주목 받고 있다. Flex 2는 벡터 그래픽과 리치 미디어 등을 활용해 기존 웹과는 차별화 되는 프레젠테이션 단에서의 화려한 화면 구성을 가능하게 한다. 특히, 웹 2.0이 강조하는 풍부하고 직관적인 사용자 인터페이스를 구현할 수 있다. Flex 2는 비주얼 레이아웃 및 차트 기능이 강화된 ' Flex Builder 2 ', 무료로 제공되는 개발자 도구인 ' Flex 2 SDK ', 'Flex Data Service 2 ' , ' Flex Charting 2 ' 등으로 구성되어 있다. Flex 2 SDK 는 자바의 JDK 롸 유사한 개발자 툴로 Flex 프레임웍과 컴파일러 등으로 구성된 Flex 개발의 기본 툴이며 누구나 웹에서 다운받아 에디터에서 직접 MXML 코드를 작성해 실제로 개발할 수 있다. ( Flex SDK 2.0 다운로드 http://labs.adobe.com ). Flex Builder2는 Flex 개발 툴이고 이클립스 기반의 위지웍 방식이므로 직접 코드를 작성하는 것보다 훨씬 편하고 생산성 있게 개발이 가능하다. Flex Data Service 2 는 Flex 1.5 까지 Flex 프레젠테이션 서버라고 불리던 것에 메시지 서비스 ( JMS )를 추가해 서버와 클라이언트 ( Flex 애플리캐이션 ) 가 데이터 통신을 지원하는 서버 기술이다. Flex Charting 2 는 1.5 버전까지 제공되는 Flex 차트 컴포넌트의 확장 버전이다. 이전 버전에서는 서버 및 개발 도구가 통합되어 제공되어 소규모 기업이나 개인 개발자가 활용하기에 부담이 컸으나 Flex 2 버전부터는 필요 솔루션만 따로 구입할 수 있어 Flex 개발을 위한 문턱이 대폭 낮아졌다.

Adobe Flex 장점

Flex는 우선 Flash 기반 기술이기 때문에 플랫폼과 브라우저에 관계없이 구동이 가능하다. 최근 운영체제와 브라우저에 관계없이 구동이 가능하다. 최근 운영체제와 브라우저가 점점 다양화되며 모든 운영체제와 모든 브라우저에서 동일하게 표시되고 작동하는 애플리케이션을 개발하는 것은 상당히 어려운 일이나, Flex를 활용하면 크로스 플랫폼에서 가능한 애플리케이션을 개발할 수 있다. 또한 웹, 컴퓨터, 휴대폰이나 개인용 디바이스에서도 활용이 가능하며 Flash에서 가능한 모든 효과를 구현할 수 있고 데이터 및 코드 중심 접근법으로 Flash 보다 유지 보수나 개발 생산성이 뛰어나다.

Flex가 가지는 또 다른 강점은 확장성이 매우 뛰어나다는 점이다. 새로운 UI 컴포넌트가 필요하거나 기존 컴포넌트에서 일부분 UI를 변경해야 할 때, 최근 일부 개발된 RIA 솔루션은 개발자나 디자이너가 직접 컴포넌트를 확장할 수 없어 따로 솔루션 벤더에게 컴포넌트 개발/수정을 요청해야 하지만 Flex는 UI 컴포넌트의 액션스크립트 클래스를 제공하고 있고 이 소스가 오픈되어 있어 새로운 UI 컴포넌트를 개발자가 직접 구현할 수 있으며 MXML 태그로 컴포넌트를 정의해 재사용할 수 있다. Flex는 프레임워크의 모든 API 를 제공하고 있다. 또한 Flex Buiklder른 전 세계에서 가장 영향력이 큰 개발툴인 이클립스 ( Eclipse ) 기반이므로 기존 이클립스의 장점 및 안정적인 플랫폼을 갖추고 있으며 기존 이클립스 개발자가 쉽게 적응할 수 있다.

** RIA를 한마디로 표현한다면 ' 한 페이지에 구현 가능한 웹 애플리케이션 ' 이라고 할 수 있다. 사용자들이 온라인에서 물건을 구매하거나 티켓을 예매할 때 많은 페이지 이동을 필요로 했지만 이제 사용자는 한 페이지 안에서 모든 기능을 이용할 수 있게 되었다. 자신이 하고자 하는 모든 것을 한 페이지 안에서 모두 할 수 있다는 것이 바로 RIA의 강점이자 매력이다.

현재 RIA ( Rich Internet Application ) 기술은 Adobd의 기존 Flash와 새로운 Flex를 이용한 방법이 가장 대중적인 방법이라고 할 수 있다. 이러한 두가지 방식으로 구현된 RIA 애플리케이션은 이미 외국의 경우 2002년부터 개발 구축됐다. 당시 구축된 Flash RIA 시스템이 아직도 사용중인 곳이 있는데 바로 브로드무어 ( BroadMoor, www.broadmoor.com ) 호펠의 원 스크린 예약 시스템이다. 이 사이트의 예약 시스템은 RIA가 적용된 첫 번째 사례였으며 2002년 TravleClick( www.travelclick.net ) 에 의해 제작되었다. 브로드무어 사이트의 예약 시스템은 플래시와 콜드퓨전으로 만들어졌으며 이 시스템은 기존의 복잡했던 예매 페이지를 플래시의 화려한 그래픽 사용자 인터페이스를 이용해 사용자가 페이지 이동 없이 한 페이지로 구현한 것이었다. 이것은 그 당시까지의 여러 페이지를 거쳐 시스템을 구현하던 웹 사용자 인터페이스의 새로운 획을 긋는 커다란 사건이었다. 그 이후 해가 바뀌면서 수많은 RIA가 적용된 사이트들이 개발 되었다. 더불어 Flex를 이용해 구축된 RIA 사이트들도 등장하기 시작했다. 또한 Flex와 Flash가 병행되기 시작했으며 RIA가 적용되는 분야 또한 넓어지고 있다. 현재 외국은 RIA 시스템은 온라인 커스터마이징 툴과 데이터 서비스 툴로 사용하고 있다.

예를 들어 미국의 할레이 데이비슨 사이트의 경우 사용자가 모터사이클을 고른 뒤 자신이 필요로 하는 옵션을 직접 선택해 장착해서 모터 사이클이 어떻게 변할지 미리 볼 수 도 있다.
할리데이비슨 ( www.harley-davidson.com ) 사용자가 직접 액세서리 등을 선택해 추가하거나 교체해 볼 수 있다.

셔윈-윌리엄스 ( www.sherwin-williams.com.com ) 셔먼 윌리엄스는 페인트 회사로 사용자가 자신이 바꿀 거실, 주방등을 선택해 RIA 환경에서 다양한 색상을 선택해 적용해 볼수있는 환경을 제공한다.

Home Locater ( www.asfusion.com/apps/homelocator ) 직관적인 부동산 검색 애플리케이션이다. 사용자들은 지도를 클릭하는 것으로 간단하게 검색할 수 있으며 가격, 침실 수 그리고 다른 주요 사항 별로 필터링이 가능하다.

현재 국내에서는 2005년부터 극장의 티켓 예매 프로세스와 일부 쇼핑몰 등 선도 기업들을 중심으로 RIA 가 채택되어 개발되고 있다. 작년에 이어 올해도 온라인 쇼핑몰과 자동차 업체를 중심으로 많은 기업들이 RIA 를 채택할 움직임을 보이고 있다.

국내 RIA 프로젝트

* 태평양 아모레의 RIA ( Flex )
* 인터파크 RIA 영화예매 시스템
* etv & 엠플 ( www.cjmall.com/etv & www.MPLE.com ) 의 구매 시스템

CJ 시스템즈의 좋은 콘서트 ( 2006 ) : 공연 RIA 예매 시스템 구축 - 실제 공연 브로셔를 보듯이 책장을 넘기는 애니메이션을 통해 사용자가 원하는 공연을 선택하고 해당 공연장의 좌석 선택까지 완벽하게 구현

디자인그룹 인터내셔널 ( 2006 ) - 디자인그룹 웹 사이트 리뉴얼 : PHP로 제작된 관리자 페이지에서 등록된 데이터를 XML 형태로 전송 받아 Flash에서 유동적으로 표현되도록 구현

한글과 컴퓨터 ( 2006 ) - 크래팟, 스타일록 라이브러리 페이지 구축 : 선택한 장르에 해당되는 스타일록 데이터를 XML 형태로 전송 받아 마우스 휠을 이용한 회전 내비게이션 형태로 구현

CJ시스템즈 ( 2006 ) 프리머스 시네마 플래시 예매 시스템 리뉴얼 : 사용자가 원하는 영화와 날짜, 시간, 좌석 등을 선택할 때마다 서버와 XML 형태로 된 데이터를 연동하여 예매 프로세스가 진행되도록 구현함

삼성카드 ( 2006 ) 삼성카드 중도상환 RIA : 카드사 최초로 Flash로 구현된 중도상환 결제 시스템이며 서버와의 데이터 연동은 XML을 이용해서 구현.

CJ 시스템즈 ( 2006 ) : CGV 현장 발권시스템 키오스크 좌석 모듈 제작 : 온라인이 아닌 오프라인의 키오스크용 영화 예매 좌석 선택 플래시 모듈로 실제 데이터 연동은 온라인 예매 시스템과 동일한 형태로 구현됨.

SK커뮤니케이션즈 ( 2005 ) NATE.com Cizle 예매 시스템 개발 [ RIA & Application Design ] : 영화 먼저, 극장 먼저, 날짜 먼저, 당일 영화까지 총 네가지 선택 방식에 따라 예매진행 방식이 변동되도록 구현된 예매 시스템으로 데이터 연동은 XML 형태로 제작

CJ시스템즈 ( 2006 ) CGV 사이트 예매 시스템 리뉴얼 : 웹 사이트의 어느 페이지에서든지 쉽게 접근할 수있는 레이어 형태로 제작된 최초의 예매 시스템으로 실제 영화관의 좌석을 사용자가 직접 선택할 수 있는 좌석 선택 예매 시스템도 최초로 구현

이렇게 RIA 는 웹과 함께 데스크탑까지 확장을 하고 있으며 머지않아 웹 제작의 일반 기술로 채용될 것으로 보인다. 웹 애플리케이션에 대한 사용자의 요구와 기대는 현재 개발되는 기술 발전 속도보다 더 빠르게 증가하는 것 또한 RIA의 활용을 부추기는 요소로 작용하고 있다.

마케팅에는 캐즘 ( Chasm ) 현상이라는 것이 있는데 혁신성을 중시하는 소비자가 중심이 되는 초기 시장과 실용성을 중시하는 소비자가 중심이 되는 주류 시장 사이에 일시적으로 수요가 정체하거나 후퇴하는 단절 현상을 말한다. 이는 일반 사용자들의 고정 관념으로 인해 혁신적인 기술이 빠르게 확산되지 못하는 것을 의미한다. 이러한 현상처럼 Adobe의 Flash와 Flex로 대변되던 RIA의 환경이 다소 주춤하고 있는 사이 RIA를 구현할 수 있는 많은 기술들이 속속 등장하고 있다. AJAX / DHTML이 자바스크립트와 XML 기술을 이용해 웹 2.0에서 많은 주목을 받고 있으며 현재 많은 업체에 의해 AJAX 를 쉽게 개발할 수 있도록 툴 킷들이 공개되고 있다.

또한 마이크로소프트의 WPF 는 차세대 벡터 방식의 그래픽 환경으로 WPF/E를 활용해 RIA를 구현할 수 있다. 여기에 엑티브엑스, 자바애플릿 기술을 들이 약진하고 있다. 더불어 전세계 온라인 검색의 최강자로 군림하고 있는 구글이 에이작스 ( AJAX ) 를 활용하면서 빠른 속도로 Ajax 기술이 전파되고 있다. 이러한 모든 것이 RIA ( Rich Internet Application ) 이며 때문에 RIA 를 처음 도입한 Adobe의 영향력이 위협받고 있다. 하지만 사용자에게 중요한 것은 웹이 되었던 데스크탑이 되었던 모든 어플리케이션에 RIA가 활용될 것이란 사실이다.

RIA 는 사용자에게는 사용에 있어서의 편리함이며 기업에게는 수익이 보장되어야 한다. 방식이야 어찌되었든 개발자들은 다양한 툴을 이용해 개발을 할 것이다. 이제 얼마 지나지 않아 우리는 쇼핑몰을 이용하면서 자신과 동일한 모습을 한 캐릭터에게 직접 옷을 입혀본 뒤 제품 구매를 구매하고, 자신이 차량에 직접 탑승한 모습을 보고 물건을 구매하게 될 것이다. 이제 여러분이 상상하는 세상이 이제 웹에서 펼쳐 질것이다. 이러한 서비스의 중심이 RIA가 있는 것이다.

spotlight 1. " 새로운 사용자 경험 RIA" W.E.B 매거진 중에서

by 뭔일이여 2007. 5. 9. 18:01

랜덤이미지 불러오기

My/Javascript

/*
randLoad
obj
type
ext
num
to
randNum
objSrc
ext
objLen
w, h

debug
*/
randomLoad = function(randObj, type, ext) {
this.obj = randObj;
this.type = type;
this.ext = ext;

this.link = new Array();
}
randomLoad.prototype.init = function() {
if(this.num != undefined) {
if(this.to != undefined) { this.randNum = random_number(this.num, this.to); }
else { this.randNum = random_number(this.num); }
this.objSrc = this.obj+this.randNum+this.ext;
} else {
try {
this.objLen = this.obj.length-1;
this.randNum = random_number(this.objLen);
this.objSrc = this.obj[this.randNum]+this.ext;
} catch(e) {
alert("입력값이 잘못되었습니다.\n다시 설정해 주세요.");
return false;
}
}
if(this.prepath != undefined) { this.objSrc = this.prepath+this.objSrc; }
if(this.debug == true) { document.test.num.value = 'rand:'+this.randNum+' from:'+this.num+' to:'+this.to; }
this.load();
}
randomLoad.prototype.load = function() {
if(this.type == 'image') {
var src = '<img src="'+this.objSrc+'" border="0"';
if(this.w != undefined) { src += ' width="'+this.w+'"'; }
if(this.h != undefined) { src += ' width="'+this.h+'"'; }
src += ' />';
} else if(this.type == 'swf' || this.type == 'movie') {
var w = this.w != undefined ? this.w : false;
var h = this.h != undefined ? this.h : false;
var src = object_load(this.type, this.objSrc+this.ext, w, h, false);
}
if(this.link.length>1) { src = '<a href="'+this.link[this.randNum]+'">'+src+'</a>'; }
if(this.target != undefined) { el_id(this.target).innerHTML = src; }
else { document.write(src); }
}
//-->
</script>

by 뭔일이여 2007. 5. 9. 13:05

돼, 되

아름다운 한글

돼
~해 를 적용했을때 적합하면 돼
나름해석 : 어떤 상황이 완료됐을 때, 했을 때(예 - 됐다)
예) 안돼요(O), 안해요(O)
안되요(X), 안하요(X)

되
~하다 를 적용했을때 적합하면 되
나름해석 : 어떤 상황이 진행될때, 하고 있을 때(예 - 하다)
예) 저장하다(O), 저장되다(O)
저장해다(X), 저장돼다(X)

이상한 점이 있다면 코멘트로 남겨주세요.

by 뭔일이여 2007. 5. 2. 11:14

했대? 했데?

아름다운 한글

경험한 사실을 회상하는 것이 아니라면 '-데(요)'를 쓰지 않습니다.

누가 어떻게 했데?(X)
어딜 갔데?(X)

누가 어떻게 했대?(O)
어딜 갔대?(0)

누가누가 머머 했대요?(0)
누가누가 머머 했데요?(X)

-------------------------------------------------------

'대(요)'나 '데(요)'는 발음이 비슷하여 혼동하기 쉽지만, 그 쓰임이 다르므로 잘못 표기하면 뜻이 달라집니다.

'-대요'는 '-다고 해요'가 줄어든 말입니다. '듣거나 본 사실을 인용하여 서술할 때' 또는 '그 사실 여부를 물을 때' 쓰는 말입니다.

예) 워낙 바빠서 서울 한 번 올라오기가 힘들대요.
그럼 왜 휴학을 했대요?

반면, '-데요'는 해요체의 종결어미로 말하는 이가 자신이 경험한 사실을 회상하여 일러주거나 그렇게 생각하게 되었을 때 쓰는 말로 '-더군요'의 뜻입니다.

예) 선생들이 학생들은 어떡하냐며 눈물을 글썽이데요.
고향에 가니까 옛날 생각이 나데요.

듣거나 본 사실을 인용하여 서술함을 뜻하는 것이라면 '공부했대요, 말했대요'로 씁니다.

(우리말배움터)
퍼온곳 - 네이버 지식인

by 뭔일이여 2007. 5. 2. 11:02

Perl-compatible regular expressions

원문

-----------------------------------------------------------------------------
This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. There are
separate text files for the pcregrep and pcretest commands.
-----------------------------------------------------------------------------

PCRE(3) PCRE(3)

NAME
PCRE - Perl-compatible regular expressions

INTRODUCTION

   The PCRE library is a set of functions that implement regular expres-
   sion pattern matching using the same syntax and semantics as Perl, with
   just a few differences. (Certain features that appeared in Python and
   PCRE before they appeared in Perl are also available using the Python
   syntax.)

   The current implementation of PCRE (release 7.x) corresponds approxi-
   mately with Perl 5.10, including support for UTF-8 encoded strings and
   Unicode general category properties. However, UTF-8 and Unicode support
   has to be explicitly enabled; it is not the default. The Unicode tables
   correspond to Unicode release 5.0.0.

   In addition to the Perl-compatible matching function, PCRE contains an
   alternative matching function that matches the same compiled patterns
   in a different way. In certain circumstances, the alternative function
   has some advantages. For a discussion of the two matching algorithms,
   see the pcrematching page.

   PCRE is written in C and released as a C library. A number of people
   have written wrappers and interfaces of various kinds. In particular,
   Google Inc. have provided a comprehensive C++ wrapper. This is now
   included as part of the PCRE distribution. The pcrecpp page has details
   of this interface. Other people's contributions can be found in the
   Contrib directory at the primary FTP site, which is:

ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

   Details of exactly which Perl regular expression features are and are
   not supported by PCRE are given in separate documents. See the pcrepat-
   tern and pcrecompat pages.

   Some features of PCRE can be included, excluded, or changed when the
   library is built. The pcre_config() function makes it possible for a
   client to discover which features are available. The features them-
   selves are described in the pcrebuild page. Documentation about build-
   ing PCRE for various operating systems can be found in the README file
   in the source distribution.

   The library contains a number of undocumented internal functions and
   data tables that are used by more than one of the exported external
   functions, but which are not intended for use by external callers.
   Their names all begin with "_pcre_", which hopefully will not provoke
   any name clashes. In some environments, it is possible to control which
   external symbols are exported when a shared library is built, and in
   these cases the undocumented symbols are not exported.

USER DOCUMENTATION

   The user documentation for PCRE comprises a number of different sec-
   tions. In the "man" format, each of these is a separate "man page". In
   the HTML format, each is a separate page, linked from the index page.
   In the plain text format, all the sections are concatenated, for ease
   of searching. The sections are as follows:

   pcre    this document
   pcre-config show PCRE installation configuration information
   pcreapi details of PCRE's native C API
   pcrebuild options for building PCRE
   pcrecallout details of the callout feature
   pcrecompat    discussion of Perl compatibility
   pcrecpp details of the C++ wrapper
   pcregrep    description of the pcregrep command
   pcrematching    discussion of the two matching algorithms
   pcrepartial details of the partial matching facility
   pcrepattern syntax and semantics of supported
   regular expressions
   pcreperform discussion of performance issues
   pcreposix the POSIX-compatible C API
   pcreprecompile    details of saving and re-using precompiled patterns
   pcresample    discussion of the sample program
   pcrestack discussion of stack usage
   pcretest    description of the pcretest testing command

In addition, in the "man" and HTML formats, there is a short page for
each C library function, listing its arguments and results.

LIMITATIONS

There are some size limitations in PCRE but it is hoped that they will
never in practice be relevant.

   The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
   is compiled with the default internal linkage size of 2. If you want to
   process regular expressions that are truly enormous, you can compile
   PCRE with an internal linkage size of 3 or 4 (see the README file in
   the source distribution and the pcrebuild documentation for details).
   In these cases the limit is substantially larger. However, the speed
   of execution is slower.

   All values in repeating quantifiers must be less than 65536. The maxi-
   mum compiled length of subpattern with an explicit repeat count is
   30000 bytes. The maximum number of capturing subpatterns is 65535.

There is no limit to the number of parenthesized subpatterns, but there
can be no more than 65535 capturing subpatterns.

The maximum length of name for a named subpattern is 32 characters, and
the maximum number of named subpatterns is 10000.

   The maximum length of a subject string is the largest positive number
   that an integer variable can hold. However, when using the traditional
   matching function, PCRE uses recursion to handle subpatterns and indef-
   inite repetition. This means that the available stack space may limit
   the size of a subject string that can be processed by certain patterns.
   For a discussion of stack issues, see the pcrestack documentation.

UTF-8 AND UNICODE PROPERTY SUPPORT

   From release 3.3, PCRE has had some support for character strings
   encoded in the UTF-8 format. For release 4.0 this was greatly extended
   to cover most common requirements, and in release 5.0 additional sup-
   port for Unicode general category properties was added.

   In order process UTF-8 strings, you must build PCRE to include UTF-8
   support in the code, and, in addition, you must call pcre_compile()
   with the PCRE_UTF8 option flag. When you do this, both the pattern and
   any subject strings that are matched against it are treated as UTF-8
   strings instead of just strings of bytes.

   If you compile PCRE with UTF-8 support, but do not use it at run time,
   the library will be a bit bigger, but the additional run time overhead
   is limited to testing the PCRE_UTF8 flag occasionally, so should not be
   very big.

   If PCRE is built with Unicode character property support (which implies
   UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-
   ported. The available properties that can be tested are limited to the
   general category properties such as Lu for an upper case letter or Nd
   for a decimal number, the Unicode script names such as Arabic or Han,
   and the derived properties Any and L&. A full list is given in the
   pcrepattern documentation. Only the short names for properties are sup-
   ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
   ter}, is not supported. Furthermore, in Perl, many properties may
   optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE
   does not support this.

The following comments apply when PCRE is running in UTF-8 mode:

   1. When you set the PCRE_UTF8 flag, the strings passed as patterns and
   subjects are checked for validity on entry to the relevant functions.
   If an invalid UTF-8 string is passed, an error return is given. In some
   situations, you may already know that your strings are valid, and
   therefore want to skip these checks in order to improve performance. If
   you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time,
   PCRE assumes that the pattern or subject it is given (respectively)
   contains only valid UTF-8 codes. In this case, it does not diagnose an
   invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when
   PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may
   crash.

2. An unbraced hexadecimal escape sequence (such as \xb3) matches a
two-byte UTF-8 character if the value is greater than 127.

3. Octal numbers up to \777 are recognized, and match two-byte UTF-8
characters for values greater than \177.

4. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
vidual bytes, for example: \x{100}{3}.

5. The dot metacharacter matches one UTF-8 character instead of a sin-
gle byte.

   6. The escape sequence \C can be used to match a single byte in UTF-8
   mode, but its use can lead to some strange effects. This facility is
   not available in the alternative matching function, pcre_dfa_exec().

   7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
   test characters of any code value, but the characters that PCRE recog-
   nizes as digits, spaces, or word characters remain the same set as
   before, all with values less than 256. This remains true even when PCRE
   includes Unicode property support, because to do otherwise would slow
   down PCRE in many common cases. If you really want to test for a wider
   sense of, say, "digit", you must use Unicode property tests such as
   \p{Nd}.

8. Similarly, characters that match the POSIX named character classes
are all low-valued characters.

   9. Case-insensitive matching applies only to characters whose values
   are less than 128, unless PCRE is built with Unicode property support.
   Even when Unicode property support is available, PCRE still uses its
   own character tables when checking the case of low-valued characters,
   so as not to degrade performance. The Unicode property information is
   used only for characters with higher values. Even when Unicode property
   support is available, PCRE supports case-insensitive matching only when
   there is a one-to-one mapping between a letter's cases. There are a
   small number of many-to-one mappings in Unicode; these are not sup-
   ported by PCRE.

AUTHOR

   Philip Hazel
   University Computing Service
   Cambridge CB2 3QH, England.

   Putting an actual email address here seems to have been a spam magnet,
   so I've taken it away. If you want to email me, use my two initials,
   followed by the two digits 10, at the domain cam.ac.uk.

REVISION

Last updated: 18 April 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------

PCREBUILD(3) PCREBUILD(3)

NAME
PCRE - Perl-compatible regular expressions

PCRE BUILD-TIME OPTIONS

   This document describes the optional features of PCRE that can be
   selected when the library is compiled. They are all selected, or dese-
   lected, by providing options to the configure script that is run before
   the make command. The complete list of options for configure (which
   includes the standard ones such as the selection of the installation
   directory) can be obtained by running

./configure --help

   The following sections include descriptions of options whose names
   begin with --enable or --disable. These settings specify changes to the
   defaults for the configure command. Because of the way that configure
   works, --enable and --disable always come in pairs, so the complemen-
   tary option always exists as well, but as it specifies the default, it
   is not described.

C++ SUPPORT

   By default, the configure script will search for a C++ compiler and C++
   header files. If it finds them, it automatically builds the C++ wrapper
   library for PCRE. You can disable this by adding

--disable-cpp

to the configure command.

UTF-8 SUPPORT

To build PCRE with support for UTF-8 character strings, add

--enable-utf8

   to the configure command. Of itself, this does not make PCRE treat
   strings as UTF-8. As well as compiling PCRE with this option, you also
   have have to set the PCRE_UTF8 option when you call the pcre_compile()
   function.

UNICODE CHARACTER PROPERTY SUPPORT

   UTF-8 support allows PCRE to process character values greater than 255
   in the strings that it handles. On its own, however, it does not pro-
   vide any facilities for accessing the properties of such characters. If
   you want to be able to use the pattern escapes \P, \p, and \X, which
   refer to Unicode character properties, you must add

--enable-unicode-properties

to the configure command. This implies UTF-8 support, even if you have
not explicitly requested it.

   Including Unicode property support adds around 30K of tables to the
   PCRE library. Only the general category properties such as Lu and Nd
   are supported. Details are given in the pcrepattern documentation.

CODE VALUE OF NEWLINE

   By default, PCRE interprets character 10 (linefeed, LF) as indicating
   the end of a line. This is the normal newline character on Unix-like
   systems. You can compile PCRE to use character 13 (carriage return, CR)
   instead, by adding

--enable-newline-is-cr

to the configure command. There is also a --enable-newline-is-lf
option, which explicitly specifies linefeed as the newline character.

Alternatively, you can specify that line endings are to be indicated by
the two character sequence CRLF. If you want this, add

--enable-newline-is-crlf

to the configure command. There is a fourth option, specified by

--enable-newline-is-anycrlf

which causes PCRE to recognize any of the three sequences CR, LF, or
CRLF as indicating a line ending. Finally, a fifth option, specified by

--enable-newline-is-any

causes PCRE to recognize any Unicode newline sequence.

   Whatever line ending convention is selected when PCRE is built can be
   overridden when the library functions are called. At build time it is
   conventional to use the standard for your operating system.

BUILDING SHARED AND STATIC LIBRARIES

   The PCRE building process uses libtool to build both shared and static
   Unix libraries by default. You can suppress one of these by adding one
   of

--disable-shared
--disable-static

to the configure command, as required.

POSIX MALLOC USAGE

   When PCRE is called through the POSIX interface (see the pcreposix doc-
   umentation), additional working storage is required for holding the
   pointers to capturing substrings, because PCRE requires three integers
   per substring, whereas the POSIX interface provides only two. If the
   number of expected substrings is small, the wrapper function uses space
   on the stack, because this is faster than using malloc() for each call.
   The default threshold above which the stack is no longer used is 10; it
   can be changed by adding a setting such as

--with-posix-malloc-threshold=20

to the configure command.

HANDLING VERY LARGE PATTERNS

   Within a compiled pattern, offset values are used to point from one
   part to another (for example, from an opening parenthesis to an alter-
   nation metacharacter). By default, two-byte values are used for these
   offsets, leading to a maximum size for a compiled pattern of around
   64K. This is sufficient to handle all but the most gigantic patterns.
   Nevertheless, some people do want to process enormous patterns, so it
   is possible to compile PCRE to use three-byte or four-byte offsets by
   adding a setting such as

--with-link-size=3

   to the configure command. The value given must be 2, 3, or 4. Using
   longer offsets slows down the operation of PCRE because it has to load
   additional bytes when handling them.

AVOIDING EXCESSIVE STACK USAGE

   When matching with the pcre_exec() function, PCRE implements backtrack-
   ing by making recursive calls to an internal function called match().
   In environments where the size of the stack is limited, this can se-
   verely limit PCRE's operation. (The Unix environment does not usually
   suffer from this problem, but it may sometimes be necessary to increase
   the maximum stack size. There is a discussion in the pcrestack docu-
   mentation.) An alternative approach to recursion that uses memory from
   the heap to remember data, instead of using recursive function calls,
   has been implemented to work round the problem of limited stack size.
   If you want to build a version of PCRE that works this way, add

--disable-stack-for-recursion

   to the configure command. With this configuration, PCRE will use the
   pcre_stack_malloc and pcre_stack_free variables to call memory manage-
   ment functions. Separate functions are provided because the usage is
   very predictable: the block sizes requested are always the same, and
   the blocks are always freed in reverse order. A calling program might
   be able to implement optimized functions that perform better than the
   standard malloc() and free() functions. PCRE runs noticeably more
   slowly when built in this way. This option affects only the pcre_exec()
   function; it is not relevant for the the pcre_dfa_exec() function.

LIMITING PCRE RESOURCE USAGE

   Internally, PCRE has a function called match(), which it calls repeat-
   edly (sometimes recursively) when matching a pattern with the
   pcre_exec() function. By controlling the maximum number of times this
   function may be called during a single matching operation, a limit can
   be placed on the resources used by a single call to pcre_exec(). The
   limit can be changed at run time, as described in the pcreapi documen-
   tation. The default is 10 million, but this can be changed by adding a
   setting such as

--with-match-limit=500000

to the configure command. This setting has no effect on the
pcre_dfa_exec() matching function.

   In some environments it is desirable to limit the depth of recursive
   calls of match() more strictly than the total number of calls, in order
   to restrict the maximum amount of stack (or heap, if --disable-stack-
   for-recursion is specified) that is used. A second limit controls this;
   it defaults to the value that is set for --with-match-limit, which
   imposes no additional constraints. However, you can set a lower limit
   by adding, for example,

--with-match-limit-recursion=10000

to the configure command. This value can also be overridden at run
time.

CREATING CHARACTER TABLES AT BUILD TIME

   PCRE uses fixed tables for processing characters whose code values are
   less than 256. By default, PCRE is built with a set of tables that are
   distributed in the file pcre_chartables.c.dist. These tables are for
   ASCII codes only. If you add

--enable-rebuild-chartables

   to the configure command, the distributed tables are no longer used.
   Instead, a program called dftables is compiled and run. This outputs
   the source for new set of tables, created in the default locale of your
   C runtime system. (This method of replacing the tables does not work if
   you are cross compiling, because dftables is run on the local host. If
   you need to create alternative tables when cross compiling, you will
   have to do so "by hand".)

USING EBCDIC CODE

   PCRE assumes by default that it will run in an environment where the
   character code is ASCII (or Unicode, which is a superset of ASCII).
   PCRE can, however, be compiled to run in an EBCDIC environment by
   adding

--enable-ebcdic

to the configure command. This setting implies --enable-rebuild-charta-
bles.

SEE ALSO

pcreapi(3), pcre_config(3).

AUTHOR

   Philip Hazel
   University Computing Service
   Cambridge CB2 3QH, England.

REVISION

Last updated: 16 April 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------

PCREMATCHING(3) PCREMATCHING(3)

NAME
PCRE - Perl-compatible regular expressions

PCRE MATCHING ALGORITHMS

   This document describes the two different algorithms that are available
   in PCRE for matching a compiled regular expression against a given sub-
   ject string. The "standard" algorithm is the one provided by the
   pcre_exec() function. This works in the same was as Perl's matching
   function, and provides a Perl-compatible matching operation.

   An alternative algorithm is provided by the pcre_dfa_exec() function;
   this operates in a different way, and is not Perl-compatible. It has
   advantages and disadvantages compared with the standard algorithm, and
   these are described below.

   When there is only one possible way in which a given subject string can
   match a pattern, the two algorithms give the same answer. A difference
   arises, however, when there are multiple possibilities. For example, if
   the pattern

^<.*>

is matched against the string

there are three possible answers. The standard algorithm finds only one
of them, whereas the alternative algorithm finds all three.

REGULAR EXPRESSIONS AS TREES

   The set of strings that are matched by a regular expression can be rep-
   resented as a tree structure. An unlimited repetition in the pattern
   makes the tree of infinite size, but it is still a tree. Matching the
   pattern to a given subject string (from a given starting point) can be
   thought of as a search of the tree. There are two ways to search a
   tree: depth-first and breadth-first, and these correspond to the two
   matching algorithms provided by PCRE.

THE STANDARD MATCHING ALGORITHM

   In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
   sions", the standard algorithm is an "NFA algorithm". It conducts a
   depth-first search of the pattern tree. That is, it proceeds along a
   single path through the tree, checking that the subject matches what is
   required. When there is a mismatch, the algorithm tries any alterna-
   tives at the current point, and if they all fail, it backs up to the
   previous branch point in the tree, and tries the next alternative
   branch at that level. This often involves backing up (moving to the
   left) in the subject string as well. The order in which repetition
   branches are tried is controlled by the greedy or ungreedy nature of
   the quantifier.

   If a leaf node is reached, a matching string has been found, and at
   that point the algorithm stops. Thus, if there is more than one possi-
   ble match, this algorithm returns the first one that it finds. Whether
   this is the shortest, the longest, or some intermediate length depends
   on the way the greedy and ungreedy repetition quantifiers are specified
   in the pattern.

   Because it ends up with a single path through the tree, it is rela-
   tively straightforward for this algorithm to keep track of the sub-
   strings that are matched by portions of the pattern in parentheses.
   This provides support for capturing parentheses and back references.

THE ALTERNATIVE MATCHING ALGORITHM

   This algorithm conducts a breadth-first search of the tree. Starting
   from the first matching point in the subject, it scans the subject
   string from left to right, once, character by character, and as it does
   this, it remembers all the paths through the tree that represent valid
   matches. In Friedl's terminology, this is a kind of "DFA algorithm",
   though it is not implemented as a traditional finite state machine (it
   keeps multiple states active simultaneously).

   The scan continues until either the end of the subject is reached, or
   there are no more unterminated paths. At this point, terminated paths
   represent the different matching possibilities (if there are none, the
   match has failed). Thus, if there is more than one possible match,
   this algorithm finds all of them, and in particular, it finds the long-
   est. In PCRE, there is an option to stop the algorithm after the first
   match (which is necessarily the shortest) has been found.

Note that all the matches that are found start at the same point in the
subject. If the pattern

cat(er(pillar)?)

   is matched against the string "the caterpillar catchment", the result
   will be the three strings "cat", "cater", and "caterpillar" that start
   at the fourth character of the subject. The algorithm does not automat-
   ically move on to find matches that start at later positions.

There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:

   1. Because the algorithm finds all possible matches, the greedy or
   ungreedy nature of repetition quantifiers is not relevant. Greedy and
   ungreedy quantifiers are treated in exactly the same way. However, pos-
   sessive quantifiers can make a difference when what follows could also
   match what is quantified, for example in a pattern like this:

^a++\w!

   This pattern matches "aaab!" but not "aaa!", which would be matched by
   a non-possessive quantifier. Similarly, if an atomic group is present,
   it is matched as if it were a standalone pattern at the current point,
   and the longest match is then "locked in" for the rest of the overall
   pattern.

   2. When dealing with multiple paths through the tree simultaneously, it
   is not straightforward to keep track of captured substrings for the
   different matching possibilities, and PCRE's implementation of this
   algorithm does not attempt to do this. This means that no captured sub-
   strings are available.

3. Because no substrings are captured, back references within the pat-
tern are not supported, and cause errors if encountered.

   4. For the same reason, conditional expressions that use a backrefer-
   ence as the condition or test for a specific group recursion are not
   supported.

5. Callouts are supported, but the value of the capture_top field is
always 1, and the value of the capture_last field is always -1.

   6. The \C escape sequence, which (in the standard algorithm) matches a
   single byte, even in UTF-8 mode, is not supported because the alterna-
   tive algorithm moves through the subject string one character at a
   time, for all active paths through the tree.

ADVANTAGES OF THE ALTERNATIVE ALGORITHM

Using the alternative matching algorithm provides the following advan-
tages:

   1. All possible matches (at a single point in the subject) are automat-
   ically found, and in particular, the longest match is found. To find
   more than one match using the standard algorithm, you have to do kludgy
   things with callouts.

   2. There is much better support for partial matching. The restrictions
   on the content of the pattern that apply when using the standard algo-
   rithm for partial matching do not apply to the alternative algorithm.
   For non-anchored patterns, the starting position of a partial match is
   available.

   3. Because the alternative algorithm scans the subject string just
   once, and never needs to backtrack, it is possible to pass very long
   subject strings to the matching function in several pieces, checking
   for partial matching each time.

DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

The alternative algorithm suffers from a number of disadvantages:

   1. It is substantially slower than the standard algorithm. This is
   partly because it has to search for all possible matches, but is also
   because it is less susceptible to optimization.

2. Capturing parentheses and back references are not supported.

3. Although atomic groups are supported, their use does not provide the
performance advantage that it does for the standard algorithm.

AUTHOR

   Philip Hazel
   University Computing Service
   Cambridge CB2 3QH, England.

REVISION

Last updated: 06 March 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------

PCREAPI(3) PCREAPI(3)

NAME
PCRE - Perl-compatible regular expressions

PCRE NATIVE API

#include <pcre.h>

pcre *pcre_compile(const char *pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);

pcre *pcre_compile2(const char *pattern, int options,
int *errorcodeptr,
const char **errptr, int *erroffset,
const unsigned char *tableptr);

pcre_extra *pcre_study(const pcre *code, int options,
const char **errptr);

int pcre_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize);

int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize,
int *workspace, int wscount);

int pcre_copy_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
char *buffer, int buffersize);

int pcre_copy_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, char *buffer,
int buffersize);

int pcre_get_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
const char **stringptr);

int pcre_get_stringnumber(const pcre *code,
const char *name);

int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);

int pcre_get_substring(const char *subject, int *ovector,
int stringcount, int stringnumber,
const char **stringptr);

int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);

void pcre_free_substring(const char *stringptr);

void pcre_free_substring_list(const char **stringptr);

const unsigned char *pcre_maketables(void);

int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);

int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

int pcre_refcount(pcre *code, int adjust);

int pcre_config(int what, void *where);

char *pcre_version(void);

void *(*pcre_malloc)(size_t);

void (*pcre_free)(void *);

void *(*pcre_stack_malloc)(size_t);

void (*pcre_stack_free)(void *);

int (*pcre_callout)(pcre_callout_block *);

PCRE API OVERVIEW

   PCRE has its own native API, which is described in this document. There
   are also some wrapper functions that correspond to the POSIX regular
   expression API. These are described in the pcreposix documentation.
   Both of these APIs define a set of C function calls. A C++ wrapper is
   distributed with PCRE. It is documented in the pcrecpp page.

   The native API C function prototypes are defined in the header file
   pcre.h, and on Unix systems the library itself is called libpcre. It
   can normally be accessed by adding -lpcre to the command for linking an
   application that uses PCRE. The header file defines the macros
   PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num-
   bers for the library. Applications can use these to include support
   for different releases of PCRE.

   The functions pcre_compile(), pcre_compile2(), pcre_study(), and
   pcre_exec() are used for compiling and matching regular expressions in
   a Perl-compatible manner. A sample program that demonstrates the sim-
   plest way of using them is provided in the file called pcredemo.c in
   the source distribution. The pcresample documentation describes how to
   run it.

   A second matching function, pcre_dfa_exec(), which is not Perl-compati-
   ble, is also provided. This uses a different algorithm for the match-
   ing. The alternative algorithm finds all possible matches (at a given
   point in the subject), and scans the subject just once. However, this
   algorithm does not return captured substrings. A description of the two
   matching algorithms and their advantages and disadvantages is given in
   the pcrematching documentation.

   In addition to the main compiling and matching functions, there are
   convenience functions for extracting captured substrings from a subject
   string that is matched by pcre_exec(). They are:

   pcre_copy_substring()
   pcre_copy_named_substring()
   pcre_get_substring()
   pcre_get_named_substring()
   pcre_get_substring_list()
   pcre_get_stringnumber()
   pcre_get_stringtable_entries()

pcre_free_substring() and pcre_free_substring_list() are also provided,
to free the memory used for extracted strings.

   The function pcre_maketables() is used to build a set of character
   tables in the current locale for passing to pcre_compile(),
   pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
   provided for specialist use. Most commonly, no special tables are
   passed, in which case internal tables that are generated when PCRE is
   built are used.

   The function pcre_fullinfo() is used to find out information about a
   compiled pattern; pcre_info() is an obsolete version that returns only
   some of the available information, but is retained for backwards com-
   patibility. The function pcre_version() returns a pointer to a string
   containing the version of PCRE and its date of release.

   The function pcre_refcount() maintains a reference count in a data
   block containing a compiled pattern. This is provided for the benefit
   of object-oriented applications.

   The global variables pcre_malloc and pcre_free initially contain the
   entry points of the standard malloc() and free() functions, respec-
   tively. PCRE calls the memory management functions via these variables,
   so a calling program can replace them if it wishes to intercept the
   calls. This should be done before calling any PCRE functions.

   The global variables pcre_stack_malloc and pcre_stack_free are also
   indirections to memory management functions. These special functions
   are used only when PCRE is compiled to use the heap for remembering
   data, instead of recursive function calls, when running the pcre_exec()
   function. See the pcrebuild documentation for details of how to do
   this. It is a non-standard way of building PCRE, for use in environ-
   ments that have limited stacks. Because of the greater use of memory
   management, it runs more slowly. Separate functions are provided so
   that special-purpose external code can be used for this case. When
   used, these functions are always called in a stack-like manner (last
   obtained, first freed), and always for memory blocks of the same size.
   There is a discussion about PCRE's stack usage in the pcrestack docu-
   mentation.

   The global variable pcre_callout initially contains NULL. It can be set
   by the caller to a "callout" function, which PCRE will then call at
   specified points during a matching operation. Details are given in the
   pcrecallout documentation.

NEWLINES

   PCRE supports five different conventions for indicating line breaks in
   strings: a single CR (carriage return) character, a single LF (line-
   feed) character, the two-character sequence CRLF, any of the three pre-
   ceding, or any Unicode newline sequence. The Unicode newline sequences
   are the three just mentioned, plus the single characters VT (vertical
   tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
   separator, U+2028), and PS (paragraph separator, U+2029).

   Each of the first three conventions is used by at least one operating
   system as its standard newline sequence. When PCRE is built, a default
   can be specified. The default default is LF, which is the Unix stan-
   dard. When PCRE is run, the default can be overridden, either when a
   pattern is compiled, or when it is matched.

   In the PCRE documentation the word "newline" is used to mean "the char-
   acter or pair of characters that indicate a line break". The choice of
   newline convention affects the handling of the dot, circumflex, and
   dollar metacharacters, the handling of #-comments in /x mode, and, when
   CRLF is a recognized line ending sequence, the match position advance-
   ment for a non-anchored pattern. The choice of newline convention does
   not affect the interpretation of the \n or \r escape sequences.

MULTITHREADING

   The PCRE functions can be used in multi-threading applications, with
   the proviso that the memory management functions pointed to by
   pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
   callout function pointed to by pcre_callout, are shared by all threads.

   The compiled form of a regular expression is not altered during match-
   ing, so the same compiled pattern can safely be used by several threads
   at once.

SAVING PRECOMPILED PATTERNS FOR LATER USE

   The compiled form of a regular expression can be saved and re-used at a
   later time, possibly by a different program, and even on a host other
   than the one on which it was compiled. Details are given in the
   pcreprecompile documentation. However, compiling a regular expression
   with one version of PCRE for use with a different version is not guar-
   anteed to work and may cause crashes.

CHECKING BUILD-TIME OPTIONS

int pcre_config(int what, void *where);

   The function pcre_config() makes it possible for a PCRE client to dis-
   cover which optional features have been compiled into the PCRE library.
   The pcrebuild documentation has more details about these optional fea-
   tures.

   The first argument for pcre_config() is an integer, specifying which
   information is required; the second argument is a pointer to a variable
   into which the information is placed. The following information is
   available:

PCRE_CONFIG_UTF8

The output is an integer that is set to one if UTF-8 support is avail-
able; otherwise it is set to zero.

PCRE_CONFIG_UNICODE_PROPERTIES

The output is an integer that is set to one if support for Unicode
character properties is available; otherwise it is set to zero.

PCRE_CONFIG_NEWLINE

   The output is an integer whose value specifies the default character
   sequence that is recognized as meaning "newline". The four values that
   are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
   and -1 for ANY. The default should normally be the standard sequence
   for your operating system.

PCRE_CONFIG_LINK_SIZE

   The output is an integer that contains the number of bytes used for
   internal linkage in compiled regular expressions. The value is 2, 3, or
   4. Larger values allow larger regular expressions to be compiled, at
   the expense of slower matching. The default value of 2 is sufficient
   for all but the most massive patterns, since it allows the compiled
   pattern to be up to 64K in size.

PCRE_CONFIG_POSIX_MALLOC_THRESHOLD

   The output is an integer that contains the threshold above which the
   POSIX interface uses malloc() for output vectors. Further details are
   given in the pcreposix documentation.

PCRE_CONFIG_MATCH_LIMIT

   The output is an integer that gives the default limit for the number of
   internal matching function calls in a pcre_exec() execution. Further
   details are given with pcre_exec() below.

PCRE_CONFIG_MATCH_LIMIT_RECURSION

   The output is an integer that gives the default limit for the depth of
   recursion when calling the internal matching function in a pcre_exec()
   execution. Further details are given with pcre_exec() below.

PCRE_CONFIG_STACKRECURSE

   The output is an integer that is set to one if internal recursion when
   running pcre_exec() is implemented by recursive function calls that use
   the stack to remember their state. This is the usual way that PCRE is
   compiled. The output is zero if PCRE was compiled to use blocks of data
   on the heap instead of recursive function calls. In this case,
   pcre_stack_malloc and pcre_stack_free are called to manage memory
   blocks on the heap, thus avoiding the use of the stack.

COMPILING A PATTERN

pcre *pcre_compile(const char *pattern, int options,
const char **errptr, int *erroffset,
const unsigned char *tableptr);

pcre *pcre_compile2(const char *pattern, int options,
int *errorcodeptr,
const char **errptr, int *erroffset,
const unsigned char *tableptr);

   Either of the functions pcre_compile() or pcre_compile2() can be called
   to compile a pattern into an internal form. The only difference between
   the two interfaces is that pcre_compile2() has an additional argument,
   errorcodeptr, via which a numerical error code can be returned.

   The pattern is a C string terminated by a binary zero, and is passed in
   the pattern argument. A pointer to a single block of memory that is
   obtained via pcre_malloc is returned. This contains the compiled code
   and related data. The pcre type is defined for the returned block; this
   is a typedef for a structure whose contents are not externally defined.
   It is up to the caller to free the memory (via pcre_free) when it is no
   longer required.

   Although the compiled code of a PCRE regex is relocatable, that is, it
   does not depend on memory location, the complete pcre data block is not
   fully relocatable, because it may contain a copy of the tableptr argu-
   ment, which is an address (see below).

   The options argument contains various bit settings that affect the com-
   pilation. It should be zero if no options are required. The available
   options are described below. Some of them, in particular, those that
   are compatible with Perl, can also be set and unset from within the
   pattern (see the detailed description in the pcrepattern documenta-
   tion). For these options, the contents of the options argument speci-
   fies their initial settings at the start of compilation and execution.
   The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time
   of matching as well as at compile time.

   If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
   if compilation of a pattern fails, pcre_compile() returns NULL, and
   sets the variable pointed to by errptr to point to a textual error mes-
   sage. This is a static string that is part of the library. You must not
   try to free it. The offset from the start of the pattern to the charac-
   ter where the error was discovered is placed in the variable pointed to
   by erroffset, which must not be NULL. If it is, an immediate error is
   given.

   If pcre_compile2() is used instead of pcre_compile(), and the error-
   codeptr argument is not NULL, a non-zero error code number is returned
   via this argument in the event of an error. This is in addition to the
   textual error message. Error codes and messages are listed below.

   If the final argument, tableptr, is NULL, PCRE uses a default set of
   character tables that are built when PCRE is compiled, using the
   default C locale. Otherwise, tableptr must be an address that is the
   result of a call to pcre_maketables(). This value is stored with the
   compiled pattern, and used again by pcre_exec(), unless another table
   pointer is passed to it. For more discussion, see the section on locale
   support below.

This code fragment shows a typical straightforward call to pcre_com-
pile():

   pcre *re;
   const char *error;
   int erroffset;
   re = pcre_compile(
   "^A.*Z",    /* the pattern */
   0,    /* default options */
   &error, /* for error message */
   &erroffset, /* for error offset */
   NULL);    /* use default character tables */

The following names for option bits are defined in the pcre.h header
file:

PCRE_ANCHORED

   If this bit is set, the pattern is forced to be "anchored", that is, it
   is constrained to match only at the first matching point in the string
   that is being searched (the "subject string"). This effect can also be
   achieved by appropriate constructs in the pattern itself, which is the
   only way to do it in Perl.

PCRE_AUTO_CALLOUT

   If this bit is set, pcre_compile() automatically inserts callout items,
   all with number 255, before each pattern item. For discussion of the
   callout facility, see the pcrecallout documentation.

PCRE_CASELESS

   If this bit is set, letters in the pattern match both upper and lower
   case letters. It is equivalent to Perl's /i option, and it can be
   changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
   always understands the concept of case for characters whose values are
   less than 128, so caseless matching is always possible. For characters
   with higher values, the concept of case is supported if PCRE is com-
   piled with Unicode property support, but not otherwise. If you want to
   use caseless matching for characters 128 and above, you must ensure
   that PCRE is compiled with Unicode property support as well as with
   UTF-8 support.

PCRE_DOLLAR_ENDONLY

   If this bit is set, a dollar metacharacter in the pattern matches only
   at the end of the subject string. Without this option, a dollar also
   matches immediately before a newline at the end of the string (but not
   before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
   if PCRE_MULTILINE is set. There is no equivalent to this option in
   Perl, and no way to set it within a pattern.

PCRE_DOTALL

   If this bit is set, a dot metacharater in the pattern matches all char-
   acters, including those that indicate newline. Without it, a dot does
   not match when the current position is at a newline. This option is
   equivalent to Perl's /s option, and it can be changed within a pattern
   by a (?s) option setting. A negative class such as [^a] always matches
   newline characters, independent of the setting of this option.

PCRE_DUPNAMES

   If this bit is set, names used to identify capturing subpatterns need
   not be unique. This can be helpful for certain types of pattern when it
   is known that only one instance of the named subpattern can ever be
   matched. There are more details of named subpatterns below; see also
   the pcrepattern documentation.

PCRE_EXTENDED

   If this bit is set, whitespace data characters in the pattern are
   totally ignored except when escaped or inside a character class. White-
   space does not include the VT character (code 11). In addition, charac-
   ters between an unescaped # outside a character class and the next new-
   line, inclusive, are also ignored. This is equivalent to Perl's /x
   option, and it can be changed within a pattern by a (?x) option set-
   ting.

   This option makes it possible to include comments inside complicated
   patterns. Note, however, that this applies only to data characters.
   Whitespace characters may never appear within special character
   sequences in a pattern, for example within the sequence (?( which
   introduces a conditional subpattern.

PCRE_EXTRA

   This option was invented in order to turn on additional functionality
   of PCRE that is incompatible with Perl, but it is currently of very
   little use. When set, any backslash in a pattern that is followed by a
   letter that has no special meaning causes an error, thus reserving
   these combinations for future expansion. By default, as in Perl, a
   backslash followed by a letter with no special meaning is treated as a
   literal. (Perl can, however, be persuaded to give a warning for this.)
   There are at present no other features controlled by this option. It
   can also be set by a (?X) option setting within a pattern.

PCRE_FIRSTLINE

   If this option is set, an unanchored pattern is required to match
   before or at the first newline in the subject string, though the
   matched text may continue over the newline.

PCRE_MULTILINE

   By default, PCRE treats the subject string as consisting of a single
   line of characters (even if it actually contains newlines). The "start
   of line" metacharacter (^) matches only at the start of the string,
   while the "end of line" metacharacter ($) matches only at the end of
   the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
   is set). This is the same as Perl.

   When PCRE_MULTILINE it is set, the "start of line" and "end of line"
   constructs match immediately following or immediately before internal
   newlines in the subject string, respectively, as well as at the very
   start and end. This is equivalent to Perl's /m option, and it can be
   changed within a pattern by a (?m) option setting. If there are no new-
   lines in a subject string, or no occurrences of ^ or $ in a pattern,
   setting PCRE_MULTILINE has no effect.

   PCRE_NEWLINE_CR
   PCRE_NEWLINE_LF
   PCRE_NEWLINE_CRLF
   PCRE_NEWLINE_ANYCRLF
   PCRE_NEWLINE_ANY

   These options override the default newline definition that was chosen
   when PCRE was built. Setting the first or the second specifies that a
   newline is indicated by a single character (CR or LF, respectively).
   Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
   two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
   that any of the three preceding sequences should be recognized. Setting
   PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
   recognized. The Unicode newline sequences are the three just mentioned,
   plus the single characters VT (vertical tab, U+000B), FF (formfeed,
   U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
   (paragraph separator, U+2029). The last two are recognized only in
   UTF-8 mode.

   The newline setting in the options word uses three bits that are
   treated as a number, giving eight possibilities. Currently only six are
   used (default plus the five values above). This means that if you set
   more than one newline option, the combination may or may not be sensi-
   ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
   PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
   cause an error.

   The only time that a line break is specially recognized when compiling
   a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
   character class is encountered. This indicates a comment that lasts
   until after the next line break sequence. In other circumstances, line
   break sequences are treated as literal data, except that in
   PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
   and are therefore ignored.

The newline option that is set at compile time becomes the default that
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.

PCRE_NO_AUTO_CAPTURE

   If this option is set, it disables the use of numbered capturing paren-
   theses in the pattern. Any opening parenthesis that is not followed by
   ? behaves as if it were followed by ?: but named parentheses can still
   be used for capturing (and they acquire numbers in the usual way).
   There is no equivalent of this option in Perl.

PCRE_UNGREEDY

   This option inverts the "greediness" of the quantifiers so that they
   are not greedy by default, but become greedy if followed by "?". It is
   not compatible with Perl. It can also be set by a (?U) option setting
   within the pattern.

PCRE_UTF8

   This option causes PCRE to regard both the pattern and the subject as
   strings of UTF-8 characters instead of single-byte character strings.
   However, it is available only when PCRE is built to include UTF-8 sup-
   port. If not, the use of this option provokes an error. Details of how
   this option changes the behaviour of PCRE are given in the section on
   UTF-8 support in the main pcre page.

PCRE_NO_UTF8_CHECK

   When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
   automatically checked. If an invalid UTF-8 sequence of bytes is found,
   pcre_compile() returns an error. If you already know that your pattern
   is valid, and you want to skip this check for performance reasons, you
   can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of
   passing an invalid UTF-8 string as a pattern is undefined. It may cause
   your program to crash. Note that this option can also be passed to
   pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check-
   ing of subject strings.

COMPILATION ERROR CODES

   The following table lists the error codes than may be returned by
   pcre_compile2(), along with the error messages that may be returned by
   both compiling functions. As PCRE has developed, some error codes have
   fallen out of use. To avoid confusion, they have not been re-used.

0 no error
1 \ at end of pattern
2 \c at end of pattern
3 unrecognized character follows \
4 numbers out of order in {} quantifier
5 number too big in {} quantifier
6 missing terminating ] for character class
7 invalid escape sequence in character class
8 range out of order in character class
9 nothing to repeat
   10 [this code is not in use]
   11 internal error: unexpected repeat
   12 unrecognized character after (?
   13 POSIX named classes are supported only within a class
   14 missing )
   15 reference to non-existent subpattern
   16 erroffset passed as NULL
   17 unknown option bit(s) set
   18 missing ) after comment
   19 [this code is not in use]
   20 regular expression too large
   21 failed to get memory
   22 unmatched parentheses
   23 internal error: code overflow
   24 unrecognized character after (?<
   25 lookbehind assertion is not fixed length
   26 malformed number or name after (?(
   27 conditional group contains more than two branches
   28 assertion expected after (?(
   29 (?R or (?digits must be followed by )
   30 unknown POSIX class name
   31 POSIX collating elements are not supported
   32 this version of PCRE is not compiled with PCRE_UTF8 support
   33 [this code is not in use]
   34 character value in \x{...} sequence is too large
   35 invalid condition (?(0)
   36 \C not allowed in lookbehind assertion
   37 PCRE does not support \L, \l, \N, \U, or \u
   38 number after (?C is > 255
   39 closing ) for (?C expected
   40 recursive call could loop indefinitely
   41 unrecognized character after (?P
   42 syntax error in subpattern name (missing terminator)
   43 two named subpatterns have the same name
   44 invalid UTF-8 string
   45 support for \P, \p, and \X has not been compiled
   46 malformed \P or \p sequence
   47 unknown property name after \P or \p
   48 subpattern name is too long (maximum 32 characters)
   49 too many named subpatterns (maximum 10,000)
   50 repeated subpattern is too long
   51 octal value is greater than \377 (not in UTF-8 mode)
   52 internal error: overran compiling workspace
   53 internal error: previously-checked referenced subpattern not
   found
   54 DEFINE group contains more than one branch
   55 repeating a DEFINE group is not allowed
   56 inconsistent NEWLINE options"

STUDYING A PATTERN

pcre_extra *pcre_study(const pcre *code, int options
const char **errptr);

   If a compiled pattern is going to be used several times, it is worth
   spending more time analyzing it in order to speed up the time taken for
   matching. The function pcre_study() takes a pointer to a compiled pat-
   tern as its first argument. If studying the pattern produces additional
   information that will help speed up matching, pcre_study() returns a
   pointer to a pcre_extra block, in which the study_data field points to
   the results of the study.

   The returned value from pcre_study() can be passed directly to
   pcre_exec(). However, a pcre_extra block also contains other fields
   that can be set by the caller before the block is passed; these are
   described below in the section on matching a pattern.

   If studying the pattern does not produce any additional information
   pcre_study() returns NULL. In that circumstance, if the calling program
   wants to pass any of the other fields to pcre_exec(), it must set up
   its own pcre_extra block.

The second argument of pcre_study() contains option bits. At present,
no options are defined, and this argument should always be zero.

   The third argument for pcre_study() is a pointer for an error message.
   If studying succeeds (even if no data is returned), the variable it
   points to is set to NULL. Otherwise it is set to point to a textual
   error message. This is a static string that is part of the library. You
   must not try to free it. You should test the error pointer for NULL
   after calling pcre_study(), to be sure that it has run successfully.

This is a typical call to pcre_study():

   pcre_extra *pe;
   pe = pcre_study(
   re, /* result of pcre_compile() */
   0,    /* no options exist */
   &error);    /* set to NULL or points to a message */

   At present, studying a pattern is useful only for non-anchored patterns
   that do not have a single fixed starting character. A bitmap of possi-
   ble starting bytes is created.

LOCALE SUPPORT

   PCRE handles caseless matching, and determines whether characters are
   letters, digits, or whatever, by reference to a set of tables, indexed
   by character value. When running in UTF-8 mode, this applies only to
   characters with codes less than 128. Higher-valued codes never match
   escapes such as \w or \d, but can be tested with \p if PCRE is built
   with Unicode character property support. The use of locales with Uni-
   code is discouraged. If you are handling characters with codes greater
   than 128, you should either use UTF-8 and Unicode, or use locales, but
   not try to mix the two.

   PCRE contains an internal set of tables that are used when the final
   argument of pcre_compile() is NULL. These are sufficient for many
   applications. Normally, the internal tables recognize only ASCII char-
   acters. However, when PCRE is built, it is possible to cause the inter-
   nal tables to be rebuilt in the default "C" locale of the local system,
   which may cause them to be different.

   The internal tables can always be overridden by tables supplied by the
   application that calls PCRE. These may be created in a different locale
   from the default. As more and more applications change to using Uni-
   code, the need for this locale support is expected to die away.

   External tables are built by calling the pcre_maketables() function,
   which has no arguments, in the relevant locale. The result can then be
   passed to pcre_compile() or pcre_exec() as often as necessary. For
   example, to build and use tables that are appropriate for the French
   locale (where accented characters with values greater than 128 are
   treated as letters), the following code could be used:

   setlocale(LC_CTYPE, "fr_FR");
   tables = pcre_maketables();
   re = pcre_compile(..., tables);

The locale name "fr_FR" is used on Linux and other Unix-like systems;
if you are using Windows, the name for the French locale is "french".

   When pcre_maketables() runs, the tables are built in memory that is
   obtained via pcre_malloc. It is the caller's responsibility to ensure
   that the memory containing the tables remains available for as long as
   it is needed.

   The pointer that is passed to pcre_compile() is saved with the compiled
   pattern, and the same tables are used via this pointer by pcre_study()
   and normally also by pcre_exec(). Thus, by default, for any single pat-
   tern, compilation, studying and matching all happen in the same locale,
   but different patterns can be compiled in different locales.

   It is possible to pass a table pointer or NULL (indicating the use of
   the internal tables) to pcre_exec(). Although not intended for this
   purpose, this facility could be used to match a pattern in a different
   locale from the one in which it was compiled. Passing table pointers at
   run time is discussed below in the section on matching a pattern.

INFORMATION ABOUT A PATTERN

int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);

   The pcre_fullinfo() function returns information about a compiled pat-
   tern. It replaces the obsolete pcre_info() function, which is neverthe-
   less retained for backwards compability (and is documented below).

   The first argument for pcre_fullinfo() is a pointer to the compiled
   pattern. The second argument is the result of pcre_study(), or NULL if
   the pattern was not studied. The third argument specifies which piece
   of information is required, and the fourth argument is a pointer to a
   variable to receive the data. The yield of the function is zero for
   success, or one of the following negative numbers:

   PCRE_ERROR_NULL the argument code was NULL
   the argument where was NULL
   PCRE_ERROR_BADMAGIC the "magic number" was not found
   PCRE_ERROR_BADOPTION the value of what was invalid

   The "magic number" is placed at the start of each compiled pattern as
   an simple check against passing an arbitrary memory pointer. Here is a
   typical call of pcre_fullinfo(), to obtain the length of the compiled
   pattern:

   int rc;
   size_t length;
   rc = pcre_fullinfo(
   re, /* result of pcre_compile() */
   pe, /* result of pcre_study(), or NULL */
   PCRE_INFO_SIZE, /* what is required */
   &length); /* where to put the data */

The possible values for the third argument are defined in pcre.h, and
are as follows:

PCRE_INFO_BACKREFMAX

   Return the number of the highest back reference in the pattern. The
   fourth argument should point to an int variable. Zero is returned if
   there are no back references.

PCRE_INFO_CAPTURECOUNT

Return the number of capturing subpatterns in the pattern. The fourth
argument should point to an int variable.

PCRE_INFO_DEFAULT_TABLES

   Return a pointer to the internal default character tables within PCRE.
   The fourth argument should point to an unsigned char * variable. This
   information call is provided for internal use by the pcre_study() func-
   tion. External callers can cause PCRE to use its internal tables by
   passing a NULL table pointer.

PCRE_INFO_FIRSTBYTE

   Return information about the first byte of any matched string, for a
   non-anchored pattern. The fourth argument should point to an int vari-
   able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
   is still recognized for backwards compatibility.)

If there is a fixed first byte, for example, from a pattern such as
(cat|cow|coyote), its value is returned. Otherwise, if either

(a) the pattern was compiled with the PCRE_MULTILINE option, and every
branch starts with "^", or

(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
set (if it were set, the pattern would be anchored),

   -1 is returned, indicating that the pattern matches only at the start
   of a subject string or after any newline within the string. Otherwise
   -2 is returned. For anchored patterns, -2 is returned.

PCRE_INFO_FIRSTTABLE

   If the pattern was studied, and this resulted in the construction of a
   256-bit table indicating a fixed set of bytes for the first byte in any
   matching string, a pointer to the table is returned. Otherwise NULL is
   returned. The fourth argument should point to an unsigned char * vari-
   able.

PCRE_INFO_LASTLITERAL

   Return the value of the rightmost literal byte that must exist in any
   matched string, other than at its start, if such a byte has been
   recorded. The fourth argument should point to an int variable. If there
   is no such byte, -1 is returned. For anchored patterns, a last literal
   byte is recorded only if it follows something of variable length. For
   example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
   /^a\dz\d/ the returned value is -1.

   PCRE_INFO_NAMECOUNT
   PCRE_INFO_NAMEENTRYSIZE
   PCRE_INFO_NAMETABLE

   PCRE supports the use of named as well as numbered capturing parenthe-
   ses. The names are just an additional way of identifying the parenthe-
   ses, which still acquire numbers. Several convenience functions such as
   pcre_get_named_substring() are provided for extracting captured sub-
   strings by name. It is also possible to extract the data directly, by
   first converting the name to a number in order to access the correct
   pointers in the output vector (described with pcre_exec() below). To do
   the conversion, you need to use the name-to-number map, which is
   described by these three values.

   The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
   gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
   of each entry; both of these return an int value. The entry size
   depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
   a pointer to the first entry of the table (a pointer to char). The
   first two bytes of each entry are the number of the capturing parenthe-
   sis, most significant byte first. The rest of the entry is the corre-
   sponding name, zero terminated. The names are in alphabetical order.
   When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
   theses numbers. For example, consider the following pattern (assume
   PCRE_EXTENDED is set, so white space - including newlines - is
   ignored):

(?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) )

   There are four named subpatterns, so the table has four entries, and
   each entry in the table is eight bytes long. The table is as follows,
   with non-printing bytes shows in hexadecimal, and undefined bytes shown
   as ??:

   00 01 d a t e 00 ??
   00 05 d a y 00 ?? ??
   00 04 m o n t h 00
   00 02 y e a r 00 ??

   When writing code to extract data from named subpatterns using the
   name-to-number map, remember that the length of the entries is likely
   to be different for each compiled pattern.

PCRE_INFO_OPTIONS

   Return a copy of the options with which the pattern was compiled. The
   fourth argument should point to an unsigned long int variable. These
   option bits are those specified in the call to pcre_compile(), modified
   by any top-level option settings within the pattern itself.

A pattern is automatically anchored by PCRE if all of its top-level
alternatives begin with one of the following:

   ^ unless PCRE_MULTILINE is set
   \A    always
   \G    always
   .*    if PCRE_DOTALL is set and there are no back
   references to the subpattern in which .* appears

For such patterns, the PCRE_ANCHORED bit is set in the options returned
by pcre_fullinfo().

PCRE_INFO_SIZE

   Return the size of the compiled pattern, that is, the value that was
   passed as the argument to pcre_malloc() when PCRE was getting memory in
   which to place the compiled data. The fourth argument should point to a
   size_t variable.

PCRE_INFO_STUDYSIZE

   Return the size of the data block pointed to by the study_data field in
   a pcre_extra block. That is, it is the value that was passed to
   pcre_malloc() when PCRE was getting memory into which to place the data
   created by pcre_study(). The fourth argument should point to a size_t
   variable.

OBSOLETE INFO FUNCTION

int pcre_info(const pcre *code, int *optptr, int *firstcharptr);

   The pcre_info() function is now obsolete because its interface is too
   restrictive to return all the available data about a compiled pattern.
   New programs should use pcre_fullinfo() instead. The yield of
   pcre_info() is the number of capturing subpatterns, or one of the fol-
   lowing negative numbers:

PCRE_ERROR_NULL the argument code was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found

   If the optptr argument is not NULL, a copy of the options with which
   the pattern was compiled is placed in the integer it points to (see
   PCRE_INFO_OPTIONS above).

   If the pattern is not anchored and the firstcharptr argument is not
   NULL, it is used to pass back information about the first character of
   any matched string (see PCRE_INFO_FIRSTBYTE above).

REFERENCE COUNTS

int pcre_refcount(pcre *code, int adjust);

   The pcre_refcount() function is used to maintain a reference count in
   the data block that contains a compiled pattern. It is provided for the
   benefit of applications that operate in an object-oriented manner,
   where different parts of the application may be using the same compiled
   pattern, but you want to free the block when they are all done.

   When a pattern is compiled, the reference count field is initialized to
   zero. It is changed only by calling this function, whose action is to
   add the adjust value (which may be positive or negative) to it. The
   yield of the function is the new value. However, the value of the count
   is constrained to lie between 0 and 65535, inclusive. If the new value
   is outside these limits, it is forced to the appropriate limit value.

   Except when it is zero, the reference count is not correctly preserved
   if a pattern is compiled on one host and then transferred to a host
   whose byte-order is different. (This seems a highly unlikely scenario.)

MATCHING A PATTERN: THE TRADITIONAL FUNCTION

int pcre_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize);

   The function pcre_exec() is called to match a subject string against a
   compiled pattern, which is passed in the code argument. If the pattern
   has been studied, the result of the study should be passed in the extra
   argument. This function is the main matching facility of the library,
   and it operates in a Perl-like manner. For specialist use there is also
   an alternative matching function, which is described below in the sec-
   tion about the pcre_dfa_exec() function.

   In most applications, the pattern will have been compiled (and option-
   ally studied) in the same process that calls pcre_exec(). However, it
   is possible to save compiled patterns and study data, and then use them
   later in different processes, possibly even on different hosts. For a
   discussion about this, see the pcreprecompile documentation.

Here is an example of a simple call to pcre_exec():

   int rc;
   int ovector[30];
   rc = pcre_exec(
   re, /* result of pcre_compile() */
   NULL, /* we didn't study the pattern */
   "some string", /* the subject string */
   11, /* the length of the subject string */
   0,    /* start at offset 0 in the subject */
   0,    /* default options */
   ovector,    /* vector of integers for substring information */
   30);    /* number of elements (NOT size in bytes) */

Extra data for pcre_exec()

   If the extra argument is not NULL, it must point to a pcre_extra data
   block. The pcre_study() function returns such a block (when it doesn't
   return NULL), but you can also create one for yourself, and pass addi-
   tional information in it. The pcre_extra block contains the following
   fields (not necessarily in this order):

   unsigned long int flags;
   void *study_data;
   unsigned long int match_limit;
   unsigned long int match_limit_recursion;
   void *callout_data;
   const unsigned char *tables;

The flags field is a bitmap that specifies which of the other fields
are set. The flag bits are:

   PCRE_EXTRA_STUDY_DATA
   PCRE_EXTRA_MATCH_LIMIT
   PCRE_EXTRA_MATCH_LIMIT_RECURSION
   PCRE_EXTRA_CALLOUT_DATA
   PCRE_EXTRA_TABLES

   Other flag bits should be set to zero. The study_data field is set in
   the pcre_extra block that is returned by pcre_study(), together with
   the appropriate flag bit. You should not set this yourself, but you may
   add to the block by setting the other fields and their corresponding
   flag bits.

   The match_limit field provides a means of preventing PCRE from using up
   a vast amount of resources when running patterns that are not going to
   match, but which have a very large number of possibilities in their
   search trees. The classic example is the use of nested unlimited
   repeats.

   Internally, PCRE uses a function called match() which it calls repeat-
   edly (sometimes recursively). The limit set by match_limit is imposed
   on the number of times this function is called during a match, which
   has the effect of limiting the amount of backtracking that can take
   place. For patterns that are not anchored, the count restarts from zero
   for each position in the subject string.

   The default value for the limit can be set when PCRE is built; the
   default default is 10 million, which handles all but the most extreme
   cases. You can override the default by suppling pcre_exec() with a
   pcre_extra    block    in    which    match_limit    is    set, and
   PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
   exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.

   The match_limit_recursion field is similar to match_limit, but instead
   of limiting the total number of times that match() is called, it limits
   the depth of recursion. The recursion depth is a smaller number than
   the total number of calls, because not all calls to match() are recur-
   sive. This limit is of use only if it is set smaller than match_limit.

   Limiting the recursion depth limits the amount of stack that can be
   used, or, when PCRE has been compiled to use memory on the heap instead
   of the stack, the amount of heap memory that can be used.

   The default value for match_limit_recursion can be set when PCRE is
   built; the default default is the same value as the default for
   match_limit. You can override the default by suppling pcre_exec() with
   a pcre_extra block in which match_limit_recursion is set, and
   PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
   limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.

The pcre_callout field is used in conjunction with the "callout" fea-
ture, which is described in the pcrecallout documentation.

   The tables field is used to pass a character tables pointer to
   pcre_exec(); this overrides the value that is stored with the compiled
   pattern. A non-NULL value is stored with the compiled pattern only if
   custom tables were supplied to pcre_compile() via its tableptr argu-
   ment. If NULL is passed to pcre_exec() using this mechanism, it forces
   PCRE's internal tables to be used. This facility is helpful when re-
   using patterns that have been saved after compiling with an external
   set of tables, because the external tables might be at a different
   address when pcre_exec() is called. See the pcreprecompile documenta-
   tion for a discussion of saving compiled patterns for later use.

Option bits for pcre_exec()

   The unused bits of the options argument for pcre_exec() must be zero.
   The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
   PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and
   PCRE_PARTIAL.

PCRE_ANCHORED

   The PCRE_ANCHORED option limits pcre_exec() to matching at the first
   matching position. If a pattern was compiled with PCRE_ANCHORED, or
   turned out to be anchored by virtue of its contents, it cannot be made
   unachored at matching time.

   PCRE_NEWLINE_CR
   PCRE_NEWLINE_LF
   PCRE_NEWLINE_CRLF
   PCRE_NEWLINE_ANYCRLF
   PCRE_NEWLINE_ANY

   These options override the newline definition that was chosen or
   defaulted when the pattern was compiled. For details, see the descrip-
   tion of pcre_compile() above. During matching, the newline choice
   affects the behaviour of the dot, circumflex, and dollar metacharac-
   ters. It may also alter the way the match position is advanced after a
   match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF,
   PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a match attempt
   fails when the current position is at a CRLF sequence, the match posi-
   tion is advanced by two characters instead of one, in other words, to
   after the CRLF.

PCRE_NOTBOL

   This option specifies that first character of the subject string is not
   the beginning of a line, so the circumflex metacharacter should not
   match before it. Setting this without PCRE_MULTILINE (at compile time)
   causes circumflex never to match. This option affects only the behav-
   iour of the circumflex metacharacter. It does not affect \A.

PCRE_NOTEOL

   This option specifies that the end of the subject string is not the end
   of a line, so the dollar metacharacter should not match it nor (except
   in multiline mode) a newline immediately before it. Setting this with-
   out PCRE_MULTILINE (at compile time) causes dollar never to match. This
   option affects only the behaviour of the dollar metacharacter. It does
   not affect \Z or \z.

PCRE_NOTEMPTY

   An empty string is not considered to be a valid match if this option is
   set. If there are alternatives in the pattern, they are tried. If all
   the alternatives match the empty string, the entire match fails. For
   example, if the pattern

a?b?

   is applied to a string not beginning with "a" or "b", it matches the
   empty string at the start of the subject. With PCRE_NOTEMPTY set, this
   match is not valid, so PCRE searches further into the string for occur-
   rences of "a" or "b".

   Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
   cial case of a pattern match of the empty string within its split()
   function, and when using the /g modifier. It is possible to emulate
   Perl's behaviour after matching a null string by first trying the match
   again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
   if that fails by advancing the starting offset (see below) and trying
   an ordinary match again. There is some code that demonstrates how to do
   this in the pcredemo.c sample program.

PCRE_NO_UTF8_CHECK

   When PCRE_UTF8 is set at compile time, the validity of the subject as a
   UTF-8 string is automatically checked when pcre_exec() is subsequently
   called. The value of startoffset is also checked to ensure that it
   points to the start of a UTF-8 character. If an invalid UTF-8 sequence
   of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
   startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is
   returned.

   If you already know that your subject is valid, and you want to skip
   these checks for performance reasons, you can    set    the
   PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
   do this for the second and subsequent calls to pcre_exec() if you are
   making repeated calls to find all the matches in a single subject
   string. However, you should be sure that the value of startoffset
   points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
   set, the effect of passing an invalid UTF-8 string as a subject, or a
   value of startoffset that does not point to the start of a UTF-8 char-
   acter, is undefined. Your program may crash.

PCRE_PARTIAL

   This option turns on the partial matching feature. If the subject
   string fails to match the pattern, but at some point during the match-
   ing process the end of the subject was reached (that is, the subject
   partially matches the pattern and the failure to match occurred only
   because there were not enough subject characters), pcre_exec() returns
   PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
   used, there are restrictions on what may appear in the pattern. These
   are discussed in the pcrepartial documentation.

The string to be matched by pcre_exec()

   The subject string is passed to pcre_exec() as a pointer in subject, a
   length in length, and a starting byte offset in startoffset. In UTF-8
   mode, the byte offset must point to the start of a UTF-8 character.
   Unlike the pattern string, the subject may contain binary zero bytes.
   When the starting offset is zero, the search for a match starts at the
   beginning of the subject, and this is by far the most common case.

   A non-zero starting offset is useful when searching for another match
   in the same subject by calling pcre_exec() again after a previous suc-
   cess. Setting startoffset differs from just passing over a shortened
   string and setting PCRE_NOTBOL in the case of a pattern that begins
   with any kind of lookbehind. For example, consider the pattern

\Biss\B

   which finds occurrences of "iss" in the middle of words. (\B matches
   only if the current position in the subject is not a word boundary.)
   When applied to the string "Mississipi" the first call to pcre_exec()
   finds the first occurrence. If pcre_exec() is called again with just
   the remainder of the subject, namely "issipi", it does not match,
   because \B is always false at the start of the subject, which is deemed
   to be a word boundary. However, if pcre_exec() is passed the entire
   string again, but with startoffset set to 4, it finds the second occur-
   rence of "iss" because it is able to look behind the starting point to
   discover that it is preceded by a letter.

   If a non-zero starting offset is passed when the pattern is anchored,
   one attempt to match at the given offset is made. This can only succeed
   if the pattern does not require the match to be at the start of the
   subject.

How pcre_exec() returns captured substrings

   In general, a pattern matches a certain portion of the subject, and in
   addition, further substrings from the subject may be picked out by
   parts of the pattern. Following the usage in Jeffrey Friedl's book,
   this is called "capturing" in what follows, and the phrase "capturing
   subpattern" is used for a fragment of a pattern that picks out a sub-
   string. PCRE supports several other kinds of parenthesized subpattern
   that do not cause substrings to be captured.

   Captured substrings are returned to the caller via a vector of integer
   offsets whose address is passed in ovector. The number of elements in
   the vector is passed in ovecsize, which must be a non-negative number.
   Note: this argument is NOT the size of ovector in bytes.

   The first two-thirds of the vector is used to pass back captured sub-
   strings, each substring using a pair of integers. The remaining third
   of the vector is used as workspace by pcre_exec() while matching cap-
   turing subpatterns, and is not available for passing back information.
   The length passed in ovecsize should always be a multiple of three. If
   it is not, it is rounded down.

   When a match is successful, information about captured substrings is
   returned in pairs of integers, starting at the beginning of ovector,
   and continuing up to two-thirds of its length at the most. The first
   element of a pair is set to the offset of the first character in a sub-
   string, and the second is set to the offset of the first character
   after the end of a substring. The first pair, ovector[0] and ovec-
   tor[1], identify the portion of the subject string matched by the
   entire pattern. The next pair is used for the first capturing subpat-
   tern, and so on. The value returned by pcre_exec() is one more than the
   highest numbered pair that has been set. For example, if two substrings
   have been captured, the returned value is 3. If there are no capturing
   subpatterns, the return value from a successful match is 1, indicating
   that just the first pair of offsets has been set.

If a capturing subpattern is matched repeatedly, it is the last portion
of the string that it matched that is returned.

   If the vector is too small to hold all the captured substring offsets,
   it is used as far as possible (up to two-thirds of its length), and the
   function returns a value of zero. In particular, if the substring off-
   sets are not of interest, pcre_exec() may be called with ovector passed
   as NULL and ovecsize as zero. However, if the pattern contains back
   references and the ovector is not big enough to remember the related
   substrings, PCRE has to get additional memory for use during matching.
   Thus it is usually advisable to supply an ovector.

   The pcre_info() function can be used to find out how many capturing
   subpatterns there are in a compiled pattern. The smallest size for
   ovector that will allow for n captured substrings, in addition to the
   offsets of the substring matched by the whole pattern, is (n+1)*3.

   It is possible for capturing subpattern number n+1 to match some part
   of the subject when subpattern n has not been used at all. For example,
   if the string "abc" is matched against the pattern (a|(z))(bc) the
   return from the function is 4, and subpatterns 1 and 3 are matched, but
   2 is not. When this happens, both values in the offset pairs corre-
   sponding to unused subpatterns are set to -1.

   Offset values that correspond to unused subpatterns at the end of the
   expression are also set to -1. For example, if the string "abc" is
   matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
   matched. The return from the function is 2, because the highest used
   capturing subpattern number is 1. However, you can refer to the offsets
   for the second and third capturing subpatterns if you wish (assuming
   the vector is large enough, of course).

Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described below.

Error return values from pcre_exec()

If pcre_exec() fails, it returns a negative number. The following are
defined in the header file:

PCRE_ERROR_NOMATCH (-1)

The subject string did not match the pattern.

PCRE_ERROR_NULL (-2)

Either code or subject was passed as NULL, or ovector was NULL and
ovecsize was not zero.

PCRE_ERROR_BADOPTION (-3)

An unrecognized bit was set in the options argument.

PCRE_ERROR_BADMAGIC (-4)

   PCRE stores a 4-byte "magic number" at the start of the compiled code,
   to catch the case when it is passed a junk pointer and to detect when a
   pattern that was compiled in an environment of one endianness is run in
   an environment with the other endianness. This is the error that PCRE
   gives when the magic number is not present.

PCRE_ERROR_UNKNOWN_OPCODE (-5)

   While running the pattern match, an unknown item was encountered in the
   compiled pattern. This error could be caused by a bug in PCRE or by
   overwriting of the compiled pattern.

PCRE_ERROR_NOMEMORY (-6)

   If a pattern contains back references, but the ovector that is passed
   to pcre_exec() is not big enough to remember the referenced substrings,
   PCRE gets a block of memory at the start of matching to use for this
   purpose. If the call via pcre_malloc() fails, this error is given. The
   memory is automatically freed at the end of matching.

PCRE_ERROR_NOSUBSTRING (-7)

   This error is used by the pcre_copy_substring(), pcre_get_substring(),
   and pcre_get_substring_list() functions (see below). It is never
   returned by pcre_exec().

PCRE_ERROR_MATCHLIMIT (-8)

   The backtracking limit, as specified by the match_limit field in a
   pcre_extra structure (or defaulted) was reached. See the description
   above.

PCRE_ERROR_CALLOUT (-9)

   This error is never generated by pcre_exec() itself. It is provided for
   use by callout functions that want to yield a distinctive error code.
   See the pcrecallout documentation for details.

PCRE_ERROR_BADUTF8 (-10)

A string that contains an invalid UTF-8 byte sequence was passed as a
subject.

PCRE_ERROR_BADUTF8_OFFSET (-11)

   The UTF-8 byte sequence that was passed as a subject was valid, but the
   value of startoffset did not point to the beginning of a UTF-8 charac-
   ter.

PCRE_ERROR_PARTIAL (-12)

The subject string did not match, but it did match partially. See the
pcrepartial documentation for details of partial matching.

PCRE_ERROR_BADPARTIAL (-13)

   The PCRE_PARTIAL option was used with a compiled pattern containing
   items that are not supported for partial matching. See the pcrepartial
   documentation for details of partial matching.

PCRE_ERROR_INTERNAL (-14)

An unexpected internal error has occurred. This error could be caused
by a bug in PCRE or by overwriting of the compiled pattern.

PCRE_ERROR_BADCOUNT (-15)

This error is given if the value of the ovecsize argument is negative.

PCRE_ERROR_RECURSIONLIMIT (-21)

   The internal recursion limit, as specified by the match_limit_recursion
   field in a pcre_extra structure (or defaulted) was reached. See the
   description above.

PCRE_ERROR_NULLWSLIMIT (-22)

   When a group that can match an empty substring is repeated with an
   unbounded upper limit, the subject position at the start of the group
   must be remembered, so that a test for an empty string can be made when
   the end of the group is reached. Some workspace is required for this;
   if it runs out, this error is given.

PCRE_ERROR_BADNEWLINE (-23)

An invalid combination of PCRE_NEWLINE_xxx options was given.

Error numbers -16 to -20 are not used by pcre_exec().

EXTRACTING CAPTURED SUBSTRINGS BY NUMBER

int pcre_copy_substring(const char *subject, int *ovector,
int stringcount, int stringnumber, char *buffer,
int buffersize);

int pcre_get_substring(const char *subject, int *ovector,
int stringcount, int stringnumber,
const char **stringptr);

int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);

   Captured substrings can be accessed directly by using the offsets
   returned by pcre_exec() in ovector. For convenience, the functions
   pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
   string_list() are provided for extracting captured substrings as new,
   separate, zero-terminated strings. These functions identify substrings
   by number. The next section describes functions for extracting named
   substrings.

   A substring that contains a binary zero is correctly extracted and has
   a further zero added on the end, but the result is not, of course, a C
   string. However, you can process such a string by referring to the
   length that is returned by pcre_copy_substring() and pcre_get_sub-
   string(). Unfortunately, the interface to pcre_get_substring_list() is
   not adequate for handling strings containing binary zeros, because the
   end of the final string is not independently indicated.

   The first three arguments are the same for all three of these func-
   tions: subject is the subject string that has just been successfully
   matched, ovector is a pointer to the vector of integer offsets that was
   passed to pcre_exec(), and stringcount is the number of substrings that
   were captured by the match, including the substring that matched the
   entire regular expression. This is the value returned by pcre_exec() if
   it is greater than zero. If pcre_exec() returned zero, indicating that
   it ran out of space in ovector, the value passed as stringcount should
   be the number of elements in the vector divided by three.

   The functions pcre_copy_substring() and pcre_get_substring() extract a
   single substring, whose number is given as stringnumber. A value of
   zero extracts the substring that matched the entire pattern, whereas
   higher values extract the captured substrings. For pcre_copy_sub-
   string(), the string is placed in buffer, whose length is given by
   buffersize, while for pcre_get_substring() a new block of memory is
   obtained via pcre_malloc, and its address is returned via stringptr.
   The yield of the function is the length of the string, not including
   the terminating zero, or one of these error codes:

PCRE_ERROR_NOMEMORY (-6)

The buffer was too small for pcre_copy_substring(), or the attempt to
get memory failed for pcre_get_substring().

PCRE_ERROR_NOSUBSTRING (-7)

There is no substring whose number is stringnumber.

   The pcre_get_substring_list() function extracts all available sub-
   strings and builds a list of pointers to them. All this is done in a
   single block of memory that is obtained via pcre_malloc. The address of
   the memory block is returned via listptr, which is also the start of
   the list of string pointers. The end of the list is marked by a NULL
   pointer. The yield of the function is zero if all went well, or the
   error code

PCRE_ERROR_NOMEMORY (-6)

if the attempt to get the memory block failed.

   When any of these functions encounter a substring that is unset, which
   can happen when capturing subpattern number n+1 matches some part of
   the subject, but subpattern n has not been used at all, they return an
   empty string. This can be distinguished from a genuine zero-length sub-
   string by inspecting the appropriate offset in ovector, which is nega-
   tive for unset substrings.

   The two convenience functions pcre_free_substring() and pcre_free_sub-
   string_list() can be used to free the memory returned by a previous
   call of pcre_get_substring() or pcre_get_substring_list(), respec-
   tively. They do nothing more than call the function pointed to by
   pcre_free, which of course could be called directly from a C program.
   However, PCRE is used in some situations where it is linked via a spe-
   cial interface to another programming language that cannot use
   pcre_free directly; it is for these cases that the functions are pro-
   vided.

EXTRACTING CAPTURED SUBSTRINGS BY NAME

int pcre_get_stringnumber(const pcre *code,
const char *name);

int pcre_copy_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
char *buffer, int buffersize);

int pcre_get_named_substring(const pcre *code,
const char *subject, int *ovector,
int stringcount, const char *stringname,
const char **stringptr);

To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern

(a+)b(?<xxx>\d+)...

   the number of the subpattern called "xxx" is 2. If the name is known to
   be unique (PCRE_DUPNAMES was not set), you can find the number from the
   name by calling pcre_get_stringnumber(). The first argument is the com-
   piled pattern, and the second is the name. The yield of the function is
   the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
   subpattern of that name.

   Given the number, you can extract the substring directly, or use one of
   the functions described in the previous section. For convenience, there
   are also two functions that do the whole job.

   Most    of    the    arguments of pcre_copy_named_substring() and
   pcre_get_named_substring() are the same as those for the similarly
   named functions that extract by number. As these are described in the
   previous section, they are not re-described here. There are just two
   differences:

   First, instead of a substring number, a substring name is given. Sec-
   ond, there is an extra argument, given at the start, which is a pointer
   to the compiled pattern. This is needed in order to gain access to the
   name-to-number translation table.

   These functions call pcre_get_stringnumber(), and if it succeeds, they
   then call pcre_copy_substring() or pcre_get_substring(), as appropri-
   ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
   behaviour may not be what you want (see the next section).

DUPLICATE SUBPATTERN NAMES

int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);

   When a pattern is compiled with the PCRE_DUPNAMES option, names for
   subpatterns are not required to be unique. Normally, patterns with
   duplicate names are such that in any one match, only one of the named
   subpatterns participates. An example is shown in the pcrepattern docu-
   mentation. When duplicates are present, pcre_copy_named_substring() and
   pcre_get_named_substring() return the first substring corresponding to
   the given name that is set. If none are set, an empty string is
   returned. The pcre_get_stringnumber() function returns one of the num-
   bers that are associated with the name, but it is not defined which it
   is.

   If you want to get full details of all captured substrings for a given
   name, you must use the pcre_get_stringtable_entries() function. The
   first argument is the compiled pattern, and the second is the name. The
   third and fourth are pointers to variables which are updated by the
   function. After it has run, they point to the first and last entries in
   the name-to-number table for the given name. The function itself
   returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
   there are none. The format of the table is described above in the sec-
   tion entitled Information about a pattern. Given all the relevant
   entries for the name, you can extract each of their numbers, and hence
   the captured data, if any.

FINDING ALL POSSIBLE MATCHES

   The traditional matching function uses a similar algorithm to Perl,
   which stops when it finds the first match, starting at a given point in
   the subject. If you want to find all possible matches, or the longest
   possible match, consider using the alternative matching function (see
   below) instead. If you cannot use the alternative function, but still
   need to find all possible matches, you can kludge it up by making use
   of the callout facility, which is described in the pcrecallout documen-
   tation.

   What you have to do is to insert a callout right at the end of the pat-
   tern. When your callout function is called, extract and save the cur-
   rent matched substring. Then return 1, which forces pcre_exec() to
   backtrack and try other alternatives. Ultimately, when it runs out of
   matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.

MATCHING A PATTERN: THE ALTERNATIVE FUNCTION

int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize,
int *workspace, int wscount);

   The function pcre_dfa_exec() is called to match a subject string
   against a compiled pattern, using a matching algorithm that scans the
   subject string just once, and does not backtrack. This has different
   characteristics to the normal algorithm, and is not compatible with
   Perl. Some of the features of PCRE patterns are not supported. Never-
   theless, there are times when this kind of matching can be useful. For
   a discussion of the two matching algorithms, see the pcrematching docu-
   mentation.

   The arguments for the pcre_dfa_exec() function are the same as for
   pcre_exec(), plus two extras. The ovector argument is used in a differ-
   ent way, and this is described below. The other common arguments are
   used in the same way as for pcre_exec(), so their description is not
   repeated here.

   The two additional arguments provide workspace for the function. The
   workspace vector should contain at least 20 elements. It is used for
   keeping track of multiple paths through the pattern tree. More
   workspace will be needed for patterns and subjects where there are a
   lot of potential matches.

Here is an example of a simple call to pcre_dfa_exec():

   int rc;
   int ovector[10];
   int wspace[20];
   rc = pcre_dfa_exec(
   re, /* result of pcre_compile() */
   NULL, /* we didn't study the pattern */
   "some string", /* the subject string */
   11, /* the length of the subject string */
   0,    /* start at offset 0 in the subject */
   0,    /* default options */
   ovector,    /* vector of integers for substring information */
   10, /* number of elements (NOT size in bytes) */
   wspace, /* working space vector */
   20);    /* number of elements (NOT size in bytes) */

Option bits for pcre_dfa_exec()

   The unused bits of the options argument for pcre_dfa_exec() must be
   zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
   LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
   PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
   three of these are the same as for pcre_exec(), so their description is
   not repeated here.

PCRE_PARTIAL

   This has the same general effect as it does for pcre_exec(), but the
   details are slightly different. When PCRE_PARTIAL is set for
   pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into
   PCRE_ERROR_PARTIAL if the end of the subject is reached, there have
   been no complete matches, but there is still at least one matching pos-
   sibility. The portion of the string that provided the partial match is
   set as the first matching string.

PCRE_DFA_SHORTEST

   Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
   stop as soon as it has found one match. Because of the way the alterna-
   tive algorithm works, this is necessarily the shortest possible match
   at the first possible matching point in the subject string.

PCRE_DFA_RESTART

   When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and
   returns a partial match, it is possible to call it again, with addi-
   tional subject characters, and have it continue with the same match.
   The PCRE_DFA_RESTART option requests this action; when it is set, the
   workspace and wscount options must reference the same vector as before
   because data about the match so far is left in them after a partial
   match. There is more discussion of this facility in the pcrepartial
   documentation.

Successful returns from pcre_dfa_exec()

   When pcre_dfa_exec() succeeds, it may have matched more than one sub-
   string in the subject. Note, however, that all the matches from one run
   of the function start at the same point in the subject. The shorter
   matches are all initial substrings of the longer matches. For example,
   if the pattern

<.*>

is matched against the string

This is <something> <something else> <something further> no more

the three matched strings are

   On success, the yield of the function is a number greater than zero,
   which is the number of matched substrings. The substrings themselves
   are returned in ovector. Each string uses two elements; the first is
   the offset to the start, and the second is the offset to the end. In
   fact, all the strings have the same start offset. (Space could have
   been saved by giving this only once, but it was decided to retain some
   compatibility with the way pcre_exec() returns data, even though the
   meaning of the strings is different.)

   The strings are returned in reverse order of length; that is, the long-
   est matching string is given first. If there were too many matches to
   fit into ovector, the yield of the function is zero, and the vector is
   filled with the longest matches.

Error returns from pcre_dfa_exec()

   The pcre_dfa_exec() function returns a negative number when it fails.
   Many of the errors are the same as for pcre_exec(), and these are
   described above. There are in addition the following errors that are
   specific to pcre_dfa_exec():

PCRE_ERROR_DFA_UITEM (-16)

   This return is given if pcre_dfa_exec() encounters an item in the pat-
   tern that it does not support, for instance, the use of \C or a back
   reference.

PCRE_ERROR_DFA_UCOND (-17)

   This return is given if pcre_dfa_exec() encounters a condition item
   that uses a back reference for the condition, or a test for recursion
   in a specific group. These are not supported.

PCRE_ERROR_DFA_UMLIMIT (-18)

   This return is given if pcre_dfa_exec() is called with an extra block
   that contains a setting of the match_limit field. This is not supported
   (it is meaningless).

PCRE_ERROR_DFA_WSSIZE (-19)

This return is given if pcre_dfa_exec() runs out of space in the
workspace vector.

PCRE_ERROR_DFA_RECURSE (-20)

   When a recursive subpattern is processed, the matching function calls
   itself recursively, using private vectors for ovector and workspace.
   This error is given if the output vector is not large enough. This
   should be extremely rare, as a vector of size 1000 is used.

SEE ALSO

pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).

AUTHOR

   Philip Hazel
   University Computing Service
   Cambridge CB2 3QH, England.

REVISION

Last updated: 24 April 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------

PCRECALLOUT(3) PCRECALLOUT(3)

NAME
PCRE - Perl-compatible regular expressions

PCRE CALLOUTS

int (*pcre_callout)(pcre_callout_block *);

   PCRE provides a feature called "callout", which is a means of temporar-
   ily passing control to the caller of PCRE in the middle of pattern
   matching. The caller of PCRE provides an external function by putting
   its entry point in the global variable pcre_callout. By default, this
   variable contains NULL, which disables all calling out.

   Within a regular expression, (?C) indicates the points at which the
   external function is to be called. Different callout points can be
   identified by putting a number less than 256 after the letter C. The
   default value is zero. For example, this pattern has two callout
   points:

(?C1)abc(?C2)def

   If the PCRE_AUTO_CALLOUT option bit is set when pcre_compile() is
   called, PCRE automatically inserts callouts, all with number 255,
   before each item in the pattern. For example, if PCRE_AUTO_CALLOUT is
   used with the pattern

A(\d{2}|--)

it is processed as if it were

(?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)

   Notice that there is a callout before and after each parenthesis and
   alternation bar. Automatic callouts can be used for tracking the
   progress of pattern matching. The pcretest command has an option that
   sets automatic callouts; when it is used, the output indicates how the
   pattern is matched. This is useful information when you are trying to
   optimize the performance of a particular pattern.

MISSING CALLOUTS

   You should be aware that, because of optimizations in the way PCRE
   matches patterns, callouts sometimes do not happen. For example, if the
   pattern is

ab(?C4)cd

   PCRE knows that any matching string must contain the letter "d". If the
   subject string is "abyz", the lack of "d" means that matching doesn't
   ever start, and the callout is never reached. However, with "abyd",
   though the result is still no match, the callout is obeyed.

THE CALLOUT INTERFACE

   During matching, when PCRE reaches a callout point, the external func-
   tion defined by pcre_callout is called (if it is set). This applies to
   both the pcre_exec() and the pcre_dfa_exec() matching functions. The
   only argument to the callout function is a pointer to a pcre_callout
   block. This structure contains the following fields:

   int    version;
   int    callout_number;
   int *offset_vector;
   const char *subject;
   int    subject_length;
   int    start_match;
   int    current_position;
   int    capture_top;
   int    capture_last;
   void    *callout_data;
   int    pattern_position;
   int    next_item_length;

   The version field is an integer containing the version number of the
   block format. The initial version was 0; the current version is 1. The
   version number will change again in future if additional fields are
   added, but the intention is never to remove any of the existing fields.

   The callout_number field contains the number of the callout, as com-
   piled into the pattern (that is, the number after ?C for manual call-
   outs, and 255 for automatically generated callouts).

   The offset_vector field is a pointer to the vector of offsets that was
   passed by the caller to pcre_exec() or pcre_dfa_exec(). When
   pcre_exec() is used, the contents can be inspected in order to extract
   substrings that have been matched so far, in the same way as for
   extracting substrings after a match has completed. For pcre_dfa_exec()
   this field is not useful.

The subject and subject_length fields contain copies of the values that
were passed to pcre_exec().

   The start_match field contains the offset within the subject at which
   the current match attempt started. If the pattern is not anchored, the
   callout function may be called several times from the same point in the
   pattern for different starting points in the subject.

The current_position field contains the offset within the subject of
the current match pointer.

   When the pcre_exec() function is used, the capture_top field contains
   one more than the number of the highest numbered captured substring so
   far. If no substrings have been captured, the value of capture_top is
   one. This is always the case when pcre_dfa_exec() is used, because it
   does not support captured substrings.

   The capture_last field contains the number of the most recently cap-
   tured substring. If no substrings have been captured, its value is -1.
   This is always the case when pcre_dfa_exec() is used.

   The callout_data field contains a value that is passed to pcre_exec()
   or pcre_dfa_exec() specifically so that it can be passed back in call-
   outs. It is passed in the pcre_callout field of the pcre_extra data
   structure. If no such data was passed, the value of callout_data in a
   pcre_callout block is NULL. There is a description of the pcre_extra
   structure in the pcreapi documentation.

   The pattern_position field is present from version 1 of the pcre_call-
   out structure. It contains the offset to the next item to be matched in
   the pattern string.

   The next_item_length field is present from version 1 of the pcre_call-
   out structure. It contains the length of the next item to be matched in
   the pattern string. When the callout immediately precedes an alterna-
   tion bar, a closing parenthesis, or the end of the pattern, the length
   is zero. When the callout precedes an opening parenthesis, the length
   is that of the entire subpattern.

   The pattern_position and next_item_length fields are intended to help
   in distinguishing between different automatic callouts, which all have
   the same callout number. However, they are set for all callouts.

RETURN VALUES

   The external callout function returns an integer to PCRE. If the value
   is zero, matching proceeds as normal. If the value is greater than
   zero, matching fails at the current point, but the testing of other
   matching possibilities goes ahead, just as if a lookahead assertion had
   failed. If the value is less than zero, the match is abandoned, and
   pcre_exec() (or pcre_dfa_exec()) returns the negative value.

   Negative values should normally be chosen from the set of
   PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
   dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
   reserved for use by callout functions; it will never be used by PCRE
   itself.

AUTHOR

   Philip Hazel
   University Computing Service
   Cambridge CB2 3QH, England.

REVISION

Last updated: 06 March 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------

PCRECOMPAT(3) PCRECOMPAT(3)

NAME
PCRE - Perl-compatible regular expressions

DIFFERENCES BETWEEN PCRE AND PERL

   This document describes the differences in the ways that PCRE and Perl
   handle regular expressions. The differences described here are mainly
   with respect to Perl 5.8, though PCRE version 7.0 contains some fea-
   tures that are expected to be in the forthcoming Perl 5.10.

   1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
   of what it does have are given in the section on UTF-8 support in the
   main pcre page.

   2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
   permits them, but they do not mean what you might think. For example,
   (?!a){3} does not assert that the next three characters are not "a". It
   just asserts that the next character is not "a" three times.

   3. Capturing subpatterns that occur inside negative lookahead asser-
   tions are counted, but their entries in the offsets vector are never
   set. Perl sets its numerical variables from any such patterns that are
   matched before the assertion fails to match something (thereby succeed-
   ing), but only if the negative lookahead assertion contains just one
   branch.

   4. Though binary zero characters are supported in the subject string,
   they are not allowed in a pattern string because it is passed as a nor-
   mal C string, terminated by zero. The escape sequence \0 can be used in
   the pattern to represent a binary zero.

   5. The following Perl escape sequences are not supported: \l, \u, \L,
   \U, and \N. In fact these are implemented by Perl's general string-han-
   dling and are not part of its pattern matching engine. If any of these
   are encountered by PCRE, an error is generated.

   6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
   is built with Unicode character property support. The properties that
   can be tested with \p and \P are limited to the general category prop-
   erties such as Lu and Nd, script names such as Greek or Han, and the
   derived properties Any and L&.

   7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
   ters in between are treated as literals. This is slightly different
   from Perl in that $ and @ are also handled as literals inside the
   quotes. In Perl, they cause variable interpolation (but of course PCRE
   does not have variables). Note the following examples:

Pattern PCRE matches Perl matches

   \Qabc$xyz\E    abc$xyz abc followed by the
contents of $xyz
   \Qabc\$xyz\E abc\$xyz    abc\$xyz
   \Qabc\E\$\Qxyz\E abc$xyz abc$xyz

The \Q...\E sequence is recognized both inside and outside character
classes.

   8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
   constructions. However, there is support for recursive patterns. This
   is not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
   "callout" feature allows an external function to be called during pat-
   tern matching. See the pcrecallout documentation for details.

   9. Subpatterns that are called recursively or as "subroutines" are
   always treated as atomic groups in PCRE. This is like Python, but
   unlike Perl.

   10. There are some differences that are concerned with the settings of
   captured strings when part of a pattern is repeated. For example,
   matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
   unset, but in PCRE it is set to "b".

   11. PCRE provides some extensions to the Perl regular expression facil-
   ities. Perl 5.10 will include new features that are not in earlier
   versions, some of which (such as named parentheses) have been in PCRE
   for some time. This list is with respect to Perl 5.10:

   (a) Although lookbehind assertions must match fixed length strings,
   each alternative branch of a lookbehind assertion can match a different
   length of string. Perl requires them all to have the same length.

(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
meta-character matches only at the very end of the string.

   (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
   cial meaning is faulted. Otherwise, like Perl, the backslash is
   ignored. (Perl can be made to issue a warning.)

   (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
   fiers is inverted, that is, by default they are not greedy, but if fol-
   lowed by a question mark they are.

(e) PCRE_ANCHORED can be used at matching time to force a pattern to be
tried only at the first matching position in the subject string.

(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
TURE options for pcre_exec() have no Perl equivalents.

(g) The callout facility is PCRE-specific.

(h) The partial matching facility is PCRE-specific.

(i) Patterns compiled by PCRE can be saved and re-used at a later time,
even on different hosts that have the other endianness.

(j) The alternative matching function (pcre_dfa_exec()) matches in a
different way and is not Perl-compatible.

AUTHOR

   Philip Hazel
   University Computing Service
   Cambridge CB2 3QH, England.

REVISION

Last updated: 06 March 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------

PCREPATTERN(3) PCREPATTERN(3)

NAME
PCRE - Perl-compatible regular expressions

PCRE REGULAR EXPRESSION DETAILS

   The syntax and semantics of the regular expressions supported by PCRE
   are described below. Regular expressions are also described in the Perl
   documentation and in a number of books, some of which have copious
   examples. Jeffrey Friedl's "Mastering Regular Expressions", published
   by O'Reilly, covers regular expressions in great detail. This descrip-
   tion of PCRE's regular expressions is intended as reference material.

   The original operation of PCRE was on strings of one-byte characters.
   However, there is now also support for UTF-8 character strings. To use
   this, you must build PCRE to include UTF-8 support, and then call
   pcre_compile() with the PCRE_UTF8 option. How this affects pattern
   matching is mentioned in several places below. There is also a summary
   of UTF-8 features in the section on UTF-8 support in the main pcre
   page.

   The remainder of this document discusses the patterns that are sup-
   ported by PCRE when its main matching function, pcre_exec(), is used.
   From release 6.0, PCRE offers a second matching function,
   pcre_dfa_exec(), which matches using a different algorithm that is not
   Perl-compatible. The advantages and disadvantages of the alternative
   function, and how it differs from the normal function, are discussed in
   the pcrematching page.

CHARACTERS AND METACHARACTERS

   A regular expression is a pattern that is matched against a subject
   string from left to right. Most characters stand for themselves in a
   pattern, and match the corresponding characters in the subject. As a
   trivial example, the pattern

The quick brown fox

   matches a portion of a subject string that is identical to itself. When
   caseless matching is specified (the PCRE_CASELESS option), letters are
   matched independently of case. In UTF-8 mode, PCRE always understands
   the concept of case for characters whose values are less than 128, so
   caseless matching is always possible. For characters with higher val-
   ues, the concept of case is supported if PCRE is compiled with Unicode
   property support, but not otherwise. If you want to use caseless
   matching for characters 128 and above, you must ensure that PCRE is
   compiled with Unicode property support as well as with UTF-8 support.

   The power of regular expressions comes from the ability to include
   alternatives and repetitions in the pattern. These are encoded in the
   pattern by the use of metacharacters, which do not stand for themselves
   but instead are interpreted in some special way.

   There are two different sets of metacharacters: those that are recog-
   nized anywhere in the pattern except within square brackets, and those
   that are recognized within square brackets. Outside square brackets,
   the metacharacters are as follows:

   \    general escape character with several uses
   ^    assert start of string (or line, in multiline mode)
   $    assert end of string (or line, in multiline mode)
   .    match any character except newline (by default)
   [    start character class definition
   |    start of alternative branch
   (    start subpattern
   )    end subpattern
   ?    extends the meaning of (
also 0 or 1 quantifier
also quantifier minimizer
   *    0 or more quantifier
   +    1 or more quantifier
also "possessive quantifier"
   {    start min/max quantifier

Part of a pattern that is in square brackets is called a "character
class". In a character class the only metacharacters are:

   \    general escape character
   ^    negate the class, but only if the first character
   -    indicates character range
   [    POSIX character class (only if followed by POSIX
syntax)
   ]    terminates the character class

The following sections describe the use of each of the metacharacters.

BACKSLASH

   The backslash character has several uses. Firstly, if it is followed by
   a non-alphanumeric character, it takes away any special meaning that
   character may have. This use of backslash as an escape character
   applies both inside and outside character classes.

   For example, if you want to match a * character, you write \* in the
   pattern. This escaping action applies whether or not the following
   character would otherwise be interpreted as a metacharacter, so it is
   always safe to precede a non-alphanumeric with backslash to specify
   that it stands for itself. In particular, if you want to match a back-
   slash, you write \\.

   If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
   the pattern (other than in a character class) and characters between a
   # outside a character class and the next newline are ignored. An escap-
   ing backslash can be used to include a whitespace or # character as
   part of the pattern.

   If you want to remove the special meaning from a sequence of charac-
   ters, you can do so by putting them between \Q and \E. This is differ-
   ent from Perl in that $ and @ are handled as literals in \Q...\E
   sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
   tion. Note the following examples:

Pattern PCRE matches Perl matches

   \Qabc$xyz\E    abc$xyz    abc followed by the
   contents of $xyz
   \Qabc\$xyz\E abc\$xyz abc\$xyz
   \Qabc\E\$\Qxyz\E abc$xyz    abc$xyz

The \Q...\E sequence is recognized both inside and outside character
classes.

Non-printing characters

   A second use of backslash provides a way of encoding non-printing char-
   acters in patterns in a visible manner. There is no restriction on the
   appearance of non-printing characters, apart from the binary zero that
   terminates a pattern, but when a pattern is being prepared by text
   editing, it is usually easier to use one of the following escape
   sequences than the binary character it represents:

   \a    alarm, that is, the BEL character (hex 07)
   \cx "control-x", where x is any character
   \e    escape (hex 1B)
   \f    formfeed (hex 0C)
   \n    newline (hex 0A)
   \r    carriage return (hex 0D)
   \t    tab (hex 09)
   \ddd    character with octal code ddd, or backreference
   \xhh    character with hex code hh
   \x{hhh..} character with hex code hhh..

   The precise effect of \cx is as follows: if x is a lower case letter,
   it is converted to upper case. Then bit 6 of the character (hex 40) is
   inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
   becomes hex 7B.

   After \x, from zero to two hexadecimal digits are read (letters can be
   in upper or lower case). Any number of hexadecimal digits may appear
   between \x{ and }, but the value of the character code must be less
   than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
   the maximum hexadecimal value is 7FFFFFFF). If characters other than
   hexadecimal digits appear between \x{ and }, or if there is no termi-
   nating }, this form of escape is not recognized. Instead, the initial
   \x will be interpreted as a basic hexadecimal escape, with no following
   digits, giving a character whose value is zero.

   Characters whose value is less than 256 can be defined by either of the
   two syntaxes for \x. There is no difference in the way they are han-
   dled. For example, \xdc is exactly the same as \x{dc}.

   After \0 up to two further octal digits are read. If there are fewer
   than two digits, just those that are present are used. Thus the
   sequence \0\x\07 specifies two binary zeros followed by a BEL character
   (code value 7). Make sure you supply two digits after the initial zero
   if the pattern character that follows is itself an octal digit.

   The handling of a backslash followed by a digit other than 0 is compli-
   cated. Outside a character class, PCRE reads it and any following dig-
   its as a decimal number. If the number is less than 10, or if there
   have been at least that many previous capturing left parentheses in the
   expression, the entire sequence is taken as a back reference. A
   description of how this works is given later, following the discussion
   of parenthesized subpatterns.

   Inside a character class, or if the decimal number is greater than 9
   and there have not been that many capturing subpatterns, PCRE re-reads
   up to three octal digits following the backslash, and uses them to gen-
   erate a data character. Any subsequent digits stand for themselves. In
   non-UTF-8 mode, the value of a character specified in octal must be
   less than \400. In UTF-8 mode, values up to \777 are permitted. For
   example:

   \040 is another way of writing a space
   \40    is the same, provided there are fewer than 40
   previous capturing subpatterns
   \7 is always a back reference
   \11    might be a back reference, or another way of
   writing a tab
   \011 is always a tab
   \0113 is a tab followed by the character "3"
   \113 might be a back reference, otherwise the
   character with octal code 113
   \377 might be a back reference, otherwise
   the byte consisting entirely of 1 bits
   \81    is either a back reference, or a binary zero
   followed by the two characters "8" and "1"

Note that octal values of 100 or greater must not be introduced by a
leading zero, because no more than three octal digits are ever read.

   All the sequences that define a single character value can be used both
   inside and outside character classes. In addition, inside a character
   class, the sequence \b is interpreted as the backspace character (hex
   08), and the sequences \R and \X are interpreted as the characters "R"
   and "X", respectively. Outside a character class, these sequences have
   different meanings (see below).

Absolute and relative back references

   The sequence \g followed by a positive or negative number, optionally
   enclosed in braces, is an absolute or relative back reference. Back
   references are discussed later, following the discussion of parenthe-
   sized subpatterns.

Generic character types

Another use of backslash is for specifying generic character types. The
following are always recognized:

   \d any decimal digit
   \D any character that is not a decimal digit
   \s any whitespace character
   \S any character that is not a whitespace character
   \w any "word" character
   \W any "non-word" character

   Each pair of escape sequences partitions the complete set of characters
   into two disjoint sets. Any given character matches one, and only one,
   of each pair.

   These character type sequences can appear both inside and outside char-
   acter classes. They each match one character of the appropriate type.
   If the current matching point is at the end of the subject string, all
   of them fail, since there is no character to match.

   For compatibility with Perl, \s does not match the VT character (code
   11). This makes it different from the the POSIX "space" class. The \s
   characters are HT (9), LF (10), FF (12), CR (13), and space (32). (If
   "use locale;" is included in a Perl script, \s may match the VT charac-
   ter. In PCRE, it never does.)

   A "word" character is an underscore or any character less than 256 that
   is a letter or digit. The definition of letters and digits is con-
   trolled by PCRE's low-valued character tables, and may vary if locale-
   specific matching is taking place (see "Locale support" in the pcreapi
   page). For example, in a French locale such as "fr_FR" in Unix-like
   systems, or "french" in Windows, some character codes greater than 128
   are used for accented letters, and these are matched by \w.

   In UTF-8 mode, characters with values greater than 128 never match \d,
   \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
   code character property support is available. The use of locales with
   Unicode is discouraged.

Newline sequences

   Outside a character class, the escape sequence \R matches any Unicode
   newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is
   equivalent to the following:

(?>\r\n|\n|\x0b|\f|\r|\x85)

   This is an example of an "atomic group", details of which are given
   below. This particular group matches either the two-character sequence
   CR followed by LF, or one of the single characters LF (linefeed,
   U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
   return, U+000D), or NEL (next line, U+0085). The two-character sequence
   is treated as a single unit that cannot be split.

   In UTF-8 mode, two additional characters whose codepoints are greater
   than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
   rator, U+2029). Unicode character property support is not needed for
   these characters to be recognized.

Inside a character class, \R matches the letter "R".

Unicode character properties

   When PCRE is built with Unicode character property support, three addi-
   tional escape sequences to match character properties are available
   when UTF-8 mode is selected. They are:

   \p{xx} a character with the xx property
   \P{xx} a character without the xx property
   \X an extended Unicode sequence

   The property names represented by xx above are limited to the Unicode
   script names, the general category properties, and "Any", which matches
   any character (including newline). Other properties such as "InMusical-
   Symbols" are not currently supported by PCRE. Note that \P{Any} does
   not match any characters, so always causes a match failure.

   Sets of Unicode characters are defined as belonging to certain scripts.
   A character from one of these sets can be matched using a script name.
   For example:

\p{Greek}
\P{Han}

Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:

   Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese,
   Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform,
   Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
   Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
   gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin,
   Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko,
   Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
   Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
   Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.

   Each character has exactly one general category property, specified by
   a two-letter abbreviation. For compatibility with Perl, negation can be
   specified by including a circumflex between the opening brace and the
   property name. For example, \p{^Lu} is the same as \P{Lu}.

   If only one letter is specified with \p or \P, it includes all the gen-
   eral category properties that start with that letter. In this case, in
   the absence of negation, the curly brackets in the escape sequence are
   optional; these two examples have the same effect:

\p{L}
\pL

The following general category property codes are supported:

   C Other
   Cc    Control
   Cf    Format
   Cn    Unassigned
   Co    Private use
   Cs    Surrogate

   L Letter
   Ll    Lower case letter
   Lm    Modifier letter
   Lo    Other letter
   Lt    Title case letter
   Lu    Upper case letter

   M Mark
   Mc    Spacing mark
   Me    Enclosing mark
   Mn    Non-spacing mark

   N Number
   Nd    Decimal number
   Nl    Letter number
   No    Other number

   P Punctuation
   Pc    Connector punctuation
   Pd    Dash punctuation
   Pe    Close punctuation
   Pf    Final punctuation
   Pi    Initial punctuation
   Po    Other punctuation
   Ps    Open punctuation

   S Symbol
   Sc    Currency symbol
   Sk    Modifier symbol
   Sm    Mathematical symbol
   So    Other symbol

   Z Separator
   Zl    Line separator
   Zp    Paragraph separator
   Zs    Space separator

   The special property L& is also supported: it matches a character that
   has the Lu, Ll, or Lt property, in other words, a letter that is not
   classified as a modifier or "other".

   The long synonyms for these properties that Perl supports (such as
   \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
   any of these properties with "Is".

   No character that is in the Unicode table has the Cn (unassigned) prop-
   erty. Instead, this property is assumed for any code point that is not
   in the Unicode table.

Specifying caseless matching does not affect these escape sequences.
For example, \p{Lu} always matches only upper case letters.

The \X escape matches any number of Unicode characters that form an
extended Unicode sequence. \X is equivalent to

(?>\PM\pM*)

   That is, it matches a character without the "mark" property, followed
   by zero or more characters with the "mark" property, and treats the
   sequence as an atomic group (see below). Characters with the "mark"
   property are typically accents that affect the preceding character.

   Matching characters by Unicode property is not fast, because PCRE has
   to search a structure that contains data for over fifteen thousand
   characters. That is why the traditional escape sequences such as \d and
   \w do not use Unicode properties in PCRE.

Simple assertions

   The final use of backslash is for certain simple assertions. An asser-
   tion specifies a condition that has to be met at a particular point in
   a match, without consuming any characters from the subject string. The
   use of subpatterns for more complicated assertions is described below.
   The backslashed assertions are:

   \b matches at a word boundary
   \B matches when not at a word boundary
   \A matches at the start of the subject
   \Z matches at the end of the subject
   also matches before a newline at the end of the subject
   \z matches only at the end of the subject
   \G matches at the first matching position in the subject

   These assertions may not appear in character classes (but note that \b
   has a different meaning, namely the backspace character, inside a char-
   acter class).

   A word boundary is a position in the subject string where the current
   character and the previous character do not both match \w or \W (i.e.
   one matches \w and the other matches \W), or the start or end of the
   string if the first or last character matches \w, respectively.

   The \A, \Z, and \z assertions differ from the traditional circumflex
   and dollar (described in the next section) in that they only ever match
   at the very start and end of the subject string, whatever options are
   set. Thus, they are independent of multiline mode. These three asser-
   tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
   affect only the behaviour of the circumflex and dollar metacharacters.
   However, if the startoffset argument of pcre_exec() is non-zero, indi-
   cating that matching is to start at a point other than the beginning of
   the subject, \A can never match. The difference between \Z and \z is
   that \Z matches before a newline at the end of the string as well as at
   the very end, whereas \z matches only at the end.

   The \G assertion is true only when the current matching position is at
   the start point of the match, as specified by the startoffset argument
   of pcre_exec(). It differs from \A when the value of startoffset is
   non-zero. By calling pcre_exec() multiple times with appropriate argu-
   ments, you can mimic Perl's /g option, and it is in this kind of imple-
   mentation where \G can be useful.

   Note, however, that PCRE's interpretation of \G, as the start of the
   current match, is subtly different from Perl's, which defines it as the
   end of the previous match. In Perl, these can be different when the
   previously matched string was empty. Because PCRE does just one match
   at a time, it cannot reproduce this behaviour.

   If all the alternatives of a pattern begin with \G, the expression is
   anchored to the starting match position, and the "anchored" flag is set
   in the compiled regular expression.

CIRCUMFLEX AND DOLLAR

   Outside a character class, in the default matching mode, the circumflex
   character is an assertion that is true only if the current matching
   point is at the start of the subject string. If the startoffset argu-
   ment of pcre_exec() is non-zero, circumflex can never match if the
   PCRE_MULTILINE option is unset. Inside a character class, circumflex
   has an entirely different meaning (see below).

   Circumflex need not be the first character of the pattern if a number
   of alternatives are involved, but it should be the first thing in each
   alternative in which it appears if the pattern is ever to match that
   branch. If all possible alternatives start with a circumflex, that is,
   if the pattern is constrained to match only at the start of the sub-
   ject, it is said to be an "anchored" pattern. (There are also other
   constructs that can cause a pattern to be anchored.)

   A dollar character is an assertion that is true only if the current
   matching point is at the end of the subject string, or immediately
   before a newline at the end of the string (by default). Dollar need not
   be the last character of the pattern if a number of alternatives are
   involved, but it should be the last item in any branch in which it
   appears. Dollar has no special meaning in a character class.

   The meaning of dollar can be changed so that it matches only at the
   very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
   compile time. This does not affect the \Z assertion.

   The meanings of the circumflex and dollar characters are changed if the
   PCRE_MULTILINE option is set. When this is the case, a circumflex
   matches immediately after internal newlines as well as at the start of
   the subject string. It does not match after a newline that ends the
   string. A dollar matches before any newlines in the string, as well as
   at the very end, when PCRE_MULTILINE is set. When newline is specified
   as the two-character sequence CRLF, isolated CR and LF characters do
   not indicate newlines.

   For example, the pattern /^abc$/ matches the subject string "def\nabc"
   (where \n represents a newline) in multiline mode, but not otherwise.
   Consequently, patterns that are anchored in single line mode because
   all branches start with ^ are not anchored in multiline mode, and a
   match for circumflex is possible when the startoffset argument of
   pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
   PCRE_MULTILINE is set.

   Note that the sequences \A, \Z, and \z can be used to match the start
   and end of the subject in both modes, and if all branches of a pattern
   start with \A it is always anchored, whether or not PCRE_MULTILINE is
   set.

FULL STOP (PERIOD, DOT)

   Outside a character class, a dot in the pattern matches any one charac-
   ter in the subject string except (by default) a character that signi-
   fies the end of a line. In UTF-8 mode, the matched character may be
   more than one byte long.

   When a line ending is defined as a single character, dot never matches
   that character; when the two-character sequence CRLF is used, dot does
   not match CR if it is immediately followed by LF, but otherwise it
   matches all characters (including isolated CRs and LFs). When any Uni-
   code line endings are being recognized, dot does not match CR or LF or
   any of the other line ending characters.

   The behaviour of dot with regard to newlines can be changed. If the
   PCRE_DOTALL option is set, a dot matches any one character, without
   exception. If the two-character sequence CRLF is present in the subject
   string, it takes two dots to match it.

   The handling of dot is entirely independent of the handling of circum-
   flex and dollar, the only relationship being that they both involve
   newlines. Dot has no special meaning in a character class.

MATCHING A SINGLE BYTE

   Outside a character class, the escape sequence \C matches any one byte,
   both in and out of UTF-8 mode. Unlike a dot, it always matches any
   line-ending characters. The feature is provided in Perl in order to
   match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
   acters into individual bytes, what remains in the string may be a mal-
   formed UTF-8 string. For this reason, the \C escape sequence is best
   avoided.

   PCRE does not allow \C to appear in lookbehind assertions (described
   below), because in UTF-8 mode this would make it impossible to calcu-
   late the length of the lookbehind.

SQUARE BRACKETS AND CHARACTER CLASSES

   An opening square bracket introduces a character class, terminated by a
   closing square bracket. A closing square bracket on its own is not spe-
   cial. If a closing square bracket is required as a member of the class,
   it should be the first data character in the class (after an initial
   circumflex, if present) or escaped with a backslash.

   A character class matches a single character in the subject. In UTF-8
   mode, the character may occupy more than one byte. A matched character
   must be in the set of characters defined by the class, unless the first
   character in the class definition is a circumflex, in which case the
   subject character must not be in the set defined by the class. If a
   circumflex is actually required as a member of the class, ensure it is
   not the first character, or escape it with a backslash.

   For example, the character class [aeiou] matches any lower case vowel,
   while [^aeiou] matches any character that is not a lower case vowel.
   Note that a circumflex is just a convenient notation for specifying the
   characters that are in the class by enumerating those that are not. A
   class that starts with a circumflex is not an assertion: it still con-
   sumes a character from the subject string, and therefore it fails if
   the current pointer is at the end of the string.

   In UTF-8 mode, characters with values greater than 255 can be included
   in a class as a literal string of bytes, or by using the \x{ escaping
   mechanism.

   When caseless matching is set, any letters in a class represent both
   their upper case and lower case versions, so for example, a caseless
   [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
   match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
   understands the concept of case for characters whose values are less
   than 128, so caseless matching is always possible. For characters with
   higher values, the concept of case is supported if PCRE is compiled
   with Unicode property support, but not otherwise. If you want to use
   caseless matching for characters 128 and above, you must ensure that
   PCRE is compiled with Unicode property support as well as with UTF-8
   support.

   Characters that might indicate line breaks are never treated in any
   special way when matching character classes, whatever line-ending
   sequence is in use, and whatever setting of the PCRE_DOTALL and
   PCRE_MULTILINE options is used. A class such as [^a] always matches one
   of these characters.

   The minus (hyphen) character can be used to specify a range of charac-
   ters in a character class. For example, [d-m] matches any letter
   between d and m, inclusive. If a minus character is required in a
   class, it must be escaped with a backslash or appear in a position
   where it cannot be interpreted as indicating a range, typically as the
   first or last character in the class.

   It is not possible to have the literal character "]" as the end charac-
   ter of a range. A pattern such as [W-]46] is interpreted as a class of
   two characters ("W" and "-") followed by a literal string "46]", so it
   would match "W46]" or "-46]". However, if the "]" is escaped with a
   backslash it is interpreted as the end of range, so [W-\]46] is inter-
   preted as a class containing a range followed by two other characters.
   The octal or hexadecimal representation of "]" can also be used to end
   a range.

   Ranges operate in the collating sequence of character values. They can
   also be used for characters specified numerically, for example
   [\000-\037]. In UTF-8 mode, ranges can include characters whose values
   are greater than 255, for example [\x{100}-\x{2ff}].

   If a range that includes letters is used when caseless matching is set,
   it matches the letters in either case. For example, [W-c] is equivalent
   to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
   character tables for a French locale are in use, [\xc8-\xcb] matches
   accented E characters in both cases. In UTF-8 mode, PCRE supports the
   concept of case for characters with values greater than 128 only when
   it is compiled with Unicode property support.

   The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
   in a character class, and add the characters that they match to the
   class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
   flex can conveniently be used with the upper case character types to
   specify a more restricted set of characters than the matching lower
   case type. For example, the class [^\W_] matches any letter or digit,
   but not underscore.

   The only metacharacters that are recognized in character classes are
   backslash, hyphen (only where it can be interpreted as specifying a
   range), circumflex (only at the start), opening square bracket (only
   when it can be interpreted as introducing a POSIX class name - see the
   next section), and the terminating closing square bracket. However,
   escaping other non-alphanumeric characters does no harm.

POSIX CHARACTER CLASSES

   Perl supports the POSIX notation for character classes. This uses names
   enclosed by [: and :] within the enclosing square brackets. PCRE also
   supports this notation. For example,

[01[:alpha:]%]

matches "0", "1", any alphabetic character, or "%". The supported class
names are

   alnum    letters and digits
   alpha    letters
   ascii    character codes 0 - 127
   blank    space or tab only
   cntrl    control characters
   digit    decimal digits (same as \d)
   graph    printing characters, excluding space
   lower    lower case letters
   print    printing characters, including space
   punct    printing characters, excluding letters and digits
   space    white space (not quite the same as \s)
   upper    upper case letters
   word "word" characters (same as \w)
   xdigit hexadecimal digits

   The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
   and space (32). Notice that this list includes the VT character (code
   11). This makes "space" different to \s, which does not include VT (for
   Perl compatibility).

   The name "word" is a Perl extension, and "blank" is a GNU extension
   from Perl 5.8. Another Perl extension is negation, which is indicated
   by a ^ character after the colon. For example,

[12[:^digit:]]

   matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
   POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
   these are not supported, and an error is given if they are encountered.

In UTF-8 mode, characters with values greater than 128 do not match any
of the POSIX character classes.

VERTICAL BAR

Vertical bar characters are used to separate alternative patterns. For
example, the pattern

gilbert|sullivan

   matches either "gilbert" or "sullivan". Any number of alternatives may
   appear, and an empty alternative is permitted (matching the empty
   string). The matching process tries each alternative in turn, from left
   to right, and the first one that succeeds is used. If the alternatives
   are within a subpattern (defined below), "succeeds" means matching the
   rest of the main pattern as well as the alternative in the subpattern.

INTERNAL OPTION SETTING

   The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
   PCRE_EXTENDED options can be changed from within the pattern by a
   sequence of Perl option letters enclosed between "(?" and ")". The
   option letters are

   i for PCRE_CASELESS
   m for PCRE_MULTILINE
   s for PCRE_DOTALL
   x for PCRE_EXTENDED

   For example, (?im) sets caseless, multiline matching. It is also possi-
   ble to unset these options by preceding the letter with a hyphen, and a
   combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
   LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
   is also permitted. If a letter appears both before and after the
   hyphen, the option is unset.

   When an option change occurs at top level (that is, not inside subpat-
   tern parentheses), the change applies to the remainder of the pattern
   that follows. If the change is placed right at the start of a pattern,
   PCRE extracts it into the global options (and it will therefore show up
   in data extracted by the pcre_fullinfo() function).

   An option change within a subpattern (see below for a description of
   subpatterns) affects only that part of the current pattern that follows
   it, so

(a(?i)b)c

   matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
   used). By this means, options can be made to have different settings
   in different parts of the pattern. Any changes made in one alternative
   do carry on into subsequent branches within the same subpattern. For
   example,

(a(?i)b|c)

   matches "ab", "aB", "c", and "C", even though when matching "C" the
   first branch is abandoned before the option setting. This is because
   the effects of option settings happen at compile time. There would be
   some very weird behaviour otherwise.

   The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
   can be changed in the same way as the Perl-compatible options by using
   the characters J, U and X respectively.

SUBPATTERNS

Subpatterns are delimited by parentheses (round brackets), which can be
nested. Turning part of a pattern into a subpattern does two things:

1. It localizes a set of alternatives. For example, the pattern

cat(aract|erpillar|)

   matches one of the words "cat", "cataract", or "caterpillar". Without
   the parentheses, it would match "cataract", "erpillar" or an empty
   string.

   2. It sets up the subpattern as a capturing subpattern. This means
   that, when the whole pattern matches, that portion of the subject
   string that matched the subpattern is passed back to the caller via the
   ovector argument of pcre_exec(). Opening parentheses are counted from
   left to right (starting from 1) to obtain numbers for the capturing
   subpatterns.

For example, if the string "the red king" is matched against the pat-
tern

the ((red|white) (king|queen))

the captured substrings are "red king", "red", and "king", and are num-
bered 1, 2, and 3, respectively.

   The fact that plain parentheses fulfil two functions is not always
   helpful. There are often times when a grouping subpattern is required
   without a capturing requirement. If an opening parenthesis is followed
   by a question mark and a colon, the subpattern does not do any captur-
   ing, and is not counted when computing the number of any subsequent
   capturing subpatterns. For example, if the string "the white queen" is
   matched against the pattern

the ((?:red|white) (king|queen))

the captured substrings are "white queen" and "queen", and are numbered
1 and 2. The maximum number of capturing subpatterns is 65535.

   As a convenient shorthand, if any option settings are required at the
   start of a non-capturing subpattern, the option letters may appear
   between the "?" and the ":". Thus the two patterns

(?i:saturday|sunday)
(?:(?i)saturday|sunday)

   match exactly the same set of strings. Because alternative branches are
   tried from left to right, and options are not reset until the end of
   the subpattern is reached, an option setting in one branch does affect
   subsequent branches, so the above patterns match "SUNDAY" as well as
   "Saturday".

NAMED SUBPATTERNS

   Identifying capturing parentheses by number is simple, but it can be
   very hard to keep track of the numbers in complicated regular expres-
   sions. Furthermore, if an expression is modified, the numbers may
   change. To help with this difficulty, PCRE supports the naming of sub-
   patterns. This feature was not added to Perl until release 5.10. Python
   had the feature earlier, and PCRE introduced it at release 4.0, using
   the Python syntax. PCRE now supports both the Perl and the Python syn-
   tax.

   In PCRE, a subpattern can be named in one of three ways: (?<name>...)
   or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
   to capturing parentheses from other parts of the pattern, such as back-
   references, recursion, and conditions, can be made by name as well as
   by number.

   Names consist of up to 32 alphanumeric characters and underscores.
   Named capturing parentheses are still allocated numbers as well as
   names, exactly as if the names were not present. The PCRE API provides
   function calls for extracting the name-to-number translation table from
   a compiled pattern. There is also a convenience function for extracting
   a captured substring by name.

   By default, a name must be unique within a pattern, but it is possible
   to relax this constraint by setting the PCRE_DUPNAMES option at compile
   time. This can be useful for patterns where only one instance of the
   named parentheses can match. Suppose you want to match the name of a
   weekday, either as a 3-letter abbreviation or as the full name, and in
   both cases you want to extract the abbreviation. This pattern (ignoring
   the line breaks) does the job:

   There are five capturing substrings, but only one is ever set after a
   match. The convenience function for extracting the data by name
   returns the substring for the first (and in this example, the only)
   subpattern of that name that matched. This saves searching to find
   which numbered subpattern it was. If you make a reference to a non-
   unique named subpattern from elsewhere in the pattern, the one that
   corresponds to the lowest number is used. For further details of the
   interfaces for handling named subpatterns, see the pcreapi documenta-
   tion.

REPETITION

Repetition is specified by quantifiers, which can follow any of the
following items:

   a literal data character
   the dot metacharacter
   the \C escape sequence
   the \X escape sequence (in UTF-8 mode with Unicode properties)
   the \R escape sequence
   an escape such as \d that matches a single character
   a character class
   a back reference (see next section)
   a parenthesized subpattern (unless it is an assertion)

   The general repetition quantifier specifies a minimum and maximum num-
   ber of permitted matches, by giving the two numbers in curly brackets
   (braces), separated by a comma. The numbers must be less than 65536,
   and the first must be less than or equal to the second. For example:

z{2,4}

   matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
   special character. If the second number is omitted, but the comma is
   present, there is no upper limit; if the second number and the comma
   are both omitted, the quantifier specifies an exact number of required
   matches. Thus

[aeiou]{3,}

matches at least 3 successive vowels, but may match many more, while

\d{8}

   matches exactly 8 digits. An opening curly bracket that appears in a
   position where a quantifier is not allowed, or one that does not match
   the syntax of a quantifier, is taken as a literal character. For exam-
   ple, {,6} is not a quantifier, but a literal string of four characters.

   In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
   individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
   acters, each of which is represented by a two-byte sequence. Similarly,
   when Unicode property support is available, \X{3} matches three Unicode
   extended sequences, each of which may be several bytes long (and they
   may be of different lengths).

The quantifier {0} is permitted, causing the expression to behave as if
the previous item and the quantifier were not present.

For convenience, the three most common quantifiers have single-charac-
ter abbreviations:

   *    is equivalent to {0,}
   +    is equivalent to {1,}
   ?    is equivalent to {0,1}

   It is possible to construct infinite loops by following a subpattern
   that can match no characters with a quantifier that has no upper limit,
   for example:

(a?)*

   Earlier versions of Perl and PCRE used to give an error at compile time
   for such patterns. However, because there are cases where this can be
   useful, such patterns are now accepted, but if any repetition of the
   subpattern does in fact match no characters, the loop is forcibly bro-
   ken.

   By default, the quantifiers are "greedy", that is, they match as much
   as possible (up to the maximum number of permitted times), without
   causing the rest of the pattern to fail. The classic example of where
   this gives problems is in trying to match comments in C programs. These
   appear between /* and */ and within the comment, individual * and /
   characters may appear. An attempt to match C comments by applying the
   pattern

/\*.*\*/

to the string

/* first comment */ not comment /* second comment */

fails, because it matches the entire string owing to the greediness of
the .* item.

   However, if a quantifier is followed by a question mark, it ceases to
   be greedy, and instead matches the minimum number of times possible, so
   the pattern

/\*.*?\*/

   does the right thing with the C comments. The meaning of the various
   quantifiers is not otherwise changed, just the preferred number of
   matches. Do not confuse this use of question mark with its use as a
   quantifier in its own right. Because it has two uses, it can sometimes
   appear doubled, as in

\d??\d

which matches one digit by preference, but can match two if that is the
only way the rest of the pattern matches.

   If the PCRE_UNGREEDY option is set (an option that is not available in
   Perl), the quantifiers are not greedy by default, but individual ones
   can be made greedy by following them with a question mark. In other
   words, it inverts the default behaviour.

   When a parenthesized subpattern is quantified with a minimum repeat
   count that is greater than 1 or with a limited maximum, more memory is
   required for the compiled pattern, in proportion to the size of the
   minimum or maximum.

   If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
   alent to Perl's /s) is set, thus allowing the dot to match newlines,
   the pattern is implicitly anchored, because whatever follows will be
   tried against every character position in the subject string, so there
   is no point in retrying the overall match at any position after the
   first. PCRE normally treats such a pattern as though it were preceded
   by \A.

   In cases where it is known that the subject string contains no new-
   lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
   mization, or alternatively using ^ to indicate anchoring explicitly.

   However, there is one situation where the optimization cannot be used.
   When .* is inside capturing parentheses that are the subject of a
   backreference elsewhere in the pattern, a match at the start may fail
   where a later one succeeds. Consider, for example:

(.*)abc\1

If the subject is "xyz123abc123" the match point is the fourth charac-
ter. For this reason, such a pattern is not implicitly anchored.

When a capturing subpattern is repeated, the value captured is the sub-
string that matched the final iteration. For example, after

(tweedle[dume]{3}\s*)+

   has matched "tweedledum tweedledee" the value of the captured substring
   is "tweedledee". However, if there are nested capturing subpatterns,
   the corresponding captured values may have been set in previous itera-
   tions. For example, after

/(a|(b))+/

matches "aba" the value of the second captured substring is "b".

ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

   With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
   repetition, failure of what follows normally causes the repeated item
   to be re-evaluated to see if a different number of repeats allows the
   rest of the pattern to match. Sometimes it is useful to prevent this,
   either to change the nature of the match, or to cause it fail earlier
   than it otherwise might, when the author of the pattern knows there is
   no point in carrying on.

Consider, for example, the pattern \d+foo when applied to the subject
line

123456bar

   After matching all 6 digits and then failing to match "foo", the normal
   action of the matcher is to try again with only 5 digits matching the
   \d+ item, and then with 4, and so on, before ultimately failing.
   "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
   the means for specifying that once a subpattern has matched, it is not
   to be re-evaluated in this way.

   If we use atomic grouping for the previous example, the matcher gives
   up immediately on failing to match "foo" the first time. The notation
   is a kind of special parenthesis, starting with (?> as in this example:

(?>\d+)foo

   This kind of parenthesis "locks up" the part of the pattern it con-
   tains once it has matched, and a failure further into the pattern is
   prevented from backtracking into it. Backtracking past it to previous
   items, however, works as normal.

   An alternative description is that a subpattern of this type matches
   the string of characters that an identical standalone pattern would
   match, if anchored at the current point in the subject string.

   Atomic grouping subpatterns are not capturing subpatterns. Simple cases
   such as the above example can be thought of as a maximizing repeat that
   must swallow everything it can. So, while both \d+ and \d+? are pre-
   pared to adjust the number of digits they match in order to make the
   rest of the pattern match, (?>\d+) can only match an entire sequence of
   digits.

   Atomic groups in general can of course contain arbitrarily complicated
   subpatterns, and can be nested. However, when the subpattern for an
   atomic group is just a single repeated item, as in the example above, a
   simpler notation, called a "possessive quantifier" can be used. This
   consists of an additional + character following a quantifier. Using
   this notation, the previous example can be rewritten as

\d++foo

   Possessive quantifiers are always greedy; the setting of the
   PCRE_UNGREEDY option is ignored. They are a convenient notation for the
   simpler forms of atomic group. However, there is no difference in the
   meaning of a possessive quantifier and the equivalent atomic group,
   though there may be a performance difference; possessive quantifiers
   should be slightly faster.

   The possessive quantifier syntax is an extension to the Perl 5.8 syn-
   tax. Jeffrey Friedl originated the idea (and the name) in the first
   edition of his book. Mike McCloskey liked it, so implemented it when he
   built Sun's Java package, and PCRE copied it from there. It ultimately
   found its way into Perl at release 5.10.

   PCRE has an optimization that automatically "possessifies" certain sim-
   ple pattern constructs. For example, the sequence A+B is treated as
   A++B because there is no point in backtracking into a sequence of A's
   when B must follow.

   When a pattern contains an unlimited repeat inside a subpattern that
   can itself be repeated an unlimited number of times, the use of an
   atomic group is the only way to avoid some failing matches taking a
   very long time indeed. The pattern

(\D+|<\d+>)*[!?]

   matches an unlimited number of substrings that either consist of non-
   digits, or digits enclosed in <>, followed by either ! or ?. When it
   matches, it runs quickly. However, if it is applied to

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

   it takes a long time before reporting failure. This is because the
   string can be divided between the internal \D+ repeat and the external
   * repeat in a large number of ways, and all have to be tried. (The
   example uses [!?] rather than a single character at the end, because
   both PCRE and Perl have an optimization that allows for fast failure
   when a single character is used. They remember the last single charac-
   ter that is required for a match, and fail early if it is not present
   in the string.) If the pattern is changed so that it uses an atomic
   group, like this:

((?>\D+)|<\d+>)*[!?]

sequences of non-digits cannot be broken, and failure happens quickly.

BACK REFERENCES

   Outside a character class, a backslash followed by a digit greater than
   0 (and possibly further digits) is a back reference to a capturing sub-
   pattern earlier (that is, to its left) in the pattern, provided there
   have been that many previous capturing left parentheses.

   However, if the decimal number following the backslash is less than 10,
   it is always taken as a back reference, and causes an error only if
   there are not that many capturing left parentheses in the entire pat-
   tern. In other words, the parentheses that are referenced need not be
   to the left of the reference for numbers less than 10. A "forward back
   reference" of this type can make sense when a repetition is involved
   and the subpattern to the right has participated in an earlier itera-
   tion.

   It is not possible to have a numerical "forward back reference" to a
   subpattern whose number is 10 or more using this syntax because a
   sequence such as \50 is interpreted as a character defined in octal.
   See the subsection entitled "Non-printing characters" above for further
   details of the handling of digits following a backslash. There is no
   such problem when named parentheses are used. A back reference to any
   subpattern is possible using named parentheses (see below).

   Another way of avoiding the ambiguity inherent in the use of digits
   following a backslash is to use the \g escape sequence, which is a fea-
   ture introduced in Perl 5.10. This escape must be followed by a posi-
   tive or a negative number, optionally enclosed in braces. These exam-
   ples are all identical:

   (ring), \1
   (ring), \g1
   (ring), \g{1}

   A positive number specifies an absolute reference without the ambiguity
   that is present in the older syntax. It is also useful when literal
   digits follow the reference. A negative number is a relative reference.
   Consider this example:

(abc(def)ghi)\g{-1}

   The sequence \g{-1} is a reference to the most recently started captur-
   ing subpattern before \g, that is, is it equivalent to \2. Similarly,
   \g{-2} would be equivalent to \1. The use of relative references can be
   helpful in long patterns, and also in patterns that are created by
   joining together fragments that contain references within themselves.

   A back reference matches whatever actually matched the capturing sub-
   pattern in the current subject string, rather than anything matching
   the subpattern itself (see "Subpatterns as subroutines" below for a way
   of doing that). So the pattern

(sens|respons)e and \1ibility

   matches "sense and sensibility" and "response and responsibility", but
   not "sense and responsibility". If caseful matching is in force at the
   time of the back reference, the case of letters is relevant. For exam-
   ple,

((?i)rah)\s+\1

matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
original capturing subpattern is matched caselessly.

   Back references to named subpatterns use the Perl syntax \k<name> or
   \k'name' or the Python syntax (?P=name). We could rewrite the above
   example in either of the following ways:

(?<p1>(?i)rah)\s+\k<p1>
(?P<p1>(?i)rah)\s+(?P=p1)

A subpattern that is referenced by name may appear in the pattern
before or after the reference.

   There may be more than one back reference to the same subpattern. If a
   subpattern has not actually been used in a particular match, any back
   references to it always fail. For example, the pattern

(a|(bc))\2

   always fails if it starts to match "a" rather than "bc". Because there
   may be many capturing parentheses in a pattern, all digits following
   the backslash are taken as part of a potential back reference number.
   If the pattern continues with a digit character, some delimiter must be
   used to terminate the back reference. If the PCRE_EXTENDED option is
   set, this can be whitespace. Otherwise an empty comment (see "Com-
   ments" below) can be used.

   A back reference that occurs inside the parentheses to which it refers
   fails when the subpattern is first used, so, for example, (a\1) never
   matches. However, such references can be useful inside repeated sub-
   patterns. For example, the pattern

(a|b\1)+

   matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
   ation of the subpattern, the back reference matches the character
   string corresponding to the previous iteration. In order for this to
   work, the pattern must be such that the first iteration does not need
   to match the back reference. This can be done using alternation, as in
   the example above, or by a quantifier with a minimum of zero.

ASSERTIONS

   An assertion is a test on the characters following or preceding the
   current matching point that does not actually consume any characters.
   The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
   described above.

   More complicated assertions are coded as subpatterns. There are two
   kinds: those that look ahead of the current position in the subject
   string, and those that look behind it. An assertion subpattern is
   matched in the normal way, except that it does not cause the current
   matching position to be changed.

   Assertion subpatterns are not capturing subpatterns, and may not be
   repeated, because it makes no sense to assert the same thing several
   times. If any kind of assertion contains capturing subpatterns within
   it, these are counted for the purposes of numbering the capturing sub-
   patterns in the whole pattern. However, substring capturing is carried
   out only for positive assertions, because it does not make sense for
   negative assertions.

Lookahead assertions

Lookahead assertions start with (?= for positive assertions and (?! for
negative assertions. For example,

\w+(?=;)

matches a word followed by a semicolon, but does not include the semi-
colon in the match, and

foo(?!bar)

matches any occurrence of "foo" that is not followed by "bar". Note
that the apparently similar pattern

(?!foo)bar

   does not find an occurrence of "bar" that is preceded by something
   other than "foo"; it finds any occurrence of "bar" whatsoever, because
   the assertion (?!foo) is always true when the next three characters are
   "bar". A lookbehind assertion is needed to achieve the other effect.

   If you want to force a matching failure at some point in a pattern, the
   most convenient way to do it is with (?!) because an empty string
   always matches, so an assertion that requires there not to be an empty
   string must always fail.

Lookbehind assertions

Lookbehind assertions start with (?<= for positive assertions and (?<!
for negative assertions. For example,

(?<!foo)bar

   does find an occurrence of "bar" that is not preceded by "foo". The
   contents of a lookbehind assertion are restricted such that all the
   strings it matches must have a fixed length. However, if there are sev-
   eral top-level alternatives, they do not all have to have the same
   fixed length. Thus

(?<=bullock|donkey)

is permitted, but

(?<!dogs?|cats?)

   causes an error at compile time. Branches that match different length
   strings are permitted only at the top level of a lookbehind assertion.
   This is an extension compared with Perl (at least for 5.8), which
   requires all branches to match the same length of string. An assertion
   such as

(?<=ab(c|de))

   is not permitted, because its single top-level branch can match two
   different lengths, but it is acceptable if rewritten to use two top-
   level branches:

(?<=abc|abde)

   The implementation of lookbehind assertions is, for each alternative,
   to temporarily move the current position back by the fixed length and
   then try to match. If there are insufficient characters before the cur-
   rent position, the assertion fails.

   PCRE does not allow the \C escape (which matches a single byte in UTF-8
   mode) to appear in lookbehind assertions, because it makes it impossi-
   ble to calculate the length of the lookbehind. The \X and \R escapes,
   which can match different numbers of bytes, are also not permitted.

   Possessive quantifiers can be used in conjunction with lookbehind
   assertions to specify efficient matching at the end of the subject
   string. Consider a simple pattern such as

abcd$

   when applied to a long string that does not match. Because matching
   proceeds from left to right, PCRE will look for each "a" in the subject
   and then see if what follows matches the rest of the pattern. If the
   pattern is specified as

^.*abcd$

   the initial .* matches the entire string at first, but when this fails
   (because there is no following "a"), it backtracks to match all but the
   last character, then all but the last two characters, and so on. Once
   again the search for "a" covers the entire string, from right to left,
   so we are no better off. However, if the pattern is written as

^.*+(?<=abcd)

   there can be no backtracking for the .*+ item; it can match only the
   entire string. The subsequent lookbehind assertion does a single test
   on the last four characters. If it fails, the match fails immediately.
   For long strings, this approach makes a significant difference to the
   processing time.

Using multiple assertions

Several assertions (of any sort) may occur in succession. For example,

(?<=\d{3})(?<!999)foo

   matches "foo" preceded by three digits that are not "999". Notice that
   each of the assertions is applied independently at the same point in
   the subject string. First there is a check that the previous three
   characters are all digits, and then there is a check that the same
   three characters are not "999". This pattern does not match "foo" pre-
   ceded by six characters, the first of which are digits and the last
   three of which are not "999". For example, it doesn't match "123abc-
   foo". A pattern to do that is

(?<=\d{3}...)(?<!999)foo

   This time the first assertion looks at the preceding six characters,
   checking that the first three are digits, and then the second assertion
   checks that the preceding three characters are not "999".

Assertions can be nested in any combination. For example,

(?<=(?<!foo)bar)baz

matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while

(?<=\d{3}(?!999)...)foo

is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999".

CONDITIONAL SUBPATTERNS

   It is possible to cause the matching process to obey a subpattern con-
   ditionally or to choose between two alternative subpatterns, depending
   on the result of an assertion, or whether a previous capturing subpat-
   tern matched or not. The two possible forms of conditional subpattern
   are

(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)

   If the condition is satisfied, the yes-pattern is used; otherwise the
   no-pattern (if present) is used. If there are more than two alterna-
   tives in the subpattern, a compile-time error occurs.

There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions.

Checking for a used subpattern by number

   If the text between the parentheses consists of a sequence of digits,
   the condition is true if the capturing subpattern of that number has
   previously matched.

   Consider the following pattern, which contains non-significant white
   space to make it more readable (assume the PCRE_EXTENDED option) and to
   divide it into three parts for ease of discussion:

( $ )? [^()]+ (?(1) $ )

   The first part matches an optional opening parenthesis, and if that
   character is present, sets it as the first captured substring. The sec-
   ond part matches one or more characters that are not parentheses. The
   third part is a conditional subpattern that tests whether the first set
   of parentheses matched or not. If they did, that is, if subject started
   with an opening parenthesis, the condition is true, and so the yes-pat-
   tern is executed and a closing parenthesis is required. Otherwise,
   since no-pattern is not present, the subpattern matches nothing. In
   other words, this pattern matches a sequence of non-parentheses,
   optionally enclosed in parentheses.

Checking for a used subpattern by name

   Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
   used subpattern by name. For compatibility with earlier versions of
   PCRE, which had this facility before Perl, the syntax (?(name)...) is
   also recognized. However, there is a possible ambiguity with this syn-
   tax, because subpattern names may consist entirely of digits. PCRE
   looks first for a named subpattern; if it cannot find one and the name
   consists entirely of digits, PCRE looks for a subpattern of that num-
   ber, which must be greater than zero. Using subpattern names that con-
   sist entirely of digits is not recommended.

Rewriting the above example to use a named subpattern gives this:

(?<OPEN> $ )? [^()]+ (?(<OPEN>) $ )

Checking for pattern recursion

   If the condition is the string (R), and there is no subpattern with the
   name R, the condition is true if a recursive call to the whole pattern
   or any subpattern has been made. If digits or a name preceded by amper-
   sand follow the letter R, for example:

(?(R3)...) or (?(R&name)...)

   the condition is true if the most recent recursion is into the subpat-
   tern whose number or name is given. This condition does not check the
   entire recursion stack.

At "top level", all these recursion test conditions are false. Recur-
sive patterns are described below.

Defining subpatterns for use by reference only

   If the condition is the string (DEFINE), and there is no subpattern
   with the name DEFINE, the condition is always false. In this case,
   there may be only one alternative in the subpattern. It is always
   skipped if control reaches this point in the pattern; the idea of
   DEFINE is that it can be used to define "subroutines" that can be ref-
   erenced from elsewhere. (The use of "subroutines" is described below.)
   For example, a pattern to match an IPv4 address could be written like
   this (ignore whitespace and line breaks):

(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b

   The first part of the pattern is a DEFINE group inside which a another
   group named "byte" is defined. This matches an individual component of
   an IPv4 address (a number less than 256). When matching takes place,
   this part of the pattern is skipped because DEFINE acts like a false
   condition.

   The rest of the pattern uses references to the named group to match the
   four dot-separated components of an IPv4 address, insisting on a word
   boundary at each end.

Assertion conditions

   If the condition is not in any of the above formats, it must be an
   assertion. This may be a positive or negative lookahead or lookbehind
   assertion. Consider this pattern, again containing non-significant
   white space, and with the two alternatives on the second line:

(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

   The condition is a positive lookahead assertion that matches an
   optional sequence of non-letters followed by a letter. In other words,
   it tests for the presence of at least one letter in the subject. If a
   letter is found, the subject is matched against the first alternative;
   otherwise it is matched against the second. This pattern matches
   strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
   letters and dd are digits.

COMMENTS

   The sequence (?# marks the start of a comment that continues up to the
   next closing parenthesis. Nested parentheses are not permitted. The
   characters that make up a comment play no part in the pattern matching
   at all.

   If the PCRE_EXTENDED option is set, an unescaped # character outside a
   character class introduces a comment that continues to immediately
   after the next newline in the pattern.

RECURSIVE PATTERNS

   Consider the problem of matching a string in parentheses, allowing for
   unlimited nested parentheses. Without the use of recursion, the best
   that can be done is to use a pattern that matches up to some fixed
   depth of nesting. It is not possible to handle an arbitrary nesting
   depth.

   For some time, Perl has provided a facility that allows regular expres-
   sions to recurse (amongst other things). It does this by interpolating
   Perl code in the expression at run time, and the code can refer to the
   expression itself. A Perl pattern using code interpolation to solve the
   parentheses problem can be created like this:

$re = qr{$ (?: (?>[^()]+) | (?p{$re}) )* $}x;

The (?p{...}) item interpolates Perl code at run time, and in this case
refers recursively to the pattern in which it appears.

   Obviously, PCRE cannot support the interpolation of Perl code. Instead,
   it supports special syntax for recursion of the entire pattern, and
   also for individual subpattern recursion. After its introduction in
   PCRE and Python, this kind of recursion was introduced into Perl at
   release 5.10.

   A special item that consists of (? followed by a number greater than
   zero and a closing parenthesis is a recursive call of the subpattern of
   the given number, provided that it occurs inside that subpattern. (If
   not, it is a "subroutine" call, which is described in the next sec-
   tion.) The special item (?R) or (?0) is a recursive call of the entire
   regular expression.

   In PCRE (like Python, but unlike Perl), a recursive subpattern call is
   always treated as an atomic group. That is, once it has matched some of
   the subject string, it is never re-entered, even if it contains untried
   alternatives and there is a subsequent matching failure.

This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):

$ ( (?>[^()]+) | (?R) )* $

   First it matches an opening parenthesis. Then it matches any number of
   substrings which can either be a sequence of non-parentheses, or a
   recursive match of the pattern itself (that is, a correctly parenthe-
   sized substring). Finally there is a closing parenthesis.

If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:

( $ ( (?>[^()]+) | (?1) )* $ )

   We have put the pattern into parentheses, and caused the recursion to
   refer to them instead of the whole pattern. In a larger pattern, keep-
   ing track of parenthesis numbers can be tricky. It may be more conve-
   nient to use named parentheses instead. The Perl syntax for this is
   (?&name); PCRE's earlier syntax (?P>name) is also supported. We could
   rewrite the above example as follows:

(?<pn> $ ( (?>[^()]+) | (?&pn) )* $ )

   If there is more than one subpattern with the same name, the earliest
   one is used. This particular example pattern contains nested unlimited
   repeats, and so the use of atomic grouping for matching strings of non-
   parentheses is important when applying the pattern to strings that do
   not match. For example, when this pattern is applied to

(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()

   it yields "no match" quickly. However, if atomic grouping is not used,
   the match runs for a very long time indeed because there are so many
   different ways the + and * repeats can carve up the subject, and all
   have to be tested before failure can be reported.

   At the end of a match, the values set for any capturing subpatterns are
   those from the outermost level of the recursion at which the subpattern
   value is set. If you want to obtain intermediate values, a callout
   function can be used (see below and the pcrecallout documentation). If
   the pattern above is matched against

(ab(cd)ef)

   the value for the capturing parentheses is "ef", which is the last
   value taken on at the top level. If additional parentheses are added,
   giving

   $ ( ( (?>[^()]+) | (?R) )* ) $
^    ^
^    ^

   the string they capture is "ab(cd)ef", the contents of the top level
   parentheses. If there are more than 15 capturing parentheses in a pat-
   tern, PCRE has to obtain extra memory to store data during a recursion,
   which it does by using pcre_malloc, freeing it via pcre_free after-
   wards. If no memory can be obtained, the match fails with the
   PCRE_ERROR_NOMEMORY error.

   Do not confuse the (?R) item with the condition (R), which tests for
   recursion. Consider this pattern, which matches text in angle brack-
   ets, allowing for arbitrary nesting. Only digits are allowed in nested
   brackets (that is, when recursing), whereas any characters are permit-
   ted at the outer level.

< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >

   In this pattern, (?(R) is the start of a conditional subpattern, with
   two different alternatives for the recursive and non-recursive cases.
   The (?R) item is the actual recursive call.

SUBPATTERNS AS SUBROUTINES

   If the syntax for a recursive subpattern reference (either by number or
   by name) is used outside the parentheses to which it refers, it oper-
   ates like a subroutine in a programming language. The "called" subpat-
   tern may be defined before or after the reference. An earlier example
   pointed out that the pattern

(sens|respons)e and \1ibility

matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern

(sens|respons)e and (?1)ibility

   is used, it does match "sense and responsibility" as well as the other
   two strings. Another example is given in the discussion of DEFINE
   above.

   Like recursive subpatterns, a "subroutine" call is always treated as an
   atomic group. That is, once it has matched some of the subject string,
   it is never re-entered, even if it contains untried alternatives and
   there is a subsequent matching failure.

   When a subpattern is used as a subroutine, processing options such as
   case-independence are fixed when the subpattern is defined. They cannot
   be changed for different calls. For example, consider this pattern:

(abc)(?i:(?1))

It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.

CALLOUTS

   Perl has a feature whereby using the sequence (?{...}) causes arbitrary
   Perl code to be obeyed in the middle of matching a regular expression.
   This makes it possible, amongst other things, to extract different sub-
   strings that match the same pair of parentheses when there is a repeti-
   tion.

   PCRE provides a similar feature, but of course it cannot obey arbitrary
   Perl code. The feature is called "callout". The caller of PCRE provides
   an external function by putting its entry point in the global variable
   pcre_callout. By default, this variable contains NULL, which disables
   all calling out.

   Within a regular expression, (?C) indicates the points at which the
   external function is to be called. If you want to identify different
   callout points, you can put a number less than 256 after the letter C.
   The default value is zero. For example, this pattern has two callout
   points:

(?C1)abc(?C2)def

   If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
   automatically installed before each item in the pattern. They are all
   numbered 255.

   During matching, when PCRE reaches a callout point (and pcre_callout is
   set), the external function is called. It is provided with the number
   of the callout, the position in the pattern, and, optionally, one item
   of data originally supplied by the caller of pcre_exec(). The callout
   function may cause matching to proceed, to backtrack, or to fail alto-
   gether. A complete description of the interface to the callout function
   is given in the pcrecallout documentation.

pcre 문법, preg

카테고리 없음

패턴 변경자
패턴 변경자 -- 정규표현식 패턴에 존재하는 변경자의 설명
설명
아래 목록은 현재 존재하는 PCRE 변경자입니다. 괄호 안의 이름은 각 변경자에 대한 PCRE 내부의 이름입니다.

i (PCRE_CASELESS)
이 변경자를 지정하면, 패턴의 문자는 대문자와 소문자를 구별하지 않습니다.

m (PCRE_MULTILINE)
기본적으로, PCRE는 주어진 문자열을 하나의 "줄"로 취급합니다. (실제로 몇개의 라인을 가지더라도) "줄 시작" 메타문자(^)는 문자열의 처음만을 인식하며, "줄 끝" 메타문자($)는 문자열의 끝이나 (D 변경자가 지정되지 않는 한) 마지막 뉴라인의 직전만을 인식합니다. 이는 펄과 같습니다.

이 변경자를 지정하면, "줄 시작"과 "줄 끝"은 주어진 문자열의 모든 뉴라인 직후와 직전을 인식합니다. respectively, as well as at the very start and end. 이는 펄의 /m 변경자와 동일합니다. 주어진 문자열에 "n" 문자가 존재하지 않거나 ^나 $ 패턴이 일어나지 않으면 이 변경자는 아무런 효과가 없습니다.

s (PCRE_DOTALL)
이 변경자가 지정되면, 패턴의 점 메타문자는 뉴라인을 포함하는 모든 문자를 인식합니다. 지정하지 않으면, 뉴라인은 제외됩니다. 이 변경자는 펄의 /s 변경자와 동일합니다. [^a]와 같은 부정클래스는 이 변경자에 관계 없이 항상 뉴라인 문자를 포함합니다.

x (PCRE_EXTENDED)
이 변경자가 지정되면, 공백 문자는 이스케이프 되거나 문자 클래스 안에 있을 경우를 제외하고, 완전히 무시합니다. 문자 클래스 밖에서 이스케이프 되지 않은 # 사이와 뉴라인 문자 다음의 문자도 무시합니다. 이는 펄의 /x 변경자와 같고, 복잡한 패턴 안에 코멘트를 사용할 수 있게 합니다. 그러나 이는 데이터 문자에만 해당하는 점에 주의하십시오. 공백 문자는 패턴의 특별한 문자 시퀀스 안에는 존재할 수 없습니다. 예를 들면, 조건 서브 패턴을 나타내는 (?( 시퀀스에는 나와서는 안됩니다.

e
이 변경자를 지정하면, preg_replace()는 변경할 문자열을 PHP 코드로 처리하고, 그 결과를 검색된 문자열의 이용하여 일반적인 치환을 합니다.

preg_replace()만 이 변경자를 사용합니다; 다른 PCRE 함수는 무시합니다.

참고: 이 변경자는 PHP 3에서는 사용할 수 없습니다.

A (PCRE_ANCHORED)
이 변경자를 지정하면, 패턴을 강제적으로 "고정"합니다. 이는 ("주어진 문자열"에서) 검색된 문자열의 시작에만 매치도록 강제합니다. 패턴 자체에서 특정한 구조를 가지게 하는, 펄에서는 유일한 방법으로 같은 효과를 얻을 수 있습니다.

D (PCRE_DOLLAR_ENDONLY)
이 변경자가 설정되면, 패턴의 달러($) 메타문자는 주어진 문자열의 마지막에만 대응합니다. 이 변경자 없이는, 달러는 마지막 문자가 뉴라인일 경우에는 바로 직전의 문자에도 매칭합니다. (마지막이 아닌 뉴라인은 제외합니다) 이 변경자는 m 변경자가 지정되었을때는 무시됩니다. 펄에는 이 변경자가 존재하지 않습니다.

S
패턴이 여러번 이용되면, 매칭에 걸리는 시간을 절약하기 위해서 분석에 더 많은 시간을 들일 가치가 있습니다. 이 변경자를 지정하면, 추가 분석을 행합니다. 현 시점에서, 패턴의 분석은 하나의 고정된 시작 문자를 가지지 않는 비고정 패턴에만 유용합니다.

U (PCRE_UNGREEDY)
이 변경자는 수량 지시의 "greediness"를 뒤집습니다. 그리하여 기본값으로 not greedy하게 합니다. 하지만 "?"가 붙으면 greedy하게 됩니다. 이는 펄과 호환되지 않습니다. 패턴 안에서 (?U) 변경자 설정으로 지정할 수 있습니다.

X (PCRE_EXTRA)
이 변경자는 펄과 호환되지 않는 PCRE의 추가 기능을 사용하게 합니다. 패턴의 문자와 결합된 백슬래쉬가 특별한 의미를 지니지 않을 경우에 에러를 발생시켜서, 차후에 추가 기능을 위해 예약해둡니다. 기본적으로 펄은, 문자와 결합된 백슬래쉬가 특별한 의미를 지니지 않을 경우에는 글자로 취급합니다. 이 변경자는 다른 기능을 제어하지 않습니다.

u (PCRE_UTF8)
이 변경자는 펄과 호환되지 않는 PCRE의 추가 기능을 사용하게 합니다. 패턴 문자열을 UTF-8으로 취급합니다. 유닉스에서는 PHP 4.1.0부터, win32에서는 PHP 4.2.3부터 사용할 수 있습니다.

/////////////////////////////////////////////////////////////////////////////////////////////

패턴 문법
패턴 문법 -- PCRE 정규표현식 문법 설명
설명
PCRE 라이브러리는 아주 약간의 차이(아래를 참고)를 제외하고, 펄 5와 동일한 구문과 의미를 사용하여 정규표현식 패턴 매칭을 수행하는 함수의 집합입니다. 현재 수행은 펄 5.005에 대응합니다.

펄과의 차이
여기에서 설명한 차이는 펄 5.005 기준입니다.

PCRE는 다른 문자 집합으로 컴파일할 수 있지만, 기본적으로 공백 문자는 C 라이브러리 함수 isspace()가 인식하는 모든 문자입니다. 보통 isspace()는 스페이스, 폼피드, 줄바꿈, 캐리지 리턴, 수평 탭, 수직 탭을 인식합니다. 펄 5는 공백 문자에 수직 탭을 포함하지 않습니다. 오랜 기간동안 펄 문서의 v 이스케이프는 사실상 인정되지 않았습니다. 그러나 문자 자체는 적어도 5.002까지 공백으로 취급되었으며, 5.004와 5.005는 s에서 인식하지 않습니다.

PCRE는 lookahead 단정에서 반복 횟수를 허용하지 않습니다. 펄은 허용하지만, 생각하는 그대로의 의미를 갖지 않습니다. 예를 들어, (?!a){3}는 다음 세 문자가 "a"가 아닌 것을 의미하지 않습니다. 단지, 다음 문자가 "a"가 아니라는 것을 세번 확인할 뿐입니다.

부정 lookahead 단정 안에서 일어나는 서브패턴 검출을 카운트를 하지만, 시작 위치 벡터에 그 엔트리를 설정하지는 않습니다. 펄은 부정 lookahead 단정이 단 하나의 브랜치를 가지고 있을 경우에 한하여, 그 단정이 매치에 실패(결과적으로 성공)하기 전에 매치한 어떠한 패턴에 대해서만 그에 대한 숫자 변수를 설정합니다.

바이너리 제로 문자는 목표 문자열에서는 지원하지만, 패턴 문자열에서는 허용하지 않습니다. 패턴은 제로로 종료하는 보통의 C 문자열로 처리하기 때문입니다. 패턴에서 바이너리 제로를 표현하기 위해서는 이스케이프 시퀀스 "\x00"로 사용할 수 있습니다.

다음의 펄 이스케이프 시퀀스는 지원하지 않습니다: l, u, L, U, E, Q. 사실, 이들은 펄의 일반 문자열 핸들링이며, 패턴 매칭 엔진의 부분이 아닙니다.

펄의 G는 싱글 패턴 매치에 적절하지 않기 때문에 지원하지 않습니다.

당연하게도, PCRE는 (?{code}) 구조를 지원하지 않습니다.

패턴의 일부가 반복될 때, 잡아낸 문자열의 설정에 관해서, 펄 5.005_02에서 일부 이상한 동작이 존재합니다. 예를 들어, "aba"에 대해서 패턴 /^(a(b)?)+$/를 매칭하면 $2를 "b" 값으로 설정하지만, "aabbaa"에 대해서 /^(aa(bb)?)+$/를 매칭하면 $2를 설정하지 않습니다. 하지만 패턴을 /^(aa(b(b))?)+$/로 변경하면 $2(와 $3)를 설정합니다. 펄 5.004에서는 $2를 두 경우 모두 설정했으며, PCRE에서도 TRUE입니다. 앞으로 펄이 이 차이를 일관성 있게 변경한다면, PCRE는 그 변경에 따를 것입니다.

또다른 해결되지 않은 모순점은 펄 5.005_02가 패턴 /^(a)?(?(1)a|b)+$/를 문자열 "a"에 매치하지만, PCRE는 하지 않습니다. 그러나 펄과 PCRE 모두 /^(a)?a/를 "a"에 매치하고 $1을 설정하지 않습니다.

PCRE는 펄 정규표현식 기능의 몇가지 확장을 지원합니다:

lookbehind 단정은 고정 길이 문자열에만 매치해야하지만, 양자 택일의 lookbehind 단정에서는 다른 길이의 문자열을 매치할 수 있습니다. 펄 5.005에서는 모두 같은 길이를 가질 것을 요구합니다.

PCRE_DOLLAR_ENDONLY를 설정하고 PCRE_MULTILINE를 설정하지 않으면 $ 메타 문자는 문자열의 가장 마지막에만 매치합니다.

PCRE_EXTRA를 설정하면, 백슬래쉬 뒤에 특별한 의미를 가지지 않는 문자의 사용은 실패하게 됩니다.

PCRE_UNGREEDY를 설정하면, 반복 수량어의 greediness가 뒤집어져서, 기본값으로 greedy하지 않게 됩니다. 하지만 뒤에 물음표가 붙으면 greedy하게 됩니다.

정규표현식 상세
소개
아래 설명은 PCRE가 지원하는 정규표현식의 문법과 의미입니다. 정규표현식은 펄 문서 및 많은 책들에 설명이 있으며, 그 중 일부에는 풍부한 예제를 가지고 있습니다. O'Reilly에서 출판한 Jeffrey Friedl의 "Mastering Regular Expressions"(ISBN 1-56592-257-3)는 예제들을 매우 자세하게 다루고 있습니다. 여기의 설명은 레퍼런스 문서에 따릅니다.

정규표현식은 주어진 문자열에 대하여 왼쪽에서 오른쪽으로 매치하는 패턴입니다. 문자열은 패턴으로 준비하고, 목표에서 대응하는 문자열에 매치합니다. 간단한 예로, 패턴 The quick brown fox는 목표 문자열의 동일한 부분에 매치합니다.

메타 문자
정규표현식이 강력한 이유는 패턴에 선택과 반복을 포함할 수 있다는 점입니다. 이는 특별한 방법으로 해석하는 메타 문자를 사용하여 패턴에 넣습니다.

메타 문자는 두가지 종류가 존재합니다: 대괄호 안을 제외하고 패턴의 어디에서라도 작동하는 종류와, 대괄호 안에서만 작동하는 종류입니다. 다음은 대괄호 밖에서 사용하는 메타 문자들입니다.

여러가지로 사용하는 일반적인 이스케이프 문자

^
목표의 처음 (멀티라인 모드에서는 줄의 처음)

$
목표의 마지막 (멀티라인 모드에서는 줄의 끝)

.
(기본값으로) 줄바꿈을 제외한 아무 문자

[
클래스 정의 시작 문자

]
클래스 정의 끝 문자

|
선택 브랜치 시작

(
서브패턴 시작

)
서브패턴 끝

?
( 의미 확장, 또는 0회나 1회, 또는 수량어 minimizer

*
0회 이상의 횟수

+
1회 이상의 횟수

{
최소/최대 횟수 시작

}
최소/최대 횟수 끝

대괄호 안쪽의 패턴은 "문자 클래스"라고 부릅니다. 다음은 문자 클래스에서 사용하는 메타 문자들입니다:

일반적인 이스케이프 문자

^
처음 문자로 올 때, 부정 클래스로 설정

-
문자 범위 지정

]
문자 클래스 종료

다음 섹션은 각 메타 문자의 사용을 설명합니다.

백슬래쉬
백슬래쉬 문자는 여러가지 사용법을 가집니다. 먼저, 뒤에 영숫자가 아닌 문자가 붙는다면, 그 문자가 가지고 있는 특별한 의미가 사라집니다. 이러한 이스케이프 문자로 백슬래쉬를 사용하는 것은 문자 클래스 안과 밖 양쪽에 모두 적용됩니다.

예를 들어, "*" 문자를 매치하길 원한다면, 패턴에는 "*"로 써야합니다. 이는 따라오는 문자가 메타 문자이던 아니던간에 관계 없이 적용하기 떄문에, 영숫자가 아닌 문자에 그 자체를 사용하기 위해 ""를 붙이는 것이 항상 안전합니다. 특히, 백슬래쉬를 매치하고자 한다면, "\"로 써야합니다.

패턴에 PCRE_EXTENDED 옵션을 사용하면, 패턴에 존재하는 (문자 클래스 안이 아닌) 공백, 그리고 문자 클래스 밖의 "#"사이의 문자와 바로 뒤의 줄바꿈 문자를 무시합니다. 이스케이프하는 백슬래쉬를 공백이나 "#"문자를 패턴에 넣기 위해 사용할 수 있습니다.

백슬래쉬의 두번째 사용법은 패턴에서 출력할 수 없는 문자를 보여지게 인코딩하는 방법을 제공합니다. 바이너리 제로가 패턴 종료를 의미하는걸 제외하면, 출력할 수 없는 문자가 나타나는 제한은 존재하지 않습니다. 그러나 패턴을 텍스트 편집으로 준비할 때는, 다음의 이스케이프 시퀀스를 사용하는 편이 바이너리 문자를 직접 표현하는 것보다 간편합니다:

a
알람, BEL 문자(hex 07)

cx
"control-x", x는 임의의 문자

e
이스케이프 (hex 1B)

f
폼피드 (hex 0C)

n
줄바꿈 (hex 0A)

r
캐리지 리턴 (hex 0D)

t
탭 (hex 09)

xhh
16진 코드 hh 문자

ddd
8진 코드 ddd 문자, 혹은 역참조

"cx"의 효과는 다음과 같이 계산합니다: "x"가 소문자라면, 대문자로 변환합니다. 그 후, 문자의 6번째 비트(hex 40)가 뒤집어집니다. 즉 "cz"은 hex 1A가 되고, "c{"은 hex 3B, 그리고 "c;"은 hex 7B가 됩니다.

"x" 뒤에, 두개의 16진 숫자를 읽습니다. (대소문자는 구별하지 않습니다)

""은 다음의 두자리의 8진수를 읽습니다. 양쪽 모두, 두자리가 되지 않을 경우, 그 표현을 그대로 사용합니다. 즉 "x7" 시퀀스는 두개의 바이너리 제로에 이어지는 BEL 문자를 정의합니다. 바로 뒤에 8진수로 인식되는 문자가 이어질 경우에는 처음의 제로 뒤에 두자리 수를 써야한다는 것을 잊지 마십시오.

백슬래쉬 뒤에 0이 아닌 수가 올 경우에 혼동할 수 있습니다. 문자 클래스 밖에서, PCRE는 그것과 따라오는 수를 10진수로 읽습니다. 수가 10보다 작거나, 표현식에서 수 이상의 묶음을 잡아냈다면, 이 시퀀스는 역참조가 됩니다. 이 작동에 관해서는 아래에, 묶음 서브패턴에 설명이 있습니다.

문자 클래스 안이나, 10진수 9 이상이 없고 서브패턴이 그만큼 존재하지 않을 경우, PCRE는 백슬래쉬 뒤의 세자리 8진수로 다시 읽어들여, 해당하는 8비트 값으로 하나의 바이트를 생성합니다. 어떠한 수라도 사용할 수 있습니다. 예를 들면:

40
스페이스의 다른 표현 방법

40
40개 미만의 서브 패턴을 검출하였을때, 동일한 의미

7
항상 역참조

11
역참조, 혹은 탭의 다른 표현 방법

11
항상 탭

113
문자 "3"이 따라오는 탭

113
8진 코드 113 문자 (역참조는 99까지입니다)

377
1비트만을 제외한 바이트

81
역참조이거나 "8"과 "1"의 두 문자가 붙는 바이너리 제로

100이상의 8진 값은 앞에 제로가 붙지 않아야만 합니다. 세자리를 넘어가는 8진 값은 읽지 않습니다.

하나의 바이트값을 정의하는 모든 시퀀스는 문자 클래스 내외, 어디에서도 사용할 수 있습니다. 추가로, 문자 클래스 안에서 "b" 시퀀스는 백스페이스 문자(hex 08)로 해석합니다. 문자 클래스 밖에서는 다른 의미를 가집니다. (아래를 참고)

백슬래쉬의 세번째 사용법은 일반적인 문자 타입의 지정입니다:

d
임의의 10진 숫자

D
10진 숫자가 아닌 임의의 문자

s
임의의 공백 문자

S
공백이 아닌 임의의 문자

w
임의의 "word" 문자

W
임의의 "non-word" 문자

각 이스케이프 시퀀스 조합은 완전한 문자 세트를 두개의 개별 세트로 분리합니다. 주어진 문자는 각 조합의 한쪽에만 매치합니다.

"word" 문자는 어떠한 문자나 숫자, 혹은 언더스코어(_)입니다. 즉, 펄의 "word"에 해당하는 어떠한 문자입니다. 문자와 숫자의 정의는 PCRE의 문자 테이블이 제어하고, 로케일 특정 매칭이 존재할 경우에는 다양할 수 있습니다. (위쪽의 "로케일 지원" 참고) 예를 들어, "fr"(프랑스어) 로케일에서는, 128 이상의 몇몇 코드를 엑센트 문자를 나타내는데 사용하며, 이들은 w에 매치합니다.

문자형 시퀀스는 문자 클래스 안과 밖에서 모두 사용할 수 있습니다. 각각 해당하는 형의 한 문자에 매치합니다. 현재 매칭 위치가 목표 문자열의 마지막이라면, 전부 실패하고, 어떠한 문자도 매치하지 않습니다.

백슬래쉬의 네번째 사용법은 간단한 단정입니다. 단정은 조건이 목표 문자열에서 다른 부분에 매치하지 않고, 특정한 위치에만 매치하도록 지정합니다. 복잡한 단정을 위한 서브패턴의 사용법은 아래에 설명이 있습니다. 백슬래쉬 단정은 다음과 같습니다.

b
word 경계

B
word 경계가 아님

A
목표의 처음 (멀티라인 모드와 무관)

Z
목표의 마지막이나 마지막에서 줄바꿈 (멀티라인 모드와 무관)

z
목표의 마지막 (멀티라인 모드와 무관)

단정은 문자 클래스 안에서 사용할 수 없습니다. ("b"가 문자 클래스 안에서는 백스페이스 문자를 나타내는 점에 주의하십시오)

word 경계는 현재 문자와 이전 문자가 둘 다 w에 매치하지 않거나 둘 다 W에 매치하지 않는 (즉, 하나는 w에 매치하고 다른 하나는 W에 매치) 목표 문자열의 위치이거나, 처음이나 마지막 문자가 w에 매치할 경우는 문자열의 처음이나 마지막입니다.

A, Z, z 단정은 전통적인 circumflex와 달러와는 달리 옵션과 관계 없이 목표 문자열의 가장 처음이나 가장 마지막에만 매치합니다. 이들은 PCRE_MULTILINE이나 PCRE_DOLLAR_ENDONLU 옵션에 영향을 받지 않습니다. Z와 z의 차이는, Z가 문자열의 마지막뿐만 아니라 문자열 마지막 문자가 줄바꿈일 경우에는 바로 앞에도 매치하지만, z는 마지막에만 매치합니다.

Circumflex와 달러
문자열 클래스의 밖, 기본 매칭 모드에서는 circumflex 문자는 현재 매칭 위치가 목표 문자열의 시작일 경우에만 성공하는 단정입니다. 문자열 클래스 안에서 cicumflex는 완전히 다른 의미를 가집니다. (아래 참고)

여러 개의 선택을 가질 경우 circumflex는 패턴의 처음 문자일 필요가 없지만, 패턴이 그 브랜치에 처음 매치할 경우를 나타내는 각 선택에서의 처음 문자여야 합니다. 모든 선택이 circumflex로 시작하는, 목표의 처음에만 매치하는 패턴은 "고정" 패턴이라 불려집니다. (패턴을 고정하는 다른 구조도 존재합니다)

달러 문자는 매칭 위치가 목표 문자열의 마지막이거나, (기본값으로) 문자열 마지막 줄바꿈의 바로 전에 해당하는 경우에만 TRUE인 단정입니다. 선택을 가지는 패턴에서는 달러가 패턴의 마지막일 필요가 없지만, 마지막을 나타내는 모든 브랜치에서 마지막 문자여야 합니다. 달러는 문자 클래스 안에서는 특별한 의미를 가지지 않습니다.

달러의 의미는 컴파일시나 매치를 할 때 PCRE_DOLLAR_ENDONLY 옵션을 설정해서 문자열의 마지막에만 매치하도록 변경할 수 있습니다. 이는 Z 단정에는 영향을 주지 않습니다.

PCRE_MULTILINE 옵션을 설정하면, circumflex와 달러 문자의 의미가 달라집니다. 이 경우, 목표 문자열의 처음과 마지막에 더하여, 내부의 "n" 문자의 뒤와 앞에도 매치합니다. 예를 들어, 패턴 /^abc$/는 멀티라인 모드에서는 목표 문자열 "defnabc"에 매치하지만, 그렇지 않다면 매치하지 않습니다. 따라서, 모든 브랜치가 "^"로 시작하는 브랜치는 단일라인 모드에서는 고정이지만, 멀티라인 모드에서는 고정이 아닙니다. PCRE_MULTILINE을 설정하면, PCRE_DOLLOR_ENDONLY 옵션을 무시합니다.

어떤 모드에서라도 A, Z, z 시퀀스는 목표의 처음과 마지막에 매치할 때 사용할 수 있습니다. PCRE_MULTILINE에 관계 없이 모든 브랜치가 A로 시작하는 패턴은 항상 고정입니다.

마침표
문자 클래스 밖에서, 패턴의 마침표는 패턴의 아무 문자에 매치합니다. 출력할 수 없는 문자도 포함하지만, (기본값으로) 줄바꿈은 포함하지 않습니다. PCRE_DOTALL 옵션을 설정하면, 마침표가 줄바꿈에도 매치합니다. 마침표의 처리는 circumflex와 달러의 처리와는 완전히 독립이며, 유일한 관계는 두 경우 모두 줄바꿈 문자에 해당한다는 점입니다. 마침표는 문자 클래스 안에서는 특별한 의미를 가지지 않습니다.

대괄호
여는 대괄호로 문자 클래스를 시작하고, 닫는 대괄호로 종료합니다. 닫는 대괄호는 그 자체로는 특별한 의미가 없습니다. 닫는 대괄호를 클래스의 멤버로 사용하려면 클래스의 가장 처음(존재한다면 시작 circumflex 뒤에)에 위치하거나 백슬래쉬로 이스케이프해야 합니다.

문자 클래스는 목표에서 하나의 문자에 매치합니다; 그 문자는 클래스가 정의하는 문자 세트에 존재해야 합니다. 클래스가 circumflex로 시작할 경우에는, 목표 문자는 클래스 정의 세트에 존재하지 않아야 합니다. circumflex가 클래스 멤버로 필요할 때는, 처음에 위치시키지 않거나 백슬래쉬로 이스케이프해야 합니다.

예를 들면, 문자 클래스 [aeiou]는 모든 소문자 모음에 매치하지만, [^aeiow]는 소문자 모음이 아닌 모든 문자에 매치합니다. circumflex는 단지 클래스에 존재하는 문자가 아닌 것들을 지정하는 편리한 방법일 뿐이라는 점에 주의하십시오. 단정이 아닙니다: 목표 문자열에서 문자를 찾아내며, 현재 위치가 문자열의 끝이면 실패합니다.

대소문자를 구별하지 않는 매칭을 설정하면, 클래스 안의 모든 문자는 대문자와 소문자 모두 매치합니다. 예를 들면, 대소문자 구별 없는 [aeiou]는 "a"와 함께 "A"도 매치하며, 대소문자 구별 없는 [^aeiou]는 구별하는 버전일 경우 매치하는 "A"에는 매치하지 않습니다.

줄바꿈 문자는 PCRE_DOTALL나 PCRE_MULTILINE 옵션의 설정에 관계 없이, 문자 클래스 안에서 특별한 방법으로 취급하지 않습니다. [^a]와 같은 클래스는 항상 줄바꿈에 매치합니다.

빼기(하이픈) 문자는 문자 클래스에서 문자의 범위를 지정하는데 사용할 수 있습니다. 예를 들어, [d-m]은 d부터 m까지의 모든 문자에 매치합니다. 빼기 문자가 클래스에서 필요하면, 백슬래쉬로 이스케이프하거나, 클래스의 맨 처음이나 마지막처럼 범위로 해석할 수 없는 위치에 나타나야 합니다.

문자 "]"를 범위의 마지막으로 지정하는 것은 불가능합니다. [W-]46]과 같은 패턴은 두 문자를 가지는 클래스("W"와 "-") 뒤에 일반 문자열 "46]"이 붙는 형태로 해석합니다. 그러므로 "W46]"이나 "-46]"에 매치합니다. 그러나, "]"를 백슬래쉬로 이스케이프하면 범위의 끝으로 해석하기에, [W-]46]은 범위와 두개의 개별 문자를 가지는 하나의 클래스로 해석합니다. 범위의 마지막으로 "]"의 8진수 및 16진수 표현을 사용할 수도 있습니다.

범위는 아스키 순서에 따라 정해집니다. [00-37]처럼 숫자로 지정한 문자를 사용할 수도 있습니다. 대소문자 구별 없는 매칭을 설정하면, 범위 안의 문자들도 대소문자 구별 없이 매칭합니다. 예를 들면, [W-c]는 대소문자 구별 없는 [][^_`wxyzabc]와 동일하며, "fr" 로케일의 문자표를 사용하면, [xc8-xcb]는 대소문자 구별 없이 엑센트 E 문자에 매치합니다.

문자형 d, D, s, S, w, W도 문자 클래스에서 사용할 수 있고, 해당하는 문자들을 클래스에 추가합니다. 예를 들어, [dABCDEF]는 모든 16진수에 매치합니다. circumflex와 위에서 대문자형을 지정하여 소문자형을 매칭하는 데에 편리하게 제한을 할 수 있습니다. 예를 들어, 클래스 [^W_]는 언더스코어를 제외한 모든 문자와 숫자에 매치합니다.

, -, ^(시작 위치), 종료 ]를 제외한 모든 영숫자가 아닌 문자는 문자 클래스에서 특별한 의미를 가지지 않지만, 이스케이프 해도 문제가 발생하지는 않습니다.

수직 바
수직 바 문자는 선택 패턴을 구별할 때 사용합니다. 예를 들어, 패턴 gilbert|sullivan은 "gilbert"나 "sullivan"에 매치합니다. 어떠한 수의 선택도 사용할 수 있고, 빈 선택도 허용합니다. (빈 문자열에 매칭합니다) 매칭 프로세스는 각 선택을 왼쪽에서 오른쪽으로 시도하며, 가장 먼저 선택한 것을 사용합니다. 서브패턴(아래에서 정의) 안에서 선택을 하면, "성공"은 서브패턴의 선택과 함께 메인 패턴의 나머지 부분도 매치하는 것을 의미합니다.

내부 옵션 설정
PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, PCRE_EXTENDED의 설정은 "(?"와 ")" 사이에 펄 옵션 문자 시퀀스로 패턴 안에서 변경할 수 있습니다. 옵션 문자는 다음과 같습니다.

표 1. 내부 옵션 문자

i PCRE_CASELESS에 해당
m PCRE_MULTILINE에 해당
s PCRE_DOTALL에 해당
i PCRE_EXTENDED에 해당

예를 들어, (?im)은 대소문자 구별 없는 멀티라인 매칭을 설정합니다. 하이픈과 함께 사용하여 옵션을 해제할 수도 있습니다. 설정과 해제를 동시에 하는 것도 허용합니다. (?im-sx)를 지정하면, PCRE_CASELESS와 PCRE_MULTILINE를 설정하고 PCRE_DOTALL와 PCRE_EXTENDED를 해제합니다. 문자가 하이픈 앞과 뒤에 모두 나타나면, 옵션을 해제합니다.

옵션이 미치는 위치는 패턴의 어디에서 설정을 했는지에 의존합니다. 모든 서브패턴(아래에서 정의) 밖에서 설정하면, 효과는 옵션을 매칭 시작 지점에서 설정 및 해제한 것과 같습니다. 다음 패턴들은 모두 완전히 동일한 작동을 합니다:

(?i)abc
a(?i)bc
ab(?i)c
abc(?i)

이들은 PCRE_CASELESS를 설정하여 패턴 abc를 처리하는 것과 동일합니다. 즉, "최고 레벨" 설정은 (서브패턴 안에서 다른 변화를 주지 않는 한) 모든 패턴에 적용합니다. 최고 레벨에서 같은 옵션의 설정이 하나 이상 존재하면, 가장 오른쪽에 나오는 설정을 사용합니다.

서브패턴 안에서 옵션을 변경하면, 효과가 다릅니다. 이는 펄 5.005에서 작동 변경점입니다. 서브패턴 안에서 옵션 변경은 따라오는 서브패턴의 부분에만 영향을 미칩니다. 그러므로 (a(?i)b)c은 abc와 aBc만을 매치합니다. (PCRE_CASELESS를 사용하지 않는 경우) 이 의미로, 옵션을 패턴의 다른 부분에서 다른 설정으로 할 수 있습니다. 하나의 선택에서 변경한 것은 같은 서브패턴에서 지속적으로 사용합니다. 예를 들면, (a(?i)b|c)는 "ab", "aB", "c", "C"에 매치합니다. "C"는 옵션을 설정하기 전에 나누어진 브랜치에 해당하지만, 이상한 동작일지라도, 처리 시에 옵션 설정의 효율성을 위해서 매치합니다.

PCRE 전용 옵션 PCRE_UNGREEDY와 PCRE_EXTRA는 U와 X 문자를 사용하여 펄 호환 옵션과 같은 방법으로 변경할 수 있습니다. (?X) 플래그는 특별하기에, 최고 레벨에서라도 다른 기능을 켜기 전에 위치해야만 합니다. 가장 처음에 놓는 편이 좋습니다.

서브패턴
서브패턴은 괄호로 구분하며, 중첩할 수 있습니다. 패턴의 부분을 서브패턴으로 만드는 것은 두가지 일을 합니다:

1. 선택 세트를 지역화합니다. 예를 들어, 패턴 cat(aract|erpillar|)는 "cat", "cataract", "caterpillar" 중에 하나에 매치합니다. 괄호가 없으면, "cataract", "erpillar", 또는 빈 문자열에 매치할 것입니다.

2. It sets up the subpattern as a capturing subpattern (as defined above). When the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain the numbers of the capturing subpatterns.

예를 들어, 문자열 "the red king"이 패턴 the ((read|white) (king|queen)에 매치한다면, "red king", "red", "king" 부분 문자열을 잡아내고, 1, 2, 3의 숫자를 부여합니다.

The fact that plain parentheses fulfil two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200.

As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns

(?i:saturday|sunday)
(?:(?i)saturday|sunday)

match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday".

반복
반복은 다음 항목들에 덧붙이는 수량어로 지정합니다:

이스케이프할 수 있는 하나의 문자

. 메타 문자

문자 클래스

역참조 (다음 섹션 참고)

서브 패턴 묶음(단정이 아닌 경우 - 아래 참고

일반적인 반복 수량어는 허용하는 매치 수의 최소값과 최대값을 중괄호 안에 쉼표로 구분하여 지정합니다. 수는 65535보다 작아야하며, 처음 수는 두번째 수보다 작거나 같아야만 합니다. 예를 들면: z{2,4}는 "zz", "zzz", "zzzz"에 매치합니다. 닫는 괄호 자체는 특수 문자가 아닙니다. 두번째 수를 생략하고, 쉼표가 존재하면 최대 제한이 없어집니다; 두번째 수와 쉼표를 모두 생략하면, 수량어는 요구하는 매치 수의 정확한 수를 지정합니다. 그러므로, [aeiou]{3,}는 3개 이상의 모음에 매치하지만, d{8}은 8자리 수에만 매치합니다. 수량어를 허용하지 않는 위치에서 중괄호를 여는 것은 수량어 구문으로 매치하지 않고, 일반 문자로 취급합니다. 예를 들어, {,6}는 수량어가 아닌, 4 문자의 일반 문자열입니다.

수량어 {0}을 허용하며, 이는 이전의 항목과 수량어가 존재하지 않는 표현식으로 작동합니다.

편의성을 위해서 (그리고 역사적인 호환성을 위해서) 가장 일반적인 세 수량어는 단일 표현을 가지고 있습니다:

표 2. 단일 문자 수량어

* {0,}과 동일
+ {1,}과 동일
? {0,1}과 동일

나중에 책이 없을때 필요할지 몰라 따로 정리해놓았다.

. : 모든 문자와 일치

| : 왼쪽 혹은 오른쪽과 일치

[] : 문자 집합 구성원 중 하나와 일치

[^] : 문자 집합 구성원을 제외하고 일치

- : 범위 정의([A-Z]와 같은 형태)

\ : 다음에 오는 문자를 이스케이프

* : 문자가 없는 경우나 하나 이상 연속하는 문자 찾기

*? : 게으른 * 문자

+ : 문자 하나 이상 찾기

+? : 게으른 + 문자

? : 문자가 없거나 하나인 문자 찾기

{n} : 정확히 요소와 n번 일치

{m,n} : 요소와 m번에서 n번 일치

{n,} : 요소와 n번 이상 일치

{n,}? : 게으른 {n,}

^ : 문자열의 시작과 일치

\A : 문자열의 시작과 일치

$ : 문자열의 끝과 일치

\Z : 문자열의 끝과 일치

\< : 단어의 시작과 일치

\> : 단어의 끝과 일치

\b : 단어의 경계와 일치

\B : \b와 반대로 일치

[\b] : 역스페이스

\c : 제어문자와 일치

\d : 모든 숫자와 일치

\D : \d와 반대

\f : 페이지 넘기기

\n : 줄바꿈

\r : 캐리지리턴

\s : 공백문자와 일치

\t : 탭

\v : 수직탭

\w : 영숫자 문자나 밑줄과 일치

\W : \w와 반대

\x : 16진수 숫자와 일치

\0 : 8진수 숫자와 일치

() : 하위표현식 정의

\1 : 첫번째 일치한 하위 표현식, 두번째 일치한 하위표현식은 \2

?= : 전방탐색

?<= : 후방탐색

?! : 부정형 전방탐색

?<! : 부정형 후방탐색

?(backreference)true : 조건지정

?(backreference)true|false : else 표현식 조건지정

\E : \L혹은 \U변환을 끝냄

\I : 다음에 오는 글자를 소문자로 변환

\L : \E를 만날때까지 모든 문자를 소문자로 변환

\u : 다음에 오는 글자를 대문자로 변환

\U : \E를 만날때까지 모든 문자를 대문자로 변환

(?m) : 다중행 모드

출처 - http://blog.naver.com/techbug/150007346770

by 뭔일이여 2007. 4. 27. 17:13

Azureus

Project/Azureus

Java 기반의 Bit Torrent P2P프로그램 Azureus
알아보자...ㅋ
http://sourceforge.net/projects/azureus/

by 뭔일이여 2007. 4. 26. 10:37

MySQL [4.0 -> 4.1] 업그레이드시 체크할것

Programming/MySQL

# 항목앞에 (*) 는 Incompatible Change(호환되지 않는 변경) 입니다.. 주의요망!!

# 이글을 작성당시 4.1 최신버전은 4.1.10 입니다.

# 이글은 mysql document 중 Upgrading from Version 4.0 to 4.1 을 번역한 겁니다.

- 글자셋 지원이 향상되었다. 서버는 다중 글자셋을 지원한다.

- 4.1 은 테이블명,컬럼명을 UTF8 형식으로 저장한다. standard 7-bit US-ASCII 가 아닌 문자로 된 테이블명, 컬럼명이 있을때는 dump & restore 를 사용하라.

그렇지 않으면(직접 복사,이전한 경우) 테이블을 사용할 수 없을 것이며 table not found 에러가 발생할 것이다. 이 경우에는 4.0 으로 다운그레이드 하면 다시 사용할 수 있다.

- 권한테이블의 password 필드가 길어졌다. mysql_fix_privilege_tables 를 사용하여 수정하라.

- Berkeley DB table handler 의 포멧이 더 길어졌다. 4.0 으로 다운그레이드 해야할 경우 mysqldump 를 사용하여 백업하고 4.0 서버를 시작하기 전에 모든 log.XXXXXXXXXX 파일을 삭제한 후 데이타를 로드하라.

- 커넥션별 타임존을 지원한다. 이름으로 타임존을 사용하려면 time zone 테이블을 생성해야 한다.

- 오래된 DBD-mysql module (Msql-MySQL-modules) 을 사용한다면 새 버전(DBD-mysql 2.xx 이상)으로 업그레이드하라. 업그레이드하지 않으면 몇몇 메소드(DBI->do() 와 같은..)가 에러상태를 정확히 판단하지 못할 것이다.

- --defaults-file=option-file-name 옵션은 옵션파일이 없다면 에러를 낼 것이다.

- 4.0.12 이상의 버전이라면 4.1 의 변화를 미리 적용해 볼 수 있다. --new 옵션을 사용하여 mysqld 를 실행하라. 또한 SET @@new=1 명령으로도 동작시킬수 있으며 SET @@new=0 으로 중단할 수있다.

몇가지 4.1 의 변화가 (업그레이드시) 문제가 될 수 있다고 생각되면 업그레이드하기 전에 --new 옵션을 사용해 미리 적용해 보길 권한다. 옵션파일에 아래와 같이 추가하여 적용해 볼 수 있다.

[mysqld-4.0]
new

[Server Changes]

1. 모든 테이블과 컬럼들이 글자셋을 가진다. 글자셋은 SHOW CREATE TABLE 을 사용해 확인할 수 있으며 mysqldump 에서도 글자셋 설정이 추가되었다.(4.0.6 이상의 버전에서는 이 새로운 형식의 덤프를 이해할 수 있지만 그 이전버전에서는 이해할 수 없다.) 단일글자셋을 사용하는 환경에서는 아무런 영향이 없다.( 즉 이전버전에서는 mysqldump 의 글자셋 설정이 아무 의미가 없다는 뜻..)

2. 4.1에서 직접 지원하는 글자셋을 사용한 4.0 데이타는 그대로 사용이 가능하다. 또한 4.1 에서 DB명, 테이블명, 컬럼명은 기본 글자셋이 무엇이든 상관없이 유니코드(UTF8) 로 저장된다.

(*)3. 4.1.0 ~ 4.1.3 버전에서 InnoDB 테이블에 TIMESTAMP 컬럼을 사용했다면 Dump & Restore 해야한다. 해당 버전의 TIMESTAMP 컬럼의 저장방식이 잘못되었다. 4.0 버전이거나 4.1.4 이후 버전이라면 문제가 없다.

(*)4. 4.1.3 부터 InnoDB 는 latin1 이 아니고 BINARY 가 아닌 문자열의 비교에 글자셋 비교함수를 사용한다. 이것으로 공백문자와 ASCII(32) 보다 작은 코드값의 문자들은 글자셋 내에서 정렬순서에 변화가 생겼다. InnoDB 는 latin1 과 BINARY 문자열에 대해서는 여전히 문자열 끝에 공백을 추가하여 비교하는 방식을 사용한다. 만약 4.1.2 혹은 그 이전 버전에서 latin1 이 아닌 컬럼에 인덱스가 있거나 테이블에 CHAR/VARCHAR/TEXT 등의 BINARY 가 아닌 컬럼에 ASCII(32) 보다 작은 코드값의 문자가 있다면 4.1.3 으로 업그레이드 후 ALTER TABLE 이나 OPTIMIZE TABLE 을 사용하여 인덱스를 재구성하라. 또한 이런 경우 MyISAM 테이블도 재구성하거나 수정하여야 한다.

(*)5. 4.1.0 ~ 4.1.5 의 버전에서 UTF8 형식의 컬럼이나 다른 멀티바이트 글자셋 컬럼에 prefix index 를 사용하였다면 4.1.6 혹은 그 이상 버전으로 업그레이드 하기 위해서는 테이블을 재구성해야 한다.

(*)6. 4.1 이전 버전에서 DB명, 테이블명, 컬럼명 등에 액센트 문자(128 ~225 코드값) 를 사용하였다면 4.1 버전으로 곧장 업그레이드 할 수 없다. (4.1 버전은 메타데이타 저장에 UTF8 을 사용하기 때문에..) RENAME TABLE 을 사용하여 UTF8 에서 지원되는 테이블명,DB명,컬럼명으로 변경하라.

(*)7. CHAR(N) 은 N 개의 글자를 의미한다. 이전 버전에서는 N 바이트를 의미했다. 1바이트 글자셋에서는 문제가 없으나 멀티바이트 글자셋을 사용한다면 문제가 된다.

8. 4.1 에서 frm 파일의 포멧이 변경되었다. 4.0.11 이후의 버전은 새 형식을 사용할 수 있지만 그 이전 버전에서는 사용할 수 없다. 4.1 에서 그 이전 버전으로 테이블을 이전하려면 Dump & Restore 를 사용하라.

9. 4.1.1 이나 그 이상의 버전으로 업그레이드하면 4.0 이나 4.1.0 으로 다운그레이드 하는것은 어렵다.

이전 버전에서는 InnoDB 의 다중 테이블스페이스를 인식하지 못한다.

(*)10. 4.1.3 의 커넥션별 타임존 지원기능에서 타임존 시스템 변수명은 system_time_zone 으로 변경되었다.

11. 윈도우 서버는 --shared-memory 옵션을 사용하여 공유메모리를 사용한 로컬 클라이언트 접속을 지원한다. 하나의 윈도우 머신에서 다수의 서버를 운영한다면 각각의 서버에 --shared-memory-base-name 옵션을 사용하라.

12. 통계 UDF 함수 인터페이스가 조금 변경되었다. 이제 각각의 XXX() 통계함수에 xxx_clear() 함수를 선언해야 한다.

[Client Changes]

mysqldump 에 --opt 와 --quote-names 옵션이 디폴트로 활성화되었다. --skip-opt 와 --skip-quote-names 옵션으로 비활성화 할 수 있다.

[SQL Changes]

(*)1. 4.1.2 부터 SHOW TABLE STATUS 출력내용중 Type 이 Engine 으로 변경되었다.

(TYPE 옵션은 4.x 버전에선 계속 지원되지만 5.1 버전에서는 사라질것이다.)

(*)2. 문자 비교는 SQL 표준을 따른다. 비교전에 문자열 끝공백을 제거하는 대신에 짧은 문자열을 공백을 추가하여 늘리는 방식을 사용한다. 이와 관련된 문제는 'a' > 'a\t' 라는 것이다.(이전 버전에서는 'a' = 'a\t' 이었다.) 만약 ASCII(32) 보다 작은 코드값의 문자로 끝나는 CHAR 나 VARCHAR 컬럼이 있다면 REPAIR TABLE 이나 myisamchk 를 사용하여 수정하라.

3. multiple-table DELETE 를 사용할때 삭제하려는 테이블의 별칭(Alias) 를 사용해야 한다.

DELETE test FROM test AS t1, test2 WHERE ...

Do this:

DELETE t1 FROM test AS t1, test2 WHERE ...

이것은 4.0 버전에서 발생하던 문제를 수정하기 위함이다.

(*)4. TIMESTAMP 는 이제 'YYYY-MM-DD HH:MM:SS' 형식의 문자열로 리턴된다. (4.0.12 부터 --new 옵션이 이와같이 동작하도록 지원한다.)

만약 4.0 의 방식처럼 숫자로 리턴되길 원한다면 +0 를 붙여서 출력하라.

mysql> SELECT ts_col + 0 FROM tbl_name;

TIMESTAMP 컬럼의 길이는 더이상 지원하지 않는다. TIMESTAMP(10) 에서 (10) 은 무시된다.

이것은 SQL 표준 호환성을 위해 필요했다. 차기 버전에서는 하위호환을 가지도록 timestamp 의 길이가 초단위의 일부분을 출력하도록 할것이다.

(*)5. 0xFFDF 와 같은 이진값은 해당값의 문자로 간주된다. 이것은 문자를 이진값으로 입력할때 글자셋과 관련된 문제를 해결한다. 숫자값으로 사용하기 위해서는 CAST() 를 사용하라.

mysql> SELECT CAST(0xFEFF AS UNSIGNED INTEGER) < CAST(0xFF AS UNSIGNED INTEGER);
-> 0

CAST() 를 사용하지 않는다면 해당값의 문자에 대한 비교가 될것이다.

mysql> SELECT 0xFEFF < 0xFF;
-> 1

이진값을 숫자형식에 사용하거나 = 연산에 대해 사용할때도 위와 동일하다.(4.0.13 부터 --new 옵션으로 4.0 서버에서 이러한 동작을 반영해볼 수 있다.)

6. DATE, DATETIME, TIME 값을 다루는 함수를 위해 클라이언트에 리턴되는 값은 임시적인 형식으로 고정되었다. 가령 4.1 버전에서 다음과 같은 값을 리턴한다.

mysql> SELECT CAST('2001-1-1' AS DATETIME);
-> '2001-01-01 00:00:00'

4.0 에서는 결과형식이 다르다.

mysql> SELECT CAST('2001-1-1' AS DATETIME);
-> '2001-01-01'

7. AUTO_INCREMENT 컬럼에 DEFAULT 를 명시할 수 없다.(4.0 에서는 무시되었으나 4.1 에서는 에러를 발생한다.)

8. LIMIT 는 음수를 인자로 받지 않는다. (-1 대신에 큰 정수숫자를 사용하라.)

9. SERIALIZE 는 sql_mode 변수값을 위한 모드값이 아니다. 대신에 SET TRANSACTION ISOLATION LEVEL SERIALIZABLE 를 사용하라. 또한 SERIALIZE 는 --sql-mode 옵션의 유효한 값이 아니며

대신 --transaction-isolation=SERIALIZABLE 를 사용하라.

[C API Changes]

(*)1. 4.1.3 에서 mysql_shutdown() 함수의 인자(SHUTDOWN-level)가 추가되었다. 이전에 사용한 mysql_shutdown(X) 을 mysql_shutdown(X,SHUTDOWN_DEFAULT) 로 수정해야 한다.

2. mysql_real_query() 와 같은 몇몇 API 들이 에러발생시 -1 이 아닌 1 을 리턴한다. 만약 아래와 같은 코드를 사용하였다면 수정해야 한다.

if (mysql_real_query(mysql_object, query, query_length) == -1)
{
printf("Got error");
}

Non-Zero 값을 체크하도록 수정하라.

if (mysql_real_query(mysql_object, query, query_length) != 0)
{
printf("Got error");
}

[Password-Handling Changes]

- 보안을 강화하기 위해 패스워드 해시 방식이 변경되었다. 4.0 이나 그 이전의 라이브러리를 사용한 클라이언트를 사용시에 문제가 발생할 수 있다.(아직 4.1로 업그레이드 하지 않은 클라이언트가 리모트 접속을 시도할때 발생할 수 있다.) 다음 항목은 이 문제 해결에 대한 전략을 제공한다. 이것은 낡은 클라이언트와 보안성의 사이에서 타협안을 제시한다.

1. 클라이언트만 4.1 라이브러리를 사용하도록 업그레이드한다. (몇가지 API 의 리턴값을 제외하고) 어떤 변화도 필요하지 않지만 Server/Client 어느쪽에서도 4.1의 새로운 기능을 사용할 수 없을것이다.

2. 4.1로 업그레이드 후 mysql_fix_privilege_tables 스크립트를 실행하여 password 컬럼의 길이를 수정한다. 그러나 서버 시작시 --old-passwords 옵션을 사용하여 하위호환성을 유지한다. 그러다가 모든 클라이언트가 4.1로 업그레이드 하면 --old-passwords 옵션 사용을 중지할 수 있다. 또한 패스워드를 새 형식에 맞도록 수정할수도 있다.

3. 4.1로 업그레이드 후 mysql_fix_privilege_tables 스크립트를 실행하여 password 컬럼의 길이를 수정한다. 모든 클라이언트가 4.1로 업그레이드 되었다면 --old-passwords 옵션을 사용하지 않고 서버를 실행한다. 대신에 모든 계정에 대해 패스워드를 새 형식에 맞도록 수정한다.

4. Netware 환경에서 4.0 에서 4.1 로 업그레이드 할때 Perl 과 PHP 버전을 업그레이드 하라.

출처 - 데이터베이스 사랑넷

by 뭔일이여 2007. 4. 18. 10:30

Line Rider

일상/기타

http://www.official-linerider.com/play.html

by 뭔일이여 2007. 4. 16. 18:58

난 행복하다.................................

일상/기타

by 뭔일이여 2007. 4. 7. 12:51

검색결과 리스트

분류 전체보기에 해당되는 글 69건

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

Pattern Modifiers

preg_match

설명

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

글

설정

사이드 메뉴

CATEGORY

TAG

RECENT POSTS

RECENT COMMENT

RECENT TRACKBACK

ARCHIVE

NOTICE

MY LINK

CALENDAR

COUNTER

티스토리툴바