Tokenizer: Brackets in regular expression
Notice: This thread is very old.
- nanuqcz
- Member | 822
Hello,
I'm trying to create simple tokenizer:
$text = 'foo OR bar';
$tokenizer = new Tokenizer([
'or' => '((OR)|(or))', // <== Here is the problem. This works fine: 'or' => 'OR',
'whitespace' => '\s+',
'word' => '\w+',
]);
$tokens = $tokenizer->tokenize($text);
dump($tokens);
Expected output:
array(5) {
[0]=>array(3) {
[0]=>string(3) "foo"
[1]=>int(0)
[2]=>string(4) "word"
}
[1]=>array(3) {
[0]=>string(1) " "
[1]=>int(3)
[2]=>string(10) "whitespace"
}
[2]=>array(3) {
[0]=>string(2) "OR"
[1]=>int(4)
[2]=>string(2) "or"
}
[3]=>array(3) {
[0]=>string(1) " "
[1]=>int(6)
[2]=>string(10) "whitespace"
}
[4]=>array(3) {
[0]=>string(3) "bar"
[1]=>int(7)
[2]=>string(4) "word"
}
}
Real output:
array(5) {
[0]=>array(3) {
[0]=>string(3) "foo"
[1]=>int(0)
[2]=>NULL
}
[1]=>array(3) {
[0]=>string(1) " "
[1]=>int(3)
[2]=>NULL
}
[2]=>array(3) {
[0]=>string(2) "OR"
[1]=>int(4)
[2]=>string(2) "or"
}
[3]=>array(3) {
[0]=>string(1) " "
[1]=>int(6)
[2]=>NULL
}
[4]=>array(3) {
[0]=>string(3) "bar"
[1]=>int(7)
[2]=>NULL
}
}
Nette Tokenizer v2.2.4, PHP 7.0.22, Ubuntu 16.04
Am I wrong, or is there a bug in Nette/Tokenizer?
Thank you.
- Jan Tvrdík
- Nette guru | 2595
Without testing
$text = 'foo OR bar';
$tokenizer = new Tokenizer([
'or' => '(?:OR)|(?:or)', // <== Here is the problem. This works fine: 'or' => 'OR',
'whitespace' => '\s+',
'word' => '\w+',
]);
$tokens = $tokenizer->tokenize($text);
dump($tokens);
- David Matějka
- Moderator | 6445
because you are creating another capturing group, which breaks tokenizer,
because tokenizer creates exactly one capturing group for every token. you have
to change it to non-capturing group using ?:
'or' => '(?:(?:OR)|(?:or))',
- nanuqcz
- Member | 822
David Matějka wrote:
because you are creating another capturing group, which breaks tokenizer, because tokenizer creates exactly one capturing group for every token. you have to change it to non-capturing group using
?:
Now it's clear for me with the (?:
syntax.
Thanks for everyone ;-)