Tokenizer: Brackets in regular expression

nanuqcz
Member | 822
+
0
-

Hello,
I'm trying to create simple tokenizer:

$text = 'foo OR bar';
$tokenizer = new Tokenizer([
	'or' => '((OR)|(or))',  // <== Here is the problem. This works fine: 'or' => 'OR',
	'whitespace' => '\s+',
	'word' => '\w+',
]);
$tokens = $tokenizer->tokenize($text);
dump($tokens);

Expected output:

array(5) {
  [0]=>array(3) {
    [0]=>string(3) "foo"
    [1]=>int(0)
    [2]=>string(4) "word"
  }
  [1]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(3)
    [2]=>string(10) "whitespace"
  }
  [2]=>array(3) {
    [0]=>string(2) "OR"
    [1]=>int(4)
    [2]=>string(2) "or"
  }
  [3]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(6)
    [2]=>string(10) "whitespace"
  }
  [4]=>array(3) {
    [0]=>string(3) "bar"
    [1]=>int(7)
    [2]=>string(4) "word"
  }
}

Real output:

array(5) {
  [0]=>array(3) {
    [0]=>string(3) "foo"
    [1]=>int(0)
    [2]=>NULL
  }
  [1]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(3)
    [2]=>NULL
  }
  [2]=>array(3) {
    [0]=>string(2) "OR"
    [1]=>int(4)
    [2]=>string(2) "or"
  }
  [3]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(6)
    [2]=>NULL
  }
  [4]=>array(3) {
    [0]=>string(3) "bar"
    [1]=>int(7)
    [2]=>NULL
  }
}

Nette Tokenizer v2.2.4, PHP 7.0.22, Ubuntu 16.04

Am I wrong, or is there a bug in Nette/Tokenizer?

Thank you.

Jan Tvrdík
Nette guru | 2595
+
+1
-

Without testing

$text = 'foo OR bar';
$tokenizer = new Tokenizer([
    'or' => '(?:OR)|(?:or)',  // <== Here is the problem. This works fine: 'or' => 'OR',
    'whitespace' => '\s+',
    'word' => '\w+',
]);
$tokens = $tokenizer->tokenize($text);
dump($tokens);
nanuqcz
Member | 822
+
0
-

It's working :-O You are magician, thank you :-)


EDIT: But I don't get it.

This one is working:

	'or' => '(?:OR)|(?:or)',

but this one not:

	'or' => '((?:OR)|(?:or))',

Last edited by nanuqcz (2017-09-08 10:10)

David Matějka
Moderator | 6445
+
+2
-

because you are creating another capturing group, which breaks tokenizer, because tokenizer creates exactly one capturing group for every token. you have to change it to non-capturing group using ?:

'or' => '(?:(?:OR)|(?:or))',
David Grudl
Nette Core | 8129
+
0
-

Use simply 'or' => 'OR|or'

nanuqcz
Member | 822
+
0
-

David Matějka wrote:

because you are creating another capturing group, which breaks tokenizer, because tokenizer creates exactly one capturing group for every token. you have to change it to non-capturing group using ?:

Now it's clear for me with the (?: syntax.

Thanks for everyone ;-)