Tokenizer: Brackets in regular expression

14 days ago

nanuqcz
Member | 844
+
0
-

Hello,
I'm trying to create simple tokenizer:

$text = 'foo OR bar';
$tokenizer = new Tokenizer([
    'or' => '((OR)|(or))',  // <== Here is the problem. This works fine: 'or' => 'OR',
    'whitespace' => '\s+',
    'word' => '\w+',
]);
$tokens = $tokenizer->tokenize($text);
dump($tokens);

Expected output:

array(5) {
  [0]=>array(3) {
    [0]=>string(3) "foo"
    [1]=>int(0)
    [2]=>string(4) "word"
  }
  [1]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(3)
    [2]=>string(10) "whitespace"
  }
  [2]=>array(3) {
    [0]=>string(2) "OR"
    [1]=>int(4)
    [2]=>string(2) "or"
  }
  [3]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(6)
    [2]=>string(10) "whitespace"
  }
  [4]=>array(3) {
    [0]=>string(3) "bar"
    [1]=>int(7)
    [2]=>string(4) "word"
  }
}

Real output:

array(5) {
  [0]=>array(3) {
    [0]=>string(3) "foo"
    [1]=>int(0)
    [2]=>NULL
  }
  [1]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(3)
    [2]=>NULL
  }
  [2]=>array(3) {
    [0]=>string(2) "OR"
    [1]=>int(4)
    [2]=>string(2) "or"
  }
  [3]=>array(3) {
    [0]=>string(1) " "
    [1]=>int(6)
    [2]=>NULL
  }
  [4]=>array(3) {
    [0]=>string(3) "bar"
    [1]=>int(7)
    [2]=>NULL
  }
}

Nette Tokenizer v2.2.4, PHP 7.0.22, Ubuntu 16.04

Am I wrong, or is there a bug in Nette/Tokenizer?

Thank you.

14 days ago

Jan Tvrdík
Nette guru | 2529
+
+1
-

Without testing

$text = 'foo OR bar';
$tokenizer = new Tokenizer([
    'or' => '(?:OR)|(?:or)',  // <== Here is the problem. This works fine: 'or' => 'OR',
    'whitespace' => '\s+',
    'word' => '\w+',
]);
$tokens = $tokenizer->tokenize($text);
dump($tokens);

13 days ago

nanuqcz
Member | 844
+
0
-

It's working :-O You are magician, thank you :-)


EDIT: But I don't get it.

This one is working:

'or' => '(?:OR)|(?:or)',

but this one not:

'or' => '((?:OR)|(?:or))',

Last edited by nanuqcz (2017-09-08 10:10)

13 days ago

David Matějka
Moderator | 5082
+
+2
-

because you are creating another capturing group, which breaks tokenizer, because tokenizer creates exactly one capturing group for every token. you have to change it to non-capturing group using ?:

'or' => '(?:(?:OR)|(?:or))',

13 days ago

David Grudl
founder | 6692
+
0
-

Use simply 'or' => 'OR|or'

13 days ago

nanuqcz
Member | 844
+
0
-

David Matějka wrote:

because you are creating another capturing group, which breaks tokenizer, because tokenizer creates exactly one capturing group for every token. you have to change it to non-capturing group using ?:

Now it's clear for me with the (?: syntax.

Thanks for everyone ;-)