Standard TokenizerΒΆ
A tokenizer of type standard providing grammar based tokenizer that is a good tokenizer for most European language documents. The tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
The following are settings that can be set for a standard tokenizer type:
Setting | Description |
---|---|
max_token_length | The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255. |