Tokenizers #
Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation.
Standard Tokenizer #
standard
A grammar based tokenizer that works well for most European language documents. The tokenizer implements the Unicode Text Segmentation algorithm.
Settings:
max_token_length
Edge NGram Tokenizer #
edgeNGram
Keyword Tokenizer #
keyword
Emits the entire input as a single output.
Settings:
buffer_size
: Term buffer size (defaults to 256)
Letter Tokenizer #
letter
Lowercase Tokenizer #
lowercase
NGram Tokenizer #
nGram
Whitespace Tokenizer #
whitespace
Pattern Tokenizer #
pattern
UAX Email URL Tokenizer #
uax_url_email
Path Hierarchy Tokenizer #
path_hierarchy
Classic Tokenizer #
classic
Thai Tokenizer #
thai