Tokenizers

Tokenizers #

Tokenizers are used to break a string down into a stream of terms or tokens. A simple tokenizer might split the string up into terms wherever it encounters whitespace or punctuation.

Standard Tokenizer #

standard

A grammar based tokenizer that works well for most European language documents. The tokenizer implements the Unicode Text Segmentation algorithm.

Settings:

  • max_token_length

Edge NGram Tokenizer #

edgeNGram

Keyword Tokenizer #

keyword

Emits the entire input as a single output.

Settings:

  • buffer_size: Term buffer size (defaults to 256)

Letter Tokenizer #

letter

Lowercase Tokenizer #

lowercase

NGram Tokenizer #

nGram

Whitespace Tokenizer #

whitespace

Pattern Tokenizer #

pattern

UAX Email URL Tokenizer #

uax_url_email

Path Hierarchy Tokenizer #

path_hierarchy

Classic Tokenizer #

classic

Thai Tokenizer #

thai