MySQL ngram Full Text Parser

This blog mentor you how to practice MySQL ngram full-text parser to support full-text searches for different languages such as Chinese, Japanese, Korean and others.

Overview to MySQL ngram full-text parser

The built-in MySQL full-text parser limits the opening and finish of words by means of white space. When it comes to other languages such as Chinese, Japanese, or Korean, etc., this is a restriction for the reason that these languages do not use word delimiters.

To report this issue, MySQL providing the ngram full-text parser. Later version 5.7.6, MySQL comprised ngram full-text parser as a built-in server plugin, sense that MySQL loads this plugin routinely when the MySQL database server starts. MySQL provisions ngram full-text parser for equally InnoDB and MyISAM storing engines.

A ngram is a connecting sequence of a number of characters from an order of text. The main function of ngram full-text parser is tokenizing a order of text into a connecting order of n characters.

The following demonstrates how the ngram full-text parser tokenizes a order of text for dissimilar value of n:

Making FULLTEXT indexes with ngram parser

To generate a FULLTEXT index that usages ngram full-text parser, you complement the WITH PARSER ngram in the CREATE TABLE, ALTER TABLE, or CREATE INDEX query.

For instance, the subsequent query generates new posts table and adds the heading and body columns to the FULLTEXT index that use ngram full-text parser.

The subsequent INSERT query inserts a new row into the posts table:

Note that the SET NAMES declaration sets the character set that mutually client and server will use to send and receive data; in this situation, it is utf8mb4.

To understand how the ngram tokenizes the text, you use the subsequent statement:

This request is useful for troubleshooting resolves. For instance, if a word does not contain in the search results, then the word may be not indexed since it is a stopword or it could be additional purpose.

Set ngram token size

As you can understand the preceding example, the token size in the ngram by defaulting is 2. To modify the token size, you custom the ngram_token_size configuration choice, which has a value among 1 and 10.

Note that a lesser token size makes lesser full-text search index and lets you to search quicker.

Since ngram_token_size is a read-only mutable, so you only can set its value with two options:

Primary, in the start-up string:

Another, in the configuration file:

ngram parser phrase search

MySQL changes a phrase hunt into ngram phrase searches. For instance, "abc" is changed into "ab bc", which returns documents that comprise "ab bc" and "abc".

The subsequent sample displays you to search for the phrase 搜索 in the posts table:

Giving out search result with ngram

Natural language mode

In NATURAL LANGUAGE MODE hunts, the search term is changed to a union of ngram values. What if the token size is 2 or bigram, the search time mysql is changed to my ys sq and ql.

Boolean mode

In BOOLEAN MODE examinations, the search term is changed to a ngram phrase search. For instance:

ngram wildcard search

The ngram FULLTEXT index covers only ngrams, so it does not know the start of terms. When you make wildcard examinations, it may return unpredicted outcome.

The next rules are useful to wildcard search by means of ngram FULLTEXT search indexes:

If the prefix term in the wildcard is petite than ngram token size, the query proceeds all documents that cover ngram tokens opening with the prefix term. For instance:

In case the prefix term in the wildcard is lengthier than ngram token size, MySQL will change the prefix term into ngram phrases and overlook the wildcard operative. See the next sample:

In this case, the term “mysqld" is transformed into ngram phrases: "my" "ys" "sq" "ql" "ld". So, all documents that cover one of these phrases are given back.

Handling stopword

The ngram parser ignores tokens that cover the stop word in the stop word list. For instance, what if the ngram_token_size is 2 and document comprises "abc". The ngram parser resolve tokenize the document to "ab" and "bc". If "b" is a stopword, ngram will ignore both "ab" and "bc" because they cover "b".

Note that you must describe your own stop word list if the language is additional than English. In adding, the stop words with lengths that are better than ngram_token_size are overlooked.

In this lesson, you have educated how to use MySQL ngram full-text parser to grip full-text searches for other languages.

GoplarDB

The Database Experts