Natural language processing (NLP) functions

Experimental feature. Learn more.

Not supported in ClickHouse Cloud

Caution

This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1 to enable it.

detectCharset

The detectCharset function detects the character set of the non-UTF8-encoded input string.

Syntax

detectCharset('text_to_be_analyzed')

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

A String containing the code of the detected character set

Examples

Query:

SELECT detectCharset('Ich bleibe für ein paar Tage.');

Result:

┌─detectCharset('Ich bleibe für ein paar Tage.')─┐
│ WINDOWS-1252                                   │
└────────────────────────────────────────────────┘

detectLanguage

Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection, and it returns the 2-letter ISO language code.

The detectLanguage function works best when providing over 200 characters in the input string.

Syntax

detectLanguage('text_to_be_analyzed')

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

The 2-letter ISO code of the detected language

Other possible results:

un = unknown, can not detect any language.
other = the detected language does not have 2 letter code.

Examples

Query:

SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there's a will, there's a way.');

Result:

fr

detectLanguageMixed

Similar to the detectLanguage function, but detectLanguageMixed returns a Map of 2-letter language codes that are mapped to the percentage of the certain language in the text. Syntax

detectLanguageMixed('text_to_be_analyzed')

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

Map(String, Float32): The keys are 2-letter ISO codes and the values are a percentage of text found for that language Examples

Query:

SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.');

Result:

┌─detectLanguageMixed()─┐
│ {'ja':0.62,'fr':0.36  │
└───────────────────────┘

detectProgrammingLanguage

Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code. Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.

Syntax

detectProgrammingLanguage('source_code')

Arguments

source_code — String representation of the source code to analyze. String.

Returned value

Programming language. String.

Examples

Query:

SELECT detectProgrammingLanguage('#include <iostream>');

Result:

┌─detectProgrammingLanguage('#include <iostream>')─┐
│ C++                                              │
└──────────────────────────────────────────────────┘

detectLanguageUnknown

Similar to the detectLanguage function, except the detectLanguageUnknown function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32. Syntax

detectLanguageUnknown('text_to_be_analyzed')

Arguments

text_to_be_analyzed — A collection (or sentences) of strings to analyze. String.

Returned value

The 2-letter ISO code of the detected language

Other possible results:

un = unknown, can not detect any language.
other = the detected language does not have 2 letter code.

Examples

Query:

SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.');

Result:

┌─detectLanguageUnknown('Ich bleibe für ein paar Tage.')─┐
│ de                                                     │
└────────────────────────────────────────────────────────┘

detectTonality

Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from -12 to 6. For each text, it calculates the average sentiment value of its words and returns it in the range [-1,1].

Note

This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at /contrib/nlp-data/tonality_ru.zst and only works for the Russian language.

Syntax

detectTonality(text)

Arguments

text — The text to be analyzed. String.

Returned value

The average sentiment value of the words in text. Float32.

Examples

Query:

SELECT detectTonality('Шарик - хороший пёс'), -- Sharik is a good dog 
       detectTonality('Шарик - пёс'), -- Sharik is a dog
       detectTonality('Шарик - плохой пёс'); -- Sharkik is a bad dog

Result:

┌─detectTonality('Шарик - хороший пёс')─┬─detectTonality('Шарик - пёс')─┬─detectTonality('Шарик - плохой пёс')─┐
│                               0.44445 │                             0 │                                 -0.3 │
└───────────────────────────────────────┴───────────────────────────────┴──────────────────────────────────────┘

lemmatize

Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained here.

Syntax

lemmatize('language', word)

Arguments

language — Language which rules will be applied. String.
word — Word that needs to be lemmatized. Must be lowercase. String.

Examples

Query:

SELECT lemmatize('en', 'wolves');

Result:

┌─lemmatize("wolves")─┐
│              "wolf" │
└─────────────────────┘

Configuration

This configuration specifies that the dictionary en.bin should be used for lemmatization of English (en) words. The .bin files can be downloaded from here.

<lemmatizers>
    <lemmatizer>
        <!-- highlight-start -->
        <lang>en</lang>
        <path>en.bin</path>
        <!-- highlight-end -->
    </lemmatizer>
</lemmatizers>

stem

Performs stemming on a given word.

Syntax

stem('language', word)

Arguments

language — Language which rules will be applied. Use the two letter ISO 639-1 code.
word — word that needs to be stemmed. Must be in lowercase. String.

Examples

Query:

SELECT arrayMap(x -> stem('en', x), ['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) AS res;

Result:

┌─res────────────────────────────────────────────────┐
│ ['I','think','it','is','a','bless','in','disguis'] │
└────────────────────────────────────────────────────┘

Supported languages for stem()

Note

The stem() function uses the Snowball stemming library, see the Snowball website for updated languages etc.

Arabic
Armenian
Basque
Catalan
Danish
Dutch
English
Finnish
French
German
Greek
Hindi
Hungarian
Indonesian
Irish
Italian
Lithuanian
Nepali
Norwegian
Porter
Portuguese
Romanian
Russian
Serbian
Spanish
Swedish
Tamil
Turkish
Yiddish

synonyms

Finds synonyms to a given word. There are two types of synonym extensions: plain and wordnet.

With the plain extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.

With the wordnet extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.

Syntax

synonyms('extension_name', word)

Arguments

extension_name — Name of the extension in which search will be performed. String.
word — Word that will be searched in extension. String.

Examples

Query:

SELECT synonyms('list', 'important');

Result:

┌─synonyms('list', 'important')────────────┐
│ ['important','big','critical','crucial'] │
└──────────────────────────────────────────┘

Configuration

<synonyms_extensions>
    <extension>
        <name>en</name>
        <type>plain</type>
        <path>en.txt</path>
    </extension>
    <extension>
        <name>en</name>
        <type>wordnet</type>
        <path>en/</path>
    </extension>
</synonyms_extensions>

detectCharset

Introduced in: v22.2

Detects the character set of a non-UTF8-encoded input string.

Syntax

detectCharset(s)

Arguments

s — The text to analyze. String

Returned value

Returns a string containing the code of the detected character set String

Examples

Basic usage

SELECT detectCharset('Ich bleibe für ein paar Tage.')

WINDOWS-1252

detectLanguage

Introduced in: v22.2

Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection and returns the 2-letter ISO language code.

The longer the input, the more precise the language detection will be.

Syntax

detectLanguage(s)

Arguments

text_to_be_analyzed — The text to analyze. String

Returned value

Returns the 2-letter ISO code of the detected language. Other possible results: un = unknown, can not detect any language, other = the detected language does not have 2 letter code. String

Examples

Mixed language text

SELECT detectLanguage('Je pense que je ne parviendrai jamais à parler français comme un natif. Where there\'s a will, there\'s a way.')

fr

detectLanguageMixed

Introduced in: v22.2

Similar to the detectLanguage function, but detectLanguageMixed returns a Map of 2-letter language codes that are mapped to the percentage of the certain language in the text.

Syntax

detectLanguageMixed(s)

Arguments

s — The text to analyze String

Returned value

Returns a map with keys which are 2-letter ISO codes and corresponding values which are a percentage of the text found for that language Map(String, Float32)

Examples

Mixed languages

SELECT detectLanguageMixed('二兎を追う者は一兎をも得ず二兎を追う者は一兎をも得ず A vaincre sans peril, on triomphe sans gloire.')

{'ja':0.62,'fr':0.36}

detectLanguageUnknown

Introduced in: v22.2

Similar to the detectLanguage function, except the detectLanguageUnknown function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.

Syntax

detectLanguageUnknown('s')

Arguments

s — The text to analyze. String

Returned value

Returns the 2-letter ISO code of the detected language. Other possible results: un = unknown, can not detect any language, other = the detected language does not have 2 letter code. String

Examples

Basic usage

SELECT detectLanguageUnknown('Ich bleibe für ein paar Tage.')

de

detectProgrammingLanguage

Introduced in: v22.2

Determines the programming language from a given source code snippet.

Syntax

detectProgrammingLanguage('source_code')

Arguments

source_code — String representation of the source code to analyze. String

Returned value

Returns programming language String

Examples

C++ code detection

SELECT detectProgrammingLanguage('#include <iostream>')

C++

detectTonality

Introduced in: v22.2

Determines the sentiment of the provided text data.

Limitation

This function is limited in its current form in that it makes use of the embedded emotional dictionary and only works for the Russian language.

Syntax

detectTonality(s)

Arguments

s — The text to be analyzed. String

Returned value

Returns the average sentiment value of the words in text Float32

Examples

Russian sentiment analysis

SELECT
    detectTonality('Шарик - хороший пёс'),
    detectTonality('Шарик - пёс'),
    detectTonality('Шарик - плохой пёс')

0.44445, 0, -0.3

lemmatize

Introduced in: v21.9

Performs lemmatization on a given word. This function needs dictionaries to operate, which can be obtained from github. For more details on loading a dictionary from a local file see page "Defining Dictionaries".

Syntax

lemmatize(lang, word)

Arguments

lang — Language which rules will be applied. String
word — Lowercase word that needs to be lemmatized. String

Returned value

Returns the lemmatized form of the word String

Examples

English lemmatization

SELECT lemmatize('en', 'wolves')

wolf

stem

Introduced in: v21.9

Performs stemming on a given word.

Syntax

stem(lang, word)

Arguments

lang — Language which rules will be applied. Use the two letter ISO 639-1 code. String
word — Lowercase word that needs to be stemmed. String

Returned value

Returns the stemmed form of the word String

Examples

English stemming

SELECT arrayMap(x -> stem('en', x),
['I', 'think', 'it', 'is', 'a', 'blessing', 'in', 'disguise']) AS res

['I','think','it','is','a','bless','in','disguis']

synonyms

Introduced in: v21.9

Finds synonyms of a given word.

There are two types of synonym extensions:

plain
wordnet

With the plain extension type you need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.

With the wordnet extension type you need to provide a path to a directory with the WordNet thesaurus in it. The thesaurus must contain a WordNet sense index.

Syntax

synonyms(ext_name, word)

Arguments

ext_name — Name of the extension in which search will be performed. String
word — Word that will be searched in extension. String

Returned value

Returns array of synonyms for the given word. Array(String)

Examples

Find synonyms

SELECT synonyms('list', 'important')

['important','big','critical','crucial']

detectCharset​

detectLanguage​

detectLanguageMixed​

detectProgrammingLanguage​

detectLanguageUnknown​

detectTonality​

lemmatize​

stem​

synonyms​

detectCharset​

detectLanguage​

detectLanguageMixed​

detectLanguageUnknown​

detectProgrammingLanguage​

detectTonality​

lemmatize​

stem​

synonyms​

detectCharset

detectLanguage

detectLanguageMixed

detectProgrammingLanguage

detectLanguageUnknown

detectTonality

lemmatize

stem

synonyms

detectCharset

detectLanguage

detectLanguageMixed

detectLanguageUnknown

detectProgrammingLanguage

detectTonality

lemmatize

stem

synonyms