Natural language processing (NLP) functions
This is an experimental feature that is currently in development and is not ready for general use. It will change in unpredictable backwards-incompatible ways in future releases. Set allow_experimental_nlp_functions = 1
to enable it.
detectCharset
The detectCharset
function detects the character set of the non-UTF8-encoded input string.
Syntax
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
- A
String
containing the code of the detected character set
Examples
Query:
Result:
detectLanguage
Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection, and it returns the 2-letter ISO language code.
The detectLanguage
function works best when providing over 200 characters in the input string.
Syntax
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
- The 2-letter ISO code of the detected language
Other possible results:
un
= unknown, can not detect any language.other
= the detected language does not have 2 letter code.
Examples
Query:
Result:
detectLanguageMixed
Similar to the detectLanguage
function, but detectLanguageMixed
returns a Map
of 2-letter language codes that are mapped to the percentage of the certain language in the text.
Syntax
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
Map(String, Float32)
: The keys are 2-letter ISO codes and the values are a percentage of text found for that language Examples
Query:
Result:
detectProgrammingLanguage
Determines the programming language from the source code. Calculates all the unigrams and bigrams of commands in the source code. Then using a marked-up dictionary with weights of unigrams and bigrams of commands for various programming languages finds the biggest weight of the programming language and returns it.
Syntax
Arguments
source_code
— String representation of the source code to analyze. String.
Returned value
- Programming language. String.
Examples
Query:
Result:
detectLanguageUnknown
Similar to the detectLanguage
function, except the detectLanguageUnknown
function works with non-UTF8-encoded strings. Prefer this version when your character set is UTF-16 or UTF-32.
Syntax
Arguments
text_to_be_analyzed
— A collection (or sentences) of strings to analyze. String.
Returned value
- The 2-letter ISO code of the detected language
Other possible results:
un
= unknown, can not detect any language.other
= the detected language does not have 2 letter code.
Examples
Query:
Result:
detectTonality
Determines the sentiment of text data. Uses a marked-up sentiment dictionary, in which each word has a tonality ranging from -12
to 6
.
For each text, it calculates the average sentiment value of its words and returns it in the range [-1,1]
.
This function is limited in its current form. Currently it makes use of the embedded emotional dictionary at /contrib/nlp-data/tonality_ru.zst
and only works for the Russian language.
Syntax
Arguments
text
— The text to be analyzed. String.
Returned value
- The average sentiment value of the words in
text
. Float32.
Examples
Query:
Result:
lemmatize
Performs lemmatization on a given word. Needs dictionaries to operate, which can be obtained here.
Syntax
Arguments
language
— Language which rules will be applied. String.word
— Word that needs to be lemmatized. Must be lowercase. String.
Examples
Query:
Result:
Configuration
This configuration specifies that the dictionary en.bin
should be used for lemmatization of English (en
) words. The .bin
files can be downloaded from
here.
stem
Performs stemming on a given word.
Syntax
Arguments
language
— Language which rules will be applied. Use the two letter ISO 639-1 code.word
— word that needs to be stemmed. Must be in lowercase. String.
Examples
Query:
Result:
Supported languages for stem()
The stem() function uses the Snowball stemming library, see the Snowball website for updated languages etc.
- Arabic
- Armenian
- Basque
- Catalan
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Lithuanian
- Nepali
- Norwegian
- Porter
- Portuguese
- Romanian
- Russian
- Serbian
- Spanish
- Swedish
- Tamil
- Turkish
- Yiddish
synonyms
Finds synonyms to a given word. There are two types of synonym extensions: plain
and wordnet
.
With the plain
extension type we need to provide a path to a simple text file, where each line corresponds to a certain synonym set. Words in this line must be separated with space or tab characters.
With the wordnet
extension type we need to provide a path to a directory with WordNet thesaurus in it. Thesaurus must contain a WordNet sense index.
Syntax
Arguments
extension_name
— Name of the extension in which search will be performed. String.word
— Word that will be searched in extension. String.
Examples
Query:
Result:
Configuration
detectCharset
Introduced in: v22.2
Detects the character set of a non-UTF8-encoded input string.
Syntax
Arguments
s
— The text to analyze.String
Returned value
Returns a string containing the code of the detected character set String
Examples
Basic usage
detectLanguage
Introduced in: v22.2
Detects the language of the UTF8-encoded input string. The function uses the CLD2 library for detection and returns the 2-letter ISO language code.
The longer the input, the more precise the language detection will be.
Syntax
Arguments
text_to_be_analyzed
— The text to analyze.String
Returned value
Returns the 2-letter ISO code of the detected language. Other possible results: un
= unknown, can not detect any language, other
= the detected language does not have 2 letter code. String
Examples
Mixed language text
detectLanguageMixed
Introduced in: v22.2
Similar to the detectLanguage
function, but detectLanguageMixed
returns a Map
of 2-letter language codes that are mapped to the percentage of the certain language in the text.
Syntax
Arguments
s
— The text to analyzeString
Returned value
Returns a map with keys which are 2-letter ISO codes and corresponding values which are a percentage of the text found for that language Map(String, Float32)
Examples
Mixed languages
detectLanguageUnknown
Introduced in: v22.2
Similar to the detectLanguage
function, except the detectLanguageUnknown function works with non-UTF8-encoded strings.
Prefer this version when your character set is UTF-16 or UTF-32.
Syntax
Arguments
s
— The text to analyze.String
Returned value
Returns the 2-letter ISO code of the detected language. Other possible results: un
= unknown, can not detect any language, other
= the detected language does not have 2 letter code. String
Examples
Basic usage
detectProgrammingLanguage
Introduced in: v22.2
Determines the programming language from a given source code snippet.
Syntax
Arguments
source_code
— String representation of the source code to analyze.String
Returned value
Returns programming language String
Examples
C++ code detection
detectTonality
Introduced in: v22.2
Determines the sentiment of the provided text data.
This function is limited in its current form in that it makes use of the embedded emotional dictionary and only works for the Russian language.
Syntax
Arguments
s
— The text to be analyzed.String
Returned value
Returns the average sentiment value of the words in text Float32
Examples
Russian sentiment analysis
lemmatize
Introduced in: v21.9
Performs lemmatization on a given word. This function needs dictionaries to operate, which can be obtained from github. For more details on loading a dictionary from a local file see page "Defining Dictionaries".
Syntax
Arguments
lang
— Language which rules will be applied.String
word
— Lowercase word that needs to be lemmatized.String
Returned value
Returns the lemmatized form of the word String
Examples
English lemmatization
stem
Introduced in: v21.9
Performs stemming on a given word.
Syntax
Arguments
lang
— Language which rules will be applied. Use the two letter ISO 639-1 code.String
word
— Lowercase word that needs to be stemmed.String
Returned value
Returns the stemmed form of the word String
Examples
English stemming
synonyms
Introduced in: v21.9
Finds synonyms of a given word.
There are two types of synonym extensions:
plain
wordnet
With the plain
extension type you need to provide a path to a simple text file, where each line corresponds to a certain synonym set.
Words in this line must be separated with space or tab characters.
With the wordnet
extension type you need to provide a path to a directory with the WordNet thesaurus in it.
The thesaurus must contain a WordNet sense index.
Syntax
Arguments
ext_name
— Name of the extension in which search will be performed.String
word
— Word that will be searched in extension.String
Returned value
Returns array of synonyms for the given word. Array(String)
Examples
Find synonyms