Tokenization Technique for None-spacing Words

#python #tutorial #learning

In Natural Language Processing to identify words from sentence in English or Latin characters is not too hard, because each word is has a space. But in Unicode character is different we need to make it compare to existing words from dictionary.

Dictionary Format:

You can structure your dictionary to include related words and explanatory phrases. Here's an example format:

Example:

khmer_dictionary = {
    'មាន': {'POS': 'Verb', 'Related': ['មានសៀវភៅ', 'មានទិន្នន័យ'], 'Explanation': 'to have'},
    'សៀវភៅ': {'POS': 'Noun', 'Related': [], 'Explanation': 'book'},
    'ច្រើន': {'POS': 'Adjective', 'Related': [], 'Explanation': 'many'},
    'ណាស់': {'POS': 'Adverb', 'Related': [], 'Explanation': 'here'},
    'នៅ': {'POS': 'Verb', 'Related': [], 'Explanation': 'to be at'},
    'ទីនេះ': {'POS': 'Noun', 'Related': [], 'Explanation': 'this place'}
}

Improving Tokenization Method:

To handle multi-word phrases and OOV words better, you need to adjust your tokenization function. Here's a revised version.

def tokenize_with_dictionary(sentence):
    tokens = []
    current_word = '' 

    for char in sentence:
        current_word += char
        if current_word in khmer_dictionary:
            tokens.append((current_word, khmer_dictionary[current_word]))
            current_word = ''

        elif current_word[:-1] in khmer_dictionary:
            tokens.append((current_word[:-1], khmer_dictionary[current_word[:-1]]))
            current_word = char

    if current_word:
        tokens.append((current_word, 'OOV'))

    return tokens

Then you can save it to database.
If you have better idea or something for improvement, please comments below.