DEV Community

Cover image for Tokenization Technique for None-spacing Words
Kuth
Kuth

Posted on

Tokenization Technique for None-spacing Words

In Natural Language Processing to identify words from sentence in English or Latin characters is not too hard, because each word is has a space. But in Unicode character is different we need to make it compare to existing words from dictionary.

Dictionary Format:

You can structure your dictionary to include related words and explanatory phrases. Here's an example format:

Example:

khmer_dictionary = {
    'មាន': {'POS': 'Verb', 'Related': ['មានសៀវភៅ', 'មានទិន្នន័យ'], 'Explanation': 'to have'},
    'សៀវភៅ': {'POS': 'Noun', 'Related': [], 'Explanation': 'book'},
    'ច្រើន': {'POS': 'Adjective', 'Related': [], 'Explanation': 'many'},
    'ណាស់': {'POS': 'Adverb', 'Related': [], 'Explanation': 'here'},
    'នៅ': {'POS': 'Verb', 'Related': [], 'Explanation': 'to be at'},
    'ទីនេះ': {'POS': 'Noun', 'Related': [], 'Explanation': 'this place'}
}
Enter fullscreen mode Exit fullscreen mode

Improving Tokenization Method:

To handle multi-word phrases and OOV words better, you need to adjust your tokenization function. Here's a revised version.

def tokenize_with_dictionary(sentence):
    tokens = []
    current_word = '' 

    for char in sentence:
        current_word += char
        if current_word in khmer_dictionary:
            tokens.append((current_word, khmer_dictionary[current_word]))
            current_word = ''

        elif current_word[:-1] in khmer_dictionary:
            tokens.append((current_word[:-1], khmer_dictionary[current_word[:-1]]))
            current_word = char

    if current_word:
        tokens.append((current_word, 'OOV'))

    return tokens
Enter fullscreen mode Exit fullscreen mode

Then you can save it to database.
If you have better idea or something for improvement, please comments below.

AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

Top comments (0)

Eliminate Context Switching and Maximize Productivity

Pieces.app

Pieces Copilot is your personalized workflow assistant, working alongside your favorite apps. Ask questions about entire repositories, generate contextualized code, save and reuse useful snippets, and streamline your development process.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay