DEV Community

Pallavi Ratra for WinkJS

Posted on • Originally published at

How to tokenize a string?

To tokenize a string using winkNLP, read the text using readDoc. Then use the tokens method to extract a collection of tokens from the string. Follow this with the out method to get this collection as a JavaScript array. This is how you can tokenize a string:

// Load wink-nlp package  & helpers.
const winkNLP = require( 'wink-nlp' );
// Load "its" helper to extract item properties.
const its = require( 'wink-nlp/src/its.js' );
// Load english language model — light version.
const model = require( 'wink-eng-lite-model' );
// Instantiate winkNLP.
const nlp = winkNLP( model );

// Input string
const text = '#Breaking:D Can’t get over this #Oscars selfie from 
// Read text
const doc = nlp.readDoc( text );
// Tokenize the string
const tokens = doc.tokens();
console.log( tokens.out() );
Enter fullscreen mode Exit fullscreen mode

This returns an array of tokens:

  '#Breaking', ':D', 'Ca', 'n’t', 'get', 'over', 'this', '#Oscars', 
'selfie','from', '@TheEllenShow', '🤩', 
Enter fullscreen mode Exit fullscreen mode

winkNLP has a lossless tokenizer which preserves and reproduces the original text. The tokenizer intelligently handles hyphenation, contractions and abbreviations. It also detects token types like ‘word’, ‘number’, ‘punctuation’, ‘symbol’, etc.

Top comments (0)