DEV Community

Vee Satayamas
Vee Satayamas

Posted on

Thai word tokenizers benchmark: nlpo3 vs newmm

Thanathip Suntorntip Gorlph ported Korakot Chaovavanich's Thai word tokenizer - Newmm, written in Python, to Rust called nlpo3. The nlpo3 website claimed that nlpo3 is 2X faster than Newmm. I felt that Nlpo3 must be faster than this claim because in contrast to Python's Regex engine, Rust's regex runs in the linear time since it was constrained not to support looking back/ahead. Moreover, 2X faster is ambiguous.

So I conducted a bit different experiment on Mac mini M1. Both Nlpo3 and Newmm run on from Zsh instead of Python Notebook. I tested on 1 million lines of Thai Wikipedia snapshot. The result is that Newmm took 3.66X of the time that Nlpo3 required for tokenizing the same text on the same computer.

Setup

  • Computer: Scaleway's Mac mini M1
  • Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
  • Python: Python 3.8.2
  • OS: Darwin 506124d8-4acf-4595-9d46-8ca4b44b8110 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64
  • Script:
#!/bin/bash

set -x

INPUT=thwik-head1m.txt

for i in {1..10}
do
  { time python3 newmm.py < $INPUT > newmm.out ; } 2>> bench_newmm.txt
  { time nlpo3 segment < $INPUT > cham.out ; } 2>> bench_o3.txt
done
Enter fullscreen mode Exit fullscreen mode
  • A command line interface for newmm:
from pythainlp import word_tokenize
import sys

for line in sys.stdin:
        print("|".join(word_tokenize(line[:-1])))
Enter fullscreen mode Exit fullscreen mode

Result

nlpo3

[root@exper1 ~]# % grep real bench_o3.txt 
real    2m10.923s
real    2m12.014s
real    2m10.931s
real    2m9.448s
real    2m9.055s
real    2m10.570s
real    2m10.672s
real    2m10.140s
real    2m11.220s
real    2m9.941s

Enter fullscreen mode Exit fullscreen mode

newmm

% grep real bench_newmm.txt 
real    7m52.180s
real    7m58.090s
real    7m57.071s
real    8m9.779s
real    7m54.576s
real    7m52.807s
real    7m59.109s
real    7m58.489s
real    7m59.604s
real    7m57.844s
Enter fullscreen mode Exit fullscreen mode

Average

  • nlpo3
% grep real bench_o3.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}' 
130.49140000000003
Enter fullscreen mode Exit fullscreen mode
  • newmm
% grep real bench_newmm.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
477.9549
Enter fullscreen mode Exit fullscreen mode

Performance ratio

3.66

Top comments (0)