Thanathip Suntorntip Gorlph ported Korakot Chaovavanich's Thai word tokenizer - Newmm, written in Python, to Rust called nlpo3. The nlpo3 website claimed that nlpo3 is 2X faster than Newmm. I felt that Nlpo3 must be faster than this claim because in contrast to Python's Regex engine, Rust's regex runs in the linear time since it was constrained not to support looking back/ahead. Moreover, 2X faster is ambiguous.
So I conducted a bit different experiment on Mac mini M1. Both Nlpo3 and Newmm run on from Zsh instead of Python Notebook. I tested on 1 million lines of Thai Wikipedia snapshot. The result is that Newmm took 3.66X of the time that Nlpo3 required for tokenizing the same text on the same computer.
Setup
- Computer: Scaleway's Mac mini M1
- Rustc: rustc 1.54.0 (a178d0322 2021-07-26)
- Python: Python 3.8.2
- OS: Darwin 506124d8-4acf-4595-9d46-8ca4b44b8110 20.6.0 Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 arm64
- Script:
#!/bin/bash
set -x
INPUT=thwik-head1m.txt
for i in {1..10}
do
{ time python3 newmm.py < $INPUT > newmm.out ; } 2>> bench_newmm.txt
{ time nlpo3 segment < $INPUT > cham.out ; } 2>> bench_o3.txt
done
- A command line interface for newmm:
from pythainlp import word_tokenize
import sys
for line in sys.stdin:
print("|".join(word_tokenize(line[:-1])))
- nlpo3 version: 1.1.2
- nlpo3-cli version: 0.0.1
- chamkho version: 0.5.0
- dataset: https://file.veer66.rocks/langbench/thwik-head1m.txt
Result
nlpo3
[root@exper1 ~]# % grep real bench_o3.txt
real 2m10.923s
real 2m12.014s
real 2m10.931s
real 2m9.448s
real 2m9.055s
real 2m10.570s
real 2m10.672s
real 2m10.140s
real 2m11.220s
real 2m9.941s
newmm
% grep real bench_newmm.txt
real 7m52.180s
real 7m58.090s
real 7m57.071s
real 8m9.779s
real 7m54.576s
real 7m52.807s
real 7m59.109s
real 7m58.489s
real 7m59.604s
real 7m57.844s
Average
- nlpo3
% grep real bench_o3.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
130.49140000000003
- newmm
% grep real bench_newmm.txt | ruby -lane 'BEGIN { all = 0.0; cnt = 0 }; cols = $F[1].split(/[ms]/).map {|x| x.to_f }; v = cols[0]*60 + cols[1]; all += v; cnt += 1; END { p all/cnt}'
477.9549
Performance ratio
3.66
Top comments (0)