I have recently worked on an NLP classifier for open borders related to COVID-19 restrictions. Tech-stack I used on it includes Node.js, TypeScript, NestJS as a back-end framework, Redis as the database, node-nlp
for natural language processing, puppeteer
and cheerio
for scraping, @nestjs/schedule
for a cron job, and React with Next.js for the front-end.
This blog post covers its main parts and their potential improvements.
Cron job
Since the data from the official website is updated once every several days on average, the cron job is invoked when the database connection is established. It runs twice daily to get the updated data if any.
Cron job scrapes the data, and every country is mapped with its information. Countries are classified with the trained classifier and put into the database.
@Cron(CronExpression.EVERY_12_HOURS)
async upsertData() {
const pageSource = await this.scraperService.getPageSource(WEBPAGE_URL);
const countriesInfo = this.scraperService.getCountriesInfo(pageSource);
const classifiedCountries = await this.nlpService.getClassifiedCountries(countriesInfo);
return this.databaseService.set('countries', JSON.stringify(countriesData));
}
Scraper
Countries have text information that may contain links and/or e-mail addresses.
A headless browser is used for scraping since some JavaScript code has to be executed in order to show e-mail addresses. To make it running on the Heroku dyno, the additional build pack has to be added.
Natural language processing
Training
The classifier is trained with utterances and several intents, and trained classifier is saved into the JSON file. One hundred eighty-eight countries are classified with training data which consists of 76 utterances.
// nlp.data.ts
export const trainingData = [
// ...
{
utterance,
intent
}
// ...
];
// nlp.service.ts
trainAndSaveModel = async (): Promise<void> => {
const modelFileName = this.getModelFileName();
const manager = this.getNlpManager(modelFileName);
this.addTrainingData(manager);
await manager.train();
manager.save(modelFileName);
};
Preprocessing
Before processing, the data is split into sentences where links and e-mail addresses are skipped, and diacritics are converted from strings to Latin characters.
Processing
Information is processed sentence by sentence using the trained model. Some sentences are classified as skipped and jumped over since they need to provide more information for classification.
for (let i = 0; i < sentences.length; i += 1) {
// ...
const { intent } = await nlpManager.process(sentences[i]);
// ...
if (!SKIPPED_INTENTS.includes(intent)) {
return {
...country,
status: intent
};
}
// ...
}
API
There is one endpoint to get all of the data. Some potential improvements include pagination and filtering of the classified data.
const classifiedCountries = await this.databaseService.get('countries');
if (!classifiedCountries) return [];
return JSON.parse(classifiedCountries);
Database
Since reading is the main operation, in-memory reading is fast, and the total amount of stored data is less than 1MB, Redis is chosen as the primary database.
Front-end
Front-end is a Progressive Web App that uses IndexedDB (not supported in Firefox when private mode is used) for caching the data, Bootstrap for styling, and React with Next.js for server-side rendering.
Course
Build your SaaS in 2 weeks - Start Now
Top comments (0)