Measuring the results of our model outputs gets a lot more complex when we’re dealing with language.
This is something that becomes quite clear very quickly for many NLP-based problems — how do we measure the accuracy of a language-based sequence when dealing with language summarization or translation?
For this, we can use Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Fortunately, the name is deceptively complicated — it’s incredibly easy to understand, and even easier to implement.
Let’s jump straight into it.
- What is ROUGE
- F1 Score
- In Python
- For Datasets