DEV Community

Moritz Wiesinger
Moritz Wiesinger

Posted on

A language processing model that determines the releasing Newspaper from its articles

My Final Project (at least one of them)

For one of my final projects during my studies in Computer Science, I (together with my 2 awesome teammates) built a small language processing tool that could determine from a written article which newspaper released that article.

We used a few technologies that we haven't seen before which made the project all the more exciting. We used custom self written web crawlers to extract newspaper articles from their websites. Then, we processed the extracted texts to get them into a more standardized and machine readable format. The resulting article texts were then used to create a language model using Facebook FastText.

With that model and new articles, we could determine which newspaper had written the article in question only by analyzing its content.

Link to Code

(The actual model is not part of the repo for file size reasons but it can be easily generated from the source data that is included.)

GitHub logo mowies / newspaper-finder

A model to determine the releasing newspaper from its articles

Newspaper Finder

A model to determine the releasing newspaper from its articles


Please read the full report here




How we built it (what's the stack?)

For this project, my team used Python as the base language with a few additional packages. Those include:

  • Scrapy (used to crawl newspaper websites and extract articles)
  • SpaCy (used for further language processing like lemmatization and stemming)

In addition to that, for the model learning we used Facebook's FastText.

Thoughts

As one of the last projects of my time at the university, this was a very exciting and interesting project to tackle. We learned many new things and got great feedback with future ideas on how to improve and expand the tool we wrote.

Top comments (0)