DEV Community

Mike Young
Mike Young

Posted on • Originally published at aimodels.fyi

Towards a Brazilian History Knowledge Graph

This is a Plain English Papers summary of a research paper called Towards a Brazilian History Knowledge Graph. If you like these kinds of analysis, you should subscribe to the AImodels.fyi newsletter or follow me on Twitter.

Introduction

The paper discusses the task of automatically constructing a knowledge graph from text, with a specific focus on Brazilian recent history as described in the Dicionário Histórico Biográfico Brasileiro (DHBB), which is a Brazilian Historical-Biographical Dictionary. The authors aim to deploy recent natural language processing (NLP) techniques to process the DHBB corpus and develop a knowledge graph for Brazilian recent history.

Section 2 describes the DHBB dictionary and its maintenance over the years, explicitly stating the goal of the paper. Section 3 reports on previous works that applied NLP techniques to the DHBB. Section 4 discusses why the DHBB is maintained as a self-contained project at the Getulio Vargas Foundation, rather than being ingested into Wikipedia.

Section 5 presents the authors' strategy for mapping the titles of DHBB articles to Wikidata entries, highlighting the most relevant issues and evaluating their mapping approach. The paper concludes with final considerations in Section 6.

The DHBB corpus

The paper discusses the Brazilian Historical-Biographical Dictionary (DHBB), an encyclopedic resource providing organized and systematic information about notable personalities and events in recent Brazilian history. The DHBB covers the period starting from the "Revolução de 1930" (1930s Revolution), a significant political upheaval in Brazil.

The DHBB currently contains 7,863 entries, with over 6,800 biographies and around 1,000 thematic entries on institutions, organizations, and events. It was initiated and is maintained by CPDOC (Centro de Pesquisa e Documentação de História Contemporânea do Brasil), an organization within the Fundação Getúlio Vargas (FGV), dedicated to preserving historical documents and developing research tools for Brazilian cultural heritage.

The first edition of the DHBB was published in 1984, with major updates in 2010, 2014, and 2015. The 2010 update made the contents fully available online, while the 2014 update involved collaboration with researchers from the School of Applied Mathematics (EMAp) to store the contents in a GitHub repository.

The DHBB aims to provide objective and unbiased entries, avoiding ideological or personal judgments. CPDOC researchers carefully revise all entries to ensure accuracy and consistent style.

The thematic entries, which describe political parties, movements, organizations, events, constitutions, laws, and foreign relations topics, are of particular interest in this work. The goal is to ensure that named entities detected in the DHBB corpus are present in Wikidata, or else complete Wikidata to serve as a backbone for a Knowledge Graph for Brazilian History.

However, the authors were surprised to find that many named entities from the DHBB entry titles could not be automatically mapped to Wikidata, suggesting potential disambiguation challenges.

Previous work

The provided text discusses various computational linguistics research projects that utilize the Dictionary of the History of Brazil (DHBB) as a primary source of information about Brazilian history. It highlights the following key points:

  1. Early exploration of the DHBB using tools like FreeLing and OpenWN-PT, a Portuguese version of WordNet (De Paiva et al., 2014).
  2. Research on extracting semantic information from appositives in the DHBB (Higuchi et al., 2018).
  3. Distant reading of the DHBB corpus to extract information without reading individual entries, identifying around 48,500 persons, 27,500 organizations, 5,000 places, and 36,900 other named entities (Higuchi et al., 2019).
  4. Investigation of different tools and methods for detecting errors in the syntactical processing of the DHBB corpus, including differences in sentence segmentation and tokenization (Ribeiro et al., 2020).
  5. Exploration of topics like the age of entrance in Brazilian politics, academic backgrounds of politicians, and family ties among political elites using the DHBB corpus (Higuchi et al., 2022).
  6. Observation that existing lexical resources may lack terms specific to Brazilian culture, highlighting the need for culture-specific knowledge graphs (de Paiva et al., 2022).

The text emphasizes the DHBB as a valuable resource for computational linguistics research in understanding Brazilian history and culture, and the ongoing efforts to extract and analyze information from this vast corpus.

Wikipedia vs. Wikidata

The text discusses the origins and growth of Wikipedia, a free online encyclopedia written and maintained by volunteers. It highlights that while the English Wikipedia has over 6.7 million articles, other language editions like the Portuguese Wikipedia have significantly fewer articles despite Portuguese being the 8th most spoken language globally with around 263 million native speakers.

The text suggests that while not all Brazilian politicians and municipalities may be notable enough for individual Wikipedia pages, they should ideally be part of Wikidata, a structured knowledge base aiming to be as comprehensive as possible. The author aims to map named entities from the DHBB (a Brazilian history corpus) to Wikidata, starting with thematic and biographical entries using the wikimapper tool.

The text emphasizes that while the information in DHBB is of high quality, researchers may not want to replicate their work on Wikipedia due to copyright concerns. However, the entities (historical characters, locations, organizations, events) referenced in these works should be available in the structured format of Wikidata.

Mapping Historical Brazilian Entities

The paper discusses the challenges of mapping titles from the Digital Dictionary of Brazilian Biography (DHBB) to Wikidata, a free and open knowledge base. Key points:

  • Out of 973 thematic entry titles in the DHBB, only 498 (51%) could be mapped to Wikidata items automatically using existing tools.

  • Examples of problematic cases include entries for entities that no longer exist, sub-entities mentioned within larger Wikipedia pages, misspellings, and naming ambiguities.

  • For biographical entries (6,980 total), only 38% did not get an automatic Wikidata mapping. However, upon human evaluation, many of these mappings proved incorrect.

  • A random sample evaluation revealed around 30% of unmapped entries (thematic and biographical) could actually be found manually in Wikidata. Conversely, 16-34% of the automatic mappings were incorrect.

  • The authors aim to crowdsource improving the mappings and adding missing entries to Wikidata, an effort useful for preserving Brazilian historical knowledge.

  • Overall, Wikidata appears lacking in coverage of Brazilian entities, organizations, and events, necessitating significant manual effort to represent this information adequately.

Conclusion

The provided text highlights the importance of building knowledge graphs for academic subjects like contemporary Brazilian history and modern Brazilian art to preserve cultural heritage. Projects such as the DHBB (Digital Dictionary of Brazilian Biography) and the 'Enciclopédia Cultural Itaú' are valuable resources, but their information needs to be connected and structured to enable querying and reasoning.

Linking DHBB entries to Wikipedia and Wikidata can improve their visibility in search engine results and make the wealth of information more widely accessible. Currently, the information contained in DHBB is not being shared as widely as desired.

The proposed project aims to add DHBB information to Wikidata, making it more broadly available. This approach is presented as a straightforward way to enhance the dissemination of this cultural knowledge.

If you enjoyed this summary, consider subscribing to the AImodels.fyi newsletter or following me on Twitter for more AI and machine learning content.

Top comments (0)