Josue Luzardo Gebrim

Posted on Dec 18, 2021

Creating a Podcast of book summaries and articles in PDF in Portuguese with AI in Python!

#machinelearning #podcast #python #googlecloud

A Podcast? Of Abstracts? PDF? In Python? Automatically? With AI? NLP?

Obs: This post was first posted on Medium

Motivation: You have already come across gigantic books or articles and thought how long it will take you to read something that in the end, you might not need to read all that and that automation could read, summarize and create an episode of a Podcast of automated way?

I had this idea after watching this video, where, with a few lines of code it is possible to create an AudioBook from a PDF file, see:

But we can still improve, imagine summarizing books from 300 to 400 pages in a few paragraphs and then create an episode for a Podcast of a few minutes…

I don’t want to spend hours listening to a book or article for no reason …

Come on, for this tutorial, I’m going to use the short story “O Alienista” that is in Machado de Assis’s book “Papers separate 1882”, considering that this is public domain content, which I am using only to validate my attempt at automation and is available at:

http://machado.mec.gov.br/obra-completa-lista/itemlist/category/24-conto

1. Remove text from a PDF file with Python:

1.1. I downloaded the PDF file: “Paper loose 1882” and created a folder on Google Drive for him, using Google Colab and the following code we can extract the text:

# Installing the Python library to read the PDF
!pip install pypdf2
import PyPDF2

#URL where the book is located
ULR_livro = './drive/MyDrive/machado_assis/pixarAvulsos.pdf'

#leading the location indicated
book = open (ULR_book, ‘rb’)

#Reading the book
pdfReader = PyPDF2.PdfFileReader (book)
text = '' #var where all the text of the book will be

#The tale is only between pages 3 and 32
for num in range (3, 32):
 page = pdfReader.getPage (num)
 text = text + page.extractText ()

With a few lines of code, we were able to extract all the 29-page text easily from the PDF file, now we can summarize it.

2) Summarizing texts with NLTK

The NLTK is a set of tools very easy to be used for Natural Language Processing, whether to summarize a text, sentiment analysis, or more, in this example, the NLTK will be used to create a summary of the story, see:

#Package installation
!pip install nltk 

#Installation of word dictionaries (corpus)
!python -m nltk.downloader all 

#dividing our text into sentences and then into words
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
sentencas = sent_tokenize(text)
palavras = word_tokenize(text.lower())

#Removing the stopwords
from nltk.corpus import stopwords
from string import punctuation
stopwords = set(stopwords.words('portuguese') + list(punctuation))
palavras_sem_stopwords = [palavra for palavra in palavras if palavra not in stopwords]

#Creating the frequency distribution
from nltk.probability import FreqDist
frequencia = FreqDist(palavras_sem_stopwords)

#Separating the most important sentences
from collections import defaultdict
sentencas_importantes = defaultdict(int)

#Loop to go through all the sentences and collect all the statistics
for i, sentenca in enumerate(sentencas):
    for palavra in word_tokenize(sentenca.lower()):
        if palavra in frequencia:
            sentencas_importantes[i] += frequencia[palavra]

#"n" most important sentences
from heapq import nlargest
idx_sentencas_importantes = nlargest(4, sentencas_importantes, sentencas_importantes.get)

# We have the summary! :)
resumo = ''
for i in sorted(idx_sentencas_importantes):
    resumo = resumo + sentencas[i]

This code above was made based on the example below about NLP, I recommend reading:

https://medium.com/@viniljf/utilizando-processamento-de-linguagem-natural-para-criar-um-sumariza%C3%A7%C3%A3o-autom%C3%A1tica-de-textos-775cb428c84e

3) Converting text to an audio in Python

Now we have a variable with the summary, we can just convert the text to a voice file, see below:

!pip install gTTS

from gtts 

import gTTSimport ostts = gTTS(resumo, lang='pt-br')

tts.save('resumo.mp3')

Conclusion?

With a few lines of code it was possible to take the text from 29 pages, make a summary and convert that summary to audio of fewer than 5 minutes, maybe it was getting a little confusing, but it was a good starting point, check out the result:

https://anchor.fm/josue-luzardo-gebrim/episodes/Descobrindo-Random-Forest-e-CNNs-eng36g

As a next step … I see the automation of publishing the mp3 file on the podcast channel and try to use other algorithms, tools, and solutions like Bert, to create a more cohesive summary.

What do you think? In your view, what is missing to create an automated podcast with AI? :)

NOTE: In my research to start creating this automation, I came across a video of the creation of an audiobook using the Machine Learning resources of Google Cloud, see:

References:

https://medium.com/@viniljf

https://sdhilip.medium.com/

https://www.youtube.com/channel/UCStj-ORBZ7TGK1FwtGAUgbQ

Follow me on Medium :)

Latest comments (1)

bredmond1019 • Jan 10 '22

Super legal! Estou aprendendo português há vários anos e acho que essa é uma ótima ferramenta para pegar os livros que tenho e transformá-los em audiolivros para ajudar na audição. Vou experimentar isso. Brigado