DEV Community

loading...
Cover image for Turning Long Textfiles into Speech with AWS Polly and Python

Turning Long Textfiles into Speech with AWS Polly and Python

Joseph D. Marhee
Systems & Infrastructure Engineer
Updated on ・3 min read

I recently published a Python package that ingests a text file, and creates a text-to-speech rendering of that text using AWS Polly. Part of the challenge is that there is a character limit for each generated recording, so simply slicing up a long string into 250 character chunks could work, but then you run the risk of breaking up words, and when the audio chunks are reassembled, a word like "the" become "tuh" and "he" when heard across this boundary. This package was partly written to handle this kind of behavior and trim as needed (possibly creating shorter and longer chunks, rather than split evenly).

After installing the package:

pip3 install polly-textfile-cli
Enter fullscreen mode Exit fullscreen mode

and running something like:

polly-textfile --path input.txt --name output-name
Enter fullscreen mode Exit fullscreen mode

inside the package, the textfile is broken into individual words into a list:

def fileChunkList(filePath, limit):
    with open(filePath, 'r') as file:
        data = file.read().replace('\n','')
    #lines = [data[i:i+limit] for i in range(0, len(data), limit)]
    lines_in = data.split(" ")
    lines = constructSentences(lines_in,limit)
    return lines
Enter fullscreen mode Exit fullscreen mode

and then lines = constructSentences(lines_in,limit) is used to reconstruct each segment to be rendered by Polly into audio:

def constructSentences(words,limit):
    ss = []
    s = []
    for w in words:
        if len(w) + len(" ".join(s)) <= limit:
            s.append(w)
        else:
            sentence = " ".join(s)
            ss.append(sentence)
            s = []
            s.append(w)
    return ss
Enter fullscreen mode Exit fullscreen mode

So if limit is 250, before adding a new word to a "sentence" (a string that has a max length of limit) it checks if length would be exceeded, and if it does, the sentence is added to the list, and a new one started. This is the lines list in fileChunkList(), which ends up being the text script for the recordings created in the next function:

def createChunkAudio(id, linesList):
    parts = len(linesList)
    partsIdList = []
    for i in range(1, parts):
        resp = streamAudio(linesList[i-1])
        stream = resp['AudioStream']._raw_stream
        with FileIO("%s-part-%s.mp3" % (id,i), 'w') as file:
            for i in stream:
                file.write(i)
            partsIdList.append(file.name)
    return partsIdList
Enter fullscreen mode Exit fullscreen mode

where, for each 250-max-length item in the lines list, an mp3 file is created (i.e. ${whatever-output-name}-part-1.mp3) from it by passing it to the streamAudio() function in the above loop, which is just the one-off call to Polly to create the audio stream:

def streamAudio(inString):
    polly = client("polly", "us-east-2")
    response = polly.synthesize_speech(
        Text=inString,
        OutputFormat="mp3",
        VoiceId="Matthew")
    return response
Enter fullscreen mode Exit fullscreen mode

At this point, you've created, for example for a textfile that needed to be split across 3 segments, mp3 files like output-part-1.mp3, output-part-2.mp3, and output-part-3.mp3, which is not terribly convenient, so the last step is to combine them, using the list of paths for the audio chunks the above functions created:

def concatPartsAudio(pathList, id):
    print(pathList)
    cmdStr = "concat:"
    for p in pathList:
        if pathList[-1] == p:
            cmdStr = cmdStr + "%s" % (p)
        else:
            cmdStr = cmdStr + "%s|" % (p)
    print(cmdStr)
    concat = os.system("ffmpeg -i '%s' -acodec copy '%s.mp3'" % (cmdStr, id))
    s = os.system("stat %s.mp3" % (id))
    return s
Enter fullscreen mode Exit fullscreen mode

this could be done any number of ways, depending on your preferred audio output settings, but in the simplest format, we're just concatenating each of the files into output.mp3.

Discussion (0)