<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Chris Murphy</title>
    <description>The latest articles on DEV Community by Chris Murphy (@mdhornet90).</description>
    <link>https://dev.to/mdhornet90</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F132733%2F5d3f0fa6-1639-4a39-b119-0cc792b08fcd.png</url>
      <title>DEV Community: Chris Murphy</title>
      <link>https://dev.to/mdhornet90</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mdhornet90"/>
    <language>en</language>
    <item>
      <title>Converting Word to PDF Using A Python-Based Lambda</title>
      <dc:creator>Chris Murphy</dc:creator>
      <pubDate>Thu, 29 Aug 2019 22:22:46 +0000</pubDate>
      <link>https://dev.to/mdhornet90/converting-word-to-pdf-using-a-python-based-lambda-3d82</link>
      <guid>https://dev.to/mdhornet90/converting-word-to-pdf-using-a-python-based-lambda-3d82</guid>
      <description>&lt;h3&gt;
  
  
  The Mission
&lt;/h3&gt;

&lt;p&gt;TL;DR or: abort mission&lt;/p&gt;

&lt;p&gt;I was recently put on a new assignment that makes heavy use of AWS for, among other things, serverless architecture. The goal of my first task was to trigger a Lambda &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html" rel="noopener noreferrer"&gt;when documents are uploaded to an S3 bucket&lt;/a&gt;, and convert files of varying formats to &lt;code&gt;.pdf&lt;/code&gt;s. Among the formats expected to be supported were &lt;code&gt;.doc&lt;/code&gt; and &lt;code&gt;.docx&lt;/code&gt;. While I knew those files are packed with metadata for use during document editing, I figured I could just scrape the document until I found ascii characters. That was until I forced VS Code to open the file raw:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc7jgv2c2qt61neflpwyq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fc7jgv2c2qt61neflpwyq.png" alt="The nightmare that is a raw Word Doc"&gt;&lt;/a&gt;&lt;br&gt;
The horror. Clearly, I was about to have my hands full.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exploration
&lt;/h3&gt;

&lt;p&gt;So I think we can all agree writing code to solve problems should be a &lt;a href="https://programmingisterrible.com/post/139222674273/write-code-that-is-easy-to-delete-not-easy-to" rel="noopener noreferrer"&gt;last resort&lt;/a&gt;, so first I wondered if I could leverage a (hopefully free) service to do the heavy lifting.&lt;/p&gt;

&lt;h4&gt;
  
  
  How about Google Docs?
&lt;/h4&gt;

&lt;p&gt;I considered using Google Docs as the conversion workhorse, but I was informed by a coworker who had been on the project longer that Google Docs always dropped certain formatting elements, typically symbols like open paren. The ask from the business was that the document format was preserved completely, so I couldn't risk an incomplete solution.&lt;/p&gt;

&lt;h4&gt;
  
  
  Ok, so what else is there?
&lt;/h4&gt;

&lt;p&gt;It turns out a popular strategy for converting word documents to pdf is to use the &lt;a href="https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters" rel="noopener noreferrer"&gt;CLI capabilities of LibreOffice&lt;/a&gt;. In fact, there already exists a JS-based library that &lt;a href="https://github.com/shelfio/aws-lambda-libreoffice" rel="noopener noreferrer"&gt;does exactly that&lt;/a&gt;!&lt;/p&gt;

&lt;h4&gt;
  
  
  Oh! So why not use Javascript instead of Python?
&lt;/h4&gt;

&lt;p&gt;Because I felt like using Python and wanted a challenge? Forget about what I said earlier about avoiding writing code.&lt;/p&gt;

&lt;h4&gt;
  
  
  Ok then.
&lt;/h4&gt;

&lt;h3&gt;
  
  
  The tools
&lt;/h3&gt;

&lt;p&gt;So we've established that I wanted to replicate the functionality of the Javascript Word-to-PDF conversion library in a Python-based AWS Lambda for valid and totally non ego-related reasons. The first step was to pick apart the code of the aforementioned JS library to figure out how the magic is happening. Let's take a look at &lt;code&gt;Shelf&lt;/code&gt;'s description for their AWS-Lambda-ified LibreOffice:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;85 MB LibreOffice to fit inside AWS Lambda compressed with &lt;code&gt;brotli&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And sure enough the code &lt;a href="https://github.com/shelfio/aws-lambda-libreoffice/blob/master/src/convert.ts" rel="noopener noreferrer"&gt;proves that out&lt;/a&gt;. It uses &lt;del&gt;Richard's&lt;/del&gt; Google's &lt;a href="https://github.com/google/brotli" rel="noopener noreferrer"&gt;&lt;code&gt;brotli&lt;/code&gt; compression algorithm&lt;/a&gt; to unpack a &lt;code&gt;lo.tar.br&lt;/code&gt; file provided by the &lt;a href="https://github.com/shelfio/libreoffice-lambda-layer" rel="noopener noreferrer"&gt;LibreOffice Lambda Layer&lt;/a&gt; into a given AWS Lambda Function's &lt;code&gt;/tmp&lt;/code&gt; folder. &lt;/p&gt;

&lt;p&gt;This sure seems like a lot of effort, why can't we just upload an unpacked instance of what's contained in that LibreOffice Layer ourselves? Well, at this point it's time to take a dive off a technical cliff...&lt;/p&gt;

&lt;h3&gt;
  
  
  Constraints
&lt;/h3&gt;

&lt;p&gt;It's been pretty well-established that the maximum allowable packaging of code to upload to Lambdas from any source &lt;a href="https://hackernoon.com/exploring-the-aws-lambda-deployment-limits-9a8384b0bec3" rel="noopener noreferrer"&gt;is 250MB&lt;/a&gt;. You might see that &lt;code&gt;85MB&lt;/code&gt; number up there and think "what's the problem, exactly?"&lt;/p&gt;

&lt;h4&gt;
  
  
  You read my mind, what is it?
&lt;/h4&gt;

&lt;p&gt;While &lt;code&gt;85MB&lt;/code&gt; is indeed a much smaller number than &lt;code&gt;250MB&lt;/code&gt;, it's a testament to how efficient the &lt;code&gt;brotli&lt;/code&gt; algorithm is at packing up its contents; uncompressed and unpacked, the size of the package is just north of &lt;code&gt;300MB&lt;/code&gt;! So if we were to upload the package ourselves, we'd still have to do the work of decompressing its contents. And in that case, why don't we just leverage the existing LibreOffice layer to &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html" rel="noopener noreferrer"&gt;keep our deployment package small&lt;/a&gt; and reduce iteration time whenever we upload new code, which is certainly subject to change far more often than our use of LibreOffice?&lt;/p&gt;

&lt;h4&gt;
  
  
  You've convinced me, but how do we move forward?
&lt;/h4&gt;

&lt;p&gt;As I mentioned before, the JS library unpacks LibreOffice to &lt;code&gt;/tmp&lt;/code&gt;, and this is beneficial for &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/running-lambda-code.html" rel="noopener noreferrer"&gt;two reasons&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The size of &lt;code&gt;/tmp&lt;/code&gt; is capped at &lt;code&gt;512MB&lt;/code&gt;, more than enough for a decompressed and unpacked instance of LibreOffice and all the fixins of a given run of a (sane) Lambda Function!&lt;/li&gt;
&lt;li&gt;The contents of &lt;code&gt;/tmp&lt;/code&gt; are &lt;em&gt;cached between runs&lt;/em&gt;, meaning that we can add logic to reuse a previously unpacked instance of LibreOffice. Considering my testing proved initial extraction of the program took between 10s and 12s, this is a critical performance improvement to keep Lambdas that rely on PDF conversion speedy!&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The Approach
&lt;/h3&gt;

&lt;p&gt;Ok finally, we get to come up with an algorithm! First let's recap what we know about how all of these pieces fit together.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The LibreOffice Lambda Layer, like all other Lambda Layers, dumps its contents into the &lt;code&gt;/opt&lt;/code&gt; folder. So we know we have an &lt;code&gt;/opt/lo.tar.br&lt;/code&gt; file with size &lt;code&gt;85MB&lt;/code&gt; that needs decompressing and unpacking.&lt;/li&gt;
&lt;li&gt;We know that for any given run of a Lambda Function, we have &lt;code&gt;512MB&lt;/code&gt; of space in &lt;code&gt;/tmp&lt;/code&gt;, so we're going to want to unpack everything there.&lt;/li&gt;
&lt;li&gt;We also know that &lt;code&gt;/tmp&lt;/code&gt; is cacheable between Lambda runs, so we're going to want to check whether a previous run of the Lambda already did the unpacking for us.&lt;/li&gt;
&lt;li&gt;Finally, we know that LibreOffice has been compressed with the &lt;code&gt;brotli&lt;/code&gt; compression algorithm. I'm going to cut the suspense short and tell you that a Python-specific &lt;a href="https://pypi.org/project/Brotli/" rel="noopener noreferrer"&gt;implementation exists&lt;/a&gt;, complete with &lt;a href="https://python-hyper.org/projects/brotlipy/en/latest/api.html" rel="noopener noreferrer"&gt;acceptable levels of documentation&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With all of this in mind, we now have enough context to port the JS code of &lt;code&gt;Shelf&lt;/code&gt;'s library to Python!&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Build Tools
&lt;/h4&gt;

&lt;p&gt;Keep in mind all of this decompressing and unpacking needs to be done in the AWS Lambda Function itself, so any external tools we need to use (like the &lt;code&gt;brotli&lt;/code&gt; module) must be bundled in the Function code we send up. I highly recommend checking out the &lt;a href="https://pypi.org/project/juniper/" rel="noopener noreferrer"&gt;juniper tool&lt;/a&gt; for this task - it bundles standalone versions of your dependencies along with all of your source code into a &lt;code&gt;.zip&lt;/code&gt; file. From there it's just a matter of &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html#python-package-dependencies" rel="noopener noreferrer"&gt;uploading your bundled code to AWS&lt;/a&gt; (note that &lt;code&gt;juniper&lt;/code&gt; handles steps 1 through 3 for you).&lt;/p&gt;

&lt;h4&gt;
  
  
  Finally, the Code
&lt;/h4&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BytesIO&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tarfile&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;brotli&lt;/span&gt;

&lt;span class="n"&gt;LIBRE_OFFICE_INSTALL_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp/instdir&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_libre_office&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIBRE_OFFICE_INSTALL_DIR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIBRE_OFFICE_INSTALL_DIR&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;We have a cached copy of LibreOffice, skipping extraction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;No cached copy of LibreOffice exists, extracting tar stream from Brotli file...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nb"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/opt/lo.tar.br&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;brotli_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;decompressor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;brotli&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Decompressor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;brotli_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;decompressor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Extracting tar stream to /tmp for caching...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tarfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileobj&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;tar&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extractall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Done caching LibreOffice!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{}/program/soffice&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIBRE_OFFICE_INSTALL_DIR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  Breaking it down
&lt;/h4&gt;

&lt;p&gt;There's a little to unpack (&lt;a href="https://chumley.barstoolsports.com/wp-content/uploads/2018/11/26/5aa22d7456db5.image_.jpg" rel="noopener noreferrer"&gt;sorry&lt;/a&gt;) in the module above, so I'm going to call out some of the more interesting chunks of code:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIBRE_OFFICE_INSTALL_DIR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIBRE_OFFICE_INSTALL_DIR&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;We have a cached copy of LibreOffice, skipping extraction&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;As I mentioned before, with how long it takes to decompress and unpack LibreOffice, we're going to want to reuse the efforts of previous runs of the Lambda. Our Lambda alone controls all the space in &lt;code&gt;/tmp&lt;/code&gt; and as far as I can tell by default Lambda executions happen &lt;a href="https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/" rel="noopener noreferrer"&gt;serially by default&lt;/a&gt;, so a simple sanity check that &lt;code&gt;instdir&lt;/code&gt; (the root of the LibreOffice program after unpacking) exists is sufficient.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Missing this line led me down a 20-minute rabbit trail trying to figure out why attempting to unpack the &lt;code&gt;.tar&lt;/code&gt; file contained in &lt;code&gt;buffer&lt;/code&gt; produced no files or folders. Make sure you set the read pointer to the beginning of the buffer if you plan on reading after writing!&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;tarfile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fileobj&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;tar&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tar&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extractall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/tmp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You'll see here I'm leveraging &lt;code&gt;tarfile&lt;/code&gt;'s &lt;a href="https://docs.python.org/3/library/tarfile.html" rel="noopener noreferrer"&gt;open&lt;/a&gt; function with a &lt;code&gt;fileobj&lt;/code&gt;. Why not write the decompressed &lt;code&gt;.tar&lt;/code&gt; file to the filesystem in &lt;code&gt;/tmp&lt;/code&gt; and then open it? Well, it turns out trying to have both packed and unpacked instances of LibreOffice exceed even the &lt;code&gt;512MB&lt;/code&gt; limit of &lt;code&gt;/tmp&lt;/code&gt;! If you refer to source of &lt;code&gt;Shelf&lt;/code&gt;'s &lt;a href="https://github.com/shelfio/aws-lambda-brotli-unpacker/blob/master/src/index.js" rel="noopener noreferrer"&gt;Brotli Unpacker Library&lt;/a&gt;, you'll see that it's piping the decompression result through a tar-extractor (implying it's an in-memory operation), so I assume they were working around the same issue. &lt;/p&gt;

&lt;p&gt;I don't code in Python for my day job too often so I might be missing out on a more pythonic way to express what's essentially the same piping operation, but it certainly gets the job done. As long as you're willing to allocate an appropriate amount of memory for your Lambda this shouldn't be a problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Wrap Up
&lt;/h3&gt;

&lt;p&gt;I &lt;a href="https://chumley.barstoolsports.com/wp-content/uploads/2018/11/26/5aa22d7456db5.image_.jpg" rel="noopener noreferrer"&gt;didn't formally performance test this solution&lt;/a&gt;, but on average  with 512 MB of memory allocated to the Lambda and assuming the Lambda is using a cached copy of the LibreOffice, the function converts PDFs in about a &lt;code&gt;1s&lt;/code&gt; to &lt;code&gt;1.5s&lt;/code&gt;, depending on its size.&lt;/p&gt;

&lt;p&gt;Figuring out this approach taught me a lot about the finer points of AWS Lambda, and it ended up being a fun challenge working within the constraints of that ecosystem. &lt;/p&gt;

&lt;p&gt;Finally, this is my first post so it should go without saying (but I'll say it anyway) that if you see a way this explanation can be improved, definitely let me know!&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html" rel="noopener noreferrer"&gt;Using AWS Lambda with Amazon S3&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.libreoffice.org/Common/Starting_the_Software_With_Parameters" rel="noopener noreferrer"&gt;LibreOffice CLI documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/shelfio/aws-lambda-libreoffice" rel="noopener noreferrer"&gt;&lt;code&gt;Shelf&lt;/code&gt;'s Lambda-Based LibreOffice&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/shelfio/libreoffice-lambda-layer" rel="noopener noreferrer"&gt;&lt;code&gt;Shelf&lt;/code&gt;'s LibreOffice Lambda Layer&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/shelfio/aws-lambda-brotli-unpacker" rel="noopener noreferrer"&gt;&lt;code&gt;Shelf&lt;/code&gt;'s Brotli Unpacker Library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://python-hyper.org/projects/brotlipy/en/latest/api.html" rel="noopener noreferrer"&gt;&lt;code&gt;brotli&lt;/code&gt; module documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/juniper/" rel="noopener noreferrer"&gt;&lt;code&gt;juniper&lt;/code&gt; lambda packaging tool&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html#python-package-dependencies" rel="noopener noreferrer"&gt;AWS Lambda deployment procedure for code with dependencies&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>aws</category>
      <category>awslambda</category>
      <category>serverless</category>
    </item>
  </channel>
</rss>
