<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dave Pearson</title>
    <description>The latest articles on DEV Community by Dave Pearson (@davep).</description>
    <link>https://dev.to/davep</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F117670%2F08e50170-5cb2-4441-aba2-e0e4bc8dbf08.jpeg</url>
      <title>DEV Community: Dave Pearson</title>
      <link>https://dev.to/davep</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/davep"/>
    <language>en</language>
    <item>
      <title>A little speed issue with openpyxl</title>
      <dc:creator>Dave Pearson</dc:creator>
      <pubDate>Tue, 27 Nov 2018 16:57:46 +0000</pubDate>
      <link>https://dev.to/davep/a-little-speed-issue-with-openpyxl-3li7</link>
      <guid>https://dev.to/davep/a-little-speed-issue-with-openpyxl-3li7</guid>
      <description>&lt;p&gt;&lt;em&gt;This is taken from a blog post &lt;a href="https://blog.davep.org/2018/06/02/a_little_speed_issue_with_openpyxl.html"&gt;I wrote a little earlier this year&lt;/a&gt;. I'm reposting a version of it here to better get a feel for this site.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Recently I've needed to write a library of code for loading data from Excel Workbooks. Given that the vast majority of coding I do at the moment is in Python, it made sense to make use of &lt;a href="https://openpyxl.readthedocs.io/"&gt;openpyxl&lt;/a&gt;. The initial prototype code I wrote worked well and it soon grew into a full-blown library that'll be used in a couple of work-related projects.&lt;/p&gt;

&lt;p&gt;But one thing kept niggling me... It just wasn't as fast as I'd expected. The workbooks I'm pulling data from aren't that large, and yet it was taking a noticeable number of seconds to read in the data, and when I let the code have a go at a directory full of such workbooks... even the fan on the machine would ramp up.&lt;/p&gt;

&lt;p&gt;It didn't seem right.&lt;/p&gt;

&lt;p&gt;I did a little bit of profiling and could see that the code was spending most of its time deep in the guts of some XML-parsing functions. While I know that an &lt;code&gt;xlsx&lt;/code&gt; file is pretty much an XML document, it seemed odd to me that it would take so much time and effort to pull the data out from it.&lt;/p&gt;

&lt;p&gt;Given that I had other code to be writing, and given that the workbook-parsing code was "good enough" for the moment, I moved on for a short while.&lt;/p&gt;

&lt;p&gt;But, a couple of weeks back, I had a bit of spare time and decided to revisit it. I did some more searching on openpyxl and speed issues and almost everything I found said that the common problem was failing to open the workbook in &lt;code&gt;read_only&lt;/code&gt; mode. That can't have been my problem because I'd being doing that from the very start.&lt;/p&gt;

&lt;p&gt;Eventually I came across a post somewhere (sorry, I've lost it for now -- I'll try and track it down again) that suggested that openpyxl was very slow to read from a workbook if you were reading one cell at a time, rather than using generators. The suggestion being that every time you pull a value form a cell, it has to parse the whole sheet up to that cell. Generators, on the other hand, would allow access to all the cells during one parse.&lt;/p&gt;

&lt;p&gt;This seemed a little unlikely to me -- I'd have expected the code to cache the parsing results or something like that -- but it also would explain what I was seeing. So I decided to give it a test.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/davep/openpyxl-speed-issue"&gt;&lt;code&gt;openpyxl-speed-issue&lt;/code&gt;&lt;/a&gt; is a version of the tests I wrote and ran and they absolutely show that there's a huge difference between cell-by-cell access vs generator access.&lt;/p&gt;

&lt;p&gt;Code like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_row&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_column&lt;/span&gt; &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sheet&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="p"&gt;][&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;is &lt;em&gt;far slower&lt;/em&gt; than something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;wb&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt; &lt;span class="s"&gt;"Test Sheet"&lt;/span&gt; &lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cell&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Here's an example of the difference in time, as seen on my iMac:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;make &lt;span class="nb"&gt;test
&lt;/span&gt;pipenv run &lt;span class="nb"&gt;time&lt;/span&gt; ./read-using-generators
        1.59 real         0.44 user         0.04 sys
pipenv run &lt;span class="nb"&gt;time&lt;/span&gt; ./read-using-peeking
       25.02 real        24.88 user         0.10 sys
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As you can see, the cell-by-cell approach is about 16 times slower than the generator approach.&lt;/p&gt;

&lt;p&gt;In most circumstances the generator approach would make most sense anyway, and in any other situation I probably would have used it and never have noticed this. However, the nature of the workbooks I need to pull data from means I need to "peek ahead" to make decisions about what I'm doing, so a more traditional loop over, with an index, made more sense.&lt;/p&gt;

&lt;p&gt;I can easily "fix" this by using the generator approach to build up a two-dimensional array of cells, acquired via the generator; so I can still do what I want &lt;em&gt;and&lt;/em&gt; benefit from using generators.&lt;/p&gt;

&lt;p&gt;In conclusion: given that I found it difficult to find information about my speed issue, and given that the one off-hand comment I saw that suggested it was this wasn't exactly easy to find, I thought I'd write it all down too and &lt;a href="https://github.com/davep/openpyxl-speed-issue"&gt;create a repository of some test code to illustrate the issue&lt;/a&gt;. Hopefully someone else will benefit from this in the future.&lt;/p&gt;

</description>
      <category>python</category>
      <category>openpyxl</category>
    </item>
  </channel>
</rss>
