<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: brthanmathwoag</title>
    <description>The latest articles on DEV Community by brthanmathwoag (@brthanmathwoag).</description>
    <link>https://dev.to/brthanmathwoag</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F28870%2F520480ec-2930-40e1-a37f-4d5d18062caa.png</url>
      <title>DEV Community: brthanmathwoag</title>
      <link>https://dev.to/brthanmathwoag</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/brthanmathwoag"/>
    <language>en</language>
    <item>
      <title>Helping pandoc generate a correct table of contents from HTML input</title>
      <dc:creator>brthanmathwoag</dc:creator>
      <pubDate>Thu, 07 Sep 2017 00:00:00 +0000</pubDate>
      <link>https://dev.to/brthanmathwoag/helping-pandoc-generate-a-correct-table-of-contents-from-html-input-27im</link>
      <guid>https://dev.to/brthanmathwoag/helping-pandoc-generate-a-correct-table-of-contents-from-html-input-27im</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;pandoc expects chapter headers to be placed directly inside the &lt;code&gt;body&lt;/code&gt; node. No &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; wrappers allowed.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;pandoc sets the book title after the last &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; tag it sees (e.g. the last file on the commandline).&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's been a long time since I last had to convert an HTML ebook to EPUB. Last time I did, I couldn't make calibre put the chapters in the correct order &lt;sup&gt;1&lt;/sup&gt; and got so angry that I tried to hand-craft the file with &lt;a href="https://github.com/brthanmathwoag/ebooks/" rel="noopener noreferrer"&gt;bash and a bunch of regexen&lt;/a&gt;. It was certainly an interesting experiment and I've learned much about EPUB internals in the process. I've also learned that while &lt;a href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1733489" rel="noopener noreferrer"&gt;it's OK to parse a limited, known set of HTML with regex&lt;/a&gt;, it's much more convenient to use an actual HTML parser.&lt;/p&gt;

&lt;p&gt;Since then, I fell in love with &lt;a href="https://pandoc.org/" rel="noopener noreferrer"&gt;pandoc&lt;/a&gt; and have been using it extensively for various projects &lt;sup&gt;2&lt;/sup&gt;. So when I recently wanted to read the &lt;a href="https://dev.to/brthanmathwoag/f-programming-wikibook-in-epub-and-mobi-formats-temp-slug-7806038"&gt;F# Programming Wikibook&lt;/a&gt; on my Kindle, I knew I would use pandoc for conversion.&lt;/p&gt;

&lt;p&gt;My enthusiasm somewhat dropped when I examined the resulting file and found out that the generated table of contents consisted of a single entry, named after the last chapter of the book. And this was not just a problem of broken navigation.&lt;/p&gt;

&lt;p&gt;An EPUB file is essentially a zip archive with chapters stored in separate HTML files. Thanks to that, ebook readers can open them one by one, which means quicker load times and lower memory footprint. Because pandoc didn't know how to split the book into chapters, it put them all into a single file so my reader had to slurp and format the entire text before displaying anything - grinding it to halt for over a minute each time the book was opened.&lt;/p&gt;

&lt;p&gt;For HTML input, pandoc is supposed to generate the TOC automatically from the &lt;code&gt;&amp;lt;h1&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;h2&amp;gt;&lt;/code&gt;, ... &lt;code&gt;&amp;lt;h6&amp;gt;&lt;/code&gt; markup. After some experimentation, it turned out that &lt;strong&gt;pandoc expects chapter headers to be placed directly inside the &lt;code&gt;body&lt;/code&gt; node&lt;/strong&gt;. While this makes sense for documents written for the sole purpose of being packaged as EPUB, this is rarely the case with HTML pages on the Internet, where you will often find the actual content wrapped in several layers of &lt;code&gt;div&lt;/code&gt;s (or tables, if you are unfortunate to roam such dangerous, god-forgotten places).&lt;/p&gt;

&lt;p&gt;Here's a test case. Say, we have a book titled &lt;em&gt;The Book&lt;/em&gt;, which consists of four chapters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;seq 1 4 | while read idx; do
    &amp;gt; "ch$idx.html" &amp;lt;&amp;lt;EOF
&amp;lt;html&amp;gt;
    &amp;lt;head&amp;gt;
        &amp;lt;title&amp;gt;Chapter $idx - The Book&amp;lt;/title&amp;gt;
    &amp;lt;/head&amp;gt;
    &amp;lt;body&amp;gt;
        &amp;lt;div&amp;gt;
            &amp;lt;div&amp;gt;
                &amp;lt;div id="content"&amp;gt;
                    &amp;lt;h1&amp;gt;Chapter $idx&amp;lt;/h1&amp;gt;
                    &amp;lt;p&amp;gt;Lorem ipsum, dolor sit amet.&amp;lt;/p&amp;gt;
                &amp;lt;/div&amp;gt;
            &amp;lt;/div&amp;gt;
        &amp;lt;/div&amp;gt;
    &amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;    
EOF

done
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we feed them to pandoc, we get a broken TOC with a title page and a single chapter, spanning all the input files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pandoc -o the_book.epub ch*.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.tznvy.eu%2Fi%2Fpandoc-chapters1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.tznvy.eu%2Fi%2Fpandoc-chapters1.png" title="An incorrectly constructed TOC with a single chapter entry" alt="An incorrectly constructed TOC with a single chapter entry"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To fix the table of contents, we have to help pandoc a little and move the &lt;code&gt;&amp;lt;h1&amp;gt;&lt;/code&gt;s up the tree until they are children of &lt;code&gt;body&lt;/code&gt;. Here's how we can do this with Python and &lt;a href="https://www.crummy.com/software/BeautifulSoup/" rel="noopener noreferrer"&gt;Beautiful Soup&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;fix.py:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import bs4

filenames = [
    'ch1.html', 'ch2.html', 'ch3.html', 'ch4.html'
]

for filename in filenames:
    with open(filename, 'r') as f:
        soup = bs4.BeautifulSoup(f, 'lxml')

    current = soup.find(id='content')
    while current.name != 'body':
        parent = current.parent
        current.unwrap()
        current = parent

    out_filename = filename.replace('.', '-flat.')
    with open(out_filename, 'w') as f:
        f.write(soup.prettify())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each file specified, the script creates the DOM and finds the node with the actual content - in this case, the one with &lt;code&gt;content&lt;/code&gt; ID (if your div doesn't have an id assigned but it has a specific class, you can get it with &lt;code&gt;soup.find(class_='...')&lt;/code&gt; instead). The call to the &lt;code&gt;unwrap&lt;/code&gt; method replaces the node with its children and we move move up the tree to the parent of the deleted node. The code is repeated until the &lt;code&gt;body&lt;/code&gt; node is reached. Finally, the DOM is saved to a file with &lt;code&gt;-flat&lt;/code&gt; appended to its name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python fix.py

pandoc \
    -o the_book.epub \
    ch1-flat.html \
    ch2-flat.html \
    ch3-flat.html \
    ch4-flat.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.tznvy.eu%2Fi%2Fpandoc-chapters2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.tznvy.eu%2Fi%2Fpandoc-chapters2.png" title="A TOC with chapters split correctly, but an incorrect book title" alt="A TOC with chapters split correctly, but an incorrect book title"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's better. The chapters were detected and split correctly, but the top two entries, which are supposed to be the title of the book, are incorrectly captioned &lt;em&gt;Chapter 4 - The Book&lt;/em&gt;. You might have noticed that this is the text in the &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; of the last file on the command line.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pandoc sets the book title after the contents of the &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; node. When invoked with multiple input files and there is more then one &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; tag, pandoc uses the last one seen.&lt;/strong&gt; But for ebooks spanning several HTML documents, the &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt;s usually denote the chapter names, and shouldn't have impact on the title of the book.&lt;/p&gt;

&lt;p&gt;To fix that, we have to dive into the HTML once again &lt;sup&gt;3&lt;/sup&gt;, make sure there is only one &lt;code&gt;&amp;lt;title&amp;gt;&lt;/code&gt; tag in our input files, and that is set to the desired book title:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;actual_title = 'The Book'

title_node = soup.find('title')
if filename == filenames[0]:
    title_node.string = actual_title
else:
    title_node.extract()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, the table of contents looks as expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pandoc \
    -o the_book.epub \
    ch1-flat.html \
    ch2-flat.html \
    ch3-flat.html \
    ch4-flat.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.tznvy.eu%2Fi%2Fpandoc-chapters3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fblog.tznvy.eu%2Fi%2Fpandoc-chapters3.png" title="A correctly constructed TOC" alt="A correctly constructed TOC"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;ol&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This was supposedly controlled by the &lt;a href="https://manual.calibre-ebook.com/faq.html#how-do-i-convert-a-collection-of-html-files-in-a-specific-order" rel="noopener noreferrer"&gt;breadth-first order toggle in Preferences → Plugins → HTML to ZIP plugin&lt;/a&gt; but setting it on seemed to have no effect at all.↩&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;My thesis being the obvious one, but also static website generation - both this blog and &lt;a href="https://ninjastyles.tznvy.eu" rel="noopener noreferrer"&gt;ninjastyles.tznvy.eu&lt;/a&gt; run on pandoc and some Python magic.↩&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This actually sounds like a good use case for a regex. Or &lt;code&gt;$EDITOR&lt;/code&gt;, if there are only a few files. But let's do this in Python, just to be consistent.↩&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://blog.tznvy.eu/2017-09-07-why-does-pandoc-generate-broken-toc-from-html-input" rel="noopener noreferrer"&gt;blog.tznvy.eu&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>pandoc</category>
      <category>python</category>
    </item>
    <item>
      <title>Dealing with dependence on generated source files in a Makefile</title>
      <dc:creator>brthanmathwoag</dc:creator>
      <pubDate>Thu, 27 Jul 2017 00:00:00 +0000</pubDate>
      <link>https://dev.to/brthanmathwoag/dealing-with-dependence-on-generated-source-files-in-a-makefile-1b9b</link>
      <guid>https://dev.to/brthanmathwoag/dealing-with-dependence-on-generated-source-files-in-a-makefile-1b9b</guid>
      <description>&lt;p&gt;I think I finally got it right today.&lt;/p&gt;

&lt;p&gt;I have a static website, with pages generated from markdown documents. Some of them are written by hand, but most are generated with a python script from a flat database in a json file. The build process is automated with a Makefile.&lt;/p&gt;

&lt;p&gt;I wanted to trigger the generator only when necessary, that is, when either the json or the script were changed. There are too many markdown docs, however, to hardcode their names in the Makefile. Normally, this is solved using a wildcard filemask, which enumerates relevant file names in a source directory, and prepending the path to the target directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALL_MDS_IN_OBJ = $(wildcard $(OBJ)/*.md)
ALL_HTML_IN_BIN = $(addprefix $(BIN)/, $(notdir $(ALL_MDS_IN_OBJ:.md=.html)))

default: $(ALL_HTML_IN_BIN)

$(BIN)/%.html: \
    $(OBJ)/%.md \
    | $(BIN)

    ... convert mds to html with pandoc ...

$(OBJ)/%.md: \
    $(SRC)/source_data.json \
    generate-mds-from-json.py \
    | $(OBJ)

    ./generate-mds-from-json.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem is, when &lt;code&gt;make&lt;/code&gt; is run for the for the first time after cloning the repo or doing &lt;code&gt;make clean&lt;/code&gt;, there are no markdown files in &lt;code&gt;$(OBJ)&lt;/code&gt; yet. The wildcard doesn't match anything, so &lt;code&gt;make&lt;/code&gt; happily announces that nothing needs to be done and exits.&lt;/p&gt;

&lt;p&gt;To cope with that, I was running the generator manually before &lt;code&gt;make&lt;/code&gt; each time I changed the json file or the script, which, of course, in the long run turned out to be tedious and error-prone.&lt;/p&gt;

&lt;p&gt;Then I saw &lt;a href="https://stackoverflow.com/questions/991841/makefile-dependency-for-unknown-files-in-known-directory-for-docbook"&gt;this question on stackoverflow&lt;/a&gt; and was finally enlightened.&lt;/p&gt;

&lt;p&gt;As &lt;a href="https://stackoverflow.com/questions/991841/makefile-dependency-for-unknown-files-in-known-directory-for-docbook#answer-992029"&gt;ChrisW&lt;/a&gt; explained, making the generator run for the first time could be forced by introducing a dependency on a sentinel file kept separately from the build artifacts, whose sole purpose was to contain the timestamp of the last generator run; The file would be touched if the generation was successful, and deleted on &lt;code&gt;make clean&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MDS_SENTINEL = .mds_sentinel

default: \
    $(MDS_SENTINEL) \
    $(ALL_HTML_IN_BIN) \
    ... other deps ...

$(MDS_SENTINEL): \
    $(SRC)/source_data.json \
    generate-mds-from-json.py

    ./generate-mds-from-json.py \
        &amp;amp;&amp;amp; touch $(MDS_SENTINEL)

clean:
    rm -r $(MDS_SENTINEL) $(OBJ)/* $(BIN)/*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But this was still not sufficient; As mentioned above, the wildcard function would still be evaluated before running the generator, resulting in empty files list and an early exit. It would be necessary to invoke &lt;code&gt;make&lt;/code&gt; twice - first time to generate the intermediate files, second time to process them further. And compared to &lt;code&gt;./generate-mds-from-json.py &amp;amp;&amp;amp; make&lt;/code&gt;, having to do &lt;code&gt;make &amp;amp;&amp;amp; make&lt;/code&gt; was not that huge win.&lt;/p&gt;

&lt;p&gt;Luckily, this problem was solved in the same thread by &lt;a href="https://stackoverflow.com/questions/991841/makefile-dependency-for-unknown-files-in-known-directory-for-docbook#answer-994248"&gt;Paul Roub&lt;/a&gt;, who suggested running &lt;code&gt;make&lt;/code&gt; recursively from the recipe. This &lt;em&gt;inner&lt;/em&gt; make would have wildcards expanded after all files are generated and process the files correctly.&lt;/p&gt;

&lt;p&gt;So the final solution looked something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MDS_SENTINEL = .mds_sentinel

ALL_MDS_IN_OBJ = $(wildcard $(OBJ)/*.md)
ALL_HTML_IN_BIN = $(addprefix $(BIN)/, $(notdir $(ALL_MDS_IN_OBJ:.md=.html)))

default: \
    $(MDS_SENTINEL) \
    $(ALL_HTML_IN_BIN) \
    ... other deps ...

$(MDS_SENTINEL): \
    $(SRC)/source_data.json \
    generate-mds-from-json.py

    ./generate-mds-from-json.py \
        &amp;amp;&amp;amp; touch $(MDS_SENTINEL) \
        &amp;amp;&amp;amp; make mds

mds: $(ALL_HTML_IN_BIN)

clean:
    rm -r $(MDS_SENTINEL) $(OBJ)/* $(BIN)/*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://blog.tznvy.eu/2017-07-27-dealing-with-dependence-on-generated-source-files-in-a-makefile"&gt;blog.tznvy.eu&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>makefile</category>
    </item>
    <item>
      <title>Hands-on Powershell: Pruning an exported VCF contact list</title>
      <dc:creator>brthanmathwoag</dc:creator>
      <pubDate>Fri, 09 Jun 2017 00:00:00 +0000</pubDate>
      <link>https://dev.to/brthanmathwoag/hands-on-powershell-pruning-an-exported-vcf-contact-list-5if</link>
      <guid>https://dev.to/brthanmathwoag/hands-on-powershell-pruning-an-exported-vcf-contact-list-5if</guid>
      <description>&lt;p&gt;Recently, when hiking in the mountains, I got caught in a raincloud and got completely soaked. When I eventually arrived at the alpine hut, I found out the power switch in my phone no longer worked. It would power on just fine, but once the screen was turned off, it was impossible to turn on again. Luckily, this was just enough to backup all data while I was looking for a new phone.&lt;/p&gt;

&lt;p&gt;When the new phone arrived in the mail and I was up to importing the contact list, I remembered I've always wanted to clean it up a bit. See, back when I had my previous phone set up, I logged in to my Google account, which caused the whole Gmail addressbook to download. Thanks, Google, I guess, but I don't even do email on my phone. Some emails were appended to existing contacts, but quite a few did not match, resulting in duplicated entries for some people, and junk entries for individuals and companies that I messaged only once 10 years ago or so.&lt;/p&gt;

&lt;p&gt;So I thought, this would be a good opportunity to get rid of them.&lt;/p&gt;

&lt;p&gt;I looked at the exported contact list and it luckily turned out to be just a flat textfile with concatenated &lt;a href="https://en.wikipedia.org/wiki/VCard"&gt;vCards&lt;/a&gt;. This is what it looked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BEGIN:VCARD
VERSION:2.1
N:;Lastname;Firstname;;
FN:Nickname
TEL;CELL;PREF:123456789
END:VCARD
BEGIN:VCARD
VERSION:2.1
N:Surname;Firstname;;;
FN:Firstname Surname
EMAIL;PREF:email@test.com
PHOTO;ENCODING=BASE64;JPEG:/9j/4AAQSkZJRgABAQAAAQABAAD/2wCEAAUDBAoJCAsJCQk
 ...
 LS5apuBYSw02xdyPEsRPJxMByx33iEoQZdohkFlugEIHljtYiFfNHtnp/9k=

END:VCARD
BEGIN:VCARD
VERSION:2.1
N:;Nickname1;;;
FN:Nickname1
TEL;CELL;PREF:456789123
END:VCARD
BEGIN:VCARD
VERSION:2.1
N:;Nickname2;;;
FN:Nickname2
TEL;CELL;PREF:789123456
END:VCARD

etc...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So what I needed to do was to split the text into separate vCard records, then filter out those which don't contain a line beginning with &lt;code&gt;TEL;&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In Powershell, this could be achieved with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$in = 'unfiltered.vcf'
$out = 'filtered.vcf'

Get-Content $in `
    | Out-String `
    | Select-String -pattern '(?s)BEGIN:VCARD.*?END:VCARD' -AllMatches `
    | % { $_.Matches} `
    | % { $_.Value } `
    | ? { $_.Contains("`nTEL;") } `
    | Out-File -Encoding ascii $out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(pardon me for the broken highlighting, there currently is no powershell highlighter in pandoc, the upcoming release will have &lt;a href="https://github.com/jgm/pandoc/issues/3334"&gt;loadable language definitions&lt;/a&gt; though so I will be able to use &lt;a href="https://github.com/m0t0k1/kate-powershell"&gt;m0t0k1's kate-powershell&lt;/a&gt; then)&lt;/p&gt;

&lt;p&gt;Now, there are some interesting parts here. Because &lt;code&gt;Select-String&lt;/code&gt; cannot match over multiple elements on pipe, we have to use &lt;code&gt;Out-String&lt;/code&gt; on &lt;code&gt;Get-Content&lt;/code&gt; output so that the whole file is processed as one string, rather than line by line. &lt;code&gt;(?s)&lt;/code&gt; enables multiline pattern matching just like &lt;a href="https://perldoc.perl.org/perlre.html#Modifiers"&gt;&lt;code&gt;/s&lt;/code&gt; modifier in traditional perl regexp&lt;/a&gt;. Also because we are processing one huge string, &lt;code&gt;-AllMatches&lt;/code&gt; has to be set so that Powershell doesn't stop after finding first match - similar to &lt;code&gt;/g&lt;/code&gt; in perl. Finally, we need to set encoding on &lt;code&gt;Out-File&lt;/code&gt; explicitly. Otherwise, it would be saved in UTF-16LE with BOM, which, at least on my phone, caused a generic &lt;code&gt;Could not import contacts&lt;/code&gt; error.&lt;/p&gt;

&lt;p&gt;Now, the above code ran for a sub-second on my machine for a 200-some KB file but if your contact list contains many base64-encoded photos, slurping the whole file at once and applying regex to it might not be the best idea. Instead, we could read the file line by line and buffer them up manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$in = 'unfiltered.vcf'
$out = 'filtered.vcf'

$curr = @()
Get-Content $in | % {
    $curr += $_
    if ($_.StartsWith("TEL;")) {
        $has_tel = $true
    }
    if ($_ -eq 'END:VCARD') {
        if ($has_tel) {
            $curr
        }
        $curr = @()
        $has_tel = $false
    }
} | Out-File -Encoding ascii $out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Less clever, longer, but probably scaling better.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was originally published on &lt;a href="https://blog.tznvy.eu/2017-06-09-pruning-exported-vcf-contact-list"&gt;blog.tznvy.eu&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>powershell</category>
    </item>
  </channel>
</rss>
