Chris Howard

Posted on Jul 16, 2018

Need help with editing a massive file

#discuss #learning #advice

I have a very large sitemap XML file that I need to alter. I need to remove all nodes that do not contain a reference to ".jpg".

Here's an example:

  <url>
    <loc>https://www.mywebsite.com/page-without-image</loc>
    <lastmod>2018-06-05</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
      <image:image>
  </url>
  <url>
    <loc>https://www.mywebsite.com/page-with-image</loc>
    <lastmod>2018-06-05</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
      <image:image>
        <image:loc>https://www.mywebsite.com/images/image-on-this-page.jpg</image:loc>
        <image:title>image title</image:title>
      </image:image>
  </url>

I need to go through the entire file, using SublimeText or any capable program, and remove all nodes (from to ) that do not have .jpg in them.

Any suggestions?

THANKS!

Top comments (6)

Dian Fay • Jul 16 '18

Don't use an editor. Write a script that loads the XML file and walks through the tree looking for and eliminating nodes meeting the criteria before writing the file back out. Python's a good choice especially if you're already familiar with it, or you could use Node or any other lightweight scripting language. It'll still take less time than doing it by hand.

Ryosuke • Jul 16 '18

@hoahchris

Basically this.

I had to do this a lot with XML exports from Wordpress. Here's a script I wrote on Github that sorts through XML using PHP to get you started. It converts XML to TXT files, but you can see how I parse and loop through the XML nodes.

rhymes • Jul 16 '18

Why don't use you an XPath based to extract the nodes?

Casey Brooks • Jul 16 '18

Pretty much any text editor out there isn't going to do a good job working on really large files. You're better off writing a small script to do the conversion for you. Here's an overview of different ways of parsing an XML document in Java, and in particular you should be looking at the SAX- or StAX-style parsers.

A DOM parser will typically load the entire file contents into memory, which isn't going to work with a large XML document, but the SAX and StAX parsers will process the file line-by-line, and so should be able to easily handle files of any size, no matter how large. The differences between a SAX and StAX parser are subtle, but this page has some useful insights into their differences, and when to use one over the other.

Of course, you could use any language for parsing XML, you just need to find the appropriate parser for that language.

But whatever you do, do not try to parse the XML with Regex.

Vlastimil Pospichal • Jul 16 '18

xmllint --xpath "//*[local-name()='image']/*[local-name()='loc' and substring(., string-length(.)-3, 4) = '.jpg']/../.." sitemap.xml

$thestillfracture profile image$

Chris Howard • Jul 17 '18

Thank you, everyone. You've given me a lot to chew on. I really appreciate the input!