DEV Community

Chris Howard
Chris Howard

Posted on

Need help with editing a massive file

I have a very large sitemap XML file that I need to alter. I need to remove all nodes that do not contain a reference to ".jpg".

Here's an example:

  <url>
    <loc>https://www.mywebsite.com/page-without-image</loc>
    <lastmod>2018-06-05</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.7</priority>
      <image:image>
  </url>
  <url>
    <loc>https://www.mywebsite.com/page-with-image</loc>
    <lastmod>2018-06-05</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.9</priority>
      <image:image>
        <image:loc>https://www.mywebsite.com/images/image-on-this-page.jpg</image:loc>
        <image:title>image title</image:title>
      </image:image>
  </url>
Enter fullscreen mode Exit fullscreen mode

I need to go through the entire file, using SublimeText or any capable program, and remove all nodes (from to ) that do not have .jpg in them.

Any suggestions?

THANKS!

Top comments (6)

Collapse
 
dmfay profile image
Dian Fay

Don't use an editor. Write a script that loads the XML file and walks through the tree looking for and eliminating nodes meeting the criteria before writing the file back out. Python's a good choice especially if you're already familiar with it, or you could use Node or any other lightweight scripting language. It'll still take less time than doing it by hand.

Collapse
 
whoisryosuke profile image
Ryosuke

@hoahchris

Basically this.

I had to do this a lot with XML exports from Wordpress. Here's a script I wrote on Github that sorts through XML using PHP to get you started. It converts XML to TXT files, but you can see how I parse and loop through the XML nodes.

Collapse
 
rhymes profile image
rhymes

Why don't use you an XPath based to extract the nodes?

Collapse
 
cjbrooks12 profile image
Casey Brooks

Pretty much any text editor out there isn't going to do a good job working on really large files. You're better off writing a small script to do the conversion for you. Here's an overview of different ways of parsing an XML document in Java, and in particular you should be looking at the SAX- or StAX-style parsers.

A DOM parser will typically load the entire file contents into memory, which isn't going to work with a large XML document, but the SAX and StAX parsers will process the file line-by-line, and so should be able to easily handle files of any size, no matter how large. The differences between a SAX and StAX parser are subtle, but this page has some useful insights into their differences, and when to use one over the other.

Of course, you could use any language for parsing XML, you just need to find the appropriate parser for that language.

But whatever you do, do not try to parse the XML with Regex.

Collapse
 
vlasales profile image
Vlastimil Pospichal
xmllint --xpath "//*[local-name()='image']/*[local-name()='loc' and substring(., string-length(.)-3, 4) = '.jpg']/../.." sitemap.xml
Collapse
 
thestillfracture profile image
Chris Howard

Thank you, everyone. You've given me a lot to chew on. I really appreciate the input!