DEV Community

Thoughts about sanitizing this Python RSS-scraping code?

Katie on November 09, 2018

I've banged out a quick-and-dirty Python script to generate HTML that displays the bilingual tech comic "CommitStrip" side-by-side in French and in...
Collapse
 
rhymes profile image
rhymes

Salut Katie!

A few observations:

  • I wouldn't worry that much for the images, you trust the source don't you? It seems that all images on commitstrip are in the form - commitstrip.com/wp-content/uploads... - so you could add a filter to check if they come from that domain

  • If CommitStrip is ever compromised and take over the website and the RSS feed there is a (remote) possibility for script injection BUT you can tell BeautifulSoup to remove <script> tags with decomponse or extract

This way you can be sure you're never going to inject a script tag inside your HTML instead of an image, probably not necessary anyway.

I have a general sense that you're not supposed to just "take stuff from strangers and display it in a browser," but I'm not sure how that plays out when it comes to "scraping web sites and redisplaying their contents."

This depends on their content policy.

Collapse
 
katiekodes profile image
Katie

Thanks -- these are exactly the kinds of tips I was looking for!

Collapse
 
jay97 profile image
Jamal Al

Awesome share😀

Collapse
 
katiekodes profile image
Katie

Thanks! I've kind of fallen in love with Commit Strip as a language tool. They do great work, writing bilingual humor -- tough stuff!