DEV Community

Discussion on: Thoughts about sanitizing this Python RSS-scraping code?

Collapse
 
rhymes profile image
rhymes

Salut Katie!

A few observations:

  • I wouldn't worry that much for the images, you trust the source don't you? It seems that all images on commitstrip are in the form - commitstrip.com/wp-content/uploads... - so you could add a filter to check if they come from that domain

  • If CommitStrip is ever compromised and take over the website and the RSS feed there is a (remote) possibility for script injection BUT you can tell BeautifulSoup to remove <script> tags with decomponse or extract

This way you can be sure you're never going to inject a script tag inside your HTML instead of an image, probably not necessary anyway.

I have a general sense that you're not supposed to just "take stuff from strangers and display it in a browser," but I'm not sure how that plays out when it comes to "scraping web sites and redisplaying their contents."

This depends on their content policy.

Collapse
 
katiekodes profile image
Katie

Thanks -- these are exactly the kinds of tips I was looking for!