Discussion on: Thoughts about sanitizing this Python RSS-scraping code?

View post

Salut Katie!

A few observations:

I wouldn't worry that much for the images, you trust the source don't you? It seems that all images on commitstrip are in the form - commitstrip.com/wp-content/uploads... - so you could add a filter to check if they come from that domain
If CommitStrip is ever compromised and take over the website and the RSS feed there is a (remote) possibility for script injection BUT you can tell BeautifulSoup to remove <script> tags with decomponse or extract

This way you can be sure you're never going to inject a script tag inside your HTML instead of an image, probably not necessary anyway.

I have a general sense that you're not supposed to just "take stuff from strangers and display it in a browser," but I'm not sure how that plays out when it comes to "scraping web sites and redisplaying their contents."

This depends on their content policy.

Katie • Nov 9 '18

Thanks -- these are exactly the kinds of tips I was looking for!