Checking links & ignoring domains

#opensource #python

For the past several weeks, I've been working on Gustavo: a URL checking command line tool written in python. I wanted to gain some experience forking repos; setting up remotes; and merging branches, so I got permission to work on a new feature for checkThatLink

TDDR / checkThatLink

The feature I wanted to add would allow for a command line argument
-i or --ignore to be passed along with a path to a file. The program should open the file; read the contents; and if any URLs are present, omit their domain from the HTTP status checking process.

To begin, I needed to understand how this program worked before I could add any features. It just so happens, that Gustavo and checkThatLink are very similar in purpose and language. I'm by no means and expert in Python, but I felt pretty comfortable reading through the source code. As I walked through the program i determined it followed these steps:

Create a parsed command line arguments object and pass it into a new checkFile object checkFile(args)
checkFile initializes all of its member variables from the args object
checkFile runs its main function checkThatFile()
Each line of the source file is inspected to see if it contains a URL
A HTTP connection is made for each URL requesting the HEAD and the URL and its response status are appended in a list of dictionaries allLinks
A series of conditional statements determine the correct function should generate the output from allLinks

I didn't want the program to waste time checking a link that would later be omitted from the output, so I figured I should insert my feature/check in between step 4. & 5.

I first created a function that would open the provided file with urls to ignore; use a regular expression to pick out all the valid urls; and return the ignoreList of domains.

Next I added a condition before step 5. above, so that each url's domain would be checked against the domains in the ignoreList

At this point, I thought I was finished. I had the author review my work before merging and I was informed that some updates were needed. The issue requirements stated that if the program receives a file containing domains to ignore, but no comments or urls are present, then the program should exit. It was a pretty straight-forward update to make:

I added another regular expression to find comments.
If both regular expressions (domains and comments) are empty, the program exits.

def getIgnoreList(self, ignoreFile):    
  found = []
  try: 
    if ignoreFile:
      with open(ignoreFile) as src:
        text = src.read() 
        found = re.findall('^https?://.*[^\s/]', text, flags=re.MULTILINE)
        comment = re.search('^#.*', text, flags=re.MULTILINE)
        return found if comment or found else sys.exit(1)
    return found
  except:
    print(f'Error with {ignoreFile}')
    sys.exit(1)

The above update passed the test and the code was merged upstream!

On the other side of the ball, I had a volunteer to implement the same ignore-feature in Gustavo. We worked through the bugs together. I found it pretty easy to make a new branch from the contributor's remote and make some fixes. I hope I wasn't too overbearing in this regard; I didn't just point out the existence of a bug, I made suggestions and provided the code to fix them.

I did learn the hard way that it is a good idea to run git status or git branch to know where you are in your git tree before fetching or pulling. It wasn't too serious, as I got issue-8 and issue-8-fix mixed up. I'm getting more comfortable with git each week, but clearly I need more practice.

DEV Community

Checking links & ignoring domains

TDDR / checkThatLink

Top comments (0)

Read next

7 steps to building scalable Backend from scratch

renovate.json file in Lobechat source code explained

How to Download YT Videos in HD Quality Using Python and Google Colab

Introducing DorkHub: A Comprehensive Collection of Google Dorks for Security Researchers