DEV Community

Jimmy Yeung
Jimmy Yeung

Posted on

3 3

Check if a file is a subset of another file using bash script

Scenario

I need to check if a file is a subset of another file into the CI pipeline. Thus bash script is chosen since it's performant and we don't need to install extra dependencies into the CI pipeline.

  1. diff
    The first command comes to my mind is diff, which is a really powerful command telling the difference between two files.

    However it's too powerful. diff "predicts" which line needs to be changed in order to make the two files identical; which is unnecessary for my use case.

    E.g. (Example from GeeksToGeeks)

    $ cat a.txt
    Gujarat
    Uttar Pradesh
    Kolkata
    Bihar
    Jammu and Kashmir
    
    $ cat b.txt
    Tamil Nadu
    Gujarat
    Andhra Pradesh
    Bihar
    Uttar pradesh
    
    $ diff a.txt b.txt
    0a1
    > Tamil Nadu
    2,3c3
    < Uttar Pradesh
     Andhra Pradesh
    5c5
     Uttar pradesh
    
  2. comm
    Without further digging into diff, I found another command comm which is simple and just fit in my use case.

    comm returns 3 columns:

    • first column contains names only present in the 1st file
    • second column contains names only present in 2nd file
    • the third column contains names common to both the files

    E.g. (Example from GeeksToGeeks)

    // displaying contents of file1 //
    $cat file1.txt
    Apaar 
    Ayush Rajput
    Deepak
    Hemant
    
    // displaying contents of file2 //
    $cat file2.txt
    Apaar
    Hemant
    Lucky
    Pranjal Thakral
    
    $comm file1.txt file2.txt
                    Apaar
    Ayush Rajput
    Deepak
                    Hemant
            Lucky
            Pranjal Thakral
    

    And to check if one file is a subset of another file, we just need the 1st column. We could just do -23 to neglect the 2nd and 3rd column. I.e.

    comm -23 file1.txt file2.txt
    

Conclusion

At last, I just end up with this simple bash script to check the subset condition:

#!/bin/bash    
SUBSET="<subset_file_path>"
SUPERSET="<superset_file_path>"
CHECK=$(comm -23 <(sort $SUBSET | uniq ) <(sort $SUPERSET | uniq ) | head -1)

if [[ ! -z $CHECK ]]; then
  echo "Detected extra line in $SUBSET and not in $SUPERSET."
  echo $CHECK
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

Added the extra sort and uniq commands there just to make sure we're comparing two sorted and deduplicated files.

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

AWS GenAI LIVE!

GenAI LIVE! is a dynamic live-streamed show exploring how AWS and our partners are helping organizations unlock real value with generative AI.

Tune in to the full event

DEV is partnering to bring live events to the community. Join us or dismiss this billboard if you're not interested. ❤️