this is my first time posting anything about programming ever in my life.. I want to use this opportunity to talk about the ML project I am currently working on(:
This project aimed to find rules using apriori, where we can create an association between variants and their pathogenicity, diagnosing the causative variant for the patience disease.
The first step was to create the model using python.. we (me and my team) trained the apriori model on data we acquired from ClinVar databases.. the genetic data we collected was not clean.. so we wrote a function for cleaning it.. removing the empty, unclear or useless info for us. to do so, we had to extract data from the VCF files (variant calling format), using the Scikit-allele library ( by reading the documentation) then, we saved it into a panda data frame using:
Then to clean the data we exclude the unwanted raws.
for example :
Variant = Variant[Variant['variants/CLNSIG'] != 'Benign']
this to remove any raw containing Benign significance which is not harmful so doesn't help our rules.
For training the model we also had to transform the data frame into a list using the following:
changing it to a string :
Variant = Variant.astype('str')
listV = Variant.values.tolist()
Now we can invoke the apriori function and give to it the list of variants we cleansed.
apriori(listV, min_support =0.05, min_confidence= 0.6, min_lift = 3, min_length=2)
Note that the support and confidence, and the lift are experimental, meaning that you have to play with it until you find the suitable values where the rules are meaningful. in our field, the evaluation is done by referring to experts since it is hard for us to inspect the accuracy of the resulting rules.
Hope this was informative, and thank you for reading.