<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: nasircsecu</title>
    <description>The latest articles on DEV Community by nasircsecu (@sapnilcsecu).</description>
    <link>https://dev.to/sapnilcsecu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F369944%2F99a6942e-7aa8-4162-a5b7-3827ae7e4cd4.jpeg</url>
      <title>DEV Community: nasircsecu</title>
      <link>https://dev.to/sapnilcsecu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sapnilcsecu"/>
    <language>en</language>
    <item>
      <title>Step by Step web application firewall (WAF) development by using multinomial native bayes algorithm</title>
      <dc:creator>nasircsecu</dc:creator>
      <pubDate>Wed, 22 Apr 2020 19:53:15 +0000</pubDate>
      <link>https://dev.to/sapnilcsecu/step-by-step-web-application-firewall-waf-development-by-using-multinomial-native-bayes-algorithm-fdd</link>
      <guid>https://dev.to/sapnilcsecu/step-by-step-web-application-firewall-waf-development-by-using-multinomial-native-bayes-algorithm-fdd</guid>
      <description>&lt;p&gt;A web application firewall (WAF) is a firewall that monitors, filters and blocks web parameter as they travel to and from a website or web application. It typically protects web applications from attacks such as cross-site forgery, cross-site-scripting (XSS), file inclusion, and SQL injection, among others.A WAF is differentiated from a regular firewall in that a WAF is able to filter the content of specific web applications while regular firewalls serve as a safety gate between servers.&lt;/p&gt;

&lt;p&gt;Web application firewall development step by using supervised machine learning:&lt;/p&gt;

&lt;p&gt;*&lt;strong&gt;&lt;em&gt;Step-1:prepare dataset&lt;/em&gt;&lt;/strong&gt;*&lt;/p&gt;

&lt;p&gt;To prepare the dataset, load the train dataset into a pandas dataframe containing two columns – txt_label and txt_text. txt_label contain attack type and txt_text contain the attack sample&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;trainDF = load_cvs_dataset(input_dataset)
txt_label = trainDF[payload_label]
txt_text = trainDF[payload_col_name]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment  found in &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/classifier/train_model.py"&gt;train_model.py&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def load_cvs_dataset(dataset_path):

    # Set Random seed
    np.random.seed(500)
    # Add the Data using pandas
    Corpus = pd.read_csv(dataset_path, encoding='latin-1', error_bad_lines=False)

    return Corpus 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment found in &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/dataset_pre/dataset_load.py"&gt;dataset_load.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;*&lt;strong&gt;&lt;em&gt;Step-2: Text Feature Engineering&lt;/em&gt;&lt;/strong&gt;*&lt;br&gt;
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement Count Vectors as features in order to obtain relevant features from our dataset.&lt;br&gt;
*&lt;strong&gt;&lt;em&gt;Count  vectors as feature:&lt;/em&gt;&lt;/strong&gt;*&lt;br&gt;
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.&lt;br&gt;
clean the text from the each text document before the feature frequency matrix generation&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt; doc=re.sub("\d+"," ",doc)
 result_doc=word_tokenize(doc)
 tagged_sentence = nltk.pos_tag(result_doc)
 edited_sentence = [word for word,tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS' and tag != 'NNS' and tag != 'NN' and tag != 'JJ' and tag != 'JJR' and tag != 'JJS']

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment  found in  &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/feature_eng/count_word.py"&gt;count_word_fit.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;after the cleaning text on each document generate the frequency matrix of feature on each document.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total_class_token = {}

    # print(vocabulary)
    class_eachtoken_count = {} 

    for class_label in class_labels: 
        total_class_token[class_label] = 0
        class_eachtoken_count[class_label] = {}
        for voc in vocabulary:
            class_eachtoken_count[class_label] [voc] = 0

    doccount = 0
    total_voca_count = 0
    for doc in doc_list:
        words = word_tokenize(doc);

        class_label = temp_class_labels[doccount]

        for word in words:
            if word in vocabulary:
                class_eachtoken_count[class_label][word] = class_eachtoken_count[class_label][word] + 1 
                total_class_token[class_label] = total_class_token[class_label] + 1
                #print("total_class_token is ",total_class_token)
                total_voca_count = total_voca_count + 1

        doccount = doccount + 1



&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment  found in  &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/feature_eng/count_word.py"&gt;count_word_fit.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;*&lt;strong&gt;&lt;em&gt;Step-3: build the train model *&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
following code segment is the implementation of multinomial native bayes algorithm&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def multi_nativebayes_train(model_data):
    #

    class_eachtoken_likelihood = {} 
    vocabulary = model_data.get_vocabulary()
    for class_label in model_data.get_class_labels(): 
        class_eachtoken_likelihood[class_label] = {}
        for voc in vocabulary:
            class_eachtoken_likelihood[class_label] [voc] = 0
    logprior={}
    vocabularyCount = model_data.get_vocabularyCount()
    class_eachtoken_count = model_data.get_class_eachtoken_count()
    for class_label in model_data.get_class_labels(): 


        total_class_token = model_data.get_total_class_token()

        logprior[class_label]=math.log(total_class_token[class_label] / vocabularyCount)

        for word in vocabulary:

            if(class_eachtoken_count[class_label][word]==0):
                class_eachtoken_likelihood[class_label][word]=0

            else:
                class_eachtoken_likelihood[class_label][word]=math.log(class_eachtoken_count[class_label][word] / total_class_token[class_label])
    train_model_data = train_model(logprior,class_eachtoken_likelihood,vocabulary,model_data.get_class_labels())       
    return train_model_data;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment  found in  &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/classifier/multinomial_nativebayes.py"&gt;multinomial_nativebayes.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;step-4:test dataset predict&lt;br&gt;
After the training process we get train model and saved it in the web server. Now put the list of test data which contain both normal and abnormal data and get the list of prediction result from train model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def multi_nativebayes_verna_predict(train_model_data, test_dataset):

    condProbabilityOfTermClass = {}
    final_doc_class_label = {}
    doccount = 0;
    logprior = train_model_data.get_logprior()

    for doc in test_dataset:

        doc=re.sub("\d+", " ", doc)
        final_doc_class_label['doc' + '-' + str(doccount)] = ''
        words = word_tokenize(doc)
        score_Class = 0
        max_score = 0
        final_class_label = ''
        is_norm = 0


        for class_label in train_model_data.get_class_labels(): 
            condProbabilityOfTermClass[class_label] = 0

            logprior_val=logprior[class_label]
            for word in words:
                word=word.lower()
                get_class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()
                vocabulary = train_model_data.get_vocabulary()
                if(word in vocabulary):

                    if(get_class_eachtoken_likelihood[class_label][word]==0):

                        condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
                    else:
                        condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + get_class_eachtoken_likelihood[class_label][word]
                else:

                    condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;

            if(condProbabilityOfTermClass[class_label] == 0):

                is_norm = 1  
                continue      
            score_Class = logprior_val + condProbabilityOfTermClass[class_label]
            if(max_score &amp;gt; score_Class):
                max_score = score_Class
                final_class_label = class_label

        if(is_norm == 1):
            final_doc_class_label['doc' + '-' + str(doccount)] = "norm" 
        else:         
            final_doc_class_label['doc' + '-' + str(doccount)] = final_class_label

        doccount = doccount + 1    


    return final_doc_class_label 

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment  found in &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/classifier/multinomial_nativebayes.py"&gt;multinomial_nativebayes.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the final stage calculating accuracy level of algorithm in web parameter filtering&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def accuracy_score(testlabelcopy, final_doc_class_label):
    label_count = 0
    wrong_count = 0
    for label in testlabelcopy:
        #print(final_doc_class_label['doc' + '-' + str(label_count)]+' '+str(label_count))
        if label != final_doc_class_label['doc' + '-' + str(label_count)] :
            wrong_count = wrong_count + 1
        label_count = label_count + 1

    accuracy = ((len(testlabelcopy) - wrong_count)*100 )/ len(testlabelcopy)

    return accuracy     

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment  found in  &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/classifier/multinomial_nativebayes.py"&gt;multinomial_nativebayes.py&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Step-5: prediction on the text classification &lt;br&gt;
On live this train model is used in text classification to verify or filter whether web parameter is normal data or vulnerable script.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def live_multi_nativebayes_verna_predict(train_model_data, input_doc):

    condProbabilityOfTermClass = {}

    doc=re.sub("\d+", " ", input_doc)
    final_doc_class_label = ''
    words = word_tokenize(doc)
    score_Class = 0
    max_score = 0
    final_class_label = ''
    is_norm = 0

    vocabulary = train_model_data.get_vocabulary() 
    logprior = train_model_data.get_logprior()
    class_label_list=train_model_data.get_class_labels()

    for class_label in class_label_list: 
        condProbabilityOfTermClass[class_label] = 0

        logprior=logprior[class_label]
        for word in words:
            word=word.lower()
            class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()

            if(word in vocabulary):

                if(class_eachtoken_likelihood[class_label][word]==0):

                    condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
                else:
                    condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + class_eachtoken_likelihood[class_label][word]
            else:

                condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;


        if(condProbabilityOfTermClass[class_label] == 0):

            is_norm = 1  
            continue      
        score_Class = logprior + condProbabilityOfTermClass[class_label]
        if(max_score &amp;gt; score_Class):
            max_score = score_Class
            final_class_label = class_label

    if(is_norm == 1):
        final_doc_class_label= "norm" 
    else:         
        final_doc_class_label = final_class_label


    return final_doc_class_label

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;this code segment  found in &lt;a href="https://github.com/sapnilcsecu/Web-application-firewall-WAF/blob/master/sapnil_machinelearning/classifier/multinomial_nativebayes.py"&gt;multinomial_nativebayes.py&lt;/a&gt;&lt;/p&gt;

</description>
      <category>waf</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
