A web application firewall (WAF) is a firewall that monitors, filters and blocks web parameter as they travel to and from a website or web application. It typically protects web applications from attacks such as cross-site forgery, cross-site-scripting (XSS), file inclusion, and SQL injection, among others.A WAF is differentiated from a regular firewall in that a WAF is able to filter the content of specific web applications while regular firewalls serve as a safety gate between servers.
Web application firewall development step by using supervised machine learning:
*Step-1:prepare dataset*
To prepare the dataset, load the train dataset into a pandas dataframe containing two columns – txt_label and txt_text. txt_label contain attack type and txt_text contain the attack sample
trainDF = load_cvs_dataset(input_dataset)
txt_label = trainDF[payload_label]
txt_text = trainDF[payload_col_name]
this code segment found in train_model.py
def load_cvs_dataset(dataset_path):
# Set Random seed
np.random.seed(500)
# Add the Data using pandas
Corpus = pd.read_csv(dataset_path, encoding='latin-1', error_bad_lines=False)
return Corpus
this code segment found in dataset_load.py
*Step-2: Text Feature Engineering*
The next step is the feature engineering step. In this step, raw text data will be transformed into feature vectors and new features will be created using the existing dataset. We will implement Count Vectors as features in order to obtain relevant features from our dataset.
*Count vectors as feature:*
Count Vector is a matrix notation of the dataset in which every row represents a document from the corpus, every column represents a term from the corpus, and every cell represents the frequency count of a particular term in a particular document.
clean the text from the each text document before the feature frequency matrix generation
doc=re.sub("\d+"," ",doc)
result_doc=word_tokenize(doc)
tagged_sentence = nltk.pos_tag(result_doc)
edited_sentence = [word for word,tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS' and tag != 'NNS' and tag != 'NN' and tag != 'JJ' and tag != 'JJR' and tag != 'JJS']
this code segment found in count_word_fit.py
after the cleaning text on each document generate the frequency matrix of feature on each document.
total_class_token = {}
# print(vocabulary)
class_eachtoken_count = {}
for class_label in class_labels:
total_class_token[class_label] = 0
class_eachtoken_count[class_label] = {}
for voc in vocabulary:
class_eachtoken_count[class_label] [voc] = 0
doccount = 0
total_voca_count = 0
for doc in doc_list:
words = word_tokenize(doc);
class_label = temp_class_labels[doccount]
for word in words:
if word in vocabulary:
class_eachtoken_count[class_label][word] = class_eachtoken_count[class_label][word] + 1
total_class_token[class_label] = total_class_token[class_label] + 1
#print("total_class_token is ",total_class_token)
total_voca_count = total_voca_count + 1
doccount = doccount + 1
this code segment found in count_word_fit.py
*Step-3: build the train model *
following code segment is the implementation of multinomial native bayes algorithm
def multi_nativebayes_train(model_data):
#
class_eachtoken_likelihood = {}
vocabulary = model_data.get_vocabulary()
for class_label in model_data.get_class_labels():
class_eachtoken_likelihood[class_label] = {}
for voc in vocabulary:
class_eachtoken_likelihood[class_label] [voc] = 0
logprior={}
vocabularyCount = model_data.get_vocabularyCount()
class_eachtoken_count = model_data.get_class_eachtoken_count()
for class_label in model_data.get_class_labels():
total_class_token = model_data.get_total_class_token()
logprior[class_label]=math.log(total_class_token[class_label] / vocabularyCount)
for word in vocabulary:
if(class_eachtoken_count[class_label][word]==0):
class_eachtoken_likelihood[class_label][word]=0
else:
class_eachtoken_likelihood[class_label][word]=math.log(class_eachtoken_count[class_label][word] / total_class_token[class_label])
train_model_data = train_model(logprior,class_eachtoken_likelihood,vocabulary,model_data.get_class_labels())
return train_model_data;
this code segment found in multinomial_nativebayes.py
step-4:test dataset predict
After the training process we get train model and saved it in the web server. Now put the list of test data which contain both normal and abnormal data and get the list of prediction result from train model.
def multi_nativebayes_verna_predict(train_model_data, test_dataset):
condProbabilityOfTermClass = {}
final_doc_class_label = {}
doccount = 0;
logprior = train_model_data.get_logprior()
for doc in test_dataset:
doc=re.sub("\d+", " ", doc)
final_doc_class_label['doc' + '-' + str(doccount)] = ''
words = word_tokenize(doc)
score_Class = 0
max_score = 0
final_class_label = ''
is_norm = 0
for class_label in train_model_data.get_class_labels():
condProbabilityOfTermClass[class_label] = 0
logprior_val=logprior[class_label]
for word in words:
word=word.lower()
get_class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()
vocabulary = train_model_data.get_vocabulary()
if(word in vocabulary):
if(get_class_eachtoken_likelihood[class_label][word]==0):
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + get_class_eachtoken_likelihood[class_label][word]
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
if(condProbabilityOfTermClass[class_label] == 0):
is_norm = 1
continue
score_Class = logprior_val + condProbabilityOfTermClass[class_label]
if(max_score > score_Class):
max_score = score_Class
final_class_label = class_label
if(is_norm == 1):
final_doc_class_label['doc' + '-' + str(doccount)] = "norm"
else:
final_doc_class_label['doc' + '-' + str(doccount)] = final_class_label
doccount = doccount + 1
return final_doc_class_label
this code segment found in multinomial_nativebayes.py
At the final stage calculating accuracy level of algorithm in web parameter filtering
def accuracy_score(testlabelcopy, final_doc_class_label):
label_count = 0
wrong_count = 0
for label in testlabelcopy:
#print(final_doc_class_label['doc' + '-' + str(label_count)]+' '+str(label_count))
if label != final_doc_class_label['doc' + '-' + str(label_count)] :
wrong_count = wrong_count + 1
label_count = label_count + 1
accuracy = ((len(testlabelcopy) - wrong_count)*100 )/ len(testlabelcopy)
return accuracy
this code segment found in multinomial_nativebayes.py
Step-5: prediction on the text classification
On live this train model is used in text classification to verify or filter whether web parameter is normal data or vulnerable script.
def live_multi_nativebayes_verna_predict(train_model_data, input_doc):
condProbabilityOfTermClass = {}
doc=re.sub("\d+", " ", input_doc)
final_doc_class_label = ''
words = word_tokenize(doc)
score_Class = 0
max_score = 0
final_class_label = ''
is_norm = 0
vocabulary = train_model_data.get_vocabulary()
logprior = train_model_data.get_logprior()
class_label_list=train_model_data.get_class_labels()
for class_label in class_label_list:
condProbabilityOfTermClass[class_label] = 0
logprior=logprior[class_label]
for word in words:
word=word.lower()
class_eachtoken_likelihood = train_model_data.get_class_eachtoken_likelihood()
if(word in vocabulary):
if(class_eachtoken_likelihood[class_label][word]==0):
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label] + class_eachtoken_likelihood[class_label][word]
else:
condProbabilityOfTermClass[class_label] = condProbabilityOfTermClass[class_label]+0;
if(condProbabilityOfTermClass[class_label] == 0):
is_norm = 1
continue
score_Class = logprior + condProbabilityOfTermClass[class_label]
if(max_score > score_Class):
max_score = score_Class
final_class_label = class_label
if(is_norm == 1):
final_doc_class_label= "norm"
else:
final_doc_class_label = final_class_label
return final_doc_class_label
this code segment found in multinomial_nativebayes.py
Top comments (4)
regular expression based venerable web parameter detection is used extensively in web application firewall (WAF) development but still no machine learning are not used most of the WAF development.because still now accuracy level of machine learning based solution cannot reach expected level .in my post i try to explain the way of development of WAF based on supervised machine learning.read my post and get start discuss about my github.com/sapnilcsecu/Web-applica... solution .if you find any limitation in this solution start discuss about this limitation
Can we extend this project by adding some other features i.e. security, monitoring incomming web traffic, encrypted commmunication between client and server as a CS final year project to make a WAF? Your effort & response are appreciated.
need more explanation ABOUT YOUR FANTASTIC WEB FIREWALL PROGRAM
this is just initialisation. But i have plan about another article in which i will explain in detail how machine learning can be used in web application venerability detection on web application firewall(WAF)