BlastAI Program

#webdev #programming #beginners #ai

Recently, I created another biopython program to further my expertise in this field. Using BLAST (Basic Local Alignment Search Tool), an algorithm that compares biological sequences, such as DNA or protein sequences. To become familiar with this tool, I decided to create a program that takes in a sequence from the user, then runs BLAST on it, looking for alignments. Then, for people who are not so familiar with what it is) using an LLM, it explains what the results signify so the user can interpret it better.

Let me explain how the program works:

First I had to get the user input in BLAST format. Also asking which type of blast program they want (blastn or blastp). BLASTn compares a nucleotide sequence to a nucleotide database, while BLASTp compares a protein sequence to a protein database.

Code:

# User Input
fasta_input = st.text_area("🧬 Enter your FASTA sequence", height=150, value=">example\nATGCGTACGTAGCTAGCTAGCTAGCTAGCTGACT")

blast_program = st.selectbox("⚙️ Select BLAST program", ["blastn", "blastp"])
database = "nt" if blast_program == "blastn" else "nr"

Then I had to create the blast function using fasta input. Here is the code:

# Run BLAST
if st.button("🚀 Run BLAST and Explain"):
    if not fasta_input.startswith(">"):
        st.error("Please provide a valid FASTA format (must start with '>').")
    else:
        try:
            with st.spinner("Submitting BLAST to NCBI..."):
                result_handle = NCBIWWW.qblast(blast_program, database, fasta_input)

            with st.spinner("Parsing BLAST results..."):
                blast_record = NCBIXML.read(result_handle)
                alignment_summaries = ""

                for alignment in blast_record.alignments[:3]:
                    for hsp in alignment.hsps:
                        if hsp.expect < 0.001:
                            alignment_summaries += f"""
Match: {alignment.hit_def}
E-value: {hsp.expect}
Score: {hsp.score}
Identities: {hsp.identities}/{hsp.align_length}
Query: {hsp.query[:60]}...
Subject: {hsp.sbjct[:60]}...
---
"""
                            break  # only one HSP per hit

A FASTA sequence is a text format that represents biological sequences such as DNA or proteins. For the BLAST to work the fasta sequence inputted HAS to begin with > followed by a sequence identifier. So the first part checks if it is valid or not.

Next the NCBIWWW.qblast function sends the BLAST request to NCBI servers. The parameters are

blast_program:the BLAST algorithm to use
database:the target database
fasta_input: the sequence enterd in FASTA format
result_handle : a file-like object containing the BLAST XML result.

Then the function NCBIXML.read() parses the XML returned by NCBI into an object called blast_record.

Now under alignment summaries it appends a summary for each significant match.

alignment.hit_def: description of the matched sequence.

hsp.score: the alignment score.

hsp.identities: how many positions match exactly.

hsp.query[:60] and hsp.sbjct[:60]: shows first 60 bases of the aligned query and subject segments (truncated for readability).

After this comes the LLM part. Using variables and prompt template I created a simple LLM that explains what is going on in the BLAST program.

# Use OpenAI to explain
                st.subheader("🤖 GPT Explanation")
                with st.spinner("Generating explanation..."):
                    prompt = f"""
You are a bioinformatics tutor. Explain the following BLAST results to a high school student in simple terms.

Here are the matches:

{alignment_summaries}

Explain:
- If the sequence matched anything known
- What organisms the matches came from
- How strong the matches were
- What alignments or similarities were found

Keep your explanation under 500 tokens. Be clear and easy to understand.
"""
                    try:
                        response = client.chat.completions.create(
                            model="gpt-4",
                            messages=[{"role": "user", "content": prompt}],
                            max_tokens=300,
                            temperature=0.5
                        )
                        explanation = response.choices[0].message.content
                        st.write(explanation)
                    except Exception as e:
                        st.error(f"Error generating explanation: {str(e)}")
                        st.info("Please check your OpenAI API key in the .env file.")
            else:
                st.warning("No strong hits found (E-value < 0.001). Try another sequence.")

        except Exception as e:
            st.error(f"Error running BLAST: {str(e)}")
            st.info("Please check your internet connection and try again.")

It is always important for the LLM to know its role. In this case, I assigned it the role of a "bioinformatics tutor," so it knows what to focus on.

Here is a video of the program in action: https://youtu.be/edJgN316OXA

As always, please let me know if you have any suggestions on the code or even any advice on what to do next (or to make this program better).