Protein Structure & Function Investigator

#ai #machinelearning #programming #beginners

Today, I would like to share a new program that I created in my journey to learn more about bioinformatics and biopython. This is the second project I undertook in this bioinformatics series, which involves investigating protein structure and function.

What the program is supposed to do: Take user input in PBID (Protein Data Bank ID) form and return the 3D visual structure of the protein, and can interact with an AI agent that can read its file to answer specific questions about it.

To create this program, I needed to use BioPython's PDB module, which takes a 4-character PDB ID and downloads the corresponding .pdb file. Then, using functions within this module, the output can be parsed so it is easier to read by the AI module and easier to work with in the code. Some key features that can be taken from a PDB file for a protein are structure resolution, secondary structure elements, and amino acid sequence (and many more).

Overall, these are the tools that I used:

Tech: Python, Streamlit, LangChain Agents.
Tools: BioPython (PDB module), Py3Dmol (for 3D visualization in Streamlit).
Data: Live data from the Protein Data Bank (PDB).

Now let's dive into the code for the program. I split this up into a couple of key parts and steps:

Getting the input and setting up a download path

pdb_id_input = st.text_input("Enter a PDB ID (e.g., 1TUP):", value="1TUP").lower().strip()
download_path = "pdb_files"

This PDB module needs a space for all files to be saved, and for that I needed to setup a file path so that for every different input a new file would be created in that folder and the program would be able to access that information.

Parsing the input after getting the file and saving it to a variable named "info"

if pdb_id_input:
    pdbl = PDBList()
    file_path = pdbl.retrieve_pdb_file(pdb_id_input, pdir=download_path, file_format="pdb")

    parser = PDBParser(QUIET=True)
    structure = parser.get_structure(pdb_id_input, file_path)

    # Extract structure info
    model = structure[0]
    chains = list(model.get_chains())
    residues = list(model.get_residues())
    atoms = list(model.get_atoms())

    # Create the `info` string with summary
    info = f"""
PDB ID: {pdb_id_input.upper()}
Structure Name: {structure.header.get('name', 'N/A')}
Experiment Method: {structure.header.get('structure_method', 'N/A')}
Resolution: {structure.header.get('resolution', 'N/A')} Å

Number of Chains: {len(chains)}
Chain IDs: {[chain.id for chain in chains]}
Number of Residues: {len(residues)}
Number of Atoms: {len(atoms)}
"""

Here, I used a PDBParser, which is already trained to parse through this type of information and organize the information so that it is easier to extract. In addition, the data is in columns and rows, so you can see that to get the model, for example, it would be structure0. From the model, you can also get a lot of other information, like the chains, residues, and atoms. This information is very common, so I wanted to ensure that it was part of the variable that I created.

Creating the "info" variable, I made sure to save a couple of key features like the ID, the structure name, experiment method, resolution, chains, chain IDs, atoms, and residues. The PDB file is very large, so I thought these would be the most important ones to include.


# Output
    st.subheader("Structure Info:")
    st.code(info)

     # --- 3D Visualization ---
    st.subheader("🧬 3D Structure Viewer")
    with open(file_path, "r") as f:
        pdb_data = f.read()

    view = py3Dmol.view(width=700, height=500)
    view.addModel(pdb_data, "pdb")
    view.setStyle({'cartoon': {'color': 'spectrum'}})
    view.zoomTo()

    view_html = view._make_html()
    st.components.v1.html(view_html, height=500, width=700)

This block of code is to visualize the protein structure which is also within the PDB file. Here, I researched syntax and came up with this basic layout for protein structure visualization in the program. Below is an example of what it would look like:

Setting up the LLM with all the necessary information.

# Ask if user has a question
    st.subheader("🤖 Ask About the Structure")
    ask_question = st.radio("Do you have any questions?", ["No", "Yes"])

    if ask_question == "Yes":
        user_question = st.text_input("Enter your question about this structure:")

        if user_question:
            # Initialize LLM
            llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.4, max_tokens=300)

            # Prompt Template
            prompt = PromptTemplate(
                input_variables=["pdb_id", "info", "question"],
                template="""
You are an expert in protein structures. Given the following:

PDB ID: {pdb_id}
Structure Info:
{info}

Answer this question:
{question}

Make sure the answer is simple enough for a highschool student to understand
"""
            )

            chain = LLMChain(llm=llm, prompt=prompt)
            response = chain.run({
                "pdb_id": pdb_id_input.upper(),
                "info": info,
                "question": user_question
            })

Through lots of practice, I thought it was pretty simple to do this part of the program. First, I prompted the user to say yes or no if they had any questions. If yes, they would be prompted to type in their question, and it would be saved in a temporary variable. Then I initialized the LLM, created a simple prompt template passing in the variables. Finally, the answer would be returned.

Just an FYI, all of my Streamlit UI was integrated between these parts. I thought it would be easier to incorporate it into each part of the program.

Here is a brief walkthrough of the program in action: https://youtu.be/6UzOTgFaA9c