DEV Community: tenexcoder

Reverse Engineering the Corona Virus

tenexcoder — Wed, 09 Dec 2020 10:00:34 +0000

In this post we will be analyzing the Corona Virus 2 aka SARS-CoV-2 from first principles.

Let’s Talk Molecular Biology

At a high level we can think of DNA as blueprints, RNA as the instructions, and Proteins as functions. From High school Biology class we can remember that DNA is made up of 4 nucleic acids guanine, adenine, cytosine and thymine; in RNA, uracil is used in place of thymine in a process called transcription. There are 20 common Amino acids each defined by a sets of three mRNA nucleotides (A, U, G and C). Proteins are a sequences of amino acid chains which fold into a specific shape to define its function.

Lets summarize what we learned from the interactive notebook above. We obtained both the genome and protein structure from the NCBI data bank. First we analyzed the annotation from the genome and dicover that its taxonomy is virus and that it molecular type is single stranded RNA. From there we explored its genome and throught the transcription and tranlsation process to obtain its amino acid sequence. To expedite the process we used the metadata to retrieve its protein coding sequences (CDS).

The 10 main CDS are A Chain of Proteins (ORF1ab), Spike Protein (S), Escape Artist (ORF3a), Envelope Protein (E), Membrane Protein (M), Signal Blocker (ORF6), Virus Liberator (ORF7a), Mystery Protein (ORF8), Nucleocapsid Protein (N), Mystery Protein (ORF10). The two important proteins to remeber are ORF1ab which funtion as the payload and S which is the exploit. The rest of the proteins assist to make everythig happen. Lastly, we visualized a 3d model of the main protein structure.

For a visual walkthrough visit The New York Times Infographic

Deploying your React app has never been simpler with Github Pages

tenexcoder — Mon, 23 Nov 2020 23:17:54 +0000

Remember the time you were trying to share progress with a client or wanted to showcase your next side projects? We all been there hoping things could be only a few clicks away.

Well fear not your wishes have been granted, there is now a free and simple approach of Deploying your React apps.

Package: gh-pages

I present to you gh-pages which I quote allows you to “Publish files to a gh-pages branch on GitHub (or any other branch anywhere else).”
The package automate the mundane step required to deploy your react app to GitHub Pages into three simple steps.
Technically this package can help you deploy any static site as long as the base directory of the static files is set accordingly — more on this in Step 2

Step 1: Add `homepage` to `package.json`

The step below is important!
If you skip it, your app will not deploy correctly.
Open your package.json and add a homepage field for your project:

“homepage”: “https://myusername.github.io/my-app",

or for a GitHub user page:

“homepage”: “https://myusername.github.io",

or for a custom domain page:

“homepage”: “https://mywebsite.com",

Create React App uses the homepage field to determine the root URL in the built HTML file.

Step 2: Install `gh-pages` and add `deploy` to `scripts` in `package.json`

Now, whenever you run npm run build, you will see a cheat sheet with instructions on how to deploy to GitHub Pages.
To publish it at https://myusername.github.io/my-app, run:

npm install — save gh-pages

Alternatively you may use yarn:

yarn add gh-pages

Add the following scripts in your package.json:

“scripts”: {
+ “predeploy”: “npm run build”,
+ “deploy”: “gh-pages -d build”,
“start”: “react-scripts start”,
“build”: “react-scripts build”,

The predeploy script will run automatically before deploy is run.

The deploy script will automagically deploy your app.

Note: The -d option is to point to the base directory of the static files. Set it according to your project’s configuration. For example the base directory for create-react-app is build by default, meanwhile for a webpack configuration it is dist.

If you are deploying to a GitHub user page instead of a project page you’ll need to make one additional modification:
Tweak your package.json scripts to push deployments to master:

“scripts”: {
“predeploy”: “npm run build”,
- “deploy”: “gh-pages -d build”,
+ “deploy”: “gh-pages -b master -d build”,

Step 3: Deploy the site by running npm run deploy
Then run:

npm run deploy

For a project page, ensure your project’s settings use `gh-pages`

Finally, make sure GitHub Pages option in your GitHub project settings is set to use the gh-pages branch:

Optionally, configure the domain

You can configure a custom domain with GitHub Pages by adding a CNAME file to the public/ folder.
Your CNAME file should look like this:
Copy

mywebsite.com

Resources

For more details check out the repository or create react app docs which this guide was heavily based on.
https://github.com/tschaub/gh-pages
https://create-react-app.dev/docs/deployment/#github-pages

🤗 BERT tokenizer from scratch

tenexcoder — Wed, 11 Nov 2020 02:48:19 +0000

As part of 🤗 Tokenizers 0.9 release, it has never been easier to create extremely fast and versatile tokenizers for your next NLP task.
No better way to showcase tokenizers' new capabilities than to create a Bert tokenizer from scratch.

Tokenizer

First, BERT relies on WordPiece, so we instantiate a new Tokenizer with this model:

from tokenizers import Tokenizer
from tokenizers.models import WordPiece

bert_tokenizer = Tokenizer(WordPiece())

Then we know that BERT preprocesses texts by removing accents and lowercasing. We also use a unicode normalizer:

from tokenizers import normalizers
from tokenizers.normalizers import Lowercase, NFD, StripAccents

bert_tokenizer.normalizer = normalizers.Sequence([NFD(), Lowercase(), StripAccents()])

The pre-tokenizer is just splitting on whitespace and punctuation:

from tokenizers.pre_tokenizers import Whitespace

bert_tokenizer.pre_tokenizer = Whitespace()

And the post-processing uses the template we saw in the previous section:

from tokenizers.processors import TemplateProcessing

bert_tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", 1),
        ("[SEP]", 2),
    ],
)

We can use this tokenizer and train on it on wikitext like in the Quicktour:

from tokenizers.trainers import WordPieceTrainer

trainer = WordPieceTrainer(
    vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
)
files = [f"data/wikitext-103-raw/wiki.{split}.raw" for split in ["test", "train", "valid"]]
bert_tokenizer.train(trainer, files)

model_files = bert_tokenizer.model.save("data", "bert-wiki")
bert_tokenizer.model = WordPiece.from_file(*model_files, unk_token="[UNK]")

bert_tokenizer.save("data/bert-wiki.json")

Now that the BERT tokenizer has been configured and trained the BERT tokenizer, we can load it with:

from tokenizers import Tokenizer

bert_tokenizer = Tokenizer.from_file("data/tokenizer-wiki.json")

Decoding

On top of encoding the input texts, a Tokenizer also has an API for decoding, that is converting IDs generated by your model back to a text. This is done by the methods decode() (for one predicted text) and decode_batch() (for a batch of predictions).
The decoder will first convert the IDs back to tokens (using the tokenizer’s vocabulary) and remove all special tokens, then join those tokens with spaces.
If you used a model that added special characters to represent subtokens of a given “word” (like the "##" in WordPiece) you will need to customize the decoder to treat them properly. If we take our previous bert_tokenizer for instance the default decoding will give:

output = bert_tokenizer.encode("Welcome to the 🤗 Tokenizers library.")
print(output.tokens)
# ["[CLS]", "welcome", "to", "the", "[UNK]", "tok", "##eni", "##zer", "##s", "library", ".", "[SEP]"]

bert_tokenizer.decode(output.ids)
# "welcome to the tok ##eni ##zer ##s library ."

But by changing it to a proper decoder, we get:

from tokenizers import decoders

bert_tokenizer.decoder = decoders.WordPiece()
bert_tokenizer.decode(output.ids)
# "welcome to the tokenizers library."

Resources

Documentation: https://huggingface.co/docs/tokenizers/python/latest/pipeline.html#all-together-a-bert-tokenizer-from-scratch
Colab: https://colab.research.google.com/github/tenexcoder/huggingface-tutorials/blob/main/BERT_tokenizer_from_scratch.ipynb
Gist: https://gist.github.com/tenexcoder/85b38e17a5557f0bb7c44bda4a08271d

Credit

All credit goes to Hugging Face Tokenizers Documentation — see resources for more details
I simply packaged the example in a digestible and shareable form.

DEV Community: tenexcoder

Reverse Engineering the Corona Virus

Let’s Talk Molecular Biology

Deploying your React app has never been simpler with Github Pages

Package: gh-pages

Step 1: Add homepage to package.json

Step 2: Install gh-pages and add deploy to scripts in package.json

For a project page, ensure your project’s settings use gh-pages

Optionally, configure the domain

Resources

🤗 BERT tokenizer from scratch

Tokenizer

Decoding

Resources

Credit

Step 1: Add `homepage` to `package.json`

Step 2: Install `gh-pages` and add `deploy` to `scripts` in `package.json`

For a project page, ensure your project’s settings use `gh-pages`