DEV Community: Rohit Farmer

Keyoxide

Rohit Farmer — Sun, 03 Mar 2024 22:09:01 +0000

aspe:keyoxide.org:LYZNNNWNTWKBSQEDI46BBOEU3E

Tweets from heads of governments and states

Rohit Farmer — Mon, 19 Sep 2022 20:27:28 +0000

Since October 2018, I have been maintaining a bot written in Python and running on a Raspberry Pi 3B+ that collects tweets from heads of governments and offices (worldwide) followed by https://twitter.com/headoffice. It was an excellent exercise learning Python, Twitter API, SQLite database, and using a Raspberry Pi for hobby projects. I have now released the data on Kaggle at https://doi.org/10.34740/KAGGLE/DSV/4208877 for the community to use.

The dataset contains an Excel workbook per year with data points on the rows and features on the columns. Features include the timestamp (UTC), language in which the tweet is written, user id, user name, tweet id, and tweet text. The first version includes the data from October 2018 until September 15, 2022. After that, future releases will be quarterly. It is a textual dataset and is primarily useful for analyses related to natural language processing.

In the Kaggle submission, I have also included a notebook (https://www.kaggle.com/code/rohitfarmer/dont-run-tweet-collection-and-preprocessing) with the Python code that collected the tweets and the additional code that I used to pre-process the data before submission. After releasing the first data set, I updated the code and moved the bot from Python to R using the rtweet library instead of tweepy. I found rtweet to perform better, especially in filtering out duplicated tweets.

In the current setup (https://github.com/rohitfarmer/government-tweets) that is still running on my Raspberry Pi 3B+, the main bot script runs every fifteen minutes via crontab and fetches data that is more recent than the latest tweet collected in the previous run. The data is stored in an SQLite database which is backed up to MEGA cloud storage via Rclone once every midnight ET.

I enjoyed the process of creating the bot and being able to run it for a couple of years, and I hope I will soon find some time to look into the data and fetch some exciting insights. But, until then, the data is available to the data science community to utilize as they please. So, please open a discussion on the Kaggle page for questions, comments, or collaborations.

Day 13 of #100DaysToOffLoad

How to use Neovim or VIM Editor as an IDE for R

Rohit Farmer — Wed, 13 May 2020 00:43:32 +0000

In this tutorial, I demonstrate how to convert Neovim or VIM text editors into a fully functional IDE for R programming language. In addition to the primary plugin Nvim-R, I also show five other plugins that we can use to enhance our code editing capabilities in Neovim text editor.

How to Rip an Audio CD in Linux?

Rohit Farmer — Mon, 06 Apr 2020 13:39:53 +0000

Ripping an audio CD means to copy the audio tracks from the CD to your computer in a format that is playable on devices other than a CD player, for example, your smartphone. There are multiple formats in which you can rip your audio CD, and they can be broadly divided into lossy (compressed) and lossless (uncompressed) formats. For example, MP3 is a lossy format, which means when you convert an audio track from an audio CD to an MP3, it loses some information. The loss of information makes the file size smaller that is easier to store without sacrificing much on detectable audio quality at least by most people listening on everyday non-specialized audio equipment. Lossless formats, on the other hand, preserves all the information that was originally recorded in the studio. The original lossless format in which CDs are written is called WAV. A typical 5 min song in WAV format takes around 50 Mb of disk space. With the storage costs lowering day by day, it may be acceptable for many of us to copy the WAV files without going through the hassle of converting them into any other format. However, one of the disadvantages of WAV format in addition to bulky file size is that it has the minimal capability to store metadata, for example, the name of the artist, album name, genre, album art etc. Lossless formats like FLAC that stands for Free Lossless Audio Codec can store metadata similar to MP3 in addition to higher quality sound as a WAV file. FLAC files also tend to be smaller than a WAV file. In reality, FLAC files are also compressed files similar to MP3; however, they don’t lose information as MP3 does.

Links:
Asunder: http://www.littlesvr.ca/asunder/

Essentials for Reproducibility

Rohit Farmer — Wed, 04 Sep 2019 21:30:55 +0000

Reproducibility of results is imperative for a sound research project. However, in computational research, especially in academic environments, we often overlook the reproducibility factor and does not put enough effort into establishing essential environments and workflows early on in the project lifecycle. Readers are often expected to figure things out themselves from the minimal information in the methods section if they want to reproduce the results themselves. In this article, I discuss some of the practices that I have adopted in the past two years to make my work more reproducible.

Version Control: Most important of all is to keep your code base under a version control system (VCS). I prefer GitHub; however, other popular variants like GitLab, Bitbucket are equally good. Posting your code online on GitHub not only serves the purpose of versioning it also acts as a backup. I have lost too many of my code in the past due to several reasons. Now I keep all my code posted on GitHub. If it is something embarrassingly trivial, then you can keep them in private repositories. At least they all will be there in case you might need to refer in the future. With all the code related to a project on GitHub, it becomes straightforward to share with collaborators or readers just by pointing them to the GitHub repository. If you happen to write a package or a module many programming languages can install your package directly from its GitHub repository without the need of posting it to a programming language-specific archive network.
Linux: Most of the academic projects run in between two to four years before you can publish an excellent paper. In addition to the project runtime, it is desirable for our results to be reproducible even after a decade. That is a very long time for most of the operating systems to stick around on the same version. In case of Windows and Mac OS, it might not even be possible to get a legal copy of an older version. In contrast, at least in theory, you can always get an older version of a popular Linux distribution that should install in a virtual machine. It can run the code written with old libraries.
Containers: Containers serves as a lighter version of a virtual machine that can encapsulate an operating system along with all the installed software and packages. A container image can be stored in a file and can be transferred easily from one computer to another. Docker is one of the most widely used container systems. Docker containers can be built from scratch or downloaded one of the pre-builts from their official vendors. If you are into scientific computing and frequently work in high-performance computing (HPC) environment, then Singularity containers should be your choice. Singularity is now available by default on most of the academic HPCs, for example, all the HPCs provided by XSEDE in the USA. Once a non-writable container is built, it freezes its contents in time and can be executed with its engine at any time in the future.
Environments: Setting up a virtual environment for every project and fixing the library versions have multiple advantages. It provides you with an isolated environment to work with without tempering the global settings, and each project can have their selected versions of libraries. It prevents you from the hassle of resolving conflicts between multiple versions of libraries if installed globally. However, in contrast to virtual machines and containers, a virtual environment can only control for a specific set of libraries. It doesn't select an operating system. I find conda to be excellent for setting up environments. I use it for both Python and R.
Documentation: Never forget to document code both inside the scripts and outside in a lab book or a readme file. Often in a hurry to produce results and get feedback from our supervisor documentation is the first thing that we forget to do. It also seems to be least necessary while the project is running because we usually have all the facts on top of our heads. However, as time passes, there is only so much that the human brain can remember and recall at the time of need. It would not be surprising that you will forget the very details of the project that you were so confident and could recite in the middle of asleep at the time you were actively working on the project. Markdown readme files, Dropbox Paper, and GitHub wiki are some of my preferred ways to document my projects.

Originally posted at https://rohitfarmer.github.io/datascience/2019/09/04/reproducibility/

Learning R: Rant 2

Rohit Farmer — Wed, 08 May 2019 19:57:05 +0000

Continuing my rant series while I am learning R. R is pretty much like Perl in terms of the philosophy that they follow; that is "There Is More Than One Way To Do It (TIMTOWTDI)". Although I appreciate this philosophy as it opens up the possibility of adopting multiple ways of doing the same thing. It becomes problematic when multiple people are working on the same project or worse if you are taking over someone else's project. Understanding the code becomes a pain in the back. Python, in this case, is a good language because it is more readable to a human and there are pretty much-set things on how to solve a particular kind of problem. Packages in R also have too many overlapping functions that mask each other in the order the packages are loaded. Therefore, you always need to be cautious about a function's source before calling it.

How-to run Jupyter notebook in an interactive node on a High-Performance Computer (HPC).

Rohit Farmer — Tue, 23 Apr 2019 23:00:00 +0000

Below is an example protocol to run Jupyter notebook in an interactive node on a high-performance computer (HPCs). Most of the HPCs have their specialised way of interacting with them. Therefore, you may have to tweak this protocol as per your need. I would be happy to discuss and troubleshoot with you; contact me at contact@rohitfarmer.dev. Any suggestions to augment this protocol with more advanced features are welcomed.

SSH to the HPC.
Claim an interactive node (follow the standard procedure for your HPC, in my case it is qrsh).
Note the interactive node name.
Run Jupyter on the claimed interactive node by jupyter notebook --no-browser --ip='0.0.0.0' or create an alias in your bashrc for a shortcut. For example alias jup='jupyter notebook --no-browser --ip='0.0.0.0''.
On your computer start another SSH session with tunnelling using the interactive node name as noted above ssh user@host -L8888:nodeName:8888 -N. The prompt probably won’t return and you may also not see any message in your terminal, but as long as there is no error message, it’s probably running fine.
To avoid re-writing the code in step 5 every time you tunnel you can use the shell script below.

I named it jupssh.

#!/bin/sh

# Check if the arugment is passed.
if [[$# -eq 0]];
then
    echo 'Usage: jupssh <node name>'
    exit 1
fi

ssh user@host -L8888:$1:8888 -N

Copy the URL that the Jupyter daemon has generated in step 4 and paste it in the browser on your computer. URL should look something similar to http://(nodeName or 127.0.0.1):8888/?token=3f7c3a8949b3fa1961c63653873fea075a93a29bffe373b5. Choose either nodeName or 127.0.0.1 in the URL.

Top banner photo by
unsplash-logoFederica Galli

Learning R: Rant 1

Rohit Farmer — Wed, 17 Apr 2019 19:44:08 +0000

I learned machine learning and data science last year with Python in my previous job, and this year I am more or less doing the same stuff in R for my current job. Learning R seems to be a pain right now, because to me coming from Python R appears to be a giant mess. There is n number of packages with multiple of them doing more or less the same thing. For example, right now I am learning about data frames/tables in R. So you have traditional R data frames that are not very easy to work with. Therefore, there is an enhanced version of it in the form of data.tables(). Former generates a data frame and the later generates a data table. In addition to the data frame and table, there is something else called tibble that is produced by the packages in the tidyverse library. Why there is not just one package like Pandas for data frames and Numpy for nd arrays?

One Year of Doing Data Science & Deep Learning

Rohit Farmer — Thu, 07 Feb 2019 00:00:00 +0000

For the past year, I have been doing data science (DS) & deep learning (DL) to understand drug bioactivation, toxicity and their implications on drug-induced liver injuries (DILI). I come from a background of molecular modeling which is a computational approach to analyse molecules, but I would not consider it to be DS or DL. Since, DS and DL were a new subject that I picked up to master, this blog post is an account of my year-long journey.

During my wait for the VISA to be sanctioned for my current position I asked my employer if there is anything that I should learn that would help me to get a good start in deep learning. The answer was Keras or Tensorflow that runs on Python. Ok, so I learned rudimentary Keras and Tensorflow and built intermediate level expertise in Python programming language. However, if someone wants to learn DL and if all he knows is vanilla Python programming then probably Keras or Tensorflow is not the first place to start. Most of the time in a DL project is spent in data wrangling which means getting the input data in the right format to be represented to a DL algorithm and re-formating the DL output for inference.

Therefore, in my opinion, an excellent place to start is to learn how to use data handling and manipulation libraries such as Numpy and Pandas. These are the two most crucial tools that you would use in any DL project. Next, learning Matplotlib would help you to plot graphs for your input data and results. You can always use a code editor to practice these tools, but nothing beats Jupyter notebook. In Jupyter you can write markdown notes as you learn, execute small snippets of code in individual cells and also plot graphs in the same space. It is suitable for trying and testing code especially during the data wrangling stage as it allows you to visualize Numpy and Pandas data frames effectively as you go.

Once you have mastered Numpy, Pandas, and Matplotlib familiarizing yourself with Scipy and Scikit-Learn is highly recommended. They are two general purpose data science and machine learning libraries that come very handy for various purposes during a DL project. I use them more often than Keras or Tensorflow. Some of the uses are for data scaling, creating train-test splits, generating stratified folds for cross-validations, metrics like ROC-AUC, MSE, MAE, etc., feature selection, statistical tests amongst others.

In my opinion, these are the bare minimum tools you need to facilitate your DL project in addition to the main DL libraries such as Keras or Tensorflow. If you want to add a few extra feathers in your cap, I can recommend learning Seaborn and Plotly for graphs and plots. They are based on Matplotlib, but with enhanced features. Pyarrow Parquet a library that can take advantage of Parquet format for columnar data storage. On occasions, I have also used Stats Model library for some conventional statistical procedures.

Now, coming to Keras and Tensorflow people usually start with Keras which is a higher level abstraction API for several DL engines including Tensorflow because of its ease. However, learning Tensorflow itself has a lot of incentives. Tensorflow gives you a lot more control over what you can do with your DL algorithm. Keras is undoubtedly the right place to start and is also very good for prototyping and also very suitable for many projects that can be solved using off the shelf DL procedures. However, if you are picking up DL for a long run, then Tensorflow would be inevitable.

If you are a person like me who has come from a non-programming background, then you would also like to learn some additional tools that are part of any professional programmer’s toolbox. Probably nothing is more important than a version control system (VCS). At the moment I guess the two most widely used VCSs are Git and Mercurial. They both can be installed on your local machine or can be used in the cloud. Bitbucket supports both Git and Mercurial repositories, and GitHub supports git. Having a GitHub profile is not only good for keeping track of your programmes, but they also serve as a code portfolio that you can showcase during a job interview, etc. At this point, I should even mention my favorite code editor that is Microsoft Visual Studio Code. It’s free, cross-platform, officially supports Python and GitHub integration.

The last piece of advice DL is only as useful as your domain knowledge in the subject that you are trying to study. DL is not going to magically find out answers for you if you are not asking the right question and providing it the correct data to look into.

Top Banner Photo by unsplash-logo
Fredy Jacob on Unsplash

How to use docker to run multiple neo4j servers simultaneously

Rohit Farmer — Sun, 19 Aug 2018 23:00:00 +0000

Neo4J graph database server can only mount one database at a time. To run more than one instances of the neo4j server with different databases mounted on them one of the efficient methods is to use neo4j docker image. The key to using more than one neo4j servers simultaneously is to use different ports for http, https and bolt connections which is relatively easy to do with the docker image. For my purpose, I also had to configure neo4j in such a way that it can access the database from a non-default location.

Step 1: Installing the neo4j docker image.

Presuming that you already have docker installed and also that you are working in Linux environment:

docker pull neo4j

Step 2: Running neo4j docker image.

By default, the neo4j docker image mounts the following folders:

home: /var/lib/neo4j
config: /var/lib/neo4j/conf
logs: /var/lib/neo4j/logs
plugins: /var/lib/neo4j/plugins
import: /import
data: /var/lib/neo4j/data
certificates: /var/lib/neo4j/certificates
run: /var/lib/neo4j/run

These directories may correspond to the already existing directories on your system. In my case, I already had a neo4j community server running on my machine so all these locations were there and were being used by the server. Therefore, I had to provide a custom location that would provide the same information. These locations would not be there in your computer if you have not installed the server version and only intending to use the docker image. To my knowledge the most import of the above-mentioned folders are data this is where your actual database will be created/stored, import to put for example CSV files for import, conf to put neo4j.conf file.

The command below is going to run the neo4j docker image taking care of running the server on non-default ports and also creating or mounting the required folders from the desired location.

docker run --detach --name=my-neo4j --rm --env=NEO4J_AUTH=none \
--publish=7475:7474 --publish=7476:7473 --publish=7688:7687 \
--volume=$HOME/neo4j/data:/data \
--volume=$HOME/neo4j/import:/import \
--volume=$HOME/neo4j/conf:/conf \
neo4j

Breaking down the above-mentioned command.

docker run …… neo4j is to run the neo4j docker image.

–detach to run the container in the background and return the prompt.

–name=my-neo4j to give the desired name to the docker instance otherwise docker will choose a random name which might not be very easy to remember if we want to refer to this session in future for some reason.

–rm is to delete the docker instance from the list upon session termination. This is useful if we want to reuse the same name.

–env=NEO4J_AUTH=none to set up the environment for no password login to the neo4j database.

–publish=7475:7474 –publish=7476:7473 –publish=7688:7687 to publish/forward the default http, https and bolt ports to the desired ports. In this case, the http, https and bolt ports will be forwarded to the desired 7475, 7476 and 7687 respectively.

–volume=$HOME/neo4j/data:/data \

–volume=$HOME/neo4j/import:/import \

–volume=$HOME/neo4j/conf:/conf \

to mount the desired locations for the database creation or access.

Note: If you are running this command for the first time, it will create the folders mentioned in –volume tag. Otherwise, it will mount the existing folders to the neo4j docker defaults.

If no error is returned then your neo4j server is running and should have been mapped to the desired ports and folders.

Step 3. Check the docker and neo4j server running status.

To check the current running docker session(s).

docker ps

This should give you an output something like this:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
1afa157d9caa neo4j "/sbin/tini -g -- ..." 36 minutes ago Up 36 minutes 7473/tcp, 0.0.0.0:7475->7474/tcp, 0.0.0.0:7688->7687/tcp my-neo4j

To terminate this session:

docker kill my-neo4j

To check neo4j running status: open a web browser and then navigate to

http://localhost:7475 (or the port that you have used for forwarding in step 2)

It should render you a page like the one below:

Change the bolt port to 7688 (or the port that you have used for forwarding in step 2), leave the password field empty (as we have asked for no authentication in step 2) and click connect.

It should connect you to your default graph.db database which should look something like below.