DEV Community: Roberto Preste

Adding users to the sudo group

Roberto Preste — Tue, 28 May 2019 16:54:43 +0000

One of the most important things to do after setting up a new Linux server (or after taking over an existing one) is to create a new user, possibly with sudo powers. Sudo is a special Linux command that allows users to perform administrator tasks even if they are not system admins.

The main reason for having a sudo user (or sudoer) is because logging in as root is usually not desirable, since it can cause troubles more often than not, but we may still want to be able to perform administrator tasks with a non-root user. Moreover, adding one or more users to the sudo group can avoid the need of spreading root credentials, because a sudo command will require the user’s own password, not the root’s one.

All the members of the sudo group and their restrictions and permissions are in the /etc/sudoers configuration file. Explaining this file and in general the sudo usage is quite an extensive topic, so we will only cover the case where we want to create a new user (or we already have it) and add it to the sudoers.

Creating a new user

If you already have a fully functioning non-root user and you just want to give it sudo privileges, you can skip to the next section.

First of all, we may want to create a new user, that we will later add to the sudoers. In order to do this, we can use the following command in the terminal:

adduser <username>

A new user called <username> will be created, together with his own home folder, usually located in /home/<username>/. This new user will of course require a password, that we will need to type twice; the password will not be visible for security reasons.

The command will also prompt us for some basic information about the new user, such as name, telephone number, etc. It is possible to leave this fields blank, though it is recommended to at least fill in the name field.

Adding a user to the sudo group

It is possible to add a user to the sudo group without having to mess around with the /etc/sudoers file. This can be accomplished using the following command:

usermod -aG sudo <username>

This command will add the user <username> to the sudo group, and that’s it.

From now on, the <username> user will be able to access administrator privileges just by prepending sudo to any command, and providing his own password.

Phred quality score

Roberto Preste — Tue, 28 May 2019 16:06:39 +0000

Next Generation Sequencing techniques have brought new insights into -omics data analysis, mostly thanks to their reliability in detecting biological variants. This reliability is usually measured using a value called Phred quality score (or Q score).

The Phred score of a base is an integer value that represents the estimated probability of an error in base calling. Mathematically, a Q score is logarithmically related to the base-calling error probabilities P, and can be calculated using the following formula:

Q = -10 log10 P

In the real world, a quality score of 20 means that there is a possibility in 100 that the base in incorrect; a quality score of 40 means the chances that the base is called incorrectly is 1 in 10000.

The Phred score is also inversely related to the base call accuracy, thus a higher Q score means a more reliable base call. Here is a useful table which shows this simple relationship:

Phred Quality Score	Incorrect base call prob	Base call accuracy
10	1 in 10	90%
20	1 in 100	99%
30	1 in 1000	99.9%
40	1 in 10000	99.99%

In fastq files, Phred quality scores are usually represented using ASCII characters, such that the quality score of each base can be specified using a single character. While older Illumina data used to apply the ASCII_BASE 64, nowadays the ASCII_BASE 33 table has been universally adopted for NGS data:

Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char
0	!	11	,	22	7	32	A
1	"	12	-	23	8	33	B
2	#	13	.	24	9	34	C
3	$	14	/	25	:	35	D
4	%	15	0	26	;	36	E
5	&	16	1	27	<	37	F
6	'	17	2	28	=	38	G
7	(	18	3	29	>	39	H
8	)	19	4	30	?	40	I
9	*	20	5	31	@	41	J
10	+	21	6

Even though there are lots of Python, Biopython and stand-alone softwares for dealing with Phred quality scores, a simple command to convert an ASCII character to its correspondent quality score is the following (from the terminal):

python3 -c 'print(ord("<ASCII>")-33)'

Or, when working in a Python3 session:

print(ord("<ASCII>")-33)

In both cases, just replace <ASCII> with the actual ASCII character and that will do the trick.

Counting sequences in Fasta/Fastq files

Roberto Preste — Mon, 27 May 2019 18:05:57 +0000

A well-established bioinformatician usually has a handful of appropriate informatics tools to manipulate and analyse genomic data, for example counting sequences in a file.

Nonetheless, in some cases it may be useful to rely on standard Unix commands, for example when your trusty laptop is not available or you’re working on someone else’s machine.

FASTA files

A .fasta file is a simple plain text file in which every sequence is represented by a header line, beginning with > and containing the sequence identifier and details, followed by a number of lines containing the actual sequence:

>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL

>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH

So if you want to count the number of sequences contained in a .fasta file, you can easily have it done using the grep command:

grep ">" file.fasta | wc -l

What this line does is just selecting all the > characters, and then count all their occurrences. More specifically, the grep command will find all the lines starting with >, and its output will then be piped to the wc (word count) command, that thanks to the -l option will count lines instead of words.

Another way of using grep on modern systems is to use the following command:

grep -c ">" file.fasta

The -c option will instruct the command to count the matching lines, instead of just printing them to the screen, without the need for wc -l as seen above.

FASTQ files

It’s not uncommon to work with .fastq files too, which are somehow just like .fasta files, but they also report bases quality. In this case the > character, used to specify the beginning of a sequence in .fasta files, is replaced by @; however, searching for its occurrences as shown above may be misleading, because the @ character is also used as a quality score symbol.

There is a trick for counting sequences in a .fastq file, anyway, and it’s related to the usual layout of this kind of file. Each sequence is represented by four lines: the first one being a sequence identifier, the second one is the actual sequence, the third line is usually empty and only contains a placeholder +, while the last line contains the sequence quality scores:

@SEQ_ID1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

This means that counting the number of sequences is easier than expected, and will only require dividing the number of lines in the file by four. This can be done on Bourne shells using these commands:

LINES=`cat file.fastq | wc -l`
READS=`expr $LINES / 4`
echo $READS

On modern shells, such as Bash, this can be done with a simple one-liner:

expr $(cat file.fastq | wc -l) / 4

With these simple tricks, you can easily find the number of sequences in your .fasta or .fastq files, right from your Unix shell.

Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char
0	!	11	,	22	7	32	A
1	"	12	-	23	8	33	B
2	#	13	.	24	9	34	C
3	$	14	/	25	:	35	D
4	%	15	0	26	;	36	E
5	&	16	1	27	<	37	F
6	'	17	2	28	=	38	G
7	(	18	3	29	>	39	H
8	)	19	4	30	?	40	I
9	*	20	5	31	@	41	J
10	+	21	6

Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char
0	!	11	,	22	7	32	A
1	"	12	-	23	8	33	B
2	#	13	.	24	9	34	C
3	$	14	/	25	:	35	D
4	%	15	0	26	;	36	E
5	&	16	1	27	<	37	F
6	'	17	2	28	=	38	G
7	(	18	3	29	>	39	H
8	)	19	4	30	?	40	I
9	*	20	5	31	@	41	J
10	+	21	6

Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char	Q Score	ASCII char
0	!	11	,	22	7	32	A
1	"	12	-	23	8	33	B
2	#	13	.	24	9	34	C
3	$	14	/	25	:	35	D
4	%	15	0	26	;	36	E
5	&	16	1	27	<	37	F
6	'	17	2	28	=	38	G
7	(	18	3	29	>	39	H
8	)	19	4	30	?	40	I
9	*	20	5	31	@	41	J
10	+	21	6