Hi Dev Community,
I have over 100 files in the following structure (called a FASTA -- very familiar to those of you who spend time looking at genetic data):
Original File 1:
>Gene1_id1
GATCGATCCGA
ATGCAGTCCAG>Gene2_id1
ATGCATGCAGC
ACTAGGCCACG
CCGTAGCGGAC>Gene1_id2
TAGCTAGCAGT
TAGCTAGCCGA
Each of these ~100 files contain ~20,000 of these genes. The problem is that my files are organized such that Gene1
IDs are blended alongside Gene2
IDs.
For my analysis, I need all of my Gene1
IDs organized in one place. I will ideally end up with one file for each GeneX
, like so:
Desired Final Gene1
File:
>Gene1_id1
GATCGATCCGA
ATGCAGTCCAG>Gene1_id2
TAGCTAGCAGT
TAGCTAGCCGA
The length of sequence varies between genes and between individuals within genes, so I need all the lines below a header line and above the next header line to be associated with the header.
My current solution has been to take each file, and then create a new file based on the header of each line. So the first file creates three new files: one for >Gene1_id1
, one for >Gene2_id1
, and one for Gene1_id2
. From there, I was planning on re-organizing to suit my needs.
The problem with the above approach is that it has created ~800,000 similarly-named files which are killing my computer. There must be a better way.
Any advice on how to proceed? Thanks!!!
-Hannah
Top comments (7)
I made a couple of assumptions to generate a way to split the files by genes
Assumptions:
Rather than doing this in bash bc that is far from my strong suit I wrote a simple ruby script that takes the raw files from one folder and parses them by Gene and writes the new files to a new folder. With this you only end up with 1 file for each Gene.
From what you wrote, I assume that you are looking for an intermediate data structure to be able to further process your data, is that correct?
The second question is: do you strictly want to stick to plain files?
And finally, is searching or retrieving of files an issue for you? (Is it enough to know the gene and the id if a file or do you want to be able to list all ids for gene x?
If you want to stick to one file per gene and id and the number of files inside one folder is an issue, you could create subfolders, either by gene name or if that's still too much, do a breakdown by the first letter of the gene name first, then have a folder with the gene name and finally the ids in there.
If you want to be more flexible with what you do with your data, you can parse the data into a csv file or sqlite database. There, you can have one column for the gene, one for the id and one for the actual generic information. Then you can iterate over the lines (which is a bit more comfortable with sqlite) depending on your desired analysis.
If I got something wrong or you need to achieve something more with your data, I am happy to hear about it.
+1 for sqlite, will make your life easier down the line.
FYI, this is my sister π.
Hope you get some helpful replies, Hannah!
What's the file format of each gene file? If it's something like a
.csv
you could probably run each file through a script that detects the header line, and then add it to an array. Once you've ran through all your files, you would have that array and then write a new file likeall_gene1.csv
or something.Since you have key-value pairs already with the header and the genetic data, you already have a good data structure that you can work off of. I think Ruby can handle this pretty well, but I'd have to play with the data in order to figure out if it's really possible, and you probably can't hand over the data so easily. π
Edit: actually, the file format probably doesn't matter for Ruby. I think you can read line by line in Ruby regardless of file format.
Same with python read all lines method, though she said 20K gene ids, I am wondering how many lines is each gene id? I think if these genes are very large I would run a python or ruby script looking for gene1 ids add to array then add to a new file. If memory becomes an issue you will have to flush the array let's say every 1M lines or a good number your system can handle.
Thank you all! It's in a .fasta file which is basically a plain text file. This is all really interesting -- I'll definitely look into sqlite and how to convert the data into a CSV so it's easier to work with. Also seems like a good time to brush up on my python! :)