DEV Community

Discussion on: Advice on using a header in text to re-organize a large dataset

Collapse
 
molly profile image
Molly Struve (she/her)

I made a couple of assumptions to generate a way to split the files by genes
Assumptions:

  • You are working with plain .txt files
  • You want the new organized genes in plain .txt files (there are a million other things you may want but I figured this was the simplest)

Rather than doing this in bash bc that is far from my strong suit I wrote a simple ruby script that takes the raw files from one folder and parses them by Gene and writes the new files to a new folder. With this you only end up with 1 file for each Gene.

path = 'genes'
Dir.foreach(path) do |filename|
  next if filename == '.' || filename == '..' || filename == '.DS_Store'
  gene_file = nil
  header = true
  puts "working on #{filename}"

  file = File.open("#{path}/#{filename}", 'r')
  file.each_line do |line|
    puts line
    if line.empty? || line == "\n"
      puts "line empty"
      header = true
      gene_file&.close
    elsif header
      puts "opening new file"
      gene_file = File.open("new_gene_files/#{line.gsub(/_id\d/, '').gsub('>', '').strip}.txt", 'a')
      puts "adding to file"
      gene_file.puts line
      header = false
    else
      puts "adding to file"
      gene_file&.puts line
    end
  end

  file.close
end