Transform Your Codebase into Comprehensive Documentation with Markdown

#chatgpt #markdown #ruby #ai

Introduction

Welcome to the age of AI. The world is moving at lightning speed towards artificial intelligence, and programmers have an array of built-in tools in code editors like Zed, VSCode, and Cursor. These editors have the capability to analyze large codebases and assist in resolving issues or creating features.

I've tested many of these editors, but sometimes even including the codebase in chat doesn't provide the full picture, which means the results often fall short or lack the quality the repository requires. The worst-case scenario arises when an AI chat starts producing circular problems: solving one problem introduces a new issue, and fixing that issue reintroduces the first problem.

The core challenge here is the limited access to the whole codebase due to constraints on the number of files or file sizes that AI tools can process.

Moreover, let’s talk about direct AI models like Claude, Perplexity, and ChatGPT. Since their inception, these tools have come a long way. Now, ChatGPT allows attachment of files in chat, but it still doesn't support submitting zipped folders or entire repositories, meaning it cannot consider your whole code. The same limitation exists for Claude and Perplexity. It would be incredibly beneficial if we could give these AI tools our code in a compressed form—something that isn’t spread across hundreds of files and is readable by the AI.

Solution: A Ruby Script to Convert a Codebase into Markdown

Why Ruby?

The first question that comes to mind is: why Ruby?

Simply put, I like and work in Ruby. There is no ulterior or mind-boggling reason.

What Does This Ruby Script Do?

You provide the root folder path of your codebase to the script, and it will create Markdown files, each with a maximum size of 100KB. These files will contain code blocks with the content of each file, alongside a project structure tree. Files and folders listed in your .gitignore or specified in the script will be excluded.

Why Markdown and 100KB Files?

Markdown: Markdown is a lightweight markup language that works well for documentation. It’s text-based, making it easy to read and compatible with most personal knowledge management tools like Notion or Obsidian.
File Size Limitation: Some AI tools, especially Claude, do not read files larger than 100KB. Therefore, the script enforces this limit. You can adjust the limit by changing the script parameters.

The Ruby Script: `ruby_to_md.rb`

Below is the Ruby script that converts your codebase into Markdown files:

https://gist.github.com/sulmanweb/ee1541b1739b06db6695370cbc8a480d

require 'fileutils'
require 'digest'
ALWAYS_IGNORE = ['.git', 'tmp', 'log', '.ruby-lsp', '.github', '.devcontainer', 'storage', '.annotaterb.yml', 'public', '.cursorrules'].freeze
IGNORED_EXTENSIONS = %w[.jpg .jpeg .png .gif .bmp .svg .webp .ico .pdf .tiff .raw .keep .gitkeep .sample .staging].freeze
MAX_FILE_SIZE = 1_000_000 # 1MB
CHUNK_SIZE = 100_000 # 100KB
def read_gitignore(directory_path)
  gitignore_path = File.join(directory_path, '.gitignore')
  return [] unless File.exist?(gitignore_path)
  File.readlines(gitignore_path).map(&:chomp).reject(&:empty?)
end
def ignored?(path, base_path, ignore_patterns)
  relative_path = path.sub("#{base_path}/", '')
  return true if ALWAYS_IGNORE.any? { |dir| relative_path.start_with?(dir + '/') || relative_path == dir }
  return true if IGNORED_EXTENSIONS.include?(File.extname(path).downcase) || File.basename(path) == '.keep'
  ignore_patterns.any? do |pattern|
    File.fnmatch?(pattern, relative_path, File::FNM_PATHNAME | File::FNM_DOTMATCH) ||
      File.fnmatch?(File.join('**', pattern), relative_path, File::FNM_PATHNAME | File::FNM_DOTMATCH)
  end
end
def convert_to_markdown(file_path)
  extension = File.extname(file_path).downcase[1..]
  format = extension.nil? || extension.empty? ? 'text' : extension
  begin
    content = File.read(file_path, encoding: 'UTF-8')
    "## #{File.basename(file_path)}\n\n```

#{format}\n#{content.strip}\n

```\n\n"
  rescue StandardError => e
    "## #{File.basename(file_path)}\n\n[File content not displayed: #{e.message}]\n\n"
  end
end
def generate_tree_markdown(tree, prefix = '')
  result = ''
  tree.each do |key, value|
    result += "#{prefix}- #{key}\n"
    result += generate_tree_markdown(value, prefix + '  ') if value.is_a?(Hash)
  end
  result
end
def write_chunked_output(output_file, content)
  base_name = File.basename(output_file, '.*')
  extension = File.extname(output_file)
  dir_name = File.dirname(output_file)
  chunk_index = 1
  offset = 0
  while offset < content.length
    chunk = content[offset, CHUNK_SIZE]
    chunk_file = File.join(dir_name, "#{base_name}_part#{chunk_index}#{extension}")
    File.open(chunk_file, 'w:UTF-8') do |file|
      file.write("---\n")
      file.write("chunk: #{chunk_index}\n")
      file.write("total_chunks: #{(content.length.to_f / CHUNK_SIZE).ceil}\n")
      file.write("---\n\n")
      file.write(chunk)
    end
    puts "Markdown file created: #{chunk_file}"
    offset += CHUNK_SIZE
    chunk_index += 1
  end
end
def process_directory(directory_path, output_file)
  ignore_patterns = read_gitignore(directory_path)
  markdown_content = "---\nencoding: utf-8\n---\n\n# Project Structure\n\n"
  file_contents = []
  file_tree = {}
  Dir.glob("#{directory_path}/**/*", File::FNM_DOTMATCH).each do |file_path|
    next if File.directory?(file_path)
    next if ['.', '..'].include?(File.basename(file_path))
    next if ignored?(file_path, directory_path, ignore_patterns)
    next if File.size(file_path) > MAX_FILE_SIZE
    relative_path = file_path.sub("#{directory_path}/", '')
    parts = relative_path.split('/')
    current = file_tree
    parts.each_with_index do |part, index|
      if index == parts.size - 1
        current[part] = nil
      else
        current[part] ||= {}
        current = current[part]
      end
    end
    file_contents << convert_to_markdown(file_path)
  end
  markdown_content += generate_tree_markdown(file_tree)
  markdown_content += "\n# File Contents\n\n"
  markdown_content += file_contents.join("\n")
  write_chunked_output(output_file, markdown_content)
end
if ARGV.length != 2
  puts "Usage: ruby script.rb <input_directory> <output_file>"
  exit 1
end
input_directory = ARGV[0]
output_file = ARGV[1]
process_directory(input_directory, output_file)

Script Usage

The script requires two arguments:

Path to Codebase: The root directory of your project.
Output File Name: The base name for the generated Markdown files.

You can run the script, for example, as follows:

ruby ruby_to_md.rb ~/st/gradwinner gradwinner.md

This will create files like:

gradwinner_part1.md
gradwinner_part2.md
gradwinner_part3.md and so on.

Note: Files and folders listed in .gitignore will be ignored in the resultant documentation files.

Key Script Features

ALWAYS_IGNORE: This constant lists folders and files that need to be ignored in addition to those specified in .gitignore.
IGNORED_EXTENSIONS: This constant lists the file extensions (e.g., images) that should not be included in the documentation.
CHUNK_SIZE: You can modify this constant to increase or decrease the amount of data in each Markdown file.
MAX_FILE_SIZE: Files larger than this size will be ignored to prevent overwhelming the documentation.

Note: Many chat tools have a limit of 25 files that can be uploaded at once, so you may need to adjust the script according to your requirements.

Conclusion

This Ruby script helps you convert a code repository into documentation in Markdown format, providing a structured overview and content breakdown. It’s an ideal solution when working with AI tools that have file size or file number limitations, making your codebase more accessible to them in a condensed and readable form.

If you have feedback or improvements to suggest, feel free to contribute!

Happy Coding!

How I Cut 22.3 Seconds Off an API Call with Sentry 👀

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

DEV Community

Transform Your Codebase into Comprehensive Documentation with Markdown

Introduction