Lucy

Posted on Apr 22 • Originally published at lucybatten.substack.com

I Built a Tool That Turns Any GitHub Repo Into an Interactive Dependency Graph: Here's Exactly How It Works

#opensource #webdev #github #development

A deep dive into the real pipeline behind CodeAtlas: AST parsing, import resolution, force-directed graphs, and everything in between.

When I join a new open source project, I do the same thing every time. I open the entry point, follow an import, open that file, follow another import, lose track of where I started, open the entry point again. Fifteen minutes later I have eight tabs open and a vague understanding of what the project does.

This is the default experience of reading code. It is also completely unnecessary.

Code is not text. Code is a graph. Every file has edges: its imports, pointing to other files. Every module sits inside a dependency structure that has a shape, and that shape tells you almost everything you need to know about how the codebase is organised. But virtually every tool we have forces us to experience that graph one node at a time, linearly, like reading a book.

I built CodeAtlas to fix this. It takes any GitHub repository URL, clones it, parses every file using two separate AST parsers, resolves every import to an actual file, builds a dependency graph, and renders it as an interactive force-directed visualisation. You can see the entire structure of a codebase in seconds, click any node to read the file in a Monaco editor, filter by depth, and understand architecture that would otherwise take hours to infer.

This post goes through every layer of how it actually works, using the real code.

GitHub: CodeAtlas

(link to live demo on GitHub page)

The Architecture at a Glance

CodeAtlas has three distinct layers:

Backend (server.js) — receives a GitHub URL, clones it with simple-git, indexes the file tree, resolves imports, and returns a graph JSON object via a REST endpoint.

Parsers (parser.js, parser_py.py) — two separate parsing pipelines: one using Babel’s AST for JavaScript and TypeScript files, one using Python’s standard library ast module for Python files.

Frontend (App.tsx) — React application that calls the API, formats the graph data, runs a BFS traversal for focus mode, and renders everything via a D3 force simulation with Monaco editor for file inspection.

Let me go through each in detail.

The Backend: Cloning, Indexing, and Graph Construction

Cloning

The entry point is a POST /analyze endpoint in server.js. The first thing it does is clone the repository:

async function cloneRepo(repoUrl) {
  await fs.remove(TEMP_DIR);
  await fs.mkdir(TEMP_DIR);
  await simpleGit().clone(repoUrl, TEMP_DIR, ["--depth", "1"]);
}

The --depth 1 flag is critical. A shallow clone fetches only the latest commit, not the full history. For large repositories this is the difference between a 2-second clone and a 45-second clone. CodeAtlas never needs git history: it only needs the current state of the files, so shallow cloning is always correct here.

fs-extra‘s remove call before mkdir ensures the temp directory is clean before each clone. Without this, a previous failed run could leave stale files that contaminate the new analysis.

File Tree Indexing

After cloning, buildIndex walks the file tree and builds a map of every relevant file and its raw imports:

function buildIndex(dir) {
  const index = {};

  function walk(folder) {
    const items = fs.readdirSync(folder, { withFileTypes: true });

    for (const item of items) {
      if (IGNORE.has(item.name)) continue;

      const fullPath = path.join(folder, item.name);

      if (item.isDirectory()) {
        walk(fullPath);
        continue;
      }

      if (!item.name.match(/\.(js|ts|tsx|py)$/)) continue;

      try {
        const content = fs.readFileSync(fullPath, "utf8");

        const imports = [
          ...content.matchAll(/from\s+['"](.*?)['"]/g),
          ...content.matchAll(/require\(['"](.*?)['"]\)/g),
        ].map(m => m[1]);

        const rel = path.relative(dir, fullPath);
        index[rel] = { imports: imports.slice(0, 30) };
      } catch (e) {}
    }
  }

  walk(dir);
  return index;
}

The IGNORE set is doing important work here:

const IGNORE = new Set([
  "node_modules", "dist", "build",
  ".git", "coverage", ".next", ".cache"
]);

node_modules alone can contain tens of thousands of files. Including it would make the graph useless… you’d be visualising the entire npm ecosystem rather than the project’s own code. dist and build are generated code that duplicates the source. .next contains Next.js build artefacts. None of these contain information about the project’s architecture.

The withFileTypes: true option on readdirSync is a performance detail worth noting. It returns Dirent objects which already know whether each entry is a file or directory, avoiding a separate stat call per entry. On repos with thousands of files this is meaningfully faster.

Import Resolution

Raw import strings like ./utils need to be resolved to actual files. The resolveImport function handles this:

function resolveImport(file, imp, allFiles) {
  if (!imp.startsWith(".")) return null;

  const base = path.dirname(file);

  const possiblePaths = [
    path.normalize(path.join(base, imp)),
    path.normalize(path.join(base, imp + ".js")),
    path.normalize(path.join(base, imp + ".ts")),
    path.normalize(path.join(base, imp + ".tsx")),
    path.normalize(path.join(base, imp, "index.js")),
    path.normalize(path.join(base, imp, "index.ts")),
  ];

  for (const p of possiblePaths) {
    if (allFiles.has(p)) return p;
  }

  return null;
}

The first thing it does is discard any import that doesn’t start with .. This filters out all third-party packages (react, lodash, express) which live in node_modules and aren’t part of the project’s own dependency graph. Only relative imports (starting with ./ or ../) represent relationships between the project’s own files.

The resolution order tries the import path as-is first, then appends common extensions, then checks for index files inside a directory of that name. This mirrors how Node.js’s own module resolution works, so it produces the same result as the runtime would.

allFiles is a Set: each resolution check is O(1). Multiply that across potentially thousands of imports in a large repo and the total resolution step stays fast.

Graph Construction

Once the index is built, indexToGraph assembles the final data structure:

function indexToGraph(index) {
  const fileList = Object.keys(index);
  const fileSet = new Set(fileList);

  const nodes = fileList.map(id => ({ id }));
  const links = [];

  for (const file of fileList) {
    for (const imp of index[file].imports) {
      const resolved = resolveImport(file, imp, fileSet);
      if (resolved) {
        links.push({ source: file, target: resolved });
      }
    }
  }

  return { nodes, links, backLinks: {} };
}

The graph format: nodes, links, backLinks, is designed specifically for D3’s force simulation on the frontend. nodes is an array of objects with an id. links is an array of { source, target } pairs using those same ids. backLinks is the reverse dependency index: for any given file, which files import it.

The JavaScript/TypeScript Parser: Babel AST

parser.js is the more powerful of the two parsers. Instead of regex, it uses Babel to parse source files into full Abstract Syntax Trees and then traverses those trees to extract imports.

An AST is a tree representation of source code where every construct e.g. a function declaration, an import statement, a variable assignment, becomes a typed node. Parsing text into an AST is the first step every compiler and linter performs. Using ASTs means the parser understands code structure rather than matching text patterns.

Parsing Files

export function parseFile(filePath) {
  try {
    const code = fs.readFileSync(filePath, "utf-8");

    const ast = parser.parse(code, {
      sourceType: "unambiguous",
      plugins: [
        "typescript",
        "jsx",
        "dynamicImport",
        "classProperties",
      ],
      errorRecovery: true,
    });

    const imports = [];

    traverse.default(ast, {
      ImportDeclaration({ node }) {
        imports.push(node.source.value);
      },
      CallExpression({ node }) {
        if (
          node.callee.name === "require" &&
          node.arguments.length === 1 &&
          node.arguments[0].type === "StringLiteral"
        ) {
          imports.push(node.arguments[0].value);
        }
      },
    });

    return imports;
  } catch (err) {
    return [];
  }
}

Several configuration decisions here are worth explaining.

sourceType: "unambiguous" tells Babel to figure out whether the file is a CommonJS module or an ES module by looking at whether it contains any import/export statements, rather than requiring you to specify upfront. Real codebases are messy and mix both styles.

errorRecovery: true is essential in practice. Real codebases contain files that don’t parse cleanly: files with experimental syntax, partially written code, or syntax errors that have been introduced but not yet caught. Without error recovery, one bad file would crash the entire parsing pipeline for the whole repo. With it, Babel does its best and returns whatever AST it can construct from the valid portions.

The CallExpression handler catches require() calls. These show up differently in the AST than import statements - they’re function calls rather than declarations - so they need their own handler. The check that node.callee.name === "require" and that the single argument is a StringLiteral ensures we only capture simple require('./path') patterns and not dynamic requires like require(getModuleName()).

Walking the Folder

parseFolderJS handles the file tree walk and graph construction for the JS/TS parser:

export function parseFolderJS(folderPath) {
  const graph = { nodes: [], links: [], backLinks: {} };
  const filesMap = {};

  function walk(dir) {
    const entries = fs.readdirSync(dir, { withFileTypes: true });
    for (let entry of entries) {
      if (IGNORE.has(entry.name)) continue;
      const fullPath = path.join(dir, entry.name);
      if (entry.isDirectory()) {
        walk(fullPath);
      } else if (entry.isFile() && fullPath.match(/\.(js|ts|jsx|tsx)$/)) {
        filesMap[fullPath] = fullPath;
      }
    }
  }

  walk(folderPath);

  for (let fullPath in filesMap) {
    const fileId = toRelative(fullPath);
    graph.nodes.push({ id: fileId });

    const imports = parseFile(fullPath);

    for (let imp of imports) {
      if (!imp.startsWith(".")) continue;

      let resolved = path.resolve(path.dirname(fullPath), imp);

      const possible = [
        resolved,
        resolved + ".ts",
        resolved + ".tsx",
        resolved + ".js",
        resolved + ".jsx",
        resolved + "/index.ts",
        resolved + "/index.tsx",
        resolved + "/index.js",
        resolved + "/index.jsx",
      ];

      const found = possible.find((p) => filesMap[p]);

      if (found) {
        const targetId = toRelative(found);
        graph.links.push({ source: fileId, target: targetId });

        if (!graph.backLinks[targetId]) graph.backLinks[targetId] = [];
        graph.backLinks[targetId].push(fileId);
      }
    }
  }

  // Deduplicate links
  graph.links = Array.from(
    new Set(graph.links.map((l) => `${l.source}->${l.target}`))
  ).map((str) => {
    const [source, target] = str.split("->");
    return { source, target };
  });

  return graph;
}

The filesMap object serves a dual purpose: it stores all file paths so they can be looked up during resolution, and its keys are exactly the absolute paths we need to check against during the possible.find() call.

Building backLinks inline during the main loop is efficient: each time a link is added forward (source → target), the reverse index is updated simultaneously. By the end of the loop, backLinks[file] contains every file that imports file, with no second pass needed.

The deduplication at the end handles a real edge case: a file might import from the same module in multiple ways, a static import at the top and a dynamic import inside a function, or two different named imports from the same module in separate import statements. Both would produce the same source → target edge. The Set-based deduplication collapses these into a single edge before the data reaches the frontend.

The Python Parser

parser_py.py handles Python repositories using Python’s own ast module:

def parse_file(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        try:
            tree = ast.parse(f.read(), filename=file_path)
        except Exception:
            return []

    imports = []

    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for name in node.names:
                imports.append(name.name)
        elif isinstance(node, ast.ImportFrom):
            if node.module:
                imports.append(node.module)

    return imports

Python has two import syntaxes that map to different AST node types. import os produces an ast.Import node where node.names is a list of alias objects (there can be multiple: import os, sys). from pathlib import Path produces an ast.ImportFrom node where node.module is the module name being imported from.

The folder parser maps module names to file paths:

for imp in imports:
    target = imp.replace(".", "/") + ".py"

Python uses dots as namespace separators: from utils.helpers import something maps to utils/helpers.py. Replacing dots with slashes converts the module path back to a filesystem path. This is a heuristic: it works correctly for relative project imports but doesn’t distinguish between standard library imports (os, sys) and project files. Standard library modules simply won’t exist as files in the repo, so they produce dangling target nodes in the graph rather than edges to real files. This is an area for future improvement.

The Frontend: React, BFS, and D3

State Architecture

App.tsx is the application root and manages all state:

const [graphData, setGraphData] = useState<GraphData>(null);
const [selectedFile, setSelectedFile] = useState<string | null>(null);
const [fileContent, setFileContent] = useState<string | null>(null);
const [focusMode, setFocusMode] = useState(true);
const [depth, setDepth] = useState(2);
const [loading, setLoading] = useState(false);

graphData holds the raw graph from the API, the complete set of nodes and links for the entire repository. displayData is a derived version, computed by useMemo, that represents what’s actually shown in the graph at any given moment based on focus mode and depth settings. Separating raw data from display data means toggling focus mode or changing depth never triggers a new API call… it just recomputes the view over the existing data.

The API URL switches automatically based on the Vite environment flag:

const API = import.meta.env.DEV
  ? "http://localhost:8080"
  : "https://codeatlas-production-e4f8.up.railway.app";

API Response Normalisation

The handleAnalyze function contains some defensive normalisation worth explaining:

const graph = raw.graph ?? raw;

const formattedGraph: GraphData = {
  nodes: graph.nodes.map((n: any) =>
    typeof n === "string" ? { id: n } : n
  ),
  links: graph.links.map((l: any) => ({
    source: typeof l.source === "string" ? l.source : l.source?.id,
    target: typeof l.target === "string" ? l.target : l.target?.id,
  })),
  backLinks: graph.backLinks || {},
};

The raw.graph ?? raw fallback handles two different response shapes from the backend, one where the graph is nested under a graph key, one where it’s the root object. This kind of defensive normalisation is common when a frontend is evolving alongside its backend.

The source/target normalisation in the links map addresses a D3 behaviour: D3’s force simulation mutates link objects during the simulation, replacing string ids with references to the actual node objects. So after the simulation runs, link.source is no longer the string "src/App.tsx" but the node object { id: "src/App.tsx", x: 123, y: 456 }. The frontend normalises both forms everywhere it needs to compare or display link endpoints.

Focus Mode: BFS Traversal

The most technically interesting part of the frontend is the focus mode implementation using useMemo:

const displayData = useMemo(() => {
  if (!graphData) return null;
  if (!focusMode || !selectedFile) return graphData;

  const visited = new Set<string>();
  const queue: { id: string; level: number }[] = [
    { id: selectedFile, level: 0 },
  ];

  while (queue.length) {
    const { id, level } = queue.shift()!;

    if (visited.has(id) || level > depth) continue;
    visited.add(id);

    for (const l of graphData.links || []) {
      const source = typeof l.source === "string" ? l.source : l.source?.id;
      const target = typeof l.target === "string" ? l.target : l.target?.id;

      if (!source || !target) continue;

      if (source === id && !visited.has(target)) {
        queue.push({ id: target, level: level + 1 });
      }
      if (target === id && !visited.has(source)) {
        queue.push({ id: source, level: level + 1 });
      }
    }
  }

  return {
    nodes: (graphData.nodes || []).filter((n) => visited.has(n.id)),
    links: (graphData.links || []).filter((l) => {
      const s = typeof l.source === "string" ? l.source : l.source?.id;
      const t = typeof l.target === "string" ? l.target : l.target?.id;
      return s && t && visited.has(s) && visited.has(t);
    }),
    backLinks: graphData.backLinks || {},
  };
}, [graphData, focusMode, selectedFile, depth]);

This is a bidirectional BFS: it traverses both forward edges (files that the selected file imports) and backward edges (files that import the selected file) up to depth hops away. The level counter on each queue entry tracks how many hops from the origin each node is, and nodes beyond depth are not enqueued.

The result is a subgraph centred on the selected file that shows its immediate neighbourhood in the dependency graph. Depth 1 shows only direct imports and importers. Depth 2 shows imports of imports. Depth 5 shows almost everything reachable.

Using useMemo with [graphData, focusMode, selectedFile, depth] as dependencies means the BFS only re-runs when one of those values changes. The computation is pure: same inputs, same output, so memoisation is safe and effective.

File Inspection

Clicking a node triggers both a local state update and an API call:

const handleNodeClick = async (id: string) => {
  setSelectedFile(id);

  const res = await fetch(`${API}/file?path=${encodeURIComponent(id)}`);
  const data = await res.json();
  setFileContent(data?.content || "");
};

The encodeURIComponent call is important: file paths can contain characters like +, #, or spaces that would corrupt a URL query parameter without encoding.

The file content is passed to Monaco editor, which provides VS Code-quality syntax highlighting and navigation in the browser:

<Editor
  height="100%"
  language={getLanguage(selectedFile)}
  value={fileContent}
  theme="vs-dark"
  options={{
    readOnly: true,
    minimap: { enabled: false },
    fontSize: 13,
  }}
/>

Language detection is handled by file extension:

const getLanguage = (file: string | null) => {
  if (!file) return "plaintext";
  if (file.endsWith(".ts") || file.endsWith(".tsx")) return "typescript";
  if (file.endsWith(".js") || file.endsWith(".jsx")) return "javascript";
  if (file.endsWith(".py")) return "python";
  return "plaintext";
};

Monaco uses this to apply the correct grammar for syntax highlighting, bracket matching, and token colouring.

The Sidebar

The sidebar shows imports and dependents for the selected file:

{/* IMPORTS */}
{(graphData.links || [])
  .filter((l) => {
    const s = typeof l.source === "string" ? l.source : l.source?.id;
    return s === selectedFile;
  })
  .map((l, i) => {
    const t = typeof l.target === "string" ? l.target : l.target?.id;
    return <li key={i}>→ {t}</li>;
  })}

{/* DEPENDENTS */}
{(graphData.backLinks?.[selectedFile] || []).map((f, i) => (
  <li key={i}>← {f}</li>
))}

Imports are computed by filtering graphData.links for edges where the selected file is the source. Dependents come directly from the backLinks index: a O(1) lookup rather than a scan over all links. This is why maintaining backLinks during graph construction matters: the sidebar is queried on every node click, and a linear scan over potentially thousands of links on each click would be noticeably slow.

What the Graph Reveals

Running CodeAtlas on real repositories produces genuinely interesting results.

The React source graph makes immediately visible what would take an hour of reading to infer. react-dom has the highest in-degree of any node, more files depend on it than anything else in the codebase. The reconciler (react-reconciler) sits nearly isolated, with very few incoming edges from the surrounding code. This is good architecture: the reconciler is the most complex part of React’s internals, and its isolation means changes to it have limited blast radius.

Codebases with tight coupling produce visually chaotic graphs… a dense hairball where every node connects to many others with no clear structure. Modular codebases produce the opposite: distinct clusters connected by sparse bridges, making module boundaries immediately visible.

Entry points are always obvious. The main entry file typically has the lowest in-degree (almost nothing imports it) and the highest out-degree (it imports everything else). It sits at the edge of the graph rather than the centre.

Current Limitations

JS/TS/Python only. The parsing pipeline is language-specific. Adding Go, Rust, or Java requires a different parser for each, though the rest of the pipeline is language-agnostic.

Import-level only. The graph shows file dependencies, not function-level dependencies. You can see that fileA depends on fileB, but not which specific functions in fileB are called. Function-level graphs are significantly more complex - you need to track symbol exports and resolve them across files - but would make the tool much more powerful.

Large repo performance. Repos above roughly 800 files start to lag in the browser because the D3 force simulation has to handle thousands of nodes and edges simultaneously. Chunked parsing on the backend is already in place. Frontend performance is the next bottleneck to address: lazy loading, only simulating the visible subgraph, is the planned approach.

Concurrent requests. The current backend uses a single shared TEMP_DIR. Two simultaneous requests would overwrite each other’s cloned repo. Per-request directories using unique IDs is the fix.

Why This Approach

The shift from reading code to exploring systems matters because the linear reading model scales poorly as codebases grow. A new contributor to a 500k-line codebase cannot read their way to understanding; they need tools that let them navigate at the right level of abstraction.

Code intelligence tools (language servers, linters, type checkers) have improved dramatically over the last decade. But they are all file-centric: they tell you things about the file you currently have open. What is mostly missing is a system-level view: how does this file relate to everything else? What does the codebase look like as a whole?

That is the gap CodeAtlas is trying to fill.

The repository is open source. If you try it on a codebase and the graph looks wrong, imports are missing, or something crashes, open an issue. The project is actively being developed.

If you found this useful, starring the repo is the best thing you can do: it helps other developers find it.