For this lab I've added another command line argument, --remove-comments to do just as it says- remove comments.
This solves a problem by reducing the amount of tokens needed to read the context packager output by LLMs, by cutting away unnecessary comments and just showing the code.
This was inspired by Repomix having a similar argument that removes comments. This just made sense to include (along with the removal of spaces).
My solution doesn't have too many similarities to Repomix's at this time. They use a manipulator pattern and separate classes for each file type.
class PythonManipulator extends BaseManipulator {
removeDocStrings(content: string): string {
if (!content) return '';
const lines = content.split('\n');
let result = '';
let buffer = '';
let quoteType: '' | "'" | '"' = '';
let tripleQuotes = 0;
const doubleQuoteRegex = /^\s*(?<!\\)(?:""")\s*(?:\n)?[\s\S]*?(?<!("""))(?<!\\)(?:""")/gm;
const singleQuoteRegex = /^\s*(?<!\\)(?:''')\s*(?:\n)?[\s\S]*?(?<!('''))(?<!\\)(?:''')/gm;
I'd love to implement something like this eventually, but for now we broadly handle python style comments (#, """/''' and c-style comments (\\, \* *\).
Example of removing Python style comments line by line:
for line in lines:
stripped = line.lstrip() # Remove leading whitespace for checks
# Skip # comment-only lines
if stripped.startswith('#'):
continue
# If already in docstring, look for the closing quotation mark
if docstring:
if docstring in line:
docstring = None
continue
# If line starts a docstring
if stripped.startswith('"""') or stripped.startswith("'''"):
quote = '"""' if stripped.startswith('"""') else "'''"
# Check if docstring closes on same line
if stripped.count(quote) >= 2:
continue # Single-line docstring, skip it
else:
docstring = quote
continue
Both implementations read the file type to determine which implementation to use.
if file_extension == '.py':
code = remove_hash_comments(code)
elif file_extension in ['.js', '.java', '.c', '.cpp']:
code = remove_slash_comments(code)
elif file_extension == '.html':
code = re.sub(r'<!--.*?-->', '', code, flags=re.DOTALL)
For the next steps, one thing I didn't include was removing in-line comments. This will require using a tokenizer or checking each line char by char to determine if there is a comment in line, as outline in this issue I've filed.
I'd also like to add a --remove-empty-lines feature which is already implemented in commented out code, but since it wasn't part of the scope of the original issue I was fixing I left it uncommented and created a new issue to add it as another argument.

Top comments (0)