DEV Community

Ben Steadman
Ben Steadman

Posted on

TPP Topic 21: Text Manipulation

This post originally appeared on

See the first post in The Pragmatic Programmer 20th Anniversary Edition series for an introduction.

Exercise 11

You’re rewriting an application that used to use YAML as a configuration language. Your company has now standardized on JSON, so you have a bunch of .yaml files that need to be turned into .json. Write a script that takes a directory and converts each .yaml file into a corresponding .json file (so database.yaml becomes database.json, and the contents are valid JSON).

Conversion between YAML and JSON can be done easily in Python using PyYAML and the standard library json module. PyYaml and json both convert basic Python objects (dict, list, str, int e.t.c) to and from YAML and JSON respectively, both providing an almost identical API.

Algorithm outline:

  1. Find all YAML files in a given directory
  2. Load a YAML file from disk into a Python object using PyYaml
  3. Serialize the Python object to JSON
  4. Write the serialized JSON to a new .json file
  5. Delete the YAML file
  6. Repeat steps 2-5 for all YAML files from step 1
import json
from pathlib import Path

import yaml

def main(d: Path):
    # make sure we have a valid directory to search in
    assert d.exists()
    assert d.is_dir()

    for yaml_f in d.glob("*.yaml"):
        # load original YAML data
        with as f:
            data = yaml.safe_load(f)
        # write data to new json file
        json_f = d / f"{yaml_f.stem}.json"
        with"w") as f:
            json.dump(data, f, indent=4)
        # delete original YAML file
        print(f"{yaml_f} -> {json_f}")

See on GitHub for the full CLI script.

Exercise 12

Your team initially chose to use camelCase names for variables, but then changed their collective mind and switched to snake_case. Write a script that scans all the source files for camelCase names and reports on them.

In a real world scenario where I just need to accomplish the task, I would definitely solve this with egrep:

$ egrep -rn --color "\b[a-z]+((\d)|([A-Z0-9][a-z0-9]+))+([A-Z])?" /path/to/source/directory

However, for educational purposes, I have implemented this functionality in a Python script along with unit tests. See on GitHub for the full CLI script and tests.

The task breaks down into the following high level steps:

  1. Iterate over a sequence of files (provided as arguments to the script)
  2. Iterate over the lines of each file
  3. Identify any camelCase strings in each line
  4. Output a report of each identified camelCase string

Steps 1 and 2 are trivial in Python:

import sys
from pathlib import Path

def main(files: List[Path]):
    for file in files:
        with as f:
            for line in f:
                # do steps 3 and 4

if __name__ == "__main__":
    if len(sys.argv) == 1:
        err_exit("Usage: ./ FILES...")

    main([Path(f) for f in sys.argv[1:]]

Step 3 can be achieved with a regular expression as shown in the egrep example. For
this exercise I am using the Google Java style guide definition of camelCase for lower camelCase - not PascalCase.

import re

CAMEL_RE = re.compile(
        # 1st character must be lower case
        # followed by a single digit
        # OR upper case character/number followed by lower case characters or number
        # final character *may* be upper case

Using CAMEL_RE, the positions of camelCase substrings can be extracted from a string. The representation of a given line and the positions of any camelCase
substrings contained within it are provided by two NamedTuple objects: MatchGroup
and Match. These are inspired by the standard library re.Match objects which are used to provide the data for MatchGroup
and Match. This provides a separation boundary between the concept of "a line
of text with camelCase substrings"
and "a regular expression that matches camelCase
. Thus allowing the underlying mechanism for finding the camelCase substrings
to more easily be changed if necessary.

from typing import Generator, Iterable, NamedTuple

class MatchGroup(NamedTuple):
    start: int
    end: int

class Match(NamedTuple):
    lineno: int
    groups: List[MatchGroup]
    line: str

def find_camel(lines: Iterable[str]) -> Generator[Match, None, None]:
    for i, line in enumerate(lines):
        groups = [MatchGroup(*m.span()) for m in CAMEL_RE.finditer(line)]
        if groups:
            yield Match(i, groups, line)

Example usage:

>>> lines = ["a camelCase line", "thisIs a line with two camelCase words", "a line without camel case"]
>>> for match in find_camel(lines):
...     print(match)
Match(lineno=0, groups=[MatchGroup(start=2, end=11)], line='a camelCase line')
Match(lineno=1, groups=[MatchGroup(start=0, end=6), MatchGroup(start=23, end=32)], line='thisIs a line with two camelCase words')

Step 4 (without installing third party packages) involves using ANSI colour escape codes. To emulate the --color option of grep, each matched
line should be output with the following format:

<optional purple filename>:<green line number>: <white text> <red camelCase match> <white text>...

For example:

<span>/path/to/my/file.txt</span>:<span>5</span>: this line has <span>camelCase</span> words as <span>subStrings</span>.

The 'pretty match' string is built up by iterating through each MatchGroup of a
given Match and extracting a slice of the original line containing non-matched text
before the position of the MatchGroup, extracting the slice of the original line
where at the position of the MatchGroup and adding the red escape code : \033[31m.

  • Note that after each use of a colour escape code, the colour is reset using \033[m
def pretty_match(m: Match, filename: str = None) -> str:
    Build a 'pretty' string representation of `m`, with coloured text and line
    numbers; optionally prefixed with `filename`.

        - Line numbers = green
        - Matches = red
        - Filenames = purple
        - Non-matching text = white
    pretty_name = f"\033[35m{filename}\033[m:" if filename else ""
    l = []
    prev = 0
    for g in m.groups:
        # text up until match in white, match in red
        prev = g.end
    return "".join([f"{pretty_name}\033[32m{m.lineno}\033[m:"] + l)

Bringing it all together in the original main function:

def main(files: List[Path]):
    Print a report on the locations of all camelCase strings in `file`. See
    `pretty_match` for output format.
    show_filenames = len(files) > 1
    for file in files:
        with as f:
            for m in find_camel(f):
                print(pretty_match(m, filename=file if show_filenames else None))

Exercise 13

Following on from the previous exercise, add the ability to change those variable names automatically in one or more files. Remember to keep a backup of the originals in case something goes horribly, horribly wrong.

Again, see on GitHub for the full CLI script and tests.

Once a camelCase substring has been found, converting it to snake_case requires two steps:

  1. Insert an _ character between each camelCase 'hump'
    • camelCase -> camel_Case
    • camelCamelCase -> camel_Camel_Case
  2. Convert to lower case

Regex substitution using a capture group can be used for step 1 by matching and
capturing a 'hump' character:


Then inserting an _ character before the 'hump' character by referencing the capture group in the substitution: '_\1'.

CONVERT_CAMEL_RE = re.compile(
        # match a nomral 'hump' i.e. camel(C)ase or camel1(C)ase
        # match mid-string uppercase humps, ignoring existing underscores
        # i.e. CAMEL(C)ase or HTTP(E)rror

def convert_camel_word(w: str) -> str:
    return CONVERT_CAMEL_RE.sub(r"_\1", w).lower()

Example usage:

>>> convert_camel_word("camelCase")

>>> convert_camel_word("camelCamelCase")

>>> convert_camel_word("snakey_camelCase")

convert_camel_word is designed to convert a single camelCase word, not strings
of arbitrary text containing camelCase words. For example:

>>> convert_camel_word("System.out.println(Arrays.toString(myArray));")

Instead of:


To convert strings of arbitrary text, the original camelCase matching from exercise 12 is used to first find
the locations of individual camelCase strings. Once found, they can be passed to
convert_camel_word and inserted into the correct position of the original text.

# factored out of existing find_camel function
def find_match_groups(s: str) -> list:
    return [MatchGroup(*m.span()) for m in CAMEL_RE.finditer(s)]

def convert_camel_line(l: str) -> str:
    # find individual camelCase strings within `l`
    for g in find_match_groups(l):
        # replace with snake_case equivalent
        l = l[0 : g.start] + convert_camel_word(l[g.start : g.end]) + l[g.end :]
    return l

def convert_camel(lines: Iterable[str]) -> Generator[str, None, None]:
    return (convert_camel_line(l) for l in lines)

Applying the conversion to a file involves backing up the original file, converting
each line in turn and writing to a new file:

def transform_camel(file: Path):
    Transform all occurences of camelCase strings in `file` to snake_case. The
    original file is renamed with a ".backup" extension to prevent data loss.
    original = Path(str(file) + ".backup")
    with as source:
        with"w") as dest:

To wrap it up, the existing main entry point delegates according to whether reporting
or transformation is required and an additional --convert command line argument is added. Command line argument parsing is now performed by argparse instead of manually inspecting sys.argv:

import argparse

def main(files: List[Path], convert=False):
    for f in files:
        if convert:
            report_camel(f, show_filenames=len(files) > 1)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=sys.modules[__name__].__doc__)
    parser.add_argument("files", nargs="+", help="source files to scan for camelCase")
        help="perform camelCase to snake_case conversion",
    args = parser.parse_args()
    main([Path(f) for f in args.files], convert=args.convert)

CLI help text:

$ ./ -h
usage: [-h] [--convert] files [files ...]

Scan source files for camelCase strings, reporting (grep style) on locations
or converting to snake_case. During conversion, orginal files are renamed with
a ".backup" extension. Yes I'm aware this can probably be achieved with a bash

positional arguments:
  files       source files to scan for camelCase

optional arguments:
  -h, --help  show this help message and exit
  --convert   perform camelCase to snake_case conversio

Top comments (2)

steadbytes profile image
Ben Steadman