Replace Broken Characters in Text Copied from Copilot Using PowerShell or Python

#githubcopilot #powershell #python #ai

You may have noticed the following:

If you copy text from a Copilot response by selecting it in your web browser and pressing Ctrl+C, the text carries over fine when you paste it. However, if you click the Copy message button instead, some of the characters, in particular single and double quotes, hyphens, and em-dashes, end up looking janky (mojibake).

Example of broken text:

I'm referring to the web-based chat UI, copilot.microsoft.com, not GitHub Copilot. I haven't tested this in the GitHub version.

I assumed it was a character-encoding issue; web browsers and most applications assume UTF-8 Unicode encoding nowadays, but perhaps Copilot was using some legacy Windows encoding like CP1252. But it seems to be more nuanced than that.

Here's Copilot's own explanation for why this happens:

It's almost always a Unicode normalization issue rather than a different encoding. The "Copy message" button tends to output characters like en-dashes, em-dashes, curly quotes, and non-breaking spaces in their full Unicode forms, while your browser's plain copy normalizes them to simpler ASCII-friendly equivalents.

Two patterns usually show up:

Smart punctuation: U+2013, U+2014, U+2018, U+2019, U+201C, U+201D.

Non-breaking spaces: U+00A0 or U+202F.

So I wrote some quick scripts to fix this.

The Fix

!!!Important: both scripts overwrite the original file.

Note: the script replaces em-dashes with - (a space, a regular hyphen, followed by another space). If you prefer something else, look for the line that replaces \u2014 and substitute it for your preferred logic.

Both versions of the script are meant to be run from the command line and expect a single argument, a path to a text file.

PowerShell

if (!$args[0]){
    write-host "Please provide a path to a TXT file."
    exit
}
$path = $args[0]
if (!(test-path $path)){
    write-host "Path $path does not exist"
    exit
}
try {
    $content = gc $path -encoding utf8
} catch {
    write-host "Could not read $path"
    exit
}
$content = $content -replace "\u2011", '-'
$content = $content -replace "\u2013", '-'
$content = $content -replace "\u2014", ' - '
$content = $content -replace "\u2018", "'"
$content = $content -replace "\u2019", "'"
$content = $content -replace "\u201C", '"'
$content = $content -replace "\u201D", '"'
set-content $path $content -encoding utf8

Python

#! /usr/bin/env python3

import sys, os, argparse

if __name__ == "__main__":
    if len(sys.argv) <= 1:
        print("Please provide a path to a TXT file.")
        sys.exit()
    path = sys.argv[1]
    if not os.path.exists(path):
        print("Path", path, "does not exist")
        sys.exit()
    try:
        with open(path, encoding='utf-8') as f:
            content = f.read()          
    except:
        print("Could not read", path)
        sys.exit()
    content = content.replace("\u2011", '-')
    content = content.replace("\u2013", '-')
    content = content.replace("\u2014", ' - ')
    content = content.replace("\u2018", "'")
    content = content.replace("\u2019", "'")
    content = content.replace("\u201C", '"')
    content = content.replace("\u201D", '"')
    with open(path, 'w', encoding='utf-8') as f:
        f.write(content)

Notes

For PowerShell:

This only works with -replace, not the string Replace() method.
The correct notation for referencing the Unicode characters is "\u2011" and not, for example, "`u2013" with a backtick (`), which is what Copilot originally suggested when I asked it to code this. (Incidentally, not to brag or anything, but neither script above is AI-generated.)

Copilot's response when I commented that it was leading me astray:

You're right on both counts:

In PowerShell, the correct escape form for a Unicode code point inside -replace is the string literal "\u2011".

The backtick form `u2011 only works in some contexts (mostly double-quoted strings), but not reliably inside .Replace() because .Replace() treats the string literally and doesn't interpret escape sequences.

So the reliable pattern is:
$text = $text -replace "\u2011", "-"
And yes - .Replace() does not interpret \uXXXX escapes at all, so it won't work there.