You may have noticed the following:
If you copy text from a Copilot response by selecting it in your web browser and pressing Ctrl+C, the text carries over fine when you paste it. However, if you click the Copy message button instead, some of the characters, in particular single and double quotes, hyphens, and em-dashes, end up looking janky (mojibake).
Example of broken text:
I'm referring to the web-based chat UI, copilot.microsoft.com, not GitHub Copilot. I haven't tested this in the GitHub version.
I assumed it was a character-encoding issue; web browsers and most applications assume UTF-8 Unicode encoding nowadays, but perhaps Copilot was using some legacy Windows encoding like CP1252. But it seems to be more nuanced than that.
Here's Copilot's own explanation for why this happens:
It's almost always a Unicode normalization issue rather than a different encoding. The "Copy message" button tends to output characters like en-dashes, em-dashes, curly quotes, and non-breaking spaces in their full Unicode forms, while your browser's plain copy normalizes them to simpler ASCII-friendly equivalents.
Two patterns usually show up:
Smart punctuation: U+2013, U+2014, U+2018, U+2019, U+201C, U+201D.
Non-breaking spaces: U+00A0 or U+202F.
So I wrote some quick scripts to fix this.
The Fix
!!!Important: both scripts overwrite the original file.
Note: the script replaces em-dashes with - (a space, a regular hyphen, followed by another space). If you prefer something else, look for the line that replaces \u2014 and substitute it for your preferred logic.
Both versions of the script are meant to be run from the command line and expect a single argument, a path to a text file.
PowerShell
if (!$args[0]){
write-host "Please provide a path to a TXT file."
exit
}
$path = $args[0]
if (!(test-path $path)){
write-host "Path $path does not exist"
exit
}
try {
$content = gc $path -encoding utf8
} catch {
write-host "Could not read $path"
exit
}
$content = $content -replace "\u2011", '-'
$content = $content -replace "\u2013", '-'
$content = $content -replace "\u2014", ' - '
$content = $content -replace "\u2018", "'"
$content = $content -replace "\u2019", "'"
$content = $content -replace "\u201C", '"'
$content = $content -replace "\u201D", '"'
set-content $path $content -encoding utf8
Python
#! /usr/bin/env python3
import sys, os, argparse
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Please provide a path to a TXT file.")
sys.exit()
path = sys.argv[1]
if not os.path.exists(path):
print("Path", path, "does not exist")
sys.exit()
try:
with open(path, encoding='utf-8') as f:
content = f.read()
except:
print("Could not read", path)
sys.exit()
content = content.replace("\u2011", '-')
content = content.replace("\u2013", '-')
content = content.replace("\u2014", ' - ')
content = content.replace("\u2018", "'")
content = content.replace("\u2019", "'")
content = content.replace("\u201C", '"')
content = content.replace("\u201D", '"')
with open(path, 'w', encoding='utf-8') as f:
f.write(content)
Notes
For PowerShell:
- This only works with
-replace, not the stringReplace()method. - The correct notation for referencing the Unicode characters is
"\u2011"and not, for example,"`u2013"with a backtick (`), which is what Copilot originally suggested when I asked it to code this. (Incidentally, not to brag or anything, but neither script above is AI-generated.)
Copilot's response when I commented that it was leading me astray:
You're right on both counts:
- In PowerShell, the correct escape form for a Unicode code point inside
-replaceis the string literal"\u2011".- The backtick form
`u2011only works in some contexts (mostly double-quoted strings), but not reliably inside.Replace()because.Replace()treats the string literally and doesn't interpret escape sequences.So the reliable pattern is:
$text = $text -replace "\u2011", "-"And yes -
.Replace()does not interpret\uXXXXescapes at all, so it won't work there.
References
A similar PowerShell script. It uses [char]0x2013 notation which may or may not work, I'd need to try it:
https://www.linkedin.com/posts/kyle-jones-312b17185_having-issues-with-unicode-characters-in-activity-7398192899500326914-fGQM/
This is where I found the correct notation for Unicode characters in PowerShell:
https://serverfault.com/questions/313695/powershell-2-how-to-strip-a-specific-character-from-a-body-of-ascii-text
This helped to figure out which Unicode code points correspond to which characters:
https://decodeunicode.org/en/u+2012
Top comments (0)