grep
has lots of options. This one's really a combination of three to do something nifty.
On the previous post, I was converting a bunch of iso-8859-1
files to utf-8
. Setting aside for the moment the fact that I was using iconv
for this (read the fine man page!), you might perhaps be wondering, how did I know which files I wanted to convert?
First off, this is not magic. It presupposes that you know the files are iso-8859-1
and that your locale is set to utf-8
. The latter is easy enough - check your LANG environment variable and set it to something suitable if it doesn't already end in '.UTF-8
' or '.utf8
' (detailed discussion of this not for this article).
The former is your problem. I can't really help you with this - data is a bunch of bytes plus an encoding you may or may not know :D So. Assuming you have a reasonable level of certainty these files are encoded to iso-8859-1
, run this:
grep -axv '.*' mysuspectfile
It will return any lines with iso-8859-1
characters that are not legal utf-8
, which is to say pretty much anything with its high bit set.
Sidebar: Before anyone jumps on me, yes there are obscure combinations of high-bit-set
iso-8859-1
characters that are legalutf-8
, but they are sufficiuently unlikely in any normal text written by people not on weird psychoactive substances for this test to be pretty reliable. Reference here for more details.
Why does this work?
-
-a
says 'treat this file as printable text' -
-v
says 'invert this match' -
-x
says 'match the whole line against the pattern.*
'
.*
(because of -a
and your locale) means 'any legal utf-8
character'. The -x
requires every character in the line to match that pattern, and the -v
will then spit out lines for which that is not the case.
To generate just a list of offending files from a directory, then, we can use -r
to recurse down the directory, and -l
to just report matching filenames.
grep -r -l -axv '.*' mydirectory
Bingo. All ready to throw at xargs -P
:D
Top comments (0)