White space is not necessarily white space

This is part of the Semicolon&Sons Code Diary - consisting of lessons learned on the job. You're in the encoding category.

Last Updated: 2025-06-30

A junior colleague had a HTML form which was oddly broken. The code looked perfect in the Chrome DevTools inspector, but the form wasn't functioning correctly (e.g. default value didn't show in a form input and the input's value was not passed to JavaScript).

When I inspected the code on my computer, I used the following vim setting to display characters similar to but different from white-spaces as special characters:

" can be disabled with :set list! (or :set nolist)
:set list

In order to figure out what exact character I was dealing with, I typed ga (mnemonic: get ascii) in normal mode when my cursor was over the character. In the status bar, the following was displayed:

< > 160, Hex 00a0, Oct 240, Digr NS

In order, this shows:

the character printed
its decimal ASCII value
its hexadecimal ASCII value
its octal ASCI value

By using these precise representations, I was able to get rid of these characters throughout the whole file using a search and replace

:%s/\%u00a0/ /g

%u is the prefix needed to represent hex codes

Generally speaking, one should not copy code from fancy editors due to the random crud they add in. Also, if you copy something from elsewhere (e.g. the internet) the encoding may not be what you think it is and you could be in a for a surprise.

As for a general solutoin, one should always have a script to detect/remove odd whitespaces. Here's one I found online:

# Assume This is saved in a file called:
# `removewhitespace`

C_ALL=en_US.UTF-8 spaces=$(printf "%b" "\U00A0\U1680\U180E\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U200B\U202F\U205F\U3000\UFEFF")

while read -r line; do
  echo "${line//[$spaces]/ }"
done < "${1:-/dev/stdin}"

The echo part is substitution on the command line. Here's an isolated example of this one feature:

  line="cats&dogs"
  # Notice that the variable name, `line`, doesn't have a $ in front of it when inside the ${} structure
  # i.e. the format is: ${stringToOperateOn/thingToRemove/thingToReplaceWith/}
  echo "${line/&/*}"
  => "cats*dogs"

The last bit ${1:-/dev/stdin} means "use what is before the colon if it is not blank, otherwise default to what is afterwards". It allows the program to act on a filename if given as the first argument ($1) or /dev/stdin otherwise. So you can call it with removewhitespace myfile or echo myfile | removewhitespace
Note that the \U00A0 with printf only works on bash version 4 and above. Luckily, it works fine with ZSH.

How can I make these "almost white space" characters more salient in my editor?

I added highlighting in vim:

" syntax match using a hex regex and store matches as `nonascii`
syntax match nonascii "[^\x00-\x7F]"

" highlight this nonascii group in a particular way
highlight nonascii guibg=Green ctermbg=2

How can I enter these character into a vim file on purpose?

You can also force a genuine tab with Ctrl-v <tab> :

What encoding does vim even use?

You can see with :set encoding? It is always utf-8 internally. For an individual file, you can set it with :set fileencoding - conversion is done with iconv when writing the file.

But is the encoding set by VIM or the filetype?

Files generally indicate their encoding with a file header. However, even when reading the header, you can never be sure what encoding a file is really using.

For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters ï»¿. Or it might be a different file type entirely.

What are all the space-like characters in ASCII?

ASCII itself only has like 126 characters (7 bit), therefore 160 is out of its range. Instead it comes under extended ASCII (more than 8 bit). There is no official "extended ascii". Instead there are many, and unicode can be considered as one.
Eventually, ISO released this standard as ISO 8859 describing its own set of eight-bit ASCII extensions. The most popular is ISO 8859-1, also called ISO Latin 1, which contained characters sufficient for the most common Western European languages. Variations were standardized for other languages as well: ISO 8859-2 for Eastern European languages and ISO 8859-5 for Cyrillic languages, for example.
Because the full English alphabet and the most-used characters in English are included in the seven-bit code points of ASCII, which are common to all encodings (even most proprietary encodings), English-language text is less damaged by interpreting it with the wrong encoding, but text in other languages can display as complete nonsense.

What is the point of a BOM?

The byte order mark (BOM) is a Unicode character, (e.g. in UTF-16: U+FEFF) whose appearance as a magic number at the start of a text stream signals encoding and endianness
Using a BOM at the start of the file breaks compatibility with programs that expect ASCII contents
The BOM indicates order, or endianness, of the text stream; The fact that the text stream's encoding is Unicode, to a high level of confidence;
The UTF-8 representation of the BOM, by contrast, is the is the (hexadecimal) byte sequence 0xEF,0xBB,0xBF.

Semicolon & Sons