RSS 2.0
RSS 0.91
ATOM 0.3
July 2, 2009

Charset issues between iso-8859-1 and UTF-8 on Ubuntu

Category: TYPO3

By: Søren Andersen

I was given the task of moving a clients website from a development server to a live server. The servers were located in two different companies, and therefore they didn't have the same configurations. That proved to be a challenge.

The consultancy that had developed the new website had done so using UTF-8 as the charset of choice. A week before the expected launch of the site, I had requested to get a hold of a dump, so I could test the moving-process in advance. I didn't find any big issues at the time. Everything seemed to work nicely except for some minor (I thought) issues with charsets on an AJAX based slideshow.

Then came the day where the website was to go online. Once again I got hold of a dump of the website. After some minor trouble, where I forgot to update the domain record, I got everything to work (I thought).

I put the website online without any trouble, and I asked a friend to test it for me. He immediately pointed out, that in his browser, the danish characters Æ, Ø and Å, didn't show up. I hurried to test this in my own browser, and I soon discovered the nightmare of anyone involved with computers: An inconsistent error, that is, inconsistent until you know what causes it.

Sometimes the charset would load correctly, at other times - on the same page - it wouldn't. In firefox it was consistently showing the wrong charset, though. The problem was that firefox decided to show the text as ISO-8859-1 even though the meta tag said UTF-8. It was getting late and I simply couldn't figure out what was going wrong, so I decided to pull the new site offline and ask the developing consultancy for help.

This morning I got a reply saying that the error probably was, that my webserver (Apache 2 on Ubuntu), was sending a content-type-header where the charset was set to be ISO-8859-1 and thus causing every standards-compliant browser (as in, not IE ;) ) to ignore the meta tag. Further he sent me some links, where I could read that this was a common problem on debian (which is the foundation of Ubuntu). Apparently the solution would be to outcomment a directive in the apache configuration (/etc/apache2/apache2.conf) that said:
DefaultType text/plain

This directive will make sure that apache will define the MIME as text/plain on the content that it can't determine the MIME type on. In this case where realurl was used, given there wasn't any file extension on the URL, there would be no MIME type. Changing this to:
#DefaultType text/plain
Seemed to help a little, but not allways. Sometimes everything would be shown correctly, at other times it would be wrong. I discovered that it had relations to the extension static_filecache, which was enabled. Everytime I made a non-cached hit to the site, Firefox would interpret the content as UTF-8. On a cached hit Firefox would interpret as ISO-8859-1.

Another email to the consultancy helped me once again. This time I was encoraged to look for a second configuration file. I immediately went into the apache dir (/etc/apache2/) and here I found a folder called conf.d, this folder contained a file named "Charset". In this file there was only one directive:
AddDefaultCharset ISO-8859-1
Changing this to
#AddDefaultCharset ISO-8859-1
did the trick.

You see, this directive would set the charset to ISO-8859-1 on every KNOWN filetype. And since static filecache will turn the pages into .html, then every cached hit would be interpreted as an HTML file, with the charset of ISO-8859-1, while every non-cached hit would be an unknown filetype where the meta-tag would be used to determine the charset (and thereby resulting in UTF-8).

This was a little story about how you can spend several hours with a bug, and end out using two well-placed #'s to solve the entire problem. I certainly hope that this will prevent someone else from spending as much time dealing with a problem like this.


No comments yet. Be the first to comment on this!

Sorry, comments are closed for this post.