Date: June 17th, 2004
Cate: Geekism

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

So the full code is now:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
  $parser = xml_parser_create($encoding);
} else {

  if(function_exists('mb_convert_encoding')) {
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);

  if($encoded_source != NULL) {
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);

  $parser = xml_parser_create("UTF-8");

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.

Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.


    Quality content is the secret to be a focus for the people to
    go to see the web site, that’s what this website is providing.


  1. October 16th, 2014

  2. of course like your web site however you have to test the spelling on several
    of your posts. A umber of them are rife with spelling issues and I find
    itt very troublesome to inform the reality nevertheless I’ll surely come
    back again.


    Zoofilia Perros miles de videos y fotos completamente sin costo
    de mujeres follando con perros el mejor sexo con animales xxx todo el parque zoológico que esperas de


  3. October 16th, 2014

  4. Hi there, i read your blog from timne to time and i own a similar onee annd i was
    just curious if you get a lot of spam responses? If sso how doo you stop it, any plugin or
    anbything you can suggest? I get so much lately it’s driving me insne so any support is
    very much appreciated.


    you’re truly a just right webmaster. The website loading velocity is incredible.
    It sort of feels that you’re doing any unique trick. Moreover, The
    contents are masterpiece. you’ve done a wonderful activity on this matter!


    Now you must be thinking, how you can get assured about the
    comfort factor without even sitting or using it personally.

    Metal for instance often dictates a cold character as opposed to warm which is warm.
    No one need know you only spent a small amount on your new Dining furniture.


  5. October 17th, 2014

  6. I have read a few good stuff here. Definitely price bookmarking for revisiting. I surprise how much attempt you place to create this type of excellent informative web site.


    Very shortly this website will be famous amid all blogging users,
    due to it’s nice articles or reviews


    Attractive section of content. I just stumbled upon your weblog and in accession capital to assert that I acquire actually enjoyed account your blog posts. Any way I’ll be subscribing to your augment and even I achievement you access consistently fast.


    s 412 Mhz processor being more than doubled in the new phone, leaving 1
    Ghz of speed at the user. Once the download has
    finished, you will definitely be presented a dialog box which displays
    the options ‘OK’ and ‘Run’. Install the PPT to Android converter and import the Power – Point files that you want to convert.


  7. October 18th, 2014

  8. Hi Dear, are you actually visiting this site daily, if so afterward
    you will definitely obtain nice know-how.


    Hi there everyone, it’s my first pay a quick visit at this site, and piece of writing
    is truly fruitful designed for me, keep up posting these articles.


  9. October 19th, 2014

  10. An impressive share! I have just forwarded this onto a colleague
    who was doing a little homework on this. And he in fact bought me lunch simply because
    I discovered it for him… lol. So let me reword this….
    Thanks for the meal!! But yeah, thanx for spending time to discuss this matter here on your internet site.


    It is also on the internet () so you have access to Deezer on your personal computer or by
    means of your mobile phone.


    My partner and I absolutely love your blog and find almost all of your post’s to be what precisely I’m looking for.
    Does one offer guest writers to write content for you? I wouldn’t mind creating a post or elaborating on a few of the subjects you write with regards to here.
    Again, awesome blog!


    Pretty part of content. I simply stumbled upon your weblog and in accession capital to say that I acquire actually
    loved account your weblog posts. Any way I will be subscribing to your augment and even I success you get
    admission to persistently fast.


  11. October 20th, 2014

  12. Spot on with this write-up, I seriously believe this web site needs a great deal more attention. I’ll probably be back again to see more, thanks for
    the advice!


    I like the valuable info you provide in your articles.

    I will bookmark your weblog and check again here regularly.
    I’m quite certain I’ll learn a lot of new stuff right here!
    Best of luck for the next!


    What’s up, its pleasant article about media print, we all be aware
    of media is a great source of facts.


    What’s up to all, it’s actually a good for me to pay a visit this web page,
    it contains valuable Information.


    Some truly fantastic posts on this site, appreciate it for contribution. “Careful. We don’t want to learn from this.” by Bill Watterson.


Leave a Reply




[Name and Mail is required. Mail won't be published.]