Date: June 17th, 2004
Cate: Geekism
Tags:

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

So the full code is now:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
  $parser = xml_parser_create($encoding);
} else {

  if(function_exists('mb_convert_encoding')) {
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
  }

  if($encoded_source != NULL) {
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
  }

  $parser = xml_parser_create("UTF-8");
}

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.


Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

380 Comments

    Quality content is the secret to be a focus for the people to
    go to see the web site, that’s what this website is providing.

    1F

  1. October 16th, 2014
    REPLY))

  2. of course like your web site however you have to test the spelling on several
    of your posts. A umber of them are rife with spelling issues and I find
    itt very troublesome to inform the reality nevertheless I’ll surely come
    back again.

    2F

    Zoofilia Perros miles de videos y fotos completamente sin costo
    de mujeres follando con perros el mejor sexo con animales xxx todo el parque zoológico que esperas de
    perros.

    3F

  3. October 16th, 2014
    REPLY))

  4. Hi there, i read your blog from timne to time and i own a similar onee annd i was
    just curious if you get a lot of spam responses? If sso how doo you stop it, any plugin or
    anbything you can suggest? I get so much lately it’s driving me insne so any support is
    very much appreciated.

    4F

    you’re truly a just right webmaster. The website loading velocity is incredible.
    It sort of feels that you’re doing any unique trick. Moreover, The
    contents are masterpiece. you’ve done a wonderful activity on this matter!

    5F

    Now you must be thinking, how you can get assured about the
    comfort factor without even sitting or using it personally.

    Metal for instance often dictates a cold character as opposed to warm which is warm.
    No one need know you only spent a small amount on your new Dining furniture.

    6F

  5. October 17th, 2014
    REPLY))

  6. I have read a few good stuff here. Definitely price bookmarking for revisiting. I surprise how much attempt you place to create this type of excellent informative web site.

    7F

    Very shortly this website will be famous amid all blogging users,
    due to it’s nice articles or reviews

    8F

    Attractive section of content. I just stumbled upon your weblog and in accession capital to assert that I acquire actually enjoyed account your blog posts. Any way I’ll be subscribing to your augment and even I achievement you access consistently fast.

    9F

    s 412 Mhz processor being more than doubled in the new phone, leaving 1
    Ghz of speed at the user. Once the download has
    finished, you will definitely be presented a dialog box which displays
    the options ‘OK’ and ‘Run’. Install the PPT to Android converter and import the Power – Point files that you want to convert.

    10F

  7. October 18th, 2014
    REPLY))

  8. Hi Dear, are you actually visiting this site daily, if so afterward
    you will definitely obtain nice know-how.

    11F

    Hi there everyone, it’s my first pay a quick visit at this site, and piece of writing
    is truly fruitful designed for me, keep up posting these articles.

    12F

  9. October 19th, 2014
    REPLY))

  10. An impressive share! I have just forwarded this onto a colleague
    who was doing a little homework on this. And he in fact bought me lunch simply because
    I discovered it for him… lol. So let me reword this….
    Thanks for the meal!! But yeah, thanx for spending time to discuss this matter here on your internet site.

    13F

    It is also on the internet () so you have access to Deezer on your personal computer or by
    means of your mobile phone.

    14F

    My partner and I absolutely love your blog and find almost all of your post’s to be what precisely I’m looking for.
    Does one offer guest writers to write content for you? I wouldn’t mind creating a post or elaborating on a few of the subjects you write with regards to here.
    Again, awesome blog!

    15F

    Pretty part of content. I simply stumbled upon your weblog and in accession capital to say that I acquire actually
    loved account your weblog posts. Any way I will be subscribing to your augment and even I success you get
    admission to persistently fast.

    16F

  11. October 20th, 2014
    REPLY))

  12. Spot on with this write-up, I seriously believe this web site needs a great deal more attention. I’ll probably be back again to see more, thanks for
    the advice!

    17F

    I like the valuable info you provide in your articles.

    I will bookmark your weblog and check again here regularly.
    I’m quite certain I’ll learn a lot of new stuff right here!
    Best of luck for the next!

    18F

    What’s up, its pleasant article about media print, we all be aware
    of media is a great source of facts.

    19F

    What’s up to all, it’s actually a good for me to pay a visit this web page,
    it contains valuable Information.

    20F

    Some truly fantastic posts on this site, appreciate it for contribution. “Careful. We don’t want to learn from this.” by Bill Watterson.

    21F

  13. October 21st, 2014
    REPLY))

  14. Pretty part of content. I just stumbled upon ylur web site and in accession capita to assert
    that I get actually loved accounjt you blog posts.
    Any way I’ll be subscribing on your augment and even I success you access constantly fast.

    22F

    I’ve read a few good stuff here. Definitely value bookmarking for revisiting.

    I surprise how a lot effort you place to create this type of magnificent informative site.

    23F

    I locate the Penomet used with the Vacu-Vin mod, is an excellent air pump.

    24F

  15. October 23rd, 2014
    REPLY))

  16. Somos profesionales de la cerrajería, y actuamos como deberían actuar todos los cerrajeros cualificados.

    25F

  17. October 24th, 2014
    REPLY))

  18. I’ve been absent for a while, but now I remember why I used to love this web site. Thank you, I’ll try and check back more often. How frequently you update your web site?

    26F

    Thanks for a marvelous posting! I genuinely enjoyed reading it, you
    could be a great author.I will be sure to bookmark your
    blog and will eventually come back down the road. I want to encourage one
    to continue your great job, have a nice morning!

    27F

  19. October 24th, 2014
    REPLY))

  20. Excellent web site. A lot of useful information here. I am sending it to some buddies ans additionally sharing in delicious. And certainly, thanks to your sweat!

    28F

    I think the admin of this site is actually working hard for
    his web site, since here every information is quality based data.

    29F

    Hi there Dear, are you genuinely visiting this web site regularly,
    if so afterward you will definitely get nice knowledge.

    30F

Leave a Reply

 Name

 Mail

 Home

[Name and Mail is required. Mail won't be published.]