Date: June 17th, 2004
Cate: Geekism
Tags:

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

So the full code is now:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
  $parser = xml_parser_create($encoding);
} else {

  if(function_exists('mb_convert_encoding')) {
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
  }

  if($encoded_source != NULL) {
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
  }

  $parser = xml_parser_create("UTF-8");
}

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.


Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

410 Comments

  1. Les Les  
    November 18th, 2014
    REPLY))

  2. Great blog right here! Also your web site rather a lot up very fast!
    What host are you the usage of? Can I get your affiliate hyperlink to your host?

    I wish my site loaded up as quickly as yours lol

    1F

  3. November 19th, 2014
    REPLY))

  4. The vice president of marketing for Trojan, Bruce Weiss states that “People are more comfortable than ever talking about vibrators and the idea of having one. Would you leave your mate if you could use your toys. The movement of the material in the screen box is not relied on the dip angle of the screen face, but determined by the directional angle of the vibration.

    2F

    The majority of these American men are older than their wives.
    Once I was for a friend’s house, he had stepped out couple of minutes
    to talk to help you his girlfriend. It is not ladylike for Filipina girls
    to be involved in discussions about their relationships or marital affairs.

    3F

    If some one wishes expert view about blogging and site-building after that i propose him/her to
    pay a visit this website, Keep up the pleasant job.

    4F

    Salut cela vous dérangerait de me laisser savoir qui hébergeur vous utilisant ?
    J’ai chargé sur votre blog en 3 divers navigateurs et je dois dire ce blog charge beaucoup plus rapide vite alors plus .
    Pouvez-vous suggérer recommandent un bon hébergement fournisseur à un honnête
    juste prix ? Merci , je l’apprécie !

    5F

  5. November 21st, 2014
    REPLY))

  6. Simply want to say your article is as astonishing. The clearness
    in your post is simply great and i can assume you
    are an expert on this subject. Well with your permission let me to
    grab your RSS feed to keep updated with forthcoming
    post. Thanks a million and please keep up the gratifying work.

    6F

  7. November 21st, 2014
    REPLY))

  8. Everything is very open with a clear clarification of the issues.
    It was truly informative. Your website is very helpful.
    Many thanks for sharing!

    7F

  9. ddd ddd  
    November 22nd, 2014
    REPLY))

  10. They must be removed first and then wait 15 minutes to replace
    them. com trap particles and gases and send 250 cubic feet of fresh
    air into your home every 60 seconds. We haven’t been able to
    investigate those issues, but until then – you may just have to choose a DIY air purifier over a substantially
    higher cost store bought unit.

    8F

    It’s hard to find experienced people for this subject, however, you seem like you know what you’re talking about!
    Thanks

    9F

    Hello, all the time i used to check blog posts
    here early in the daylight, since i like to find out more and more.

    10F

Leave a Reply

 Name

 Mail

 Home

[Name and Mail is required. Mail won't be published.]