Date: June 17th, 2004
Cate: Geekism
Tags:

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

So the full code is now:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
  $parser = xml_parser_create($encoding);
} else {

  if(function_exists('mb_convert_encoding')) {
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
  }

  if($encoded_source != NULL) {
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
  }

  $parser = xml_parser_create("UTF-8");
}

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.


Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

426 Comments

  1. Les Les  
    November 18th, 2014
    REPLY))

  2. Great blog right here! Also your web site rather a lot up very fast!
    What host are you the usage of? Can I get your affiliate hyperlink to your host?

    I wish my site loaded up as quickly as yours lol

    1F

  3. November 19th, 2014
    REPLY))

  4. The vice president of marketing for Trojan, Bruce Weiss states that “People are more comfortable than ever talking about vibrators and the idea of having one. Would you leave your mate if you could use your toys. The movement of the material in the screen box is not relied on the dip angle of the screen face, but determined by the directional angle of the vibration.

    2F

    The majority of these American men are older than their wives.
    Once I was for a friend’s house, he had stepped out couple of minutes
    to talk to help you his girlfriend. It is not ladylike for Filipina girls
    to be involved in discussions about their relationships or marital affairs.

    3F

    If some one wishes expert view about blogging and site-building after that i propose him/her to
    pay a visit this website, Keep up the pleasant job.

    4F

    Salut cela vous dérangerait de me laisser savoir qui hébergeur vous utilisant ?
    J’ai chargé sur votre blog en 3 divers navigateurs et je dois dire ce blog charge beaucoup plus rapide vite alors plus .
    Pouvez-vous suggérer recommandent un bon hébergement fournisseur à un honnête
    juste prix ? Merci , je l’apprécie !

    5F

  5. November 21st, 2014
    REPLY))

  6. Simply want to say your article is as astonishing. The clearness
    in your post is simply great and i can assume you
    are an expert on this subject. Well with your permission let me to
    grab your RSS feed to keep updated with forthcoming
    post. Thanks a million and please keep up the gratifying work.

    6F

  7. November 21st, 2014
    REPLY))

  8. Everything is very open with a clear clarification of the issues.
    It was truly informative. Your website is very helpful.
    Many thanks for sharing!

    7F

  9. ddd ddd  
    November 22nd, 2014
    REPLY))

  10. They must be removed first and then wait 15 minutes to replace
    them. com trap particles and gases and send 250 cubic feet of fresh
    air into your home every 60 seconds. We haven’t been able to
    investigate those issues, but until then – you may just have to choose a DIY air purifier over a substantially
    higher cost store bought unit.

    8F

    It’s hard to find experienced people for this subject, however, you seem like you know what you’re talking about!
    Thanks

    9F

    Hello, all the time i used to check blog posts
    here early in the daylight, since i like to find out more and more.

    10F

  11. November 24th, 2014
    REPLY))

  12. For instance, in spite of all the advances games like
    blackjack and poker stay most played.

    11F

  13. December 1st, 2014
    REPLY))

  14. On March 23, 2010 after nearly a year of overall consideration by both chambers of
    Congress, President Obama signed into law the US federal statue of Patient Protection and Affordable Care Act (PPACA).
    Firstly, penny stocks are often outlined as stocks trading at below
    $5 a share. That is where he feels he fits in, his talents are respected and he is part of the team.

    12F

  15. December 1st, 2014
    REPLY))

  16. You’re so cool! I do not believe I’ve read through a
    single thing like that before. So wonderful to discover somebody
    with a few unique thoughts on this topic. Seriously..
    thanks for starting this up. This website is something that is needed on the internet,
    someone with a little originality!

    13F

    It’s a shame you don’t hhave a donate button! I’d without
    a doubt dojate to this outstanding blog! I guess for now i’ll settle for
    bookmarking and adding your RSS feed to my Google
    account. I look forward to fresh updates and wil share this site with my Facebook group.
    Chhat soon!

    ford escape seat covers
    musicians friend guitars

    14F

  17. December 1st, 2014
    REPLY))

  18. Whether or not you’re constructing a non-profit site or possibly a company
    web site, Wordpress would not charge a fee for something, forever.

    Web Content Management – If you have no practical information, concept about HTML,
    have no patience in mastering some cool HTML editors, or merely hate the difficult stuff regarding style and areas like this, but likewise has to handle
    a website and make required improvements without having to be a nerd first,
    then Wordpress will be the finest option for you. One of the hardest things about being
    a single mom is finding enough money to pay the bills.

    15F

  19. December 1st, 2014
    REPLY))

  20. Creating a brand will mean you’ll immediately be recognized and your effort will be grouped with your
    particular brand without having to commit a great deal of cash on a marketing campaign. Thinking outside the box will
    be essential for survival. The application is lengthy and complex and usually
    requires an experienced specialist to complete it.

    16F

    Hey, you used to write excellent, but the last few posts have been kinda boring… I miss your great writings. Past few posts are just a little out of track! come on!

    17F

    I do believe all of the concepts you have offered to your post. They’re really convincing and can certainly work. Still, the posts are too brief for beginners. Could you please prolong them a bit from subsequent time? Thanks for the post.

    18F

    pharmaeu.net is good to go

    19F

    Can I simply say what a comfort to find a person that genuinely knows what they’re discussing
    over the internet. You certainly know how to bring an issue to light and make it important.
    More and more people really need to read this and understand this
    side of the story. It’s surprising you’re not more
    popular because you surely have the gift.

    20F

    Anyone who underestimates the own value, sells himself and his product(s) under
    value and does so msny things that hiss expertise cant’ play.

    0 is usually a comprehensive, automated web submitter that’s incorporated a user
    friendly interface too allow website proprietors to optimize online
    search engine usage. You mayy bring to the table an association over once more.

    21F

    Das meiste davon wandert auf direktem Weg, mit einer
    dicken Umarmung (schwedisch: kram) zu meinen drei Kindern (oder
    Freunden ferner Verwandten).

    22F

    I just like the helpful info you supply for your articles.

    I’ll bookmark your blog and check once more here regularly.
    I am slightly sure I’ll learn plenty of new stuff proper here!
    Best of luck for the following!

    23F

  21. December 11th, 2014
    REPLY))

  22. This article is not meant to provide professional financial advice,
    forex training, or any other information to be used for making financial decisions.

    The organic food stuffs available in the market do not have any kind of
    additives or flavors and hence they taste awesome
    naturally. As for now, it is considered the first private bank in Iraq, and history has tracked down that
    they really do a good job with handling finances.

    24F

    The buttons are very easy too turn on as well without too much tension “This, combined with a simple large-print LCD screen and high earpiece volume, make the Snapfon Ez an obvious choice. That is why the several lenders’ internet sites are safer than the personal internet sites. “I
    think the closing of these dealershis was punitive and
    secretive, and it’s the most un-American thing for tthe government to help
    fprce yoou out of business and deprive you of the American dream.

    25F

    Genialer Post. Ähnlich genial, wie Gaming auch. So verpönt das Spielen auch
    sei, so ist wissenschaftlich bewiesen, dass spielen, grundlegende geistige
    Gesundheit des Spielenden vorausgesetzt, förderlich für das
    Reaktionsvermögen des Spielenden ist. Beim Daddeln von Computerspielen lernt man ohne Stress schnell Entscheidungen zu treffen und zwischen wichtig und unwichtig zu unterscheiden. Viele Videospiele vermitteln überdies auch noch Basiswissen über Management und fördern die kognitiven Fähigkeiten.
    Selbst die verrissenen MMOs können doch häufig
    den falsch vermittelten Effekt des Abdriftens in eine virtuelle Realität} aufhalten. Der Spielende
    wird mitunter einigen wenigen verkommenen Subjekten begegnen,
    allerdings findet der Gamer nur zu oft just in seinem Lieblingsspiel
    Gleichgesinnte.
    Summa summarum: das Gaming ist der Hammer! Ungeachtet des Spiels und auch ob PC,Konsole,Handy,etc.

    .

    26F

Leave a Reply

 Name

 Mail

 Home

[Name and Mail is required. Mail won't be published.]