Date: June 17th, 2004
Cate: Geekism
Tags:

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

So the full code is now:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
  $parser = xml_parser_create($encoding);
} else {

  if(function_exists('mb_convert_encoding')) {
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
  }

  if($encoded_source != NULL) {
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
  }

  $parser = xml_parser_create("UTF-8");
}

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.


Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

139 Comments

    And deeply analysed the critical issues related to reservations and
    recently added Article 21A. Don’t forget to create a bibliography
    or all the sources you used in your paper. Good term papers are built around research questions.

    1F

  1. September 1st, 2014
    REPLY))

  2. I know this web page gives quality depending articles and other material, is there any other web page which presents such data in quality?

    2F

    , he will have to collect information that support mercy killing should be legalized and at the same time evidences that support mercy killing should not be legalized.
    Using the right kind of map flow will help you to get involved into the theory.
    com has a team of experienced writers those are best
    in their fields.

    3F

    One of the biggest mistakes a student will make is in their eagerness or anxiety they can read
    the instructions quickly and miss some of the important parts.
    3) Write an outline : After you gather all the necessary information for writing your research paper, the next baest thing that you need to
    do is prepare an outline. Dissertation provides a person an opportunity or chance to work independently.

    4F

  3. September 1st, 2014
    REPLY))

  4. Give them a square bowl, a circular bowl, and a cup. He or
    she then offers his or her services via the Internet.
    That gathering settled on the reality that in order to help Africa move
    forward and heal, we the wealthiest countries of the world would need
    to increase aid, forgive the debt Africa owes to us and remove trade barriers that
    hinder African exports.

    5F

    Wow! At last I got a web site from where I be capable
    of actually obtain helpful data regarding my study
    and knowledge.

    6F

    It’s an remarkable post designed for all the online
    users; they will take advantage from it I am sure.

    7F

    Normally, the introduction should not go beyond two pages.

    But when it comes to a research paper, mostly it is required to submit a
    research paper if you are pursuing to get a master’s degree in any
    specific field. It therefore becomes extremely hard for a student
    to distinguish a reliable company from an unreliable one.

    8F

    Asking questions are truly pleasant thing if you are not understanding something completely, however this piece of writing offers nice
    understanding even.

    9F

  5. September 2nd, 2014
    REPLY))

  6. As well as you’ll receive quite possibly the most famous and exclusive tools for that
    game, but you’ll always enjoy of constant updates.
    Overall, the device is one of the best GPS devices available
    on the market, and has great overall appeal.
    If you are still dealing with low FPS lag with the personal
    computer then following steps can support you fix lagging problems with your
    game.

    10F

  7. September 2nd, 2014
    REPLY))

  8. On a dedicated server, none of these prospects might be borrowed by or shared with
    another server. To access more information on dedicated servers, dedicated server hosting, dedicated game server rental and on linux dedicated servers please visit this link.

    You could have more lag on a Hamachi server just because your
    ISP has an off day.

    11F

  9. September 3rd, 2014
    REPLY))

  10. Hi! I’ve been following your website forr some tike now and finally got the
    courage to go ahead and give you a shout out from Huffman Texas!
    Just wanted to say keep up the great work!

    12F

    Hi! Do you know if they make any pluginns to safeguard agaiinst hackers?
    I’m kinda paranoid about losing everything I’ve worked hard on. Any recommendations?

    13F

    Definitely believe that which you said. Your favorite justification seemed
    to be on the net the easiest thing to be aware of.
    I say to you, I certainly get annoyed while people think about worries that they plainly do not know about.
    You managed to hit the nail upon the top as well as defined out the whole thing without having side effect , people can take a
    signal. Will likely be back to get more. Thanks

    14F

    I have read so many articles regarding the blogger lovers however this article is really a pleasant paragraph,
    keep it up.

    15F

  11. September 5th, 2014
    REPLY))

  12. I’m not that much of a internet reader to be honest but your sites
    really nice, keep it up! I’ll go ahead and bookmark your website to come back
    in the future. All the best

    16F

  13. September 5th, 2014
    REPLY))

  14. I think what you wrote was actually very reasonable. However, think on this, what if you wrote a catchier title?
    I ain’t saying your information is not good., but what if you added
    a headline that grabbed a person’s attention? I mean PHP,
    XML, and Character Encodings: a tale of sadness, rage, and (data-)loss « Steve Minutillo :
    : messy-78 is a little vanilla. You ought to glance at Yahoo’s
    home page and note how they create article headlines to get
    people interested. You might add a related
    video or a related picture or two to get readers interested about what you’ve written. Just my opinion,
    it would make your website a little livelier.

    17F

    But no matter how much clay one adds to the world, it will not
    become God’s kingdom. Competition between states led to upgrades of technologies that seeped into the continent from Asia and the Middle East.
    The second voice I heard blessing the Elect One and the elect ones who hang upon the Master of Spirits.

    18F

    Good way of telling, and pleasant paragraph to obtain data
    regarding my presentation focus, which i am going to present in academy.

    19F

    Guilt enters at this point, in connection with incestuous wishes.
    Masochists are more likely to be successful by social
    standards: professionally, sexually, emotionally, culturally, in marriages or out.
    Here’s another classic RPG game with a classic game title.

    20F

  15. September 7th, 2014
    REPLY))

  16. We work with some of the best professional essay and research paper writers in the UK and can provide you with a well researched and properly cited essay for any topic that you could think of.

    There’s a tendency for most admission essays to be boring and lifeless if one writes it in the form of research paper.
    It therefore becomes extremely hard for a student
    to distinguish a reliable company from an unreliable one.

    21F

    Wonderful blog! I found it while searching on Yahoo News.
    Do you have any suggestions on how to get listed in Yahoo News?
    I’ve been trying for a while but I never seem to get there!
    Appreciate it

    22F

    The discontinuation syndrome with benzodiazepines has been found to last for months.

    Alcohol abuse is also a major problem amongst executives and celebrities because of the high stress
    levels that are associated with these jobs. Many questions remain about the exact cause of Jackson.

    23F

    Use information from a variety of reference sources.

    t forget to mention evidences of whatever you have said in the paper.
    Many students get stuck in this segment of the education and
    very soon they fade up.

    24F

    Yes! Finally someone writes about Leave a comment.

    25F

    Why users still make use of to read news papers when in this technological globe
    everything is presented on web?

    26F

    Helpful information. Lucky me I found your website by chance, and I’m stunned why this accident did not happened in advance!
    I bookmarked it.

    27F

    Pretty nice post. I just stumbled upon your weblog
    and wished to mention that I have truly loved
    surfing around your weblog posts. In any case I will be subscribing in your
    rss feed and I am hoping you write again soon!

    28F

    What’s up to every one, the contents existing at this website are truly remarkable for
    people knowledge, well, keep up the nice work fellows.

    29F

    Wonderful blog! I found it while browsing on Yahoo
    News. Do you have any tips on how to get listed in Yahoo News?
    I’ve been trying for a while but I never seem to get there!
    Thanks

    30F

    Inside chat rooms little cliques form and some of these contain couples who
    seem to have a lot in common with each other and they start to form an emotional
    attachment with each other. In other words, their school had to make AYP (Advanced Yearly Progress) or to
    at least improved their scores. A climate of permissiveness
    pervades many of these chat rooms and instant messages
    sent privately between two people allow them to engage in erotic chats without
    the risk of being caught by their real-life partner.

    31F

  17. September 15th, 2014
    REPLY))

  18. Definitely believe that that you said. Your favourite reason seemed to be at the web the simplest thing to keep in mind of.
    I say to you, I definitely get irked even as folks consider
    concerns that they just don’t realize about. You controlled to hit
    the nail upon the highest and also defined out the entire thing without having
    side-effects , other folks could take a signal. Will
    probably be again to get more. Thanks

    32F

  19. September 15th, 2014
    REPLY))

  20. I savor, cause I found just what I used to be having a
    look for. You’ve ended my 4 day lengthy hunt!
    God Bless you man. Have a great day. Bye

    33F

    It’s an awesome post in favor of all the internet
    users; they will obtain advantage from it I am sure.

    34F

    When some one searches for his required thing,
    thus he/she wants to be available that in detail, therefore that thing is maintained over here.

    35F

    When it comes to loans for older cars, lenders have their own benchmarks.
    i20 is pretty fuel efficient with a fuel economy of 18.
    Sadly, due to non-compliance with the BS IV norms,
    this car had to move out of 13 major cities of India.

    36F

  21. September 17th, 2014
    REPLY))

  22. This is something that payday loans can easily
    compete with because there are a lot of advantages to getting this type of loan.
    The most important benefit of second mortgage loans is that they come at a low interest and allow payback in long periods from 5 to 20 years,
    therefore greatly reducing the need to make large monthly payments.
    Besides these, more popular ones are consolidation loans and low-rate loans, bridging loans, and others.

    37F

    Aw, this was an exceptionally good post. Taking the time and actual effort
    to make a superb article… but what can I say… I hesitate a
    whole lot and never seem to get nearly anything done.

    38F

    I loved as much as you will receive carried out right here.
    The sketch is attractive, your authored material stylish.
    nonetheless, you command get bought an edginess over that you wish
    be delivering the following. unwell unquestionably come
    further formerly again since exactly the same nearly a lot often inside case you shield this
    hike.

    39F

Leave a Reply

 Name

 Mail

 Home

[Name and Mail is required. Mail won't be published.]