Date: June 17th, 2004
Cate: Geekism
Tags:

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

So the full code is now:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";
}

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
  $parser = xml_parser_create($encoding);
} else {

  if(function_exists('mb_convert_encoding')) {
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
  }

  if($encoded_source != NULL) {
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
  }

  $parser = xml_parser_create("UTF-8");
}

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.


Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.

89 Comments

  1. AshaAsha  
    January 25th, 2006
    REPLY))

  2. Hello,
    I am using PHP and MYSQL to store Japanese data. I am facing encoding problem. I am not able to store it in table with proper encoding can any one help me?

    1F

  3. KK  
    February 22nd, 2006
    REPLY))

  4. Too late probably, but just replace all “” from the string before executing the regexp, and it should work with UTF-16

    2F

  5. KK  
    February 22nd, 2006
    REPLY))

  6. ok, my last comment was “mangled”

    Replace all 0-bytes from the string. That’s \ by the way

    3F

  7. oinkoink  
    March 30th, 2006
    REPLY))

  8. i suggest to go a little further with the regular expression in a network world full of screwed up source codes:

    preg_match( “/]+encoding\s*=\s*['"]?([\w._]+(-[\w._]+)*)['"]?[^>]\?>/i”, $xml, $m )

    it’s possible i screwed up here myself, double check advised, i didn’t test it yet.

    4F

  9. oinkoink  
    March 30th, 2006
    REPLY))

  10. well seems like your comment page doesn’t translate “smaller than”.

    5F

  11. stevesteve  
    March 30th, 2006
    REPLY))

  12. Sorry about that… I think Wordpress probably has a similar screwed up regex that tries to sanitize comments that munged yours.

    6F

  13. pepepepe  
    April 11th, 2006
    REPLY))

  14. hola egañádos

    7F

  15. June 11th, 2006
    REPLY))

  16. Hi, I am using MagpieRSS 0.7 but I still am getting character substitution! please see this site still in progress:
    http://www.andrewzahn.com/crissa

    can you tell what could cause this?
    thanks

    8F

  17. Jason JudgeJason Judge  
    June 23rd, 2006
    REPLY))

  18. Just a note on this bit:

    “If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding=”utf-8″ and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it.”

    It should be noted that any UTF-8 stream can be treated as a valid ISO-8859 stream, since an ISO stream is a series of bytes. However, the reverse does not hold true. There are ISO-8859 (and other single-byte and multibyte streams) that turn out to be invalid as UTF-8 streams.

    The reason is that a series of independant bytes is a series of bytes, but UTF-8 has strict rules in which ranges of byte values can follow certain other bytes.

    The parser may not fall over now when it hits these invalid sequences, but I am not sure it is safe to assume that will always be the case.

    I think it would be safer to send an unknown or unhandlable encoding into the parser as ISO8859, and then convert the entities afterwards. *That* is why the parser defaults to ISO and not UTF.

    9F

  19. August 2nd, 2006
    REPLY))

  20. Steve,

    Great stuff here. I’m using Magpie v 7a. The parser looks to have I incorporated your encoding fix for PHP4. On my page, particularly in the Yahoo News feeds, several of the characters are converted to question marks. I looked at the original feed, and the special characters are an apostrophe and an mdash. Apparently, all apostrophes aren’t equal as some are handled well and others are converted to ‘?’. The apostrophe causing problems slants backwards (almost like an accent).

    All in all, I would say that the output is still pretty good, and completely readable. It would be icing on the cake to fix this problem with special characters.

    The Yahoo feed is UTF-8. Here is the link to the actual XML file:

    link

    Off topic, comparing Yahoo News and Google News feeds. Yahoo is much better in terms of their advanced search options. Google has no provision to exclude based on keywords. However, Google incorporates nice thumbnails into their output descriptions. This makes them very appealing, and it will be interesting to see how long it lasts before others begin incorporating small thumbnails as well. I could see including a conditional that would only display stories which contained thumbnails.

    10F

  21. unclepiak ลุงเปี๊ยกunclepiak ลุงเปี๊ยก  
    September 28th, 2006
    REPLY))

  22. sound interesting ! ขออนุญาตทดสอบภาษาไทย

    11F

  23. October 1st, 2006
    REPLY))

  24. I had the same problem using magpie .72 and found that the only solution that worked, was to ditch my UTF-8 feed and replace it with an ISO-8859 feed instead. After changing that setting in WordPress and clearing magpie’s cache, everything worked perfectly.

    12F

  25. October 28th, 2006
    REPLY))

  26. Hey all.. Thanks for the post. I am actually trying to mod this plugin right now for my site for something having to do nothing with favicons and would love some help if anyone is willing to trade a few emails.

    13F

  27. MattMatt  
    January 26th, 2007
    REPLY))

  28. If you, like me, found no luck here in resolving your char. munging issue in Magpie 0.72 and RSS (Atom works fine, right?) then perhaps what I did may help you…

    when you include and define your Magpie params be sure to include this line:

    define(’MAGPIE_OUTPUT_ENCODING’, ‘UTF-8′);

    and in the head of your HTML docuemnt be sure (for browser compatability and user preference consideration) to add/change the following:

    (less than sign) meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ / (greater than sign)

    of course you will replace the signs indicated in parentesis… without the parenthesis… hehehe

    14F

  29. March 8th, 2007
    REPLY))

  30. Hi,
    I’ve exactly the same problem with my project. Here is a good conversation that also includes useful links about this problem, and its solution.
    http://groups.google.com.tr/group/comp.infosystems.www.authoring.html/browse_thread/thread/d5faecca4ac6e44c/fd7fd436ef3187af?lnk=gst&q=iconv&rnum=6&hl=tr#fd7fd436ef3187af

    15F

  31. March 11th, 2007
    REPLY))

  32. i tried to do it on http://www.katalog.wzrost.com but have problems with showing rss in ISO-8859-2….

    16F

  33. April 4th, 2007
    REPLY))

  34. I could not overcome an error in cyrillic encoding ISO in Java Struts…
    All letters normally decoded, except for russian “I”..
    What is that?

    17F

  35. May 22nd, 2007
    REPLY))

  36. As you may already know, Wordpress adds a Non-breakable space in between the last two words of a title.

    The problem is, when it is run through feedburner, it gets converted to some raw character that looks like Â. I can’t match the ‘Â’ and I can’t find any way to get rid of it or convert it.

    Have you encountered this?

    18F

  37. January 22nd, 2008
    REPLY))

  38. I’ve been using magpieRSS for a couple of years now. I have not found an easy way to get “well formed xml”. Instead I have an intermediate parser that looks for these bad characters and strips them out, so every time I find a bad character or embedded code, I add it to the list and it gets cleaned up. It’s more work than I wanted, but it works 99% of the time. If anyone knows of a better way, let me know.

    19F

  39. May 24th, 2008
    REPLY))

  40. I wrote a XML parser function in PHP…
    http://www.bin-co.com/php/scripts/xml2array/
    and I was getting complaints that it was not working for some languages.

    Now I know why.

    20F

  41. HannuHannu  
    August 2nd, 2008
    REPLY))

  42. I so respect you for writing this article. It is hard to remember how much work is behind well writen accurate text. I hope that there will at some point some webpage where this kind or diamonds are aggregated as one library on knowledge.

    And my problem was UTF-8 rss to ISO-8859-1 page, which php can not handle automatically using version 4. (customer won’t upgrade to php5, I would…)

    21F

  43. September 30th, 2008
    REPLY))

  44. great help guide, i will change the value in magpierss, to utf-8.

    22F

  45. November 30th, 2008
    REPLY))

  46. I had a strange problem using php DOM with a site that has iso-8859-1 in its head, but sends actual utf-8 header through apache. It’s a cyrillic wordpress blog that is not well configured. The solution is somekind strange- you have to do sth like this:
    $html=preg_replace(’/]*>/’,’

    ‘,$html);
    , e.g. add a charset header before anything else in the head. It didn’t work even if I replaced the normal header coming from the page on its normal position. The DOM just thought it was 8859-1 encoded. This is a very strange bug as normally you can get page headers using curl or sth like that and convert the page to utf-8…

    23F

  47. CFD CFD  
    March 2nd, 2009
    REPLY))

  48. This is giving me a headache of the worst kind, cant seem to get to grips with it at all. Im really needing to parse out an xml feed, was hoping i could use the magpie flow for it, but its next stage huh, I need something for pure xml based parsing.

    24F

    I am doing a simple translation in IE7 using the domdocument and my page comes out fine. When I do the same thing PHP5 I get a capital letter A with a hat on it. I have been all over the net and cant figure what is going on. I am using UTF-8 encoding in all of my documents and my PHP code even ensures the UTF-8 encode. Here are my only lines of PHP:
    xslt_process($xsltproc,’webconfig.xml’,'webcontentbios.xsl’);
    xslt_free($xsltproc);
    echo $xslt_result;

    So why does this work fine in IE7 and firefox using straight Javascript and dom xml/xsl transformation but in PHP I get these characters. What is the cure? Thanks.

    25F

  49. October 27th, 2009
    REPLY))

  50. I find this international character problem comes up a lot on my site too. Thanks for the updat go magpie.
    Snuggie

    26F

  51. March 22nd, 2010
    REPLY))

  52. Le Poste Agréable remercie de la solution

    27F

  53. April 28th, 2010
    REPLY))

  54. I have a problem related to parsing the russian feed xml . So plz let me know how can I parse russian characters and store them in database. As well as I again want to fetch them and display in list……
    plz help……………

    28F

  55. May 3rd, 2010
    REPLY))

  56. NiksJank say: I can not participate now in discussion – it is very occupied. I will return – I will necessarily express the opinion.

    _____________
    {cealis
    online bestellen
    6

    29F

  57. July 30th, 2010
    REPLY))

  58. Many thanks for the article.

    I will use some of the codes on my own site.

    Regards John

    30F

    This was a great help to me, thank you.

    Brad

    31F

  59. August 19th, 2010
    REPLY))

  60. @Michael: Here we are, some 6 years after the original post and although I admit I am not an ace programmer, browser cross-compatibility still makes me insane! As much of a capitalist as I am, I sometimes wish there were just one brower in the world! :)

    32F

  61. September 20th, 2010
    REPLY))

  62. Well written article. I laughed out loud when you wrote…. “Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked” Thanks for wrtting it.

    -David

    33F

  63. May 19th, 2011
    REPLY))

  64. @ Binny V A
    hi ,

    i passed a string into your function in of the attribute value is table format like

    test

    in window same value result i can seee
    test

    when comes to linex iam getting result as like this

    table>test</table

    please help me

    34F

  65. August 31st, 2012
    REPLY))

  66. http://feedonfeeds.com/ does not work.

    35F

  67. August 31st, 2012
    REPLY))

  68. It’s working now.

    36F

  69. October 23rd, 2012
    REPLY))

  70. It’ genius !! I spend hours looking for encoding solution, big thanks

    37F

  71. Johng286Johng286  
    May 13th, 2014
    REPLY))

  72. Heya im for the first time here. I discovered this board and I to uncover It truly helpful &amp it helped me out a whole lot. I hope to supply something back and aid other people such as you helped me. cafeakedckea

    38F

    Perfectly composed content material , thankyou for selective information .

    39F

Leave a Reply

 Name

 Mail

 Home

[Name and Mail is required. Mail won't be published.]