Date: June 17th, 2004
Cate: Geekism

PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss

Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!

So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.

Here’s how Magpie was creating the XML parser:

$parser = xml_parser_create();

Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:

$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”

So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:

$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”

That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";

That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.

So the full code is now:

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $xml, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";

$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.

Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.

So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.

$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';

if (preg_match($rx, $source, $m)) {
  $encoding = strtoupper($m[1]);
} else {
  $encoding = "UTF-8";

if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
  $parser = xml_parser_create($encoding);
} else {

  if(function_exists('mb_convert_encoding')) {
    $encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);

  if($encoded_source != NULL) {
    $source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);

  $parser = xml_parser_create("UTF-8");

xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.

Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:

$parser = xml_parser_create("");

Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:

$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");

Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”

At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.


  1. AshaAsha  
    January 25th, 2006

  2. Hello,
    I am using PHP and MYSQL to store Japanese data. I am facing encoding problem. I am not able to store it in table with proper encoding can any one help me?


  3. KK  
    February 22nd, 2006

  4. Too late probably, but just replace all “” from the string before executing the regexp, and it should work with UTF-16


  5. KK  
    February 22nd, 2006

  6. ok, my last comment was “mangled”

    Replace all 0-bytes from the string. That’s \ by the way


  7. oinkoink  
    March 30th, 2006

  8. i suggest to go a little further with the regular expression in a network world full of screwed up source codes:

    preg_match( “/]+encoding\s*=\s*['"]?([\w._]+(-[\w._]+)*)['"]?[^>]\?>/i”, $xml, $m )

    it’s possible i screwed up here myself, double check advised, i didn’t test it yet.


  9. oinkoink  
    March 30th, 2006

  10. well seems like your comment page doesn’t translate “smaller than”.


  11. stevesteve  
    March 30th, 2006

  12. Sorry about that… I think Wordpress probably has a similar screwed up regex that tries to sanitize comments that munged yours.


  13. pepepepe  
    April 11th, 2006

  14. hola egañádos


  15. June 11th, 2006

  16. Hi, I am using MagpieRSS 0.7 but I still am getting character substitution! please see this site still in progress:

    can you tell what could cause this?


  17. Jason JudgeJason Judge  
    June 23rd, 2006

  18. Just a note on this bit:

    “If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding=”utf-8″ and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it.”

    It should be noted that any UTF-8 stream can be treated as a valid ISO-8859 stream, since an ISO stream is a series of bytes. However, the reverse does not hold true. There are ISO-8859 (and other single-byte and multibyte streams) that turn out to be invalid as UTF-8 streams.

    The reason is that a series of independant bytes is a series of bytes, but UTF-8 has strict rules in which ranges of byte values can follow certain other bytes.

    The parser may not fall over now when it hits these invalid sequences, but I am not sure it is safe to assume that will always be the case.

    I think it would be safer to send an unknown or unhandlable encoding into the parser as ISO8859, and then convert the entities afterwards. *That* is why the parser defaults to ISO and not UTF.


  19. August 2nd, 2006

  20. Steve,

    Great stuff here. I’m using Magpie v 7a. The parser looks to have I incorporated your encoding fix for PHP4. On my page, particularly in the Yahoo News feeds, several of the characters are converted to question marks. I looked at the original feed, and the special characters are an apostrophe and an mdash. Apparently, all apostrophes aren’t equal as some are handled well and others are converted to ‘?’. The apostrophe causing problems slants backwards (almost like an accent).

    All in all, I would say that the output is still pretty good, and completely readable. It would be icing on the cake to fix this problem with special characters.

    The Yahoo feed is UTF-8. Here is the link to the actual XML file:


    Off topic, comparing Yahoo News and Google News feeds. Yahoo is much better in terms of their advanced search options. Google has no provision to exclude based on keywords. However, Google incorporates nice thumbnails into their output descriptions. This makes them very appealing, and it will be interesting to see how long it lasts before others begin incorporating small thumbnails as well. I could see including a conditional that would only display stories which contained thumbnails.


  21. unclepiak ลุงเปี๊ยกunclepiak ลุงเปี๊ยก  
    September 28th, 2006

  22. sound interesting ! ขออนุญาตทดสอบภาษาไทย


  23. October 1st, 2006

  24. I had the same problem using magpie .72 and found that the only solution that worked, was to ditch my UTF-8 feed and replace it with an ISO-8859 feed instead. After changing that setting in WordPress and clearing magpie’s cache, everything worked perfectly.


  25. October 28th, 2006

  26. Hey all.. Thanks for the post. I am actually trying to mod this plugin right now for my site for something having to do nothing with favicons and would love some help if anyone is willing to trade a few emails.


  27. MattMatt  
    January 26th, 2007

  28. If you, like me, found no luck here in resolving your char. munging issue in Magpie 0.72 and RSS (Atom works fine, right?) then perhaps what I did may help you…

    when you include and define your Magpie params be sure to include this line:

    define(’MAGPIE_OUTPUT_ENCODING’, ‘UTF-8′);

    and in the head of your HTML docuemnt be sure (for browser compatability and user preference consideration) to add/change the following:

    (less than sign) meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ / (greater than sign)

    of course you will replace the signs indicated in parentesis… without the parenthesis… hehehe


  29. March 8th, 2007

  30. Hi,
    I’ve exactly the same problem with my project. Here is a good conversation that also includes useful links about this problem, and its solution.


  31. March 11th, 2007

  32. i tried to do it on but have problems with showing rss in ISO-8859-2….


  33. April 4th, 2007

  34. I could not overcome an error in cyrillic encoding ISO in Java Struts…
    All letters normally decoded, except for russian “I”..
    What is that?


  35. May 22nd, 2007

  36. As you may already know, Wordpress adds a Non-breakable space in between the last two words of a title.

    The problem is, when it is run through feedburner, it gets converted to some raw character that looks like Â. I can’t match the ‘Â’ and I can’t find any way to get rid of it or convert it.

    Have you encountered this?


  37. January 22nd, 2008

  38. I’ve been using magpieRSS for a couple of years now. I have not found an easy way to get “well formed xml”. Instead I have an intermediate parser that looks for these bad characters and strips them out, so every time I find a bad character or embedded code, I add it to the list and it gets cleaned up. It’s more work than I wanted, but it works 99% of the time. If anyone knows of a better way, let me know.


  39. May 24th, 2008

  40. I wrote a XML parser function in PHP…
    and I was getting complaints that it was not working for some languages.

    Now I know why.


  41. HannuHannu  
    August 2nd, 2008

  42. I so respect you for writing this article. It is hard to remember how much work is behind well writen accurate text. I hope that there will at some point some webpage where this kind or diamonds are aggregated as one library on knowledge.

    And my problem was UTF-8 rss to ISO-8859-1 page, which php can not handle automatically using version 4. (customer won’t upgrade to php5, I would…)


  43. September 30th, 2008

  44. great help guide, i will change the value in magpierss, to utf-8.


  45. November 30th, 2008

  46. I had a strange problem using php DOM with a site that has iso-8859-1 in its head, but sends actual utf-8 header through apache. It’s a cyrillic wordpress blog that is not well configured. The solution is somekind strange- you have to do sth like this:

    , e.g. add a charset header before anything else in the head. It didn’t work even if I replaced the normal header coming from the page on its normal position. The DOM just thought it was 8859-1 encoded. This is a very strange bug as normally you can get page headers using curl or sth like that and convert the page to utf-8…


  47. CFD CFD  
    March 2nd, 2009

  48. This is giving me a headache of the worst kind, cant seem to get to grips with it at all. Im really needing to parse out an xml feed, was hoping i could use the magpie flow for it, but its next stage huh, I need something for pure xml based parsing.


    I am doing a simple translation in IE7 using the domdocument and my page comes out fine. When I do the same thing PHP5 I get a capital letter A with a hat on it. I have been all over the net and cant figure what is going on. I am using UTF-8 encoding in all of my documents and my PHP code even ensures the UTF-8 encode. Here are my only lines of PHP:
    echo $xslt_result;

    So why does this work fine in IE7 and firefox using straight Javascript and dom xml/xsl transformation but in PHP I get these characters. What is the cure? Thanks.


  49. October 27th, 2009

  50. I find this international character problem comes up a lot on my site too. Thanks for the updat go magpie.


  51. March 22nd, 2010

  52. Le Poste Agréable remercie de la solution


  53. April 28th, 2010

  54. I have a problem related to parsing the russian feed xml . So plz let me know how can I parse russian characters and store them in database. As well as I again want to fetch them and display in list……
    plz help……………


  55. May 3rd, 2010

  56. NiksJank say: I can not participate now in discussion – it is very occupied. I will return – I will necessarily express the opinion.

    online bestellen


  57. July 30th, 2010

  58. Many thanks for the article.

    I will use some of the codes on my own site.

    Regards John


    This was a great help to me, thank you.



  59. August 19th, 2010

  60. @Michael: Here we are, some 6 years after the original post and although I admit I am not an ace programmer, browser cross-compatibility still makes me insane! As much of a capitalist as I am, I sometimes wish there were just one brower in the world! :)


  61. September 20th, 2010

  62. Well written article. I laughed out loud when you wrote…. “Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked” Thanks for wrtting it.



  63. May 19th, 2011

  64. @ Binny V A
    hi ,

    i passed a string into your function in of the attribute value is table format like


    in window same value result i can seee

    when comes to linex iam getting result as like this


    please help me


  65. August 31st, 2012

  66. does not work.


  67. August 31st, 2012

  68. It’s working now.


  69. October 23rd, 2012

  70. It’ genius !! I spend hours looking for encoding solution, big thanks


  71. Johng286Johng286  
    May 13th, 2014

  72. Heya im for the first time here. I discovered this board and I to uncover It truly helpful &amp it helped me out a whole lot. I hope to supply something back and aid other people such as you helped me. cafeakedckea


    Perfectly composed content material , thankyou for selective information .


    Método para conseguir rápidamente las 301 visitas


    I appreciate, result in I discovered exactly what I used to be taking a look for.
    You’ve ended my 4 day long hunt! God Bless you man. Have a nice day.


    People not merely need to create habitats and farms to put things and
    create the required food, in addition they need to construct
    tavern, field and various other architecture to support extra
    works, and sometimes obvious pebbles, stones, trees and grass to make area regarding the small crowded
    floating island. This is exactly the fix for the challenge as well as your answer concerning “how to cheat Monster Legends”.
    Ray Crisara, a tough cop who’s survived a terrible car accident, discovers his wife is dead and his daughter is
    in critical condition.


    It is the one of the exceptionally main trouble with the students that conduct
    the study on the subject which is contrasted by the apprentices.
    We are professional writing agency, committed to the success of our clients.
    Don’t guess or figure you can go without knowing a few things–that’s the attitude that causes
    plenty of students to receive F’s on what could have been A


    What made us happy once may not have the same
    effect a second time and so we change our requirements.

    Also, the target area wherein you may score an eligible point is the opponent’s torso, from the waist up, their
    arms from the wrists up, and their head. Often, it comes up
    after watching a weapons demonstration, reading a book or simply when you start to consider training that might supplement your current martial art.


    Normally, the introduction should not go beyond two pages.

    Don’t forget to create a bibliography or all the sources you used in your
    paper. Every person wants to be a good researcher today but there is a whole lot of difference
    between a researcher and a good researcher.


    Use information from a variety of reference sources.
    This requires that claims and their evidence are clear to the writer and are articulated clearly to the reader.

    com has a team of experienced writers those are best in their fields.


Leave a Reply




[Name and Mail is required. Mail won't be published.]