Update: This code has been finalized and debugged, and is now shipped as part of MagpieRSS 0.7! Sadness and rage no more!
So I have this little program, called Feed on Feeds. It’s an RSS and Atom aggregator. For a long time I’ve known that it doesn’t quite handle international characters that well, so I set out to fix it. I knew that somewhere between input feed and output HTML page, characters were getting messed up. I adopted a policy of “UTF-8 Everywhere”: since FoF has to deal with feeds in lots of different charsets, but display them all on one page, I’d translate everything into UTF-8. I UTF-ized everything in the display code, and made sure that the DB wasn’t mucking with the characters, finally closing in on the place where it seemed characters were being munged: the XML parser itself, called by MagpieRSS, the RSS and Atom parser used by FoF.
Here’s how Magpie was creating the XML parser:
$parser = xml_parser_create();
Nice! Simple! But it munges characters, especially numeric entities. After reading some PHP docs, I found that there are two things you can set in PHP’s XML parser: the source encoding, and the target encoding. You can set the target encoding this way:
$parser = xml_parser_create();
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
This means: “Whatever charset the XML is in, I want you to translate it into UTF-8. And if you happen to find any numeric entities in there, resolve them into UTF-8 characters, too.”
So I tried that. But it still wasn’t working. Some feeds were translated into UTF-8 properly, but others weren’t. Feeds already in UTF-8 were re-encoded, resulting in gibberish. Reading some more documentation and bug reports, I found that if you don’t set the source encoding, PHP assumes your XML is in ISO-8859-1! I was amazed that PHP’s XML parser didn’t examine the XML prolog to determine the encoding, and further shocked that they chose such an insane default. But anyway. You can set both source and target encodings this way:
$parser = xml_parser_create("EBCIDIC");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
This means, “I’m about to give you some XML, in EBCIDIC. I want you to translate all those characters into UTF-8 while you’re parsing it. Don’t forget to turn any numeric entities you find into UTF-8, too.”
That works… but presents a problem. How do you know the charset the XML is in? The only answer I could come up with: scan the XML myself, and find the encoding!
$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';if (preg_match($rx, $xml, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}
That regex finds the charset declaration in the XML prolog itself, and if found, saves it in the variable $encoding. If it wasn’t found, it assumes the XML is in UTF-8 already, which is the default for XML.
So the full code is now:
$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';if (preg_match($rx, $xml, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
That, finally, worked. All my feeds were reliably translated into UTF-8. But that was just by coincidence. All the feeds I subscribe to were already in UTF-8 or ISO-8859-1. After making this release, people complained that feeds in ISO-8859-15 and BIG-5 weren’t working. Consulting the PHP docs again, and double checking in the source code because it was just so surprising I found that PHP 4.x only supports UTF-8, ISO-8859-1, and US-ASCII. So anybody out there who wants to subscribe to a feed in ISO-8859-anything-but-1 or BIG5 of SHIFT-JIS is still screwed.
Even PHP 5 won’t help here, when it is released: It sort-of supports a longer list of encodings, but not BIG5 or GB2312, the two main Chinese encodings.
So I searched the PHP docs some more, and came up with a potential solution: mbstring! The mbstring family of functions supports a huge long list of encodings, and can translate between them. So here’s the final solution: use a regex to find the source encoding. If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding="utf-8" and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it. If mb_convert_encoding blows up (which it will if the source encoding is not recognized, or if the function completely doesn’t exist, which I’m told is highly probable, since it is an optional extension) just give up and pass the XML straight to the parser and avert your eyes as it makes mincemeat of the characters. At least I tried.
$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';if (preg_match($rx, $source, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}if($encoding == "UTF-8" || $encoding == "US-ASCII" || $encoding == "ISO-8859-1") {
$parser = xml_parser_create($encoding);
} else {if(function_exists('mb_convert_encoding')) {
$encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
}if($encoded_source != NULL) {
$source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
}$parser = xml_parser_create("UTF-8");
}xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
Surprisingly, this hack on top of a hack wrapped up in a hack with extra hack on top… worked! It was able to parse ISO-8859-15, BIG-5, even GB2312 feeds just fine, and translate them all into UTF-8 for display on a single page. I have these changes in my local copy of FoF now, and I’m going to let them burn in for a few days before I release them to the wider world, who will probably point out, within minutes, the multiple and tragic ways that even this solution fails. But until then, I proclaim that this is the state of the art in PHP XML charset-aware parsing. I think this is as good as it gets in PHP 4.x.
Footnote: when I say PHP5 sort-of supports more encodings, this is what I mean: PHP5 (I looked at RC3, maybe these bugs will be fixed by the final release) is completely nuts. The XML parser supports a bunch more encodings, but they are really hard to get to. If you try to explicitly set the input encoding, the PHP code limits you to UTF-8, ISO-8859-1, or US-ASCII, even though libxml2, the underlying parser, supports many more. But, if you know the super secret codes, you can construct the parser this way:
$parser = xml_parser_create("");
Notice the difference? In PHP5 bizarro world, passing in an empty string means “do what you should have done all along, auto detect the stupid encoding!” But, there’s another problem: if you auto-detect the stupid encoding this way, the stupid target encoding is stupidly set to ISO-8859-1. I don’t know who would want that. And it goes against the documentation, which says by default the target encoding is set to the source encoding. And again, you are restricted artifically from setting the output encoding to anything other than UTF-8, ISO-8859-1, or US-ASCII. So you could, if you want, use a regex (yuck!) to find the source encoding, but you wouldn’t be allowed to set the target encoding to match. But, at least, you can do this:
$parser = xml_parser_create("");
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
Meaning, “Auto detect the source encoding, and then translate everything, including numeric entities, into UTF-8.”
At least you will be able to… when PHP5 comes out, and is installed on the server where your application needs to run, which for me (I get complaints that FoF won’t work on PHP 3) probably won’t be for several years.


Phil Ringnalda
If you really want to cover all the bases, don’t just look for mb_convert_encoding(): you can also look for (and rarely find) iconv() and recode(). Then, on *nix, try to shell_exec(’iconv…’), though you’ll fail in safe mode and most shared hosts disable shell_exec. But, there’s still one last hope! Most of them don’t realize that they should also disable the strange and terrible proc_open(), so you can actually fork a process to run iconv, and open input and output pipes to feed and read.
Or, sigh, write ten lines of Python to call Mark’s Universal Feed Parser and return the output as a PHP include. Sometimes, PHP really ticks me off.
steve
OOooooo… I didn’t know about those!
Or I could just create a web service: http://xml-transcoder.com/?to=utf-8&url=http://nasty.feed/in/EBCIDIC/or/some/such . Then you’d just seamlessly subscribe to the transcoded version of the feed.
isis
good job and funny title.
steve
Thanks, isis. I used your feed (since it’s big5) as one of the test cases!
zonble
Steve. May you permit me to translate your post into Chinese and share it to the Chinese readers? I consider that there might be lots of people in Asia would like to know how to handle with the International characters while programing PHP.
steve
Yes! You certainly can! If you can wait until Monday, you can post your translation of this article along with a pointer to Feed on Feeds 0.1.7 which will contain the working code.
Mark Wu
Hi Steve:
I just plan integrate the FoF into pLog, this tale really help me learn a lot about PHP’s stupid encoding …
Regards, mark
Mark Wu
Only one thing, my ISP does not support iconv … my god … How can I do?
steve
Uh oh… how about ‘mbstring’? If you don’t have iconv or mbstring, then you will only be able to work with feeds that are in UTF-8, ISO-8859-1, or ASCII.
Thomas Clavier
for me it isn’t a good idea to search charset in xml header. it’s good to search in http header because for w3c if charset it isn’t specify in http header you can use xml charset.
http://www.w3.org/TR/WD-html40-970708/charset.html
steve
To really get it right you have to check both. In practice, I haven’t yet found any feeds where checking the headers is necessary or even helpful.
Mark Wu
Hi Steve:
Thanks, It works,I already asked my ISP install iconv …. It works..now!!
And, I just only use the hacked magpie-rss that shiped with FoF. With it, I can let pLog support the encoding, convert GB & BIG5 to UTF without change any code … really thanks!!
Please go my site to see the result, you really help me a lot!! ^-^.
RSS Feeds by Site ==> http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedulebysite
RSS Feeds by Time ==> http://blog.markplace.net/index.php?blogId=1&op=Template&show=schedule
And, may you allow me to submit the my code with your hacked magpie-rss version to pLog ??
Thanks!
Regards, Mark
steve
Mark: Great news! The code is GPL, so you can share it with anybody you like, as long as they follow the terms of the GPL as well. The author of Magpie is looking at these changes now, and refining them, and they may be included in a future “official” version of MagpieRSS.
Anonymous
god. this is a boring blog.
possibly, because i’m stupid and know nothing of php encoding.
but for all our sakes, go to a party and get hammered.
David
Thomas makes a good point, but it’s even worse than that–not only are there character encodings specified by both HTTP and the XML prolog, but either or both of them could be completely wrong. I would imagine there are many feeds served as ISO-8859-1 that are actually windows-1252 (or whatever it is) and they just don’t happen to contain any of the characters that are different between the two formats, yet. When one of them does, your code might handle it, or maybe it’ll start spitting out gibberish again. And if someone gets UTF-8 and UTF-16 mixed up, I think you could wind up with everything shifted off by a byte, and now the feed is completely unreadable again.
Basically, until we can convince everyone to use UTF-8/16/32 for everything, this will be a bloody pain in the arse. It looks like PHP just makes it even more painful. I would second Phil’s recommendation: use Mark’s Universal Feed Parser, and when something breaks, just get him to fix it. :-D
steve
At this point I’m still in the “get it right when the feed is right” stage. FoF still doens’t even do that, all the time. Once I’ve got that one licked, I may move on to “get it right even when the feed doesn’t”.
So Much Geek, So Little Time » Reject Incorrectly encoded Pingbacks
[…] ject Incorrectly encoded Pingbacks Filed under: Life — unteins @ 12:11 pm This article has some info about […]
Mark Wu
Hi Steve:
I wrote a blog about pLog RSSFetcher Plug-in. Thanks for such good work.
http://blog.markplace.net/index.php?op=ViewArticle&articleId=119&blogId=1
Regards, Mark
Harry Fuecks
Great post. Many thanks. Seems many php developers are largely oblivious to character encoding issues. Looking at some of the feed generation libraries out there, seems it’s a similar story - last time I looked only RssWriter pays any attention to UTF8. The author is Portugese I believe, which may be why…
Gives me some ideas for further features for HTMLSax - right now it shouldn’t (not that I’ve tested carefully) choke on anything but also doesn’t support the user by taking care of encoding issues.
inertia
hi Steve,
I heard this good tool from isis, and install it on the share hosting to test. And one thing I can’t figure out is that after reading all docs carefully and asking my ISP had mbstring and iconv both complied, I still can’t fetcher big5 blogs, ex isis’s blog. Do you thnik where may I get worng?
I know this porblem is guite “ambiguous”, say, the ISP didn’t compiled well ,or some installation step got wrong. but I also wish get some ideas form you.
regards
inertia
Peter Van Dijck\'s Guide to Ease
Steve Minutillo :: messy-78 PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss: a must read if you’re wrestling with PHP, XML and character encodings!…
MeriBlog : Meri Williams\' Weblog
Multitasking Mice Fertilitiy
Interesting article over at OK/Cancel all about multitasking PHP, XML, and Character Encodings: a tale of sadness, rage, and (data-)loss — worth reading just for the title ;-) Very comprehensive guidebook to developing with web standards I love the…
Grace and peace to you! » 2004 » June
[…] Grace and peace to you!
» PHP, XML, and Character Encodings: a […]
Pete Prodoehl
I too must suggest that using Mark’s Universal Feed Parser would be a good idea. In fact, since the process of harvesting the feeds, parsing them, and storing them in MySQL can be separated out from the whole UI/reading part of things, this is a great suggestion. I know it sort of makes fof a weird combo PHP/Python app, but it could be an option for those of use who don’t have a problem with that.
LinuxBrit
PHP, XML, and Character Encodings
Man, PHP can be really unbelievably stupid sometimes :(
Elaine
yipes…I started taking a look at the problem when I was mucking about with a personal variant of FoF and could never quite figure out what was going on. I always thought it was a problem with Magpie, but I had no idea it went that deep. thanks for all your work on FoF, and for sharing all the gory details!
Simon Jessey
Multibyte string functions are not part of the PHP default install, so many webhosts do not include it - my own webhost refused to add it. Just thought I’d let you know.
Dominic Mitchell
What about UTF-16? Your regex won’t be able to pick up the XML declaration then…
I love spanners. :-)
-Dom
steve
inertia: I’m lucky enough to be on a host where mbstring and iconv are both included and work perfectly. I’ve actually never even compiled PHP myself, so I don’t really know what can go wrong there, other than the obvious “check phpinfo()”
Simon: I know. There’s nothing I can really do about that. In the next version of FoF the installer will inspect your system and tell you what it finds, so at least this will be less confusing.
Dominic: I knew there’d be a hitch somewhere. Are you sure? Have you tried it with UTF-16? If it doesn’t work, is there any workaround?
David
Thanks, excellent post. Good to know someone’s working so we don’t have to :)
snapping links » lookin’ good
[…] n into the problem with character encoding (darn curly quotes)…so I went back to look at what Steve Minutillo had to say abou […]
Andrew
Thanks for your post. It’s unfortunate that php’s support for i18n is so poor here. For what it’s worth (probably not much), EBCDIC is not spelled with an extra ‘I’, even though people pronounce it as if it did. Man, I hope nobody sends out news feeds in EBCDIC O:-).
车东Blog^2
Lilina:RSS聚合器构建个人门户(Write once, publish anywhere)
最近搜集RSS解析工具中找到了MagPieRSS 和基于其设计的Lilina;Lilina的主要功能:
1 基于WEB界面的RSS管理:添加,删除,OPML导出,RSS后台缓存机制(避免对数据源服务器产生过大压力),ScriptLet…
车东BLOG
MagPieRSS中UTF-8和GBK的RSS解析分析:php中的面向字符编程详解
第一次尝试MagpieRSS,因为没有安装iconv和mbstring,所以失败了,今天在服务器上安装了iconv和mtstring的支持,我今天仔细看了一下lilina中的rss_fetch的用法:最重要的是制定RSS的输出格式为’MAGPIE_O…
grace
how to do it even without using XML ? I just wan my page to be able to key in chinese character, submit to mysql in code form, then display on page the chinese character.
What is the testing script which I can test for this?
Thank you
steve
grace: I don’t have any links to tutorials on that, but what you’re talking about is fairly easy to do. In fact, the free weblog engine that powers this site, Wordpress, is capable of doing just that. You could download Wordpress and examine how international characters are handled.
Valery\'s Mindlog
, .. , 00:01. , ( “ ”, ). …
Sascha Carlins Linkdump
This page has been linkdumped
Charsets…
Andy
I’m developing a library called mbstring emulator which emulate mbstring functions. I already published mbstring emulator for Japanese(supports Shift_JIS, EUC-JP, UTF-8),and now I’m working with western language version(iso-8859-1 and utf-8). I’d like to know what encoding do you need .
steve
BIG-5 would be nice.
ryh
Good job, I’m using this on my site.
+CMS
Andy your mbstring emulator is perfect. thx
Charl47
Hi steve
I’m using magpierss 7.0. Where may i put your synthaxe exactly. I don’t unsderstand all your explain, i’m newbee in RSS and I’m french. So it’s difficult for me to translate this.
Thank’s.
steve
If you are using MagpieRSS 0.7, then this code is already built in! This was written before MagpieRSS 0.7 was created.
nwestwood
I’m using magpieRSS to read multiple feeds and create 1 feed with news of interest to our industry and it works, mostly. When I write out the data I retrieve from magpie, I get “not well-formed” XML, for example & symbols in the URL’s that Feed Validator doesn’t like. Is there a way to get the data back and have it encoded correctly? or what do you suggest?
-thanks - Neal
steve
Magpie is working the way Feed on Feeds uses it. You could try asking on the magpierss mailing list with some more specifics on your problem.
junesnow17
我不知道我在這裡輸入中文字是否可以顯示出來
我的英文很差所以只好輸入中文
因為各種原因,我在學習php的同時發現我現在安裝的php版本太舊,所以花了兩天時間更新到最新的版本
結果論壇中的會員名字全部變成亂碼@@
看了這篇文章,雖然問題沒有解決,我老公最終還是把資料庫還原到以前的版本,但是我十分感動,有人會在關注這件事情.
因為文字,我們使用方塊字的這些群體都被忽略,往往寫一些程式的時候就會因為文字的限制而弄得暈頭轉向.希望能早一天,有國人都可以使用的php.
不要再讓我們覺得自處受制了.
Jari
Hello Steve
I was looking for this script for a long time. It looks great. But I get a parse error , unexpected ‘*’ , in the line $rx=
I made a copy/paste in Dreamweaver. What’s wrong in the syntax? Thanks to help me. Best regards
jari
Hello
I tried magpie rss_parse.inc script that use your post. Doing copy with notepad I have now no syntax error. I want to read xml files in ISO or UTF-8 . I tried the function : function php4_create_parser to test the encoding of the input files. I use the regular expression : ‘//m’ with preg_match, but if I do an echo of this regular expression I get nothing. What’s wrong? How can I get the encoding of the xml files ? May you help me.
Best regards
Javier
Thanks so much man, this has solve a huge problem I had. Ive been searching for a solution for a while, parsing XML with PHP 4.x could be frustrating specially if your XML files need to use entities for different languages (in my case, spanish).
Ive learned a lot with this, thanks again ^_^
Asha
Hello,
I am using PHP and MYSQL to store Japanese data. I am facing encoding problem. I am not able to store it in table with proper encoding can any one help me?
K
Too late probably, but just replace all “” from the string before executing the regexp, and it should work with UTF-16
K
ok, my last comment was “mangled”
Replace all 0-bytes from the string. That’s \ by the way
oink
i suggest to go a little further with the regular expression in a network world full of screwed up source codes:
preg_match( “/]+encoding\s*=\s*[’”]?([\w._]+(-[\w._]+)*)[’”]?[^>]\?>/i”, $xml, $m )
it’s possible i screwed up here myself, double check advised, i didn’t test it yet.
oink
well seems like your comment page doesn’t translate “smaller than”.
steve
Sorry about that… I think Wordpress probably has a similar screwed up regex that tries to sanitize comments that munged yours.
pepe
hola egañádos
Andrew
Hi, I am using MagpieRSS 0.7 but I still am getting character substitution! please see this site still in progress:
www.andrewzahn.com/crissa
can you tell what could cause this?
thanks
Jason Judge
Just a note on this bit:
“If PHP can handle it natively, fine. If not, lop off the XML prolog, replace it with one that says encoding=”utf-8″ and pass the whole XML file through mb_convert_encoding to convert it to UTF-8 before the parser even sees it.”
It should be noted that any UTF-8 stream can be treated as a valid ISO-8859 stream, since an ISO stream is a series of bytes. However, the reverse does not hold true. There are ISO-8859 (and other single-byte and multibyte streams) that turn out to be invalid as UTF-8 streams.
The reason is that a series of independant bytes is a series of bytes, but UTF-8 has strict rules in which ranges of byte values can follow certain other bytes.
The parser may not fall over now when it hits these invalid sequences, but I am not sure it is safe to assume that will always be the case.
I think it would be safer to send an unknown or unhandlable encoding into the parser as ISO8859, and then convert the entities afterwards. *That* is why the parser defaults to ISO and not UTF.
matt
Steve,
Great stuff here. I’m using Magpie v 7a. The parser looks to have I incorporated your encoding fix for PHP4. On my page, particularly in the Yahoo News feeds, several of the characters are converted to question marks. I looked at the original feed, and the special characters are an apostrophe and an mdash. Apparently, all apostrophes aren’t equal as some are handled well and others are converted to ‘?’. The apostrophe causing problems slants backwards (almost like an accent).
All in all, I would say that the output is still pretty good, and completely readable. It would be icing on the cake to fix this problem with special characters.
The Yahoo feed is UTF-8. Here is the link to the actual XML file:
link
Off topic, comparing Yahoo News and Google News feeds. Yahoo is much better in terms of their advanced search options. Google has no provision to exclude based on keywords. However, Google incorporates nice thumbnails into their output descriptions. This makes them very appealing, and it will be interesting to see how long it lasts before others begin incorporating small thumbnails as well. I could see including a conditional that would only display stories which contained thumbnails.
unclepiak ลุงเปี๊ยก
sound interesting ! ขออนุญาตทดสอบภาษาไทย
Rasmus
I had the same problem using magpie .72 and found that the only solution that worked, was to ditch my UTF-8 feed and replace it with an ISO-8859 feed instead. After changing that setting in WordPress and clearing magpie’s cache, everything worked perfectly.
Oliver
Hey all.. Thanks for the post. I am actually trying to mod this plugin right now for my site for something having to do nothing with favicons and would love some help if anyone is willing to trade a few emails.
Matt
If you, like me, found no luck here in resolving your char. munging issue in Magpie 0.72 and RSS (Atom works fine, right?) then perhaps what I did may help you…
when you include and define your Magpie params be sure to include this line:
define(’MAGPIE_OUTPUT_ENCODING’, ‘UTF-8′);
and in the head of your HTML docuemnt be sure (for browser compatability and user preference consideration) to add/change the following:
(less than sign) meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8″ / (greater than sign)
of course you will replace the signs indicated in parentesis… without the parenthesis… hehehe
sempsteen
Hi,
I’ve exactly the same problem with my project. Here is a good conversation that also includes useful links about this problem, and its solution.
http://groups.google.com.tr/group/comp.infosystems.www.authoring.html/browse_thread/thread/d5faecca4ac6e44c/fd7fd436ef3187af?lnk=gst&q=iconv&rnum=6&hl=tr#fd7fd436ef3187af
purr
i tried to do it on http://www.katalog.wzrost.com but have problems with showing rss in ISO-8859-2….
Alex
I could not overcome an error in cyrillic encoding ISO in Java Struts…
All letters normally decoded, except for russian “I”..
What is that?
Chrys Bader
As you may already know, Wordpress adds a Non-breakable space in between the last two words of a title.
The problem is, when it is run through feedburner, it gets converted to some raw character that looks like Â. I can’t match the ‘Â’ and I can’t find any way to get rid of it or convert it.
Have you encountered this?
nwestwood
I’ve been using magpieRSS for a couple of years now. I have not found an easy way to get “well formed xml”. Instead I have an intermediate parser that looks for these bad characters and strips them out, so every time I find a bad character or embedded code, I add it to the list and it gets cleaned up. It’s more work than I wanted, but it works 99% of the time. If anyone knows of a better way, let me know.
Binny V A
I wrote a XML parser function in PHP…
http://www.bin-co.com/php/scripts/xml2array/
and I was getting complaints that it was not working for some languages.
Now I know why.
Hannu
I so respect you for writing this article. It is hard to remember how much work is behind well writen accurate text. I hope that there will at some point some webpage where this kind or diamonds are aggregated as one library on knowledge.
And my problem was UTF-8 rss to ISO-8859-1 page, which php can not handle automatically using version 4. (customer won’t upgrade to php5, I would…)
Solar Power
great help guide, i will change the value in magpierss, to utf-8.
atanas
I had a strange problem using php DOM with a site that has iso-8859-1 in its head, but sends actual utf-8 header through apache. It’s a cyrillic wordpress blog that is not well configured. The solution is somekind strange- you have to do sth like this:
$html=preg_replace(’/]*>/’,’
‘,$html);
, e.g. add a charset header before anything else in the head. It didn’t work even if I replaced the normal header coming from the page on its normal position. The DOM just thought it was 8859-1 encoded. This is a very strange bug as normally you can get page headers using curl or sth like that and convert the page to utf-8…
CFD
This is giving me a headache of the worst kind, cant seem to get to grips with it at all. Im really needing to parse out an xml feed, was hoping i could use the magpie flow for it, but its next stage huh, I need something for pure xml based parsing.
Michael Donnel
I am doing a simple translation in IE7 using the domdocument and my page comes out fine. When I do the same thing PHP5 I get a capital letter A with a hat on it. I have been all over the net and cant figure what is going on. I am using UTF-8 encoding in all of my documents and my PHP code even ensures the UTF-8 encode. Here are my only lines of PHP:
xslt_process($xsltproc,’webconfig.xml’,'webcontentbios.xsl’);
xslt_free($xsltproc);
echo $xslt_result;
So why does this work fine in IE7 and firefox using straight Javascript and dom xml/xsl transformation but in PHP I get these characters. What is the cure? Thanks.
Pleae leave a comment!