PDA

View Full Version : Strange strange behaviour with special French characters


hartmut
02-19-2010, 04:00 AM
Problem:
When I copy from firefox or IE a French webpage directly to UR the French characters are not shown correctly.
When I save the same webpage to scrapbook, export it to a folder and then import to UR via file import the characters are displayer correctly.

Please see following example

direct copy:
Décoration et loisirs créatifs

via scrapbook:
Décoration et loisirs créatifs
I am using UR 4.1b, Firefox 3.6 and WinmdowsXP


Hartmut

kinook
02-19-2010, 08:55 AM
That works ok in our tests. The first Google result for "Décoration et loisirs créatifs" was http://www.creamalice.com/. After importing that page from Firefox 3.6 into UR 4.1b using the UR Firefox extension 'Copy to Ultra Recall' button, the item text displays correctly in UR (see attached .urd file and screen shot).

hartmut
02-20-2010, 01:07 AM
Than yo, I tried with the side you mention and it works fine,.
I suppose it is a problem of the site were I downloadad this side, as the side of this site have all the same problem.

The original side was

http://www.tourisme-hautemarne.com/terroir-tradition-en-haute-marne/artisanat-et-savoir-faire/decoration-et-loisirs-creatifs,810,1283.html?


Hartmut

kinook
02-22-2010, 08:41 AM
It appears that what is happening is the web page text is UTF-8 encoded, but without a BOM (byte order mark), and within the web page itself, the content is declared to be encoded as iso-8859-1 (the encoding for Western European text), which is inconsistent with the actual encoding. UR imports the data correctly, but when displaying the page, the embedded IE browser assumes the page is encoded as iso-8859-1 rather than UTF-8, which results in the accented characters displaying incorrectly. My guess is that scrapbook converts everything to the current code page or UTF-8 (adding a BOM), but UR doesn't do this (and even Firefox's Save Page As does something similar to UR, except that it doesn't capture images).

http://en.wikipedia.org/wiki/UTF-8

http://en.wikipedia.org/wiki/Byte-order_mark

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

One workaround is to select the page content (Ctrl+A) in the browser before importing into UR -- the HTML clipboard data has consistent encodings and is handled correctly.