Python Latin Characters and Unicode -
I have a tree structure in which keywords can contain some Latin characters. I have a function that grows through all the leaves of trees and adds each keyword to a list under certain conditions.
Here I have the code to add these keywords to the list:
Print "Add:" + Self. Keyword leaf_list.append (self.keyword) print leaf_list If the keyword is the keyword università © , then my output is: Adding: code: università © ['universit \ xc3 \ xa9'] It appears that the print function correctly shows Latin, but when I add it to the list, It gets decoded.
How can I change it? I need to be able to print the list with standard Latin characters, not their decoded versions.
You do not have Unicode objects, but byte string with UTF-8 encoded text. To print such byte strings on your terminal may work if your terminal is configured to handle UTF-8 text. When a list is converted into a string, then the contents of the list representation ; The result of repr () function Represented string object, printable ASCII uses escape codes for any byte outside the range; For example, new lines are replaced by \ n . Your UTF-8 bytes are presented by the \ xhh escape sequence. If you were using Unicode objects, the representation of \ xhh will be escaped still , but only Latin-1 class (outside ASCII ) For Unicode codepoints (the rest are displayed on the basis of \ uhhhh and \ Uhhhhhhh their codepoint); When reading, automatically encodes such values in the right encoding for your terminal: gt; & Gt; U'università © 'u'universit \ xe9' & gt; & Gt; & Gt; Lane (U'nagriti '©') 10> gt; & Gt; Print YuinGeeriti © 'University' Compare it with a byte string:
& gt; & Gt; & Gt; 'University' '' University \ xc3 \ xa9 '' gt; & Gt; & Gt; Lane ('università ©') 11> gt; & Gt; 'University' © DCDAD ('UTF8') You'Ingerit \ xe 9 '& gt; & Gt; & Gt; Print 'università ©' università © Note that the length indicates that ÃÆ'à ¢ â,¬Å¡Ãƒâ € šÃ, «It was my terminal that Python with \ xc3 \ xa9 bytes presented in the Python session with the paste of the Ã⠀ šÃ, character, the way it is configured to use UTF-8 , And Python has detected it and decoded bytes when I have defined literally the u '..' Unicode object. I firmly recommend that you can read the following articles to understand how Python handles Unicode, and what is the difference between Unicode text and encoded byte string: < ul>
Joel Spolsky
by Ned Bottler < / Ul>
Comments
Post a Comment