Python Latin Characters and Unicode -


I have a tree structure in which keywords can contain some Latin characters. I have a function that grows through all the leaves of trees and adds each keyword to a list under certain conditions.

Here I have the code to add these keywords to the list:

  Print "Add:" + Self. Keyword leaf_list.append (self.keyword) print leaf_list   

If the keyword is the keyword università © , then my output is:

 Adding: code: università © ['universit \ xc3 \ xa9']   

It appears that the print function correctly shows Latin, but when I add it to the list, It gets decoded.

How can I change it? I need to be able to print the list with standard Latin characters, not their decoded versions.

You do not have Unicode objects, but byte string with UTF-8 encoded text. To print such byte strings on your terminal may work if your terminal is configured to handle UTF-8 text.

When a list is converted into a string, then the contents of the list representation ; The result of repr () function Represented string object, printable ASCII uses escape codes for any byte outside the range; For example, new lines are replaced by \ n . Your UTF-8 bytes are presented by the \ xhh escape sequence.

If you were using Unicode objects, the representation of \ xhh will be escaped still , but only Latin-1 class (outside ASCII ) For Unicode codepoints (the rest are displayed on the basis of \ uhhhh and \ Uhhhhhhh their codepoint); When reading, automatically encodes such values ​​in the right encoding for your terminal:

  gt; & Gt; U'università © 'u'universit \ xe9' & gt; & Gt; & Gt; Lane (U'nagriti '©') 10> gt; & Gt; Print YuinGeeriti © 'University'   

Compare it with a byte string:

  & gt; & Gt; & Gt; 'University' '' University \ xc3 \ xa9 '' gt; & Gt; & Gt; Lane ('università ©') 11> gt; & Gt; 'University' © DCDAD ('UTF8') You'Ingerit \ xe 9 '& gt; & Gt; & Gt; Print 'università ©' università ©   

Note that the length indicates that ÃÆ'à ¢ â,¬Å¡Ãƒâ € šÃ, «It was my terminal that Python with \ xc3 \ xa9 bytes presented in the Python session with the paste of the Ã⠀ šÃ, character, the way it is configured to use UTF-8 , And Python has detected it and decoded bytes when I have defined literally the u '..' Unicode object.

I firmly recommend that you can read the following articles to understand how Python handles Unicode, and what is the difference between Unicode text and encoded byte string: < ul>

  • Joel Spolsky

  • by Ned Bottler < / Ul>

  • Comments

    Popular posts from this blog

    Pass DB Connection parameters to a Kettle a.k.a PDI table Input step dynamically from Excel -

    multithreading - PhantomJS-Node in a for Loop -

    c++ - MATLAB .m file to .mex file using Matlab Compiler -