pythonunicode函数_Python:在Unicode和普通字符串之间转
换
1.1. 问题 Problem
You need to deal with data that doesn't fit in the ASCII character set.
你需要处理不适合⽤ASCII字符集表⽰的数据.
1.2. 解决 Solution
Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:
Unicode字符串可以⽤多种⽅式编码为普通字符串, 依照你所选择的编码(encoding):
Toggle line numbers
1#将Unicode转换成普通的Python字符串:"编码(encode)"
2unicodestring = u"Hello world"
3utf8string = de("utf-8")
4asciistring = de("ascii")
5isostring = de("ISO-8859-1")
6utf16string = de("utf-16")
7
8
9#将普通的Python字符串转换成Unicode: "解码(decode)"
10plainstring1 = unicode(utf8string, "utf-8")
11plainstring2 = unicode(asciistring, "ascii")
12plainstring3 = unicode(isostring, "ISO-8859-1")
13plainstring4 = unicode(utf16string, "utf-16")
14
15assert plainstring1==plainstring2==plainstring3==plainstring4
1.3. 讨论 Discussion
If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode梬hat it is, how it works, and how Python uses it.
如果你发现⾃⼰正在处理包含⾮ASCII码字符的⽂本, 你必须学习Unicode,关于它是什么,如何⼯作,⽽且Python如何使⽤它。
Unicode is a big topic.Luckily, you don't need to know everything about Unicode to be able to solve real-world problems with it: a few basic bits of knowledge are enough.First, you must understand the difference between bytes and characters.In older, ASCII-centric languages and environments, bytes and characters are treated as the same thing.Since a byte can hold up to 256 values, these environments are limited to 256 characters.Unicode, on the other hand, has tens of thousands of characters.That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes.
Unicode是⼀个⼤的主题。幸运地,你并不需要知道关于Unicode码的每件事,就能够⽤它解决真 实世界的问题: ⼀些基本知识就够了。⾸先,你得了解在字节和字符之间的不同。原先,在以ASCII码为中⼼的语⾔和环境中,字节和字符被当做相同的事物。由于⼀个字节只 能有256个值,这些环境就受限为只⽀持 256个字符。Unicode码,另⼀⽅⾯,有数万个字符,那意谓着每个Unicode字符占⽤多个字节,因此,你需要在字符和字节之间作出区别。
Standard Python strings are really byte strings, and a Python character is really a byte.Other terms for the standard Python type are "8-bit string" and "plain string.",In this recipe we will call them byte strings, to remind you of their byte-orientedness.
标准的Python字符串确实是字节字符串,⽽且⼀个Python字符真的是⼀个字节。换个术语,标准的 Python字符串类型的是 "8位字符串(8-bit string)"和"普通字符串(plain string)". 在这⼀份配⽅中我们把它们称作是字节串(byte strings), 并记住它们是基于字节的。
Conversely, a Python Unicode character is an abstract object big enough to hold the character, analogous to Python's long integers.You don't have to worry about the internal representation;the representation of Unicode characters becomes an issue only when you are trying to send them to some byte-oriented function, such as the write method for files or the send method for network socke
ts.At that point, you must choose how to represent the characters as bytes.Converting from Unicode to a byte string is called encoding the string.Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the strings from bytes to characters.
相反地,⼀个Python Unicode码字符是⼀个⼤到⾜够⽀持(Unicode)字符的⼀个抽象对象, 类似于Python中的长整数。 你不必要为内在的表⽰担忧; 只有当你正在尝试把它们传递给给⼀些基于字节的函数的时候,Unicode字符的表⽰变成⼀个议题, ⽐如⽂件的write⽅法或⽹络套接字的send ⽅法。那时,你必须要选择该如何表⽰这些(Unicode)字符为字节。从Unicode码到字节串的转换被叫做编码。同样地,当你从⽂件,套接字或其他 的基于字节的对象中装⼊⼀个Unicode字符串的时候,你需要把字节串解码为(Unicode)字符。
There are many ways of converting Unicode objects to byte strings, each of which is called an encoding.For a variety of historical, political, and technical reasons, there is no one "right" encoding.Every encoding has a case-insensitive name, and that name is passed to the decode method as a parameter. Here are a few you should know about:
将Unicode码对象转换成字节串有许多⽅法, 每个被称为⼀个编码(encoding)。由于多种历史的,政治上的,和技术上的原因,没有⼀个 "正确的"编码。每个编码有⼀个⼤⼩写⽆关的名字,⽽且那⼀个名字被作为⼀个叁数传给解码⽅法。这⾥是⼀些你应该知道的:
The UTF-8 encoding can handle any Unicode character.It is also backward compatible with ASCII, so a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters.This property makes UTF-8 very backward-compatible, especially with older Unix tools.UTF-8 is far and away the dominant encoding on Unix.It's primary weakness is that it is fairly inefficient for Eastern texts.
unicode码和ascii码区别UTF-8 编码能处理任何的Unicode字符。它也是与ASCII码向后兼容的,因此⼀个纯粹的ASCII码⽂件也能被考虑为⼀个UTF-8 ⽂件,⽽且⼀个碰巧只使⽤ASCII码字符的 UTF-8 ⽂件和拥有同样字符的ASCII码⽂件是相同的。 这个特性使得UTF-8的向后兼容性⾮常好,尤其使⽤较旧的 Unix⼯具时。UTF-8 ⽆疑地是在 Unix 上的占优势的编码。它主要的弱点是对东⽅⽂字是⾮常低效的。
The UTF-16 encoding is favored by Microsoft operating systems and the Java environment.It is less efficient for Western languages but more efficient for Eastern ones.A variant of UTF-16 is sometimes known as UCS-2.
UTF-16 编码在微软的操作系统和Java环境下受到偏爱。它对西⽅语⾔是⽐较低效,但对于东⽅语⾔是更有效率的。⼀个 UTF-16 的变体有时叫作UCS-2 。
The ISO-8859 series of encodings are 256-character ASCII supersets.They cannot support all of the
Unicode
characters;they can support only some particular language or family of languages.ISO-8859-1, also known as Latin-1, covers most Western European and African languages, but not Arabic.ISO-8859-2, also known as Latin-2,covers many Eastern European languages such as Hungarian and Polish.
ISO-8859编码系列是256个字符的ASCII码的超集。他们不能够⽀援所有的Unicode码字符; 他们只能⽀援⼀些特别的语⾔或语⾔家族。ISO-8859-1, 也既Latin-1,包括⼤多数的西欧和⾮洲语⾔, 但是不含阿拉伯语。ISO-8859-2,也既Latin-2,包括许多东欧的语⾔,像是匈⽛利语和波兰语。
If you want to be able to encode all Unicode characters, you probably want to use UTF-8.You will probably need to deal with the other encodings only when you are handed data in those encodings created by some other application.
如果你想要能够编码所有的Unicode码字符,你或许想要使⽤UTF-8。只有当你需要处理那些由其他应⽤产⽣的其它编码的数据时,你或许才需要处理其他编码。
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论