【CPython3.6源码分析】PyBytesObject/PyUnicodeObject

前言

众所周知,Python2 中存在 str、bytes、unicode,Python3中只存在 str、bytes,然而却并不表示相同的含义,Python3中的 str 即Python2中的 unicode。

按照 CPython3的文档显示:Sequence Objects下辖 Bytes Objects、Unicode Objects。自PEP393之后,Unicode Type 变成了层次化的结构,用以减少内存占用。

Bytes Objects

PyBytesObject

1
2
3
4
5
// bytesobject.h.12
Type PyBytesObject represents a character string. An extra zero byte is
reserved at the end to ensure it is zero-terminated, but a size is
present so strings with null bytes in them can be represented. This
is an immutable object type.

同样开局一段注释:

  • 字符串末尾有一个 \0
  • 字符串计数 size 不含 \0
  • 自身是不可变类型
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// pyport.h.93
/* Py_hash_t is the same size as a pointer. */
typedef Py_ssize_t Py_hash_t;

// bytesobject.h.31
typedef struct {
PyObject_VAR_HEAD
Py_hash_t ob_shash;
char ob_sval[1];

/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
*/
} PyBytesObject;

从源码可以看出:

  • PyBytesObject 是变长对象
  • 用 char 数组存储字符串对象,数组长度默认为 1
  • 数组 ob_sval 含有 ob_size+1 个元素
  • PyBytesObject 内部有 ob_shash 变量缓存hash值,且 初始值为 -1

PyBytes_Type

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// bytesobject.c.25
#define PyBytesObject_SIZE (offsetof(PyBytesObject, ob_sval) + 1)

// bytesobject.c.2837
PyTypeObject PyBytes_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"bytes",
PyBytesObject_SIZE, // tp_basicsize
sizeof(char), // tp_itemsize
...
(reprfunc)bytes_repr, /* tp_repr */
&bytes_as_number, /* tp_as_number */
&bytes_as_sequence, /* tp_as_sequence */
&bytes_as_mapping, /* tp_as_mapping */
(hashfunc)bytes_hash, /* tp_hash */
...
};

恩,不出意外,也是 PyVarObject_HEAD_INIT(&PyType_Type, 0)

Bytes 共享机制

创建对象时,存在跟 small_ints 类似,使用了对象池技术的 characters:

1
2
3
// bytesobject.c.22
static PyBytesObject *characters[UCHAR_MAX + 1];
static PyBytesObject *nullstring;

  • 当 size==1 时,将尝试从 characters 中获取对象指针。UCHAR_MAX 即 无符号整型最大值 255。
  • 当 size==0 时,将共享使用 同一个 空字符串指针 nullstring。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
>>> a = b'a'
>>> b = b'a'
>>> id(a),id(b)
(1618457902176, 1618457902176)
>>>
>>> a = b'aa'
>>> b = b'aa'
>>> id(a),id(b)
(1618457901016, 1618457902256)
>>>
>>> c = b'a'
>>> id(c)
1618457902176
>>>
>>> d = b''
>>> e = b''
>>> id(d),id(e)
(1618427315824, 1618427315824)

PyBytes_FromString

1
2
3
4
5
PyObject* PyBytes_FromString(const char *v)
PyObject* PyBytes_FromStringAndSize(const char *v, Py_ssize_t len)
PyObject* PyBytes_FromFormat(const char *format, ...)
PyObject* PyBytes_FromFormatV(const char *format, va_list vargs)
PyObject* PyBytes_FromObject(PyObject *o)

同样,CPython定义了很多创建 BytesObejct 的方法,下面也只看其中一种。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// bytesobject.c.132
/*
For PyBytes_FromString(), the parameter `str' points to a null-terminated
string containing exactly `size' bytes.
*/
PyObject * PyBytes_FromString(const char *str)
{
size_t size;
PyBytesObject *op;

assert(str != NULL);
size = strlen(str);
if (size == 0 && (op = nullstring) != NULL) {
Py_INCREF(op);
return (PyObject *)op;
}
if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL) {
Py_INCREF(op);
return (PyObject *)op;
}

/* Inline PyObject_NewVar */
op = (PyBytesObject *)PyObject_MALLOC(PyBytesObject_SIZE + size);
(void)PyObject_INIT_VAR(op, &PyBytes_Type, size); // PY_TYPE(op) = PyBytes_Type
op->ob_shash = -1;
memcpy(op->ob_sval, str, size+1);
/* share short strings */
if (size == 0) {
nullstring = op;
Py_INCREF(op);
} else if (size == 1) {
characters[*str & UCHAR_MAX] = op;
Py_INCREF(op);
}
return (PyObject *) op;
}

从上面的源码可以看出,PyBytes_FromString 大概分为4部分:

  • 计算 字符串长度 strlen(str)
  • 处理 空字符串 size == 0,尝试获取全局变量 nullstring
  • 处理 单字符串 size == 1,尝试获取共享对象 characters
  • 申请空间、创建对象、拷贝内存、返回结果

需要注意的是:

  • PyObject_MALLOC 申请空间大小为 PyBytesObject_SIZE + size,是一个确定的不能再次改变的值
  • memcpy(op->ob_sval, str, size+1), size+1 表明把 字符数组的 ‘\0’也存入了 op_ob_sval,与前文 相对应
  • op->ob_shash = -1,hash 缓冲值,赋值-1,与前文相对应
  • 共享数组 characters ,是在对象的不断创建中,逐渐填满

Unicode Objects

PyUnicodeObject

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* There are 4 forms of Unicode strings:

- compact ascii:
* structure = PyASCIIObject
* kind = PyUnicode_1BYTE_KIND
* 仅 ASCII 字符,7bit
* throw PyUnicode_New

- compact:
* structure = PyCompactUnicodeObject
* kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or PyUnicode_4BYTE_KIND
* 仅 latin1 且 Non-ASCII 字符,>=8bit
* throw PyUnicode_New

- legacy string, not ready:
* structure = PyUnicodeObject
* kind = PyUnicode_WCHAR_KIND
* PyUnicode_FromUnicode(NULL, len);

- legacy string, ready:
* structure = PyUnicodeObject structure
* kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or PyUnicode_4BYTE_KIND
* PyUnicode_FromUnicode(NULL, len);

// compact 与 legacy 的显著区别:
Compact strings use only one memory block (structure + characters),
whereas legacy strings use one block for the structure and one block
for characters.
*/

同样开篇一段注释,详细内容可以看PEP393,之所以弄得这么复杂,就是为了权衡通用性与空间利用率。下面还是来看代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// unicodeobject.h.197
/* ASCII-only strings created through PyUnicode_New;
utf8_length == wstr_length == length;
the utf8 pointer == data pointer == wstr */
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* 码位(code points) */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
unsigned int interned:2; // 共享机制
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;/
unsigned int :24;
} state;
wchar_t *wstr; /* 真实字符串 (null-terminated) */
} PyASCIIObject;

/* Non-ASCII strings allocated through PyUnicode_New;
the data immediately follow the structure. */
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length; /* Number of bytes in utf8, excluding the
terminating \0. */
char *utf8; /* UTF-8 representation (null-terminated) */
Py_ssize_t wstr_length; /* Number of code points in wstr, possible
* surrogates count as two code points. */
} PyCompactUnicodeObject;

/* Strings allocated through PyUnicode_FromUnicode(NULL, len);
The actual string data is initially in the wstr block;
and copied into the data block using _PyUnicode_Ready. */
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data; /* Canonical, smallest-form Unicode buffer */
} PyUnicodeObject;

如上,定义了3种Objec结构体,具体功能及创建方式,见注释内容。

PyUnicode_Type

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// unicodeobject.c.15170
PyTypeObject PyUnicode_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"str", /* tp_name */
sizeof(PyUnicodeObject), /* tp_size */
0, /* tp_itemsize */
/* Slots */
(destructor)unicode_dealloc, /* tp_dealloc */
...
unicode_repr, /* tp_repr */
&unicode_as_number, /* tp_as_number */
&unicode_as_sequence, /* tp_as_sequence */
&unicode_as_mapping, /* tp_as_mapping */
(hashfunc) unicode_hash, /* tp_hash*/
...
unicode_new, /* tp_new */
PyObject_Del, /* tp_free */
};

可见,PyUnicode_Type 就是 Python3 中的 str。

创建对象

与 PyBytesObject 类似,PyUnicodeObject 也存在好几种创建方式,详见python.org。由于存在多种 Unicode OBject,各自的创建方式还不一样,下面分开查看。

PyUnicode_New

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// unicodeobject.c.1220
PyObject * PyUnicode_New(Py_ssize_t size, Py_UCS4 maxchar)
{
PyObject *obj;
PyCompactUnicodeObject *unicode;
void *data;
enum PyUnicode_Kind kind;
int is_sharing, is_ascii;
Py_ssize_t char_size;
Py_ssize_t struct_size;
... /* 判断 获取变量的值 */
obj = (PyObject *) PyObject_MALLOC(struct_size + (size + 1) * char_size);
obj = PyObject_INIT(obj, &PyUnicode_Type);

unicode = (PyCompactUnicodeObject *)obj;
_PyUnicode_LENGTH(unicode) = size;
_PyUnicode_HASH(unicode) = -1;
_PyUnicode_STATE(unicode).interned = 0;
_PyUnicode_STATE(unicode).kind = kind;
_PyUnicode_STATE(unicode).compact = 1;
_PyUnicode_STATE(unicode).ready = 1;
_PyUnicode_STATE(unicode).ascii = is_ascii;
...
/* 根据变量值, 赋值
unicode->utf8 = ?
unicode->utf8_length = ?
_PyUnicode_WSTR_LENGTH(unicode) = ?
_PyUnicode_WSTR(unicode) = ?
*/
return obj;
}

PyUnicode_New 是创建compact string的方式,代码很长,多数都是在容错处理。最终 MALLOC,然后赋初值,return。那么,问题来了,难道 Unicode Object 没有共享机制?

PyUnicode_FromUnicode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// unicodeobject.c.1993
PyObject * PyUnicode_FromUnicode(const Py_UNICODE *u, Py_ssize_t size)
{
PyObject *unicode;
Py_UCS4 maxchar = 0;
Py_ssize_t num_surrogates;

if (u == NULL)
return (PyObject*)_PyUnicode_New(size);

/* 宏套宏,最终实现共享 unicode_empty=PyUnicode_New(0, 0); */
if (size == 0)
_Py_RETURN_UNICODE_EMPTY();

/* 共享 Single character*/
if (size == 1 && (Py_UCS4)*u < 256)
return get_latin1_char((unsigned char)*u);

/* 创建新 not single 对象 */
unicode = PyUnicode_New(size - num_surrogates, maxchar);
switch (PyUnicode_KIND(unicode)) {
/* case 不同 kind 执行执行相应的转换*/
}
return unicode_result(unicode);
}

如上,原始代码很长,上面只截取了相对重要的部分。好高兴,终于看到 Unicode 共享机制的苗头了,不过还得一个一个来看。

_PyUnicode_New
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// unicodeobject.c.1067
static PyUnicodeObject * _PyUnicode_New(Py_ssize_t length)
{
PyUnicodeObject *unicode;
size_t new_size;

/* 共享 empty strings */
if (length == 0 && unicode_empty != NULL) {
Py_INCREF(unicode_empty);
return (PyUnicodeObject*)unicode_empty;
}

/* 容错代码(略):length 不能过大,也不能 < 0 */

// 创建对象
unicode = PyObject_New(PyUnicodeObject, &PyUnicode_Type);
new_size = sizeof(Py_UNICODE) * ((size_t)length + 1);

// 赋初值
_PyUnicode_WSTR_LENGTH(unicode) = length;
_PyUnicode_HASH(unicode) = -1;
_PyUnicode_STATE(unicode).interned = 0;
_PyUnicode_STATE(unicode).kind = 0;
_PyUnicode_STATE(unicode).compact = 0;
_PyUnicode_STATE(unicode).ready = 0;
_PyUnicode_STATE(unicode).ascii = 0;
_PyUnicode_DATA_ANY(unicode) = NULL;
_PyUnicode_LENGTH(unicode) = 0;
_PyUnicode_UTF8(unicode) = NULL;
_PyUnicode_UTF8_LENGTH(unicode) = 0;

// 真实数据
_PyUnicode_WSTR(unicode) = (Py_UNICODE*) PyObject_MALLOC(new_size);

// 讨巧,只处理数组两端
_PyUnicode_WSTR(unicode)[0] = 0;
_PyUnicode_WSTR(unicode)[length] = 0;
return unicode;
}

// 调用链:
if (u == NULL)
return (PyObject*)_PyUnicode_New(size);

源码依然很长,从整理后的代码可以看出,_PyUnicode_New 使用场景是:知道字符串长度,但不知道字符串的具体内容。只创建了内存空间,真实数据 ‘都是0’ 。

get_latin1_char(latin-1共享机制)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static PyObject* get_latin1_char(unsigned char ch)
{
PyObject *unicode = unicode_latin1[ch];
if (!unicode) {
unicode = PyUnicode_New(1, ch);
if (!unicode)
return NULL;
PyUnicode_1BYTE_DATA(unicode)[0] = ch;
assert(_PyUnicode_CheckConsistency(unicode, 1));
unicode_latin1[ch] = unicode;
}
Py_INCREF(unicode);
return unicode;
}

// 调用链:
if (size == 1 && (Py_UCS4)*u < 256)
return get_latin1_char((unsigned char)*u);

在上面的代码中,看到了熟悉的套路,一个数组unicode_latin1

1
2
3
4
// unicodeobject.c.213
/* Single character Unicode strings in the Latin-1 range are being
shared as well. */
static PyObject *unicode_latin1[256] = {NULL};

可见unicode_latin1数组,是一开始就创建,但并未填充数据,这就是单 unicode 的共享机制。

unicode_result
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// unicodeobject.c.548
static PyObject* unicode_result(PyObject *unicode)
{
if (PyUnicode_IS_READY(unicode))
return unicode_result_ready(unicode);
else
return unicode_result_wchar(unicode);
}

static PyObject* unicode_result_ready(PyObject *unicode)
{
Py_ssize_t length;

length = PyUnicode_GET_LENGTH(unicode);
if (length == 0) {
// 共享 unicode_empty
return unicode_empty;
}

if (length == 1) {
void *data = PyUnicode_DATA(unicode);
int kind = PyUnicode_KIND(unicode);
Py_UCS4 ch = PyUnicode_READ(kind, data, 0);
if (ch < 256) {
// 共享 latin1_char
return unicode;
}
}
return unicode;
}

// 调用链:
/* 创建新 not single 对象
PyUnicode_New 中赋值 unicode.ready = 1; */
unicode = PyUnicode_New(size - num_surrogates, maxchar);
return unicode_result(unicode);

从上面的代码来看,似乎 unicode_result 对 PyUnicode_New 来说,纯粹是多余的。

从整个PyUnicode_FromUnicode来看,只是针对单latin-1字符,进行了共享。那就不能解释下面的代码:

1
2
3
4
5
6
7
8
9
10
11
12
>>> a = 'abcde'
>>> b = 'abcde'
>>> id(a), id(b),id(a)==id(b)
(1605538588408, 1605538588408, True)
>>> del a
>>> del b
>>> a = 'abcde'
>>> id(a)
1605538115744
>>> b = 'abcde'
>>> id(a) == id(b)
True

Unicode 共享机制

1
2
3
4
// unicodeobject.h.412
#define SSTATE_NOT_INTERNED 0 // 未共享
#define SSTATE_INTERNED_MORTAL 1 // 共享,不增加引用计数
#define SSTATE_INTERNED_IMMORTAL 2 // 永久,不会被销毁

在上文的PyUnicode_New_PyUnicode_New中,都进行了 unicode.interned = 0 赋值操作。在源码中发现 0 对应着不共享。

同时在 unicodeobject.c源码中,发现 4个可疑函数:

1
2
3
4
void PyUnicode_InternInPlace(PyObject **p)
void PyUnicode_InternImmortal(PyObject **p)
PyObject * PyUnicode_InternFromString(const char *cp)
void _Py_ReleaseInternedUnicodeStrings(void)

在 CPython 的其他源码中,大量存在类似true_str = PyUnicode_InternFromString("True")的代码,而PyUnicode_InternFromString内部又调用PyUnicode_InternInPlace

PyUnicode_InternInPlace

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// unicodeobject.c.174
/* 注意,interned 不会影响 deallocation*/
static PyObject *interned = NULL;
static PyObject *unicode_empty = NULL;

// unicodeobject.c.15278
void PyUnicode_InternInPlace(PyObject **p)
{
PyObject *s = *p;
PyObject *t;
// 类型检查,对子类 不共享
if (s == NULL || !PyUnicode_Check(s))
return;
/* If it's a subclass, we don't really know what putting
it in the interned dict might do. */
if (!PyUnicode_CheckExact(s))
return;
if (PyUnicode_CHECK_INTERNED(s))
return;

// 初始化 interned 字典
if (interned == NULL) {
interned = PyDict_New();
}
Py_ALLOW_RECURSION // ceval.h.113,保存线程 recursion_critical
t = PyDict_SetDefault(interned, s, s);
Py_END_ALLOW_RECURSION // 恢复 recursion_critical

// 之前不存在
if (t != s) {
Py_INCREF(t);
Py_SETREF(*p, t);
return;
}

// 已经存在
Py_REFCNT(s) -= 2; // k,v 各一次
_PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL; // == 1
}

可以发现:

  • 共享前会进行类型检查,共享历史检查
  • 共享是用 interned 这个字典对象实现
  • PyDict_SetDefault 返回的是字典中对象的指针
  • 若 t != s 即,字典中已经存在该值,对t 减引用,修改 *p 指向,返回
  • 若 t ==s 即,字典中之前不存在,那么 放入字典,并设置 s.interned = 1
  • interned 字典中的指针,不作为对象的有效引用,因此执行 Py_REFCNT(s) -= 2

问题暂时解决了,利用 interned 字典+PyUnicode_Intern*实现了共享。