【CPython3.6源码分析】PyDictObject

参考

前言

Python官网把 PyDictObject 归类与 Concrete Objects Layer,享受同样待遇的还有PySetObject。在前面 PythonUnicodeObject 中,我们已经见到了 PythonDict 的运用,即共享机制 interned。在 Python 世界里,字典被用于建立字节码的运行环境,用来存放变量名和变量值,意味着做任何操作几乎都要设计到 PythonDict,

因此,对搜索的效率要求及其苛刻。因而采用的 HashTable(散列表),在最优情况下能达到O(1)。散列表的基本思想是,将键映射为一个整数,把整数作为索引访问内存。主要逻辑是:查询键值 ——散列函数 hash function —— 散列值 hash value —— 内存区域 —— 查询结果——散列冲突。Python 处理散列冲突的问题,采用的是 开放定址法。删除探测链上元素,采用的是伪删除。

因为字典的重要性,Python 甚至单独在Objects/dictnotes.txt中写入了关于字典的说明,下面仅挑选部分内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
1. 主要应用
1. 传递关键字参数(1~3个元素)
2. 类方法查找:
1. 通常包含 8~16 个元素。
2. 通常只写入一次,但多次查找
3. 当使用基类时,会频繁在基类中查找
3. 实例属性查找、全局变量查找
1. 通常包含 4~10 个元素。
2. 写入和读取都非常频繁
4. Builtins(内置命令)
1. 频繁的读取,几乎不写入
2. About 150 interned strings (as of Py3.3).
3. 其中一切访问频率远大于其他

2. 数据存储
由3部分组成:
1. dictobject 自身
2. A dict-keys object (keys & hashes)
3. A values array

仅涉及单个键的字典操作可以是O(1),除非涉及到调整大小。

现在的版本与之前的差别:
1. key value 可以分开存储
2. 分离表中 key-val 新增组合 (key, NULL),代表被删除
3. key-val表中,不能嵌套小表
4. 一般字典比以前略大
5. 单个类的所有对象,共享key表,节约大量内存

PyDict_Type

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// dictobject.c.3282
PyTypeObject PyDict_Type = {
PyVarObject_HEAD_INIT(&PyType_Type, 0)
"dict",
sizeof(PyDictObject),
(destructor)dict_dealloc, /* tp_dealloc */
&dict_as_sequence, /* tp_as_sequence */
&dict_as_mapping, /* tp_as_mapping */
PyObject_HashNotImplemented, /* tp_hash */
(getiterfunc)dict_iter, /* tp_iter */
mapp_methods, /* tp_methods */
dict_init, /* tp_init */
PyType_GenericAlloc, /* tp_alloc */
dict_new, /* tp_new */
PyObject_GC_Del, /* tp_free */
};

PyDictObject

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// dictobject.h.18
typedef struct _dictkeysobject PyDictKeysObject;

typedef struct {
PyObject_HEAD

/* Number of items in the dictionary */
Py_ssize_t ma_used;

/* Dictionary version: globally unique,
value change each time the dictionary is modified */
uint64_t ma_version_tag;

PyDictKeysObject *ma_keys;

/* If ma_values is NULL, the table is "combined":
keys and values are stored in ma_keys.

If ma_values is not NULL, the table is splitted:
keys are stored in ma_keys and values are stored in ma_values
*/
PyObject **ma_values;
} PyDictObject;

如上,不同于之前的 PyListObject/PyUnicodeObject,他们被归类于 Sequence Object。而PyDictObject 归类于 Concrete Objects。因此,直接定义为 PyObject_HEAD。其他字段含义,见注释。

既然是分离设计,那么必然存在两个储存表的地方,一个是 ma_keys,一个是ma_values。具体内容见下文注释。

1
2
3
4
5
6
7
8
9
10
11
The DictObject can be in one of two forms.

A combined table:
ma_values == NULL, dk_refcnt == 1.
Values are stored in the me_value field of the PyDictKeysObject.

A split table:
ma_values != NULL, dk_refcnt >= 1
Values are stored in the ma_values array.
Only string (unicode) keys are allowed.
All dicts sharing same key must have same insertion order.

PyDictKeysObject

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// dict-common.h/23
struct _dictkeysobject {
Py_ssize_t dk_refcnt;
Py_ssize_t dk_size;
dict_lookup_func dk_lookup;
Py_ssize_t dk_usable;
Py_ssize_t dk_nentries;
union {
int8_t as_1[8];
int16_t as_2[4];
int32_t as_4[2];
int64_t as_8[1];
} dk_indices;
};

PyDictKeysObject 实现了字典的 hash table,布局如下:

layout

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
+---------------+
| dk_refcnt |
| dk_size | /* Size of the hash table.
| | It must be a power of 2. */
| dk_lookup | /* Function to lookup in the hash table. */
| dk_usable | /* Number of usable entries in dk_entries. */
| dk_nentries | /* Number of used entries in dk_entries. */
+---------------+
| dk_indices | // Actual hash table of dk_size entries.
| |
+---------------+
| dk_entries | /* array of PyDictKeyEntry.
| | len(dk_entries) == USABLE_FRACTION(dk_size) */
+---------------+

The size in bytes of an indice depends on dk_size:

- 1 byte if dk_size <= 0xff (char*)
- 2 bytes if dk_size <= 0xffff (int16_t*)
- 4 bytes if dk_size <= 0xffffffff (int32_t*)
- 8 bytes otherwise (int64_t*)

Dynamically sized, 8 is minimum.

需要注意的是,dk_indices 是一个共用体,会根据 dk_size 的值,决定存储 index 的类型。

PyDictKeyEntry

1
2
3
4
5
6
7
8
9
10
11
// dict-common.h.17
#define DKIX_EMPTY (-1)
#define DKIX_DUMMY (-2) /* Used internally */
#define DKIX_ERROR (-3)

// dict-common.h.4
typedef struct {
Py_hash_t me_hash; /* Cached hash code of me_key. */
PyObject *me_key;
PyObject *me_value; /* only meaningful for combined tables */
} PyDictKeyEntry;

dk_entries中存储的是 PyDictKeyEntry对象,其中每个元素都可以称为一个 enrty。因为使用了负数作为 entry 的状态,因此dk_indices中存储的是有符号整数。可以通过 宏 DK_ENTRIES 访问 entry:

1
2
3
4
5
6
7
8
9
10
11
// dictobject.c.289
#define DK_SIZE(dk) ((dk)->dk_size)

#define DK_IXSIZE(dk) \
(DK_SIZE(dk) <= 0xff ? \
1 : DK_SIZE(dk) <= 0xffff ? \
2 : DK_SIZE(dk) <= 0xffffffff ? \
4 : sizeof(int64_t))

#define DK_ENTRIES(dk) \
((PyDictKeyEntry*)(&(dk)->dk_indices.as_1[DK_SIZE(dk) * DK_IXSIZE(dk)]))

dk_indices

dk_indices 即是真正的 hash table,对应一个 slot 数组,每个slot 有四种状态

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// dictobject.c.64
1. Unused. index == DKIX_EMPTY
This is each slot's initial state.
Does not hold an active (key, value) pair now and never did.
Unused can transition to Active upon key insertion.

2. Active. index >= 0, me_key != NULL and me_value != NULL
Holds an active (key, value) pair.
This is the only case in which me_value != NULL.
Active can transition to Dummy or Pending upon key deletion~~~~
(for combined and split tables respectively).

3. Dummy. index == DKIX_DUMMY (combined only)
Previously held an active (key, value) pair,
but that was deleted and an active pair has not yet overwritten the slot.
Dummy can transition to Active upon key insertion.
Dummy slots cannot be made Unused again.

4. Pending. index >= 0, key != NULL, and value == NULL (split only)
Not yet inserted in split-table.

简单来说就是

  1. Unused,初始状态,该 slot 没有被使用,index = -1
  2. Active,正在使用,index > 0
  3. Dummy,曾经使用过,但现在被删除了,index = -2 。(仅限 combined-table)
  4. Pending,还未插入。(仅限 split-table)
    正因为 Dummy 态不能转换为 Unused,所以保证了探测链的连续性,对应前文说的 伪删除。

创建

1
2
3
4
5
6
7
8
9
10
#define PyDict_MINSIZE 8

// dictobject.c.620
PyObject * PyDict_New(void)
{
PyDictKeysObject *keys = new_keys_object(PyDict_MINSIZE);
if (keys == NULL)
return NULL;
return new_dict(keys, NULL);
}

PyDict_MINSIZE 是任何新 Dict 的起始大小,默认为8,满足运行过程中大量的函数参数传递过程。

new_keys_object

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// dictobject.c.374
#define USABLE_FRACTION(n) (((n) << 1)/3)

static PyDictKeysObject *new_keys_object(Py_ssize_t size)
{
PyDictKeysObject *dk;
Py_ssize_t es, usable;

// dk_size >= 8 and must be a power of 2.
assert(size >= PyDict_MINSIZE);
assert(IS_POWER_OF_2(size));

// len(dk_entries) == USABLE_FRACTION(dk_size),最小为5
usable = USABLE_FRACTION(size);
... // es = 1/2/3/4; 根据 size 大小,确定存储位数

// 尝试共享
if (size == PyDict_MINSIZE && numfreekeys > 0) {
dk = keys_free_list[--numfreekeys];
}
else {
dk = PyObject_MALLOC(sizeof(PyDictKeysObject)
- Py_MEMBER_SIZE(PyDictKeysObject, dk_indices)
+ es * size
+ sizeof(PyDictKeyEntry) * usable);
if (dk == NULL) {
PyErr_NoMemory();
return NULL;
}
}
DK_DEBUG_INCREF dk->dk_refcnt = 1;
dk->dk_size = size;
dk->dk_usable = usable;
dk->dk_lookup = lookdict_unicode_nodummy;
dk->dk_nentries = 0;

// 初始化 dk_indices == -1
memset(&dk->dk_indices.as_1[0], 0xff, es * size);

// 初始化 dk_entries == 0
memset(DK_ENTRIES(dk), 0, sizeof(PyDictKeyEntry) * usable);
return dk;
}

从代码中也清晰的看见,对象缓冲池 keys_free_list 的身影。其中需要注意的是:

1
usable = USABLE_FRACTION(size);    // (((n) << 1)/3)

size 默认为 PyDict_MINSIZE 即8。通过计算可以得出默认存放5个对象。

new_dict

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// dictobject.c.573
static PyObject *
new_dict(PyDictKeysObject *keys, PyObject **values)
{
PyDictObject *mp;

// 尝试共享
if (numfree) {
mp = free_list[--numfree];
_Py_NewReference((PyObject *)mp);
}
else {
mp = PyObject_GC_New(PyDictObject, &PyDict_Type);
if (mp == NULL) {
DK_DECREF(keys);
free_values(values);
return NULL;
}
}
mp->ma_keys = keys;
mp->ma_values = values;
mp->ma_used = 0;
mp->ma_version_tag = DICT_NEXT_VERSION();

return (PyObject *)mp;
}

代码不长,很好懂。再次发现了对象池 free_list。

共享机制

前面 创建 PyDictObject 和 PyDictKeysObject 时,都利用了对象缓冲池。

1
2
3
4
5
#define PyDict_MAXFREELIST 80
static PyDictObject *free_list[PyDict_MAXFREELIST];
static int numfree = 0;
static PyDictKeysObject *keys_free_list[PyDict_MAXFREELIST];
static int numfreekeys = 0;

宏定义如上,与 PyListObject 类似,甚至连名字都类似。那么必然的,销毁对象时,会把对象放入缓冲池。

dict_dealloc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
static PyObject *empty_values[1] = { NULL };

// dictobject.c.2003
static void dict_dealloc(PyDictObject *mp)
{
PyObject **values = mp->ma_values;
PyDictKeysObject *keys = mp->ma_keys;
Py_ssize_t i, n;

/* bpo-31095: UnTrack is needed before calling any callbacks */
PyObject_GC_UnTrack(mp);
Py_TRASHCAN_SAFE_BEGIN(mp)
if (values != NULL) {
if (values != empty_values) {
for (i = 0, n = mp->ma_keys->dk_nentries; i < n; i++) {
Py_XDECREF(values[i]);
}
free_values(values);
}
DK_DECREF(keys);
}
else if (keys != NULL) {
assert(keys->dk_refcnt == 1);
DK_DECREF(keys);
}

// 尝试共享
if (numfree < PyDict_MAXFREELIST && Py_TYPE(mp) == &PyDict_Type)
free_list[numfree++] = mp;
else
Py_TYPE(mp)->tp_free((PyObject *)mp);
Py_TRASHCAN_SAFE_END(mp)
}

// dictobject.c.554
static void free_keys_object(PyDictKeysObject *keys)
{
PyDictKeyEntry *entries = DK_ENTRIES(keys);
Py_ssize_t i, n;
for (i = 0, n = keys->dk_nentries; i < n; i++) {
Py_XDECREF(entries[i].me_key);
Py_XDECREF(entries[i].me_value);
}
// 尝试共享
if (keys->dk_size == PyDict_MINSIZE && numfreekeys < PyDict_MAXFREELIST) {
keys_free_list[numfreekeys++] = keys;
return;
}
PyObject_FREE(keys);
}

CRUD

Mapping方法簇

1
2
3
4
5
6
7
8
9
10
static PyMappingMethods dict_as_mapping = {
(lenfunc)dict_length, /*mp_length*/
(binaryfunc)dict_subscript, /*mp_subscript*/
(objobjargproc)dict_ass_sub, /*mp_ass_subscript*/
};

static Py_ssize_t dict_length(PyDictObject *mp)
{
return mp->ma_used;
}

PyDict_Type 中定义的 tp_as_mapping == &dict_as_mapping。从代码中可以看见len(dict)时间复杂度为O(1)。执行 dict[item] 即调用 dict_subscript()。

GetItem

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// dictobject.c.2172
static PyObject * dict_subscript(PyDictObject *mp, PyObject *key)
{
PyObject *v;
Py_ssize_t ix;
Py_hash_t hash;
PyObject **value_addr;

// 1. 获取/计算 hash 值
if (!PyUnicode_CheckExact(key) ||
(hash = ((PyASCIIObject *) key)->hash) == -1) {
hash = PyObject_Hash(key);
if (hash == -1)
return NULL;
}
// 2. 查找 index ,返回 ix>0 或 ix == -1 / -3
ix = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr, NULL);
if (ix == DKIX_ERROR) // -3
return NULL;

// 3. 根据 ix 结果值,执行 __miss__ 或 返回结果
if (ix == DKIX_EMPTY || *value_addr == NULL) {
if (!PyDict_CheckExact(mp)) {
/* Look up __missing__ method if we're a subclass. */
PyObject *missing, *res;
_Py_IDENTIFIER(__missing__);
missing = _PyObject_LookupSpecial((PyObject *)mp, &PyId___missing__);
if (missing != NULL) {
res = PyObject_CallFunctionObjArgs(missing,
key, NULL);
Py_DECREF(missing);
return res;
}
else if (PyErr_Occurred())
return NULL;
}
_PyErr_SetKeyError(key);
return NULL;
}
v = *value_addr;
Py_INCREF(v);
return v;
}

源码如上,需要注意的的是,返回的是 对象引用。并且,时间复杂度为O(1)。

ix 获取方式 mp->ma_keys->dk_lookup,即调用 PyDictObject 自身的 dk_lookup,在前面的 PyDict_New 中,可以发现 dk->dk_lookup = lookdict_unicode_nodummy。实际上,dk_lookup 存在不止一种。

dk_lookup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// ictobject.c.224
/* lookdict() is general-purpose, and may return DKIX_ERROR if (and only if) a
comparison raises an exception. */
static Py_ssize_t lookdict(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject ***value_addr,
Py_ssize_t *hashpos);

/* Specialized version for string-only keys */
static Py_ssize_t lookdict_unicode(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject ***value_addr,
Py_ssize_t *hashpos);

/* Faster version of lookdict_unicode when it is known that no <dummy> keys
* will be present. */
static Py_ssize_t
lookdict_unicode_nodummy(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject ***value_addr,
Py_ssize_t *hashpos);

/* Version of lookdict for split tables.
* All split tables and only split tables use this lookup function.
* Split tables only contain unicode keys and no dummy keys,
* so algorithm is the same as lookdict_unicode_nodummy.
*/
static Py_ssize_t lookdict_split(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject ***value_addr,
Py_ssize_t *hashpos);

如上,dk_lookup 有4个非常相似的 查找函数。因为在 Python 中,大量使用 str 作为字典的 key,因此会有2个特定针对 str 对象的 优化版本。

lookdict

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
static Py_ssize_t
lookdict(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject ***value_addr, Py_ssize_t *hashpos)
{
size_t i, mask;
Py_ssize_t ix, freeslot;
int cmp;
PyDictKeysObject *dk;
PyDictKeyEntry *ep0, *ep;
PyObject *startkey;

top:
dk = mp->ma_keys;
mask = DK_MASK(dk);
ep0 = DK_ENTRIES(dk);
i = (size_t)hash & mask;

ix = dk_get_index(dk, i);
if (ix == DKIX_EMPTY) {
if (hashpos != NULL)
*hashpos = i;
*value_addr = NULL;
return DKIX_EMPTY;
}
if (ix == DKIX_DUMMY) {
freeslot = i;
}
else {
ep = &ep0[ix];
assert(ep->me_key != NULL);
if (ep->me_key == key) {
*value_addr = &ep->me_value;
if (hashpos != NULL)
*hashpos = i;
return ix;
}
if (ep->me_hash == hash) {
startkey = ep->me_key;
Py_INCREF(startkey);
cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
Py_DECREF(startkey);
if (cmp < 0) {
*value_addr = NULL;
return DKIX_ERROR;
}
if (dk == mp->ma_keys && ep->me_key == startkey) {
if (cmp > 0) {
*value_addr = &ep->me_value;
if (hashpos != NULL)
*hashpos = i;
return ix;
}
}
else {
/* The dict was mutated, restart */
goto top;
}
}
freeslot = -1;
}

for (size_t perturb = hash;;) {
perturb >>= PERTURB_SHIFT;
i = ((i << 2) + i + perturb + 1) & mask;
ix = dk_get_index(dk, i);
if (ix == DKIX_EMPTY) {
if (hashpos != NULL) {
*hashpos = (freeslot == -1) ? (Py_ssize_t)i : freeslot;
}
*value_addr = NULL;
return ix;
}
if (ix == DKIX_DUMMY) {
if (freeslot == -1)
freeslot = i;
continue;
}
ep = &ep0[ix];
assert(ep->me_key != NULL);
if (ep->me_key == key) {
if (hashpos != NULL) {
*hashpos = i;
}
*value_addr = &ep->me_value;
return ix;
}
if (ep->me_hash == hash) {
startkey = ep->me_key;
Py_INCREF(startkey);
cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
Py_DECREF(startkey);
if (cmp < 0) {
*value_addr = NULL;
return DKIX_ERROR;
}
if (dk == mp->ma_keys && ep->me_key == startkey) {
if (cmp > 0) {
if (hashpos != NULL) {
*hashpos = i;
}
*value_addr = &ep->me_value;
return ix;
}
}
else {
/* The dict was mutated, restart */
goto top;
}
}
}
assert(0); /* NOT REACHED */
return 0;
}

如上,通用的 lookdict 代码很长。大概可以分为 以下几部分

ix = ?
1
2
3
4
5
6
7
8
9
10
11
12
13
#define DK_MASK(dk) (((dk)->dk_size)-1)

static Py_ssize_t
lookdict(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject ***value_addr, Py_ssize_t *hashpos)
{
size_t i, mask;
PyDictKeysObject *dk;

dk = mp->ma_keys;
mask = DK_MASK(dk);
i = (size_t)hash & mask; // hash%dk_size == hash & (dk_size - 1)
}

前面已经知道:

  • hash == PyObject_Hash(key)
  • dk_size >= 8 && IS_POWER_OF_2
  • 注意最后一行,将 hash 值映射到数组上
if ix == DKIX_EMPTY
1
2
3
4
5
6
7
8
9
10
11
12
13
14
static Py_ssize_t
lookdict(PyDictObject *mp, PyObject *key,
Py_hash_t hash, PyObject ***value_addr, Py_ssize_t *hashpos)
{
/* lookup indices. returns DKIX_EMPTY, DKIX_DUMMY, or ix >=0 */
ix = dk_get_index(dk, i);

if (ix == DKIX_EMPTY) {
if (hashpos != NULL)
*hashpos = i;
*value_addr = NULL;
return DKIX_EMPTY;
}
}

如上,当 ix == DKIX_EMPTY 表明,slot 可用,直接返回。需要注意的是,返回之前,把 *value_addr 设为了 NULL,相当于清空了数据。

if ix != DKIX_EMPTY
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
PyDictKeyEntry *ep0, *ep;

// DK_ENTRIES(keys)[index] if index >= 0
ep0 = DK_ENTRIES(dk);

if (ix == DKIX_DUMMY) {
freeslot = i; // 伪删除,me_value==NULL
}
else {
ep = &ep0[ix];
assert(ep->me_key != NULL);
if (ep->me_key == key) { // 引用相同
*value_addr = &ep->me_value;
if (hashpos != NULL)
*hashpos = i;
return ix;
}
if (ep->me_hash == hash) { // hash 相同
startkey = ep->me_key;
Py_INCREF(startkey);
cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
/* cmp = startkey == key ? 1 : 0 or -1 (raise ERROR) */
Py_DECREF(startkey);
if (cmp < 0) {
*value_addr = NULL;
return DKIX_ERROR;
}
if (dk == mp->ma_keys && ep->me_key == startkey) {
if (cmp > 0) { // 值相同
*value_addr = &ep->me_value;
if (hashpos != NULL)
*hashpos = i;
return ix;
}
}
else {
/* The dict was mutated, restart */
goto top;
}
}
freeslot = -1;
}

如上,当 ix == DKIX_DUMMY 时,将 freeslot 设置为该 slot。若后续搜索没有成功找到,那么将返回该 slot。

当 ix>=0 && ix != DKIX_DUMMY 时,获取 Entry 对象,对其值进行判断。前面已经知道,me_hash 是 hash(me_key) 的 缓存。

接着两个 if 判断,是为了优先考虑引用相同,接着再考虑引用不同而值相同。因为 Python 内部存在对象缓冲池,小整数,字符串等具有相同的对象引用,而对于1024等大整数,就必须判断 引用不一样而值一样。

PyObject_RichCompareBool 是传入操作数与操作符,返回数值。

  • 若cmp==1,即 hash,value 都相等,正确返回。
  • 若cmp==0,即 hash相等,值不等,即 hash 冲突。

解决 hash 冲突

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/*
执行循环的 前提条件:
if ix == DKIX_DUMMY:
freeslot = i
elif ix>=0 and hash冲突:
freeslot = -1;
*/
#define PERTURB_SHIFT 5

for (size_t perturb = hash;;) {
// 探测链算法
perturb >>= PERTURB_SHIFT;
i = ((i << 2) + i + perturb + 1) & mask;
ix = dk_get_index(dk, i);
if (ix == DKIX_EMPTY) {
if (hashpos != NULL) {
*hashpos = (freeslot == -1) ? (Py_ssize_t)i : freeslot;
/* 找到一个 UnusedEnpty,表明搜索失败
1. freeslot 不存在,返回现在的
2. freeslot 已经有一个,返回第一个
*/
}
*value_addr = NULL;
return ix;
}
if (ix == DKIX_DUMMY) { // 探测链 继续寻找
if (freeslot == -1)
freeslot = i;
continue;
}
ep = &ep0[ix];
assert(ep->me_key != NULL);
if (ep->me_key == key) { // 引用相同
if (hashpos != NULL) {
*hashpos = i;
}
*value_addr = &ep->me_value;
return ix;
}
if (ep->me_hash == hash) {
startkey = ep->me_key;
Py_INCREF(startkey);
cmp = PyObject_RichCompareBool(startkey, key, Py_EQ);
Py_DECREF(startkey);
if (cmp < 0) {
*value_addr = NULL;
return DKIX_ERROR;
}
if (dk == mp->ma_keys && ep->me_key == startkey) {
if (cmp > 0) { // 值相同
if (hashpos != NULL) {
*hashpos = i;
}
*value_addr = &ep->me_value;
return ix;
}
}
else {
/* The dict was mutated, restart */
goto top;
}
}
}

如上,整个过程是,当发生冲突时,再次计算获取一个 i 值,最终返回 NUll 或 目标值。

SetItem

dict_ass_sub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// dictobject.c.2163
static int
dict_ass_sub(PyDictObject *mp, PyObject *v, PyObject *w)
{
if (w == NULL)
return PyDict_DelItem((PyObject *)mp, v);
else
return PyDict_SetItem((PyObject *)mp, v, w);
}

// dictobject.c.1554
int
PyDict_SetItem(PyObject *op, PyObject *key, PyObject *value)
{
PyDictObject *mp;
Py_hash_t hash;
// 类型检查
if (!PyDict_Check(op)) {
PyErr_BadInternalCall();
return -1;
}
assert(key);
assert(value);
mp = (PyDictObject *)op;
// 获取 hash 值
if (!PyUnicode_CheckExact(key) ||
(hash = ((PyASCIIObject *) key)->hash) == -1)
{
hash = PyObject_Hash(key);
if (hash == -1)
return -1;
}

/* insertdict() handles any resizing that might be necessary */
return insertdict(mp, key, hash, value);
}

执行 dict[item] = value 时,调用 dict_ass_sub()。
如上,与 List 类似,SetItem 和 DelItem 都是调用同一个方法。在 PyDict_SetItem 中,仅进行类型检查,计算 hash 值,实际插入调用 insertdict。

insertdict

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
// dictobject.c.1110
static int
insertdict(PyDictObject *mp, PyObject *key, Py_hash_t hash, PyObject *value)
{
PyObject *old_value;
PyObject **value_addr;
PyDictKeyEntry *ep, *ep0;
Py_ssize_t hashpos, ix;

Py_INCREF(key);
Py_INCREF(value);

// split-table 插入 key(not Unicode)
if (mp->ma_values != NULL && !PyUnicode_CheckExact(key)) {
if (insertion_resize(mp) < 0)
goto Fail;
}
// 计算 ix 的值 ix = ?
ix = mp->ma_keys->dk_lookup(mp, key, hash, &value_addr, &hashpos);
if (ix == DKIX_ERROR)
goto Fail;

assert(PyUnicode_CheckExact(key) || mp->ma_keys->dk_lookup == lookdict);
MAINTAIN_TRACKING(mp, key, value);

// split-table 插入 key(different order)
if (_PyDict_HasSplitTable(mp) &&
((ix >= 0 && *value_addr == NULL && mp->ma_used != ix) ||
(ix == DKIX_EMPTY && mp->ma_used != mp->ma_keys->dk_nentries))) {
if (insertion_resize(mp) < 0)
goto Fail;
find_empty_slot(mp, key, hash, &value_addr, &hashpos);
ix = DKIX_EMPTY;
}

// Insert when ix == DKIX_EMPTY
if (ix == DKIX_EMPTY) {
/* Insert into new slot. */
if (mp->ma_keys->dk_usable <= 0) {
/* Need to resize. */
if (insertion_resize(mp) < 0)
goto Fail;
find_empty_slot(mp, key, hash, &value_addr, &hashpos);
}
ep0 = DK_ENTRIES(mp->ma_keys);
ep = &ep0[mp->ma_keys->dk_nentries];
dk_set_index(mp->ma_keys, hashpos, mp->ma_keys->dk_nentries);
ep->me_key = key;
ep->me_hash = hash;
if (mp->ma_values) {
assert (mp->ma_values[mp->ma_keys->dk_nentries] == NULL);
mp->ma_values[mp->ma_keys->dk_nentries] = value;
}
else {
ep->me_value = value;
}
mp->ma_used++;
mp->ma_version_tag = DICT_NEXT_VERSION();
mp->ma_keys->dk_usable--;
mp->ma_keys->dk_nentries++;
assert(mp->ma_keys->dk_usable >= 0);
assert(_PyDict_CheckConsistency(mp));
return 0;
}

assert(value_addr != NULL);

// Insert when ix != DKIX_EMPTY
old_value = *value_addr;
if (old_value != NULL) {
*value_addr = value;
mp->ma_version_tag = DICT_NEXT_VERSION();
assert(_PyDict_CheckConsistency(mp));

Py_DECREF(old_value); /* which **CAN** re-enter (see issue #22653) */
Py_DECREF(key);
return 0;
}

/* pending state */
assert(_PyDict_HasSplitTable(mp));
assert(ix == mp->ma_used);
*value_addr = value;
mp->ma_used++;
mp->ma_version_tag = DICT_NEXT_VERSION();
assert(_PyDict_CheckConsistency(mp));
Py_DECREF(key);
return 0;

Fail:
Py_DECREF(value);
Py_DECREF(key);
return -1;
}

如上,实际插入的对象的函数,代码也很长,需要分块处理。

split-table 插入 key(not Unicode)
1
2
3
4
if (mp->ma_values != NULL && !PyUnicode_CheckExact(key)) {
if (insertion_resize(mp) < 0)
goto Fail;
}

在 前面 PyDictObject 结构体定义中,有提到:

1
2
3
4
A split table:
ma_values != NULL, dk_refcnt >= 1
Values are stored in the ma_values array.
Only string (unicode) keys are allowed.

当,split table 插入的 key 不是 Unicode 时,调用 insertion_resize。

split-table 插入 key(different order)
1
2
3
4
5
6
7
8
9
10
11
/* When insertion order is different from shared key, we can't share
* the key anymore. Convert this instance to combine table.
*/
if (_PyDict_HasSplitTable(mp) &&
((ix >= 0 && *value_addr == NULL && mp->ma_used != ix) ||
(ix == DKIX_EMPTY && mp->ma_used != mp->ma_keys->dk_nentries))) {
if (insertion_resize(mp) < 0)
goto Fail;
find_empty_slot(mp, key, hash, &value_addr, &hashpos);
ix = DKIX_EMPTY;
}

在 前面 PyDictObject 结构体定义中,同时有提到:

1
2
A split table:
All dicts sharing same key must have same insertion order.

如源码中注释所述,当插入 顺序不一致时,将调用 insertion_resize。

Insert when ix == DKIX_EMPTY
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
if (ix == DKIX_EMPTY) {
/* Insert into new slot. */
if (mp->ma_keys->dk_usable <= 0) {
/* Need to resize. */
if (insertion_resize(mp) < 0)
goto Fail;
find_empty_slot(mp, key, hash, &value_addr, &hashpos);
}
// 初始化
ep0 = DK_ENTRIES(mp->ma_keys);
ep = &ep0[mp->ma_keys->dk_nentries];
dk_set_index(mp->ma_keys, hashpos, mp->ma_keys->dk_nentries);
ep->me_key = key;
ep->me_hash = hash;

// 插入不同地方
if (mp->ma_values) {
assert (mp->ma_values[mp->ma_keys->dk_nentries] == NULL);
mp->ma_values[mp->ma_keys->dk_nentries] = value;
}
else {
ep->me_value = value;
}

// 调整属性值
mp->ma_used++;
mp->ma_version_tag = DICT_NEXT_VERSION();
mp->ma_keys->dk_usable--;
mp->ma_keys->dk_nentries++;
assert(mp->ma_keys->dk_usable >= 0);
assert(_PyDict_CheckConsistency(mp));
return 0;
}

如上,ix == DKIX_EMPTY 时,执行插入动作。

  1. if 可用空间 dk_usable <=0,调用 insertion_resize
  2. dk_set_index,初始化 me_key, me_hash
  3. 根据 ma_values,判断 table 类型,插入 不同的地方
  4. 调整 mp 自身属性
Insert when ix != DKIX_EMPTY
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
assert(value_addr != NULL);
old_value = *value_addr;
if (old_value != NULL) {
*value_addr = value;
mp->ma_version_tag = DICT_NEXT_VERSION();
assert(_PyDict_CheckConsistency(mp));

Py_DECREF(old_value); /* which **CAN** re-enter (see issue #22653) */
Py_DECREF(key);
return 0;
}

/* pending state */
assert(_PyDict_HasSplitTable(mp));
assert(ix == mp->ma_used);
*value_addr = value;
mp->ma_used++;
mp->ma_version_tag = DICT_NEXT_VERSION();
assert(_PyDict_CheckConsistency(mp));
Py_DECREF(key);
return 0;

如上,ix != DKIX_EMPTY 即代表修改值。同时,根据 old_value 可以判断出 表的类型。

PyDict_DelItem

前面已经提到 Del 和 Set 有相同的入口 dict_ass_sub。Del 最终实现是靠 PyDict_DelItem()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// dictobject.c/1621
int PyDict_DelItem(PyObject *op, PyObject *key)
{
Py_hash_t hash;
assert(key);
if (!PyUnicode_CheckExact(key) ||
(hash = ((PyASCIIObject *) key)->hash) == -1) {
hash = PyObject_Hash(key);
if (hash == -1)
return -1;
}

return _PyDict_DelItem_KnownHash(op, key, hash);
}

如上,先验证 hash,在调用 _PyDict_DelItem_KnownHash

_PyDict_DelItem_KnownHash

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
int
_PyDict_DelItem_KnownHash(PyObject *op, PyObject *key, Py_hash_t hash)
{
Py_ssize_t hashpos, ix;
PyDictObject *mp;
PyObject **value_addr;

if (!PyDict_Check(op)) {
PyErr_BadInternalCall();
return -1;
}
assert(key);
assert(hash != -1);
mp = (PyDictObject *)op;
ix = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr, &hashpos);
if (ix == DKIX_ERROR)
return -1;
if (ix == DKIX_EMPTY || *value_addr == NULL) {
_PyErr_SetKeyError(key);
return -1;
}
assert(dk_get_index(mp->ma_keys, hashpos) == ix);

// Split table doesn't allow deletion. Combine it.
if (_PyDict_HasSplitTable(mp)) {
if (dictresize(mp, DK_SIZE(mp->ma_keys))) {
return -1;
}
ix = (mp->ma_keys->dk_lookup)(mp, key, hash, &value_addr, &hashpos);
assert(ix >= 0);
}
return delitem_common(mp, hashpos, ix, value_addr);
}

如上,_PyDict_DelItem_KnownHash 代码很好理解,根据 key hash 找到相应的 enrty。
需要注意的是其中的,Split-table 不允许删除操作,必然要转换为 combined table。

delitem_common

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
static int
delitem_common(PyDictObject *mp, Py_ssize_t hashpos, Py_ssize_t ix,
PyObject **value_addr)
{
PyObject *old_key, *old_value;
PyDictKeyEntry *ep;

old_value = *value_addr;
assert(old_value != NULL);
*value_addr = NULL;
mp->ma_used--;
mp->ma_version_tag = DICT_NEXT_VERSION();
ep = &DK_ENTRIES(mp->ma_keys)[ix];
dk_set_index(mp->ma_keys, hashpos, DKIX_DUMMY);
ENSURE_ALLOWS_DELETIONS(mp);
old_key = ep->me_key;
ep->me_key = NULL;
Py_DECREF(old_key);
Py_DECREF(old_value);

assert(_PyDict_CheckConsistency(mp));
return 0;
}

如上,代码很简单。。就酱。。

insertion_resize

1
2
3
4
5
6
// dictobject.c.1099
static int
insertion_resize(PyDictObject *mp)
{
return dictresize(mp, GROWTH_RATE(mp));
}

前面,已经提到,在 CRUD 时,有几种情况下会调用 insertion_resize

  • dk_usable <=0
  • split-table insert key which is not Unicode.
  • split-table insertion order is different from shared key.

GROWTH_RATE

1
#define GROWTH_RATE(d) (((d)->ma_used*2)+((d)->ma_keys->dk_size>>1))

如上,在调用 dictresize 之前,会先计算一个 GROWTH_RATE。这个东西的作用可以参考链接已失效

1
2
3
4
5
6
7
8
9
10
11
12
如果原有的数据量小于原有大小的1/4,那么它也会小于现有大小的1/4,但新的dict缩小了;
如果原有的数据量大于原有大小的1/4,那么它也会大于现有大小的1/4,但新的dict扩大了。

if used < size/4:
used*4 < used*2 + size/4*2 = used*2 + size/2 = newsize
used < newsize/4
newsize = used*2 + size/2 < size/4*2 + size/2 = size

if used > size/4:
used*4 > used*2 + size/4*2 = used*2 + size/2 = newsize
used > newsize/4
newsize = used*2 + size/2 > size/4*2 + size/2 = size

emmm…好吧,看看就好

dictresize

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// dictobject.c.1250
/*
Restructure the table by allocating a new table and reinserting all
items again. When entries have been deleted, the new table may
actually be smaller than the old one.
If a table is split (its keys and hashes are shared, its values are not),
then the values are temporarily copied into the table, it is resized as
a combined table, then the me_value slots in the old table are NULLed out.
After resizing a table is always combined,
but can be resplit by make_keys_shared().
*/
static int
dictresize(PyDictObject *mp, Py_ssize_t minsize)
... // 很多很多
/* Allocate a new table. */
// 注:static PyDictKeysObject *new_keys_object(Py_ssize_t size)
mp->ma_keys = new_keys_object(newsize);

/* Main loop */
for (i = 0; i < oldkeys->dk_nentries; i++) {
PyDictKeyEntry *ep = &ep0[i];
if (ep->me_value != NULL) {
insertdict_clean(mp, ep->me_key, ep->me_hash, ep->me_value);
}
}
... // 很多很多
}

/*
Internal routine used by dictresize() to insert an item which is
known to be absent from the dict. This routine also assumes that
the dict contains no deleted entries.
Besides the performance benefit,
using insertdict() in dictresize() is dangerous (SF bug #1456209).
Note that no refcounts are changed by this routine;
if needed, the caller is responsible for incref'ing `key` and `value`.
Neither mp->ma_used nor k->dk_usable are modified by this routine;
the caller must set them correctly
*/
static void
insertdict_clean(PyDictObject *mp, PyObject *key, Py_hash_t hash,
PyObject *value)

emm…代码很长,看注释就够了。

  • dictresize 返回的都是 combined table,跟上文一致
  • combined split 两者是可以相互转换的
  • resize 后,表是可以被 扩容或缩小的
  • insertdict_clean 只干插入的活,不干其他的
  • 只要涉及到 resize ,就涉及到 Malloc memcpy,耗时 耗力