Cuckoo Filter: Design and Implementation

For massive data processing services, we usually need an index data structure to help queries, quickly determine if the data record exists, and this data structure is typically also called filters. Consider such a scene, you need to enter the URL on your browser when you go online. At this time, the browser needs to determine if this is a malicious website, which will filter the thousands of URL indexes of local cache, if not, It is released, if (possibly) exists, initiate a verification request to the remote server and give back to the client to give a warning.

The stored storage is divided into order and disorder, the former uses an associated container, such as B tree, the latter uses a hash algorithm. These two types of algorithms have their own advantages and disadvantages, for example, the associated container time complexity stabilizes O (logn), and supports the range query; for example, the query of the hash algorithm, the increase is more fast o (1), but this is ideal In the case of the situation, when the collision is serious, the time complexity of the hash algorithm will be degraded to O (n). Therefore, choosing a good hash algorithm is very important.

The next very popular hash index structure is Bloom Filter, which is similar to the Hashset like Bitmap, so the spatial utilization is high. Its unique place is that it uses multiple hash functions to avoid haveh collision, as shown (source wikipedia), the bit array is initialized to all 0, when X, X is mapped to 3 by 3 hash functions Different Bit bits 1, when query X, only the BIT bit mapped by these three functions is 1 to explain that x may exist, but at least one 0 indicates that x is definitely not existed.

However, this bitmap mode of Bloom Filter brought two questions: one is a false Positives, which can provide “certain no existence” when query, but only provide “possible existence” because there are other elements Map to the same bit bit, causing this location 1, then an unsatisfactory element may be falsely positively; the other is a sluggery (false nagatives), the same reason, if an element is deleted, resulting in the mapping The Bit bit is set 0, then the element that originally exists will be missing. Because the latter problem is much more, the Bloom Filter must ensure “definitely no” to tolerate “probably yes” and does not allow deletion of elements.

With regard to the problem of elemental deletion, a modified scheme is to introduce the Bloom Filter, but in this way, the original BIT space is to expand into a count value, and the spatial efficiency is reduced.

Cuckoo Hashing

In order to solve this problem, this article introduces a new hash algorithm – Cuckoo Filter, which can ensure the inevitability of the element, but also delete any elements without violating the previous mention, only a micro space than Bitmap effectiveness. First, this algorithm’s source is a CMU paper, and the author has made a simple implementation according to its C language (Github), attached to the correctness test of a text data for import.

Next, I will explain the implementation of the hash algorithm in conjunction with my sample code. Let’s take a look at Cuckoo Hashing, its hash function is paired (specific implementation can be designed according to demand), each element is two, mapped to two locations, one is the location of the record The other is a standby location. This spare position is used in processing collision. This is to say that Cuckoo’s noun’s allusions, the Chinese name is a grana bird, this bird has a kind of ¼´»« ¼´ ÓÖ ¼´, it is not willing to build a nest, but The egg is in another bird’s nest, and its young bird will be born more than other birds. The cracked mother has a cruel action. The young bird will desperately squeeze out the nest of the unbornful other birds, in the future Enjoy the “parent” food. With biological practice, cuckoo hashing handled the collision method, which is to kick the element of the original position, but the element being kicked out is lucky than the bird egg, because it also has a spare position can be placed, if There are people in the spare position, kick it, so reciprocated. Until the number of times reached a maximum, the hash table was full, and the REHASH operation was performed. As shown below (Source):

We can’t help but ask how much the space utilization before the hash collision? Unfortunately tells you that there is no difference between a one-dimensional number of groups with other hash functions, it is 50%. But if it is two-dimensional?

An improved hash table is shown in the following figure, there is a 4-way slot (slot). When the hash function is mapped to the same bucket, it does not have an element to be kicked before the other three Slot is not filled, which is greatly buffered the chance of collision. The author’s own simple implementation, using a two-dimensional hash table (4-way SLOT) to account for approximately 80% (CMU paper data is said to be more than 90%, should be expanded).

Cuckoo Filter Design and Implement Cuckoo Hashing The principle is over, let’s demonstrate a cuckoo filter application you implement yourself, easy to use, less than 500 line C code. The application scenario is like this: Suppose there is a piece of text data, we import it into a virtual flash via CUCKOO FILTER, and export it to another text file. Flash stored unit page is a log_entry, which contains a pair of key / value, value is the text data, and Key is the SHA1 value of this size data (the sector SHA1 can be generated by data source, no need to store to flash, But here is mainly designed to test, there is no derivation of relationships between KEY and VALUE).

#define sector_size (1

By the way, the DAT_LEN settings, before we design a virtual flash (available in Malloc), because the Flash unit is read and written by page size sector_size, here each log_entry is just a page size, of course, can be adjusted according to the actual situation.

The above is the Flash storage structure, so that Slot in the hash table has three members Tag, Status, and Offset, respectively, the hash value, status value, and the offset position in Flash. STATUS has three enumeration values: availible, occupied, deleted, indicating that this slot is idle, and it is still deleted. As for Tag, according to the reason, there should be two hash values, corresponding to two hash functions, but one of them has been in the position of Bucket, so we only need to save another alternate bucket, so that it is kicked, Just use this tag to find another place where it is safe.

enum {AVAILIBLE, OCCUPIED, DELETED,};. / * The in-memory hash bucket cache is to filter keys (which is assumed SHA1) via * cuckoo hashing function and map keys to log entries stored on flash * / struct hash_slot_cache {uint32_t Tag: 30; / * summary of key * / uint32_t status: 2; / * fsm * / uint32_t offset; / * offset on flash memory * /};

Is it a little big below? It doesn’t matter, you can also adjust the data type size according to the situation, such as uint16_t, just for the test correctness.

As for the creation of the hash table and the creation of BUCKET and SLOT, see the initialization code. BUCKETS is a secondary pointer, each bucket pointing to 4 Slot size cache, namely 4-way slot, then bucket_num is 1/4 of Slot_Num. Here we deliberately tune slot_num, for testing the occurrence of REHASH.

#define ASSOC_WAY (4) / * 4-way association * / struct hash_table {struct hash_slot_cache ** buckets; struct hash_slot_cache * slots; uint32_t slot_num; uint32_t bucket_num;}; int cuckoo_filter_init (size_t size) {… / * Allocate hash slots * / hash_table.slot_num nvrom_size / SECTOR_SIZE; / * Make rehashing happen * / hash_table.slot_num / 4; hash_table.slots calloc (hash_table.slot_num, sizeof (struct hash_slot_cache)); if (hash_table.slots NULL) {return -1 ;} / * Allocate hash buckets associated with slots * / hash_table.bucket_num hash_table.slot_num / ASSOC_WAY; hash_table.buckets malloc (hash_table.bucket_num * sizeof (struct hash_slot_cache *)); if (hash_table.buckets NULL) {free (hash_table. Slots); return -1;} for (i 0; i

#define cuckoo_hash_lsb (key, count) ((size_t *) (key)) [0] & (count – 1)) # define cuckoo_hash_msb (key, count) ((SIZE_T *) (KEY)) [1] & (count – 1)))

Finally, I want to explain the three most important operations of Cuckoo Filter – query, inserting and deleting. The query operation is simple. We carry back the parameters of the Parameters to Tag [0] and tag [1], and first locate the BUCKET first with tag [0], from 4 Slot to compare Tag [1]. Only after the TAG is more than the tag, because of the part of Key, we go to verify the complete key from Flash and return the data in the flash value READDR output in Flash. Correspondingly, if bucket [tag [0]] is not in the middle, we go to Bucket [Tag [1]] to compare (code slightly), if it is more than not, you can affirm this key does not exist. . The advantage of this design is to reduce unnecessary Flash read operations, each comparative is the TAG in memory without the full key.

static int cuckoo_hash_get (struct hash_table * table, uint8_t * key, uint8_t ** read_addr) {int i, j; uint8_t * addr; uint32_t tag [2], offset; struct hash_slot_cache * slot; tag [0] cuckoo_hash_lsb (key, table -> bucket_num); tag [1] cuckoo_hash_msb (key, table-> bucket_num); / * filter the key and verify if it exists * / slot table- & gt; buckets [tag [0]]; for (i 0; I bucket_num) slot [i] .tag) {if (slot [i] .status occupied) {offset slot [i] .offset; addr key_verify (key, offset); if (addr! null) {if (read_addr! null ) {* read_addr addr;} Break;}} else if (slot [i] .status deleted) {return deled;}} …} Next, the simple deletion will first, which is simple because Delete will correspond Slot’s status value settings, in fact, there is no dry, that is, it will not really remove data in Flash. Why? Very simple, there is no need. There is also a reason that the Flash’s write operation needs to erase the entire page, which will be able to go, so many Flash supports random read, but must be written sequentially.

static void cuckoo_hash_delete (struct hash_table * table, uint8_t * key) {uint32_t i, j, tag [2]; struct hash_slot_cache * slot; tag [0] cuckoo_hash_lsb (key, table-> bucket_num); tag [1] cuckoo_hash_msb (key , Table-> Bucket_Num); slot table-> buckets [tag [0]]; for (i 0; i bucket_num) slot [i] .tag) {slot [i] .status deleded; return;} …}

Understand the literacy features of Flash, you know that the insert operation is to be an APPEND. However, we don’t discuss excessive flash details, the insert logic of the hash table level is similar to the query, I will not post the code. What you have to post here is how to judge and process the collision, in fact, there is no mystery here, just save the temporary variable with Old_Tag and Old_offset, so that an element can be used to find the ready-to-use place. But there will be a judgment here. Every time I kicked people, it will be a full fullness of the hash table, it is necessary to rehash when Alt_Cnt is greater than 512.

static int cuckoo_hash_collide (struct hash_table * table, uint32_t * tag, uint32_t * p_offset) {int i, j, k, alt_cnt; uint32_t old_tag [2], offset, old_offset; struct hash_slot_cache * slot; / * Kick out the old bucket and Move it to the alternative bucket. * / offset * p_offset; slot table-> buckets [tag [0]]; OLD_TAG [0] tag [0]; OLD_TAG [1] slot [0] .tag; old_offset slot [0] .offset; slot [0] .tag tag [1]; slot [0] .offset offset; i 0 ^ 1; k 0; alt_cnt 0; kick_out: slot table-> buckets [old_tag ??[i]]; for (J 0; j 512) {if (k assoc_way – 1) {/ * hash table is almost full and needs to be resized * / return 1;} else {k ++;}} uint32_t TMP_TAG SLOT [K] .TAG; UINT32_T TMP_OFFSET SLOT [K] .offset; Slot [K] .tag old_tag ??[i ^ 1]; Slot [K] .offset Old_offset; OLD_TAG [I ^ 1] TMP_TAG; OLD_OFFSET TMP_OFFSET; I ^ 1; goto kick_out;} return 0;} Rehash logic is also very simple, It is nothing more than to re-realloc in the hash table, and the space is doubled, and then re-inserts the key in the flash to the new hash table. There is a trap here to pay attention, don’t have the same key mixed! Although cuckoo hashing is unlike an O (N), it has two hash values, but each element has two hash values, and each calculated hash value varies with the size of the hash table Rehash, the same Key did not immediately detect conflicts, but when the same key reached a certain scale, the nightmare began. Due to the insert operation in Rehash, once the collision is triggered, it triggers the rehash, which is a process of rehabilitation of REHASH. Since the old memory is not released, the new memory is constantly redistributing, and the entire program is like a DOS attack. Therefore, each time the operation must be judged whether Key has existed, and the insertion of the insertion of the REHASH will prevent such situations. The author is unfortunately in the test. It is aware of the reason for a long time, and the students who engage in IT must prevent the minister ~

static void cuckoo_rehash (struct hash_table * table) {… uint8_t * read_addr nvrom_base_addr; uint32_t entries log_entries; while (entries–) {uint8_t key [20]; uint32_t offset read_addr – nvrom_base_addr; for (i 0; i & lt; 20 ; i ++) {key [i] flash_read (read_addr); read_addr ++;} / * Duplicated keys in hash table which can cause eternal * hashing collision Be careful of that * / assert (cuckoo_hash_put (table, key, & offset!!!)) ; if (cuckoo_hash_get (& old_table, keys, null) deleted) {cuckoo_hash_delete (Table, Key);}} …} What is the logic of this code? Let me help you find a big file unqlite.c test, this is an embedded database source code, a total of 59959 line code. As a file that needs to be imported, compile our Cuckoo Filter, then execute:

./cuckoo_db unqlite.c output.c

You will find that generate Output.c is just 59959 line code, one point is not bad, probably yes finally turned into definitely yes. It can also be seen at the same time, Cuckoo Filter is really fast! If you want to see the entire process of Hashing, you can open the debug macro in the ReadME. Finally, welcome to this little or submit PR!

Reference

Cuckoo Filter papers and PPT: Cuckoo Filter: Practically Better Than Bloom