pmeunier/sanakirja-benchmarks - Discussion #1 - Correctness of the implementation

#1 Correctness of the implementation

Opened by Kerollmops on January 22, 2021

Kerollmops on January 22, 2021

I found out that we can talk about changes/patches by using a discussion, so here is a code change to ensure that the writing algorithm is right, spoiler it is wrong, at least for the lookup_ord algorithm.

Kerollmops added a change on January 22, 2021

Add some tests to make sure the code is valid

TCMY4ZZ7MQRK6RVE6JYO6NFAGDQVWUHSZ42GXQBKQOHSXUDCCU6QC

main

pmeunier on January 22, 2021

Actually using the value changes the picture completely, because we’re not triggering compiler optimisations. The results are now (for 1 million insertions):

linked-list: 1.114093703s 52.911632629s
ordered-vec: 9.82672031s 3.473954564s
skip-list: 3.26403791s 2.440610396s

Kerollmops on January 22, 2021

Wow, and have you found the bug in the ordered-vec implementation? So yeah it seems that it’s better to use skip-lists when data size is variable/unknown.

pmeunier on January 22, 2021

Yes, I did fix it. I don’t think that changed anything in the timings, because the number of operations is the same, but using the values avoids compiler optimisations for the lookups.

Why do you say “when data size is variable/unknown”? Here we’re doing u64/u64, and even hardcoding the size everywhere, and skiplists seem much better.

I just noticed that my implementation was using skiplist in a very favourable insertion order. Reversing the insertion makes it even faster, so I ran a few extra benchmarks:

Using a good RNG (the current implementation in Pijul), I get:

skip-list: 10.708713313s 7.700753636s

I have no idea why this makes lookups so slow. When using a counter instead of an RNG, and insert the elements in a random order, I get:

skip-list: 5.505557367s 4.336896327s

And now I’m really confused… Insertions in skiplists seems to be always much faster, while lookups are somewhat random.

pmeunier on January 22, 2021

Running the experiment again, this time reshuffling the insertion order every time, I get closer to the average performance. The results are:

linked-list: 19.304200185s 34.911236107s
ordered-vec: 15.841584038s 2.273466672s
skip-list: 7.489056262s 5.168968129s

So, skiplists are much better for insertions on average, but might sometimes be slightly unbalanced, leading to 2.2 times slower lookups. This cannot be seen when the keys are inserted in order, because then the skiplist becomes a perfectly balanced tree, and lookups are exactly equivalent to a binary search. I’m not sure why the plain binary search is slower, it might be related to the fact that the indices looked up in the skiplist are already written.

Kerollmops on January 22, 2021

I am not sure I understand what you mean by “using the value”, were the values not written on the page yet?

Have you thought about putting all keys next to each other in the ordered-vec version? Even without using SIMD instructions yet, you would have advantages by keeping the CPU cache clean by not interleaving keys with data.

Those last benchmarks are so much interesting. It seems that a more complex version that uses amortized cost (like you said), where you use a skip-list for insertion but transform it into an ordered-vec before committing, would be great. Will the final sort doubles the total writing time? It doesn’t seem quite possible without cloning the page, so forget about what I said.

I like the fact that the ordered-vec is 2.27x faster than the second one for reading (random?) keys. I am really more interested in reading performances. Write performances are important but if you need the best speed in this domain, an LSM tree is way more adapted (unsure if this page format could be used in an LSM).

pmeunier on January 22, 2021

I am not sure I understand what you mean by “using the value changes”, were the values not written on the page yet?

I meant (using the value) (changes) (the picture). When you call a function with only immutable parameters, but you don’t use the result, the compiler may skip it. This is annoying for benchmarks, another place where it’s annoying is security, where you have to make sure that some operations are actually done.

Have you thought about putting all keys next to each other in the ordered-vec version? Even without using SIMD instructions yet, you would have advantages by keeping the CPU cache clean by not interleaving keys with data.

My L2 cache is 256k, and my L1 cache is 32k. The stack is 4k (the array) + at most two small stack frames, so I would bet that the array is in the cache all the time.

I like the fact that the ordered-vec is 2.27x faster than the second one for reading (random?) keys. I am really more interested in reading performances.

If not all your entries are of the same size, this isn’t really doable though. One thing to note about skiplists is that they get perfectly balanced every time you copy them, which makes them even faster than ordered vecs (as show in one of my previous messages). Sanakirja copies them a lot, in fact every transaction does that. When you put or delete, many pages get rebalanced, split or merged, which again perfectly rebalances everything.

Kerollmops on January 22, 2021

My L2 cache is 256k, and my L1 cache is 32k. The stack is 4k (the array) + at most two small stack frames, so I would bet that the array is in the cache all the time.

Yeah, it makes total sense indeed. I thought cache sizes were smaller than that, apologies. The only gain you would get would be at the register level and therefore only good for SIMD instructions. If you want to try, I advise you to use packed_simd but only to be sure that the compiler uses SIMD instructions, using Rust types like an u64x4 eq operation that gives an m64x4 mask (or an m64x8 depending on your CPU registers) and convert the mask into an array.

I am not even sure that a SIMD based binary search would make any sense.

pmeunier on January 22, 2021

I’ve also calculated a few things for B+ trees vs B trees, since I haven’t found benchmarks anywhere, and since B+ trees are significantly harder to implement when we have multiple values for each key, I wanted to make sure it was worth it.

In the most favourable case, we’re storing lots of entries on each page. Let’s consider the most favourable case for B+, which is Db<u64, u64>. Using linked lists/sorted vecs we can store 255 entries on a B+ page, and 170 on a B page.

B+ trees sort some of their entries twice. If we can store a entries on a page, the extra fraction with N entries in total is N/a + N/a^2 + N/a^3 + … + N/a^d, where d is the depth (meaning N/a^d = a if the root page is full). This is equal to N/(1 - 1/a) (Taylor series), which in our case is N / (1-1/255), which is approximately N * 1.004. Therefore, I will ignore this in the rest of the calculation.

Now, the difference in arity is significant. This means that B trees will be log 255 / log 170 deeper than B+ trees, which is approximately 1.08, i.e. 8% deeper. Because the depth is an integer, you would start to see the first difference with a depth of 16 (because 15 * 1.08 = 15.… and 16 * 1.08 + 17.…), which means that you have 255^16 entries in your base, i.e. 319626579315078487616775634918212890625 = 3.19e38.

This is theoretical of course, but it shows how much of an improvement we can expect in the most favourable case (i.e. tiny keys and values). In the case of multi-values, this might very well be offset by the increased bookkeeping, even for very large pages.

pmeunier on January 22, 2021

No, wait, this is actually an off-by-one error, since 171 entries already require depth 2 for B trees, and 1 for B+. So, that bound is for the next improvement over this.

I believe the main reason B+ trees are so popular is because of the chained leafs to iterate faster.

pmeunier on January 22, 2021

Just found http://carlosproal.com/ir/papers/p121-comer.pdf which confirms this. One easy optimisation over B trees would be to use B* trees, and avoid splitting pages unless their siblings are full too. This isn’t hard to do, and packs the trees way more than B+. Moreover, using the same trick as B* trees for B+ trees would be quite complicated, since you would need to update entire subtrees when balancing internal nodes.

pmeunier on January 22, 2021

packs the trees way more than B+

Wait, in our example, this would bring benefits similar to B+ trees in terms of arity, i.e. not much. In our specific case of implementing these structures on-disk, this could even duplicate more pages, consuming more disk space.