fd05037e1a changed the allocation batch
size from 256 to 128 under the assumption that an indexEntry is 60 bytes
on amd64, but it's 64: structs are padded out to a multiple of 8 for
alignment reasons. That means we'd waste no space in malloc even without
the batch allocation, at least on 64-bit machines. While that strategy
cuts the overallocation down dramatically for many small indexes, it also
seems to slow allocation down (Go 1.18, Linux, amd64, -benchtime=2s):
name old time/op new time/op delta
DecodeIndex-8 4.67s ± 5% 4.60s ± 1% ~ (p=0.953 n=10+5)
DecodeIndexParallel-8 4.67s ± 3% 4.60s ± 1% ~ (p=0.953 n=10+5)
IndexHasUnknown-8 37.8ns ± 8% 36.5ns ±14% ~ (p=0.841 n=5+5)
IndexHasKnown-8 38.5ns ±12% 37.7ns ±10% ~ (p=0.968 n=5+5)
IndexAlloc-8 615ms ±18% 607ms ± 1% ~ (p=1.000 n=10+5)
IndexAllocParallel-8 245ms ±11% 285ms ± 6% +16.40% (p=0.001 n=10+5)
MasterIndexAlloc-8 286ms ± 9% 275ms ± 2% ~ (p=1.000 n=10+5)
LoadIndex/v1-8 27.0ms ± 4% 26.8ms ± 1% ~ (p=0.690 n=5+5)
LoadIndex/v2-8 22.4ms ± 1% 22.8ms ± 2% +1.48% (p=0.016 n=5+5)
name old alloc/op new alloc/op delta
IndexAlloc-8 446MB ± 0% 446MB ± 0% -0.00% (p=0.000 n=8+4)
IndexAllocParallel-8 446MB ± 0% 446MB ± 0% -0.00% (p=0.008 n=8+5)
MasterIndexAlloc-8 213MB ± 0% 159MB ± 0% -25.47% (p=0.000 n=10+5)
name old allocs/op new allocs/op delta
IndexAlloc-8 913k ± 0% 2632k ± 0% +188.19% (p=0.008 n=5+5)
IndexAllocParallel-8 913k ± 0% 2632k ± 0% +188.21% (p=0.008 n=5+5)
MasterIndexAlloc-8 318k ± 0% 1172k ± 0% +267.86% (p=0.008 n=5+5)
Instead, this patch sets a batch size of 4, which means no space is
wasted by malloc on 64-bit and very little on 32-bit. It still gets very
close to the savings from not allocating in batches, without requiring
special code for bits.UintSize==64. Benchmark results, again for
Linux/amd64:
name old time/op new time/op delta
DecodeIndex-8 4.67s ± 5% 4.83s ± 9% ~ (p=0.315 n=10+10)
DecodeIndexParallel-8 4.67s ± 3% 4.68s ± 4% ~ (p=0.315 n=10+10)
IndexHasUnknown-8 37.8ns ± 8% 44.5ns ±19% ~ (p=0.095 n=5+5)
IndexHasKnown-8 38.5ns ±12% 36.9ns ± 8% ~ (p=0.690 n=5+5)
IndexAlloc-8 615ms ±18% 628ms ±18% ~ (p=0.218 n=10+10)
IndexAllocParallel-8 245ms ±11% 262ms ± 9% +7.02% (p=0.043 n=10+10)
MasterIndexAlloc-8 286ms ± 9% 287ms ±13% ~ (p=1.000 n=10+10)
LoadIndex/v1-8 27.0ms ± 4% 26.8ms ± 0% ~ (p=1.000 n=5+5)
LoadIndex/v2-8 22.4ms ± 1% 22.5ms ± 0% ~ (p=0.056 n=5+5)
name old alloc/op new alloc/op delta
IndexAlloc-8 446MB ± 0% 446MB ± 0% ~ (p=1.000 n=8+10)
IndexAllocParallel-8 446MB ± 0% 446MB ± 0% -0.00% (p=0.000 n=8+8)
MasterIndexAlloc-8 213MB ± 0% 160MB ± 0% -25.02% (p=0.000 n=10+9)
name old allocs/op new allocs/op delta
IndexAlloc-8 913k ± 0% 1333k ± 0% +45.94% (p=0.000 n=8+10)
IndexAllocParallel-8 913k ± 0% 1333k ± 0% +45.94% (p=0.000 n=8+8)
MasterIndexAlloc-8 318k ± 0% 525k ± 0% +64.99% (p=0.000 n=10+10)
The allocation method indexmap.newEntry has also been rewritten in a
form that is a few instructions shorter.