Internals and advanced usage
AggregateBy.jl
's most important design goal is to be a lightweight tool that simplifies interactive analysis at the command line. However, it also tries to achieve reasonable performance, and that often means inferring the key- and value-types of the returned Dict
. In detail, here is what actually happens for a ficticious aggregator
(e.g., like count
, sum
, or collect
) and "by" function By(fkey, fval)
:
- If you call
aggregator(By{K,V}(fkey, fval), itr)
, it should return aDict{K,V}
. It does not rely on inference. - If you call
aggregator(By(fkey, fval), itr)
, it will determine whetheritr
has a known eltype (seeBase.IteratorEltype
):- if the eltype
T
is known, it infersK
fromfkey(::T)
andV
fromaggregator
andfval(::T)
- if the eltype is unknown, it will aggregate to
Dict{Any,Any}
internally, and then try to "tighten" the eltype upon return.
- if the eltype
You can use AggregateBy.UNKNOWN
if you want to tighten selectively, e.g., By{UNKNOWN,Any}
will tighten the keytype but not the valtype, and so on.
To illustrate these considerations, let's experiment with both By(fkey)
and By{K,V}(fkey)
in three different cases:
- where the eltype is known and concrete (e.g.,
Vector{Int}
) - where the eltype is known and abstract (e.g.,
Vector{Any}
) - where the eltype is unknown
For the third case, it will help if we define a custom container type:
struct UnknownEltype
container
end
Base.IteratorEltype(::Type{UnknownEltype}) = Base.EltypeUnknown()
Base.iterate(u::UnknownEltype) = iterate(u.container)
Base.iterate(u::UnknownEltype, s) = iterate(u.container, s)
Now, let's set up the data we need for benchmarking:
julia> by1 = By{Bool,Int}(isodd)
By{Bool, Int64, typeof(isodd)}(isodd)
julia> by2 = By(isodd)
By{AggregateBy.UNKNOWN, AggregateBy.UNKNOWN, typeof(isodd)}(isodd)
julia> vconcrete = collect(1:11); # Vector{Int}
julia> vabstract = Any[(1:11)...]; # Vector{Any}
julia> vunknown = UnknownEltype(vconcrete); # unknown eltype
On the author's machine,
julia> using BenchmarkTools
julia> @btime sum($by, $v); # supply either by1 or by2, and either vconcrete, vabstract, or vunknown
yields the following results:
Container | By{K,V}(fkey) | By(fkey) |
---|---|---|
vconcrete | 150.153 ns (4 allocations: 432 bytes) | 150.652 ns (4 allocations: 432 bytes) |
vabstract | 343.598 ns (4 allocations: 432 bytes) | 906.775 ns (4 allocations: 512 bytes) |
vunknown | 861.281 ns (26 allocations: 960 bytes) | 1.327 μs (30 allocations: 1.44 KiB) |
In the vunknown
row, much of the cost in the By{K,V}(fkey)
case is due to the unknown type of vunknown.container
; the alternative definition
struct UnknownEltype2
container::Vector{Int}
end
Base.IteratorEltype(::Type{UnknownEltype}) = Base.EltypeUnknown()
yields substantially better performance (302.490 ns (4 allocations: 432 bytes)
). The By(fkey)
case improves less dramatically (1.039 μs (8 allocations: 944 bytes)
).