Author: Dmitry Lomov
NestedSets (an implementation used in Skylark
set data type) is an essential
data structure for passing transitive cumulative data during Bazel's analysis.
The reason for that is its memory efficiency: a union of two sets carries a
constant memory overhead (compare that to lists, for example, where a union of
two lists has an O(length of original lists) memory overhead).
When Bazel builds a dependency graph, all data returned by providers coming from rule implementation stays in memory roughly throughout the lifetime of a Bazel server. If providers use lists for their cumulative data, the total amount of memory consumed by them will grow as O(N^2) where N is the number of nodes in the build graph. Sets are the only data structure available in Skylark that can reduce that amount to O(N).
However, sets in Skylark suffer from several deficiencies, which preclude their usage in providers, which leads in turn to performance issues.
This document discusses the reasons for these deficiencies and suggests several ways to resolve them.
Skylark sets are, in a certain sense, typed. Heterogeneous Skylark sets (e.g. ints and strings in the same set) are not allowed. Skylark set carries its contentType (the type of its elements) with it.
First of all, all elements of Skylark set must be immutable. Sets of sets are not allowed either.
Second, when a union of skylark sets is computed, their contentTypes are intersected to find a type that contains elements of both sets. Union operation fails if that intersection is empty.
Type intersection disallows heterogeneous sets since intersection of two primitive types is empty unless it is one and the same type. However, all tuples in Skylark have the same type, TUPLE, therefore sets of heterogeneous tuples, and even tuples of different length, are allowed. Also, since all tuples share the same type, Skylark set has no information about components of its constituent tuples.
All values returned as providers from rule and aspect implementations are required to be safe. Safe values are defined as follows (note that the notion is defined on values, not on types):
Note that in this entire definition, only the definition of safe sets involves any types. For all other composite values, their constituents are examined directly, but in case of sets, this is too expensive (because sets supposedly hold transitive cumulative information), so their content type is examined instead.
This poses a problem if we want to allow safe sets of tuples, as set of tuples forget what are the types of constituents of those tuples.
As discussed in the "Motivation" section, we need to make sure that Skylark rules and aspects can use sets to pass information during the build analysis; that is we need to allow two things:
Allowing sets to contain structs will require equality semantics on structs to change (from reference to structural equality). We deem the risk of this change very low; even if structs are compared for equality now in user code, that is likely a bug, and since structs are immutable, the observed behavior should not change.
Structs, just like tuples, can be (deeply) mutable and immutable:
l = [27, 42, 30] # This list is mutable (inside a function definition) t = (l, 42) # This tuple is mutable, since l can be modified s = struct(field = l) # This struct is mutable, since l can be modified
Only immutable structs will be allowed as elements of sets, similarly to how tuples are handled today.
To allow sets of structs inside providers, we need to reconcile value safety check with set typing (Recall that sets forget what constituents their elements have, so fast safety check is impossible).
We can record more precise element types in sets (not just STRUCT, but the entire list of fields of those structs). We will need to define what does struct intersection mean (for example, do we allow sets of structs with disjoint collections of fields?). If we want to record types for sets of tuples, we will need to have a very permissive type system to preserve current behavior. For any restriction we introduce, we need a careful rationale.
If we implement "Declared providers" proposal and provide more extesive type information on top of that, sets can get typing information from there.
Looking at all the issues in this document, it appears that safe value check causes more trouble than the value it brings. The only substantive requirement for provider values is that they are immutable. Another motivation for a safe value check was serialization, but if we are serious about serializing configured targets graph, we will need to learn to serialize arbitrary Skylark values anyway.