Recently I read the code of Chunk
and felt a little bit confused. I found the code look messy since there are some legacy problems existing for 3 years. So it may need refactoring.
Shortly speacking, I think there are too many "size"s: capacity
, requiredRows
and maxChunkSize
.
-
capacity
andmaxChunkSize
were introduced in chunk: support capacity grow #7473 -
requiredRows
was introduced in Control the number of rows in chunks returned by SelectResult #9293
They both left some TODOs and FIXMEs, but there weren’t any follow-ups and the code hasn’t changed.
type Chunk struct {
...
// capacity indicates the max number of rows this chunk can hold.
// TODO: replace all usages of capacity to requiredRows and remove this field
capacity int
// requiredRows indicates how many rows the parent executor want.
requiredRows int
}
// NewChunkWithCapacity creates a new chunk with field types and capacity.
func NewChunkWithCapacity(fields []*types.FieldType, cap int) *Chunk {
return New(fields, cap, cap) //FIXME: in following PR.
}
// New creates a new chunk.
// cap: the limit for the max number of rows.
// maxChunkSize: the max limit for the number of rows.
func New(fields []*types.FieldType, cap, maxChunkSize int) *Chunk {
chk := &Chunk{
columns: make([]*Column, 0, len(fields)),
capacity: mathutil.Min(cap, maxChunkSize),
// set the default value of requiredRows to maxChunkSize to let chk.IsFull() behave
// like how we judge whether a chunk is full now, then the statement
// "chk.NumRows() < maxChunkSize"
// equals to "!chk.IsFull()".
requiredRows: maxChunkSize,
}
for _, f := range fields {
chk.columns = append(chk.columns, NewColumn(f, chk.capacity))
}
return chk
}
There seems to be these "size"s:
-
current allocated space (the commonly known capacity):
capacity
seems to be used like this, but it is not safe:AppendRow
may break it.capacity
is only updated whenGrowAndReset
. What’s more, its comment “cap: the limit for the max number of rows” is misleading.However, there are some use cases where
requiredRows
is used. IMOIsFull
is a capacity concept and is usingrequiredRows
.In practice, both
Capacity()
andRequiredRows()
are used somewhere to represent “batchsize”. -
limit, max size: no this limit actually. Paramater
maxChunkSize
can restrictGrowAndReset
, but we canAppendRow
arbitrarily. -
needed size:
SetRequiredRows
+IsFull
is used. However, I’m wondering why not simply resize the chunk. Is this concept redundant? What does it mean whenneeded size != capacity
?