Refactor `util/chunk`

Recently I read the code of Chunk and felt a little bit confused. I found the code look messy since there are some legacy problems existing for 3 years. So it may need refactoring.

Shortly speacking, I think there are too many "size"s: capacity, requiredRows and maxChunkSize.

They both left some TODOs and FIXMEs, but there weren’t any follow-ups and the code hasn’t changed.

type Chunk struct {
	...

	// capacity indicates the max number of rows this chunk can hold.
	// TODO: replace all usages of capacity to requiredRows and remove this field
	capacity int

	// requiredRows indicates how many rows the parent executor want.
	requiredRows int
}

// NewChunkWithCapacity creates a new chunk with field types and capacity.
func NewChunkWithCapacity(fields []*types.FieldType, cap int) *Chunk {
	return New(fields, cap, cap) //FIXME: in following PR.
}

// New creates a new chunk.
//  cap: the limit for the max number of rows.
//  maxChunkSize: the max limit for the number of rows.
func New(fields []*types.FieldType, cap, maxChunkSize int) *Chunk {
	chk := &Chunk{
		columns:  make([]*Column, 0, len(fields)),
		capacity: mathutil.Min(cap, maxChunkSize),
		// set the default value of requiredRows to maxChunkSize to let chk.IsFull() behave
		// like how we judge whether a chunk is full now, then the statement
		// "chk.NumRows() < maxChunkSize"
		// equals to "!chk.IsFull()".
		requiredRows: maxChunkSize,
	}

	for _, f := range fields {
		chk.columns = append(chk.columns, NewColumn(f, chk.capacity))
	}

	return chk
}

There seems to be these "size"s:

  • current allocated space (the commonly known capacity):
    capacity seems to be used like this, but it is not safe: AppendRow may break it. capacity is only updated when GrowAndReset. What’s more, its comment “cap: the limit for the max number of rows” is misleading.

    However, there are some use cases where requiredRows is used. IMO IsFull is a capacity concept and is using requiredRows.

    In practice, both Capacity() and RequiredRows() are used somewhere to represent “batchsize”.

  • limit, max size: no this limit actually. Paramater maxChunkSize can restrict GrowAndReset, but we can AppendRow arbitrarily.

  • needed size: SetRequiredRows + IsFull is used. However, I’m wondering why not simply resize the chunk. Is this concept redundant? What does it mean when needed size != capacity?

3 Likes

needed size: SetRequiredRows + IsFull is used. However, I’m wondering why not simply resize the chunk. Is this concept redundant? What does it mean when needed size != capacity ?

I think it is to reduce allocation, just like vector in cpp. One thing sounds reasonable to me:

requiredRows can be moved outside util/chunk, then it becomes len() representing length of the chunk, i.e. you can not change it except push/pop elements in. And how many rows could be pushed will decided by the caller. It is much more like len() and cap() in other languages if chunk is not responsible for required/max.

1 Like

I think len() and cap() are very easy to understand, since Chunk is just a container. And I agree that we can move requiredRows outside. However, I found that before adding requiredRows to Chunk, there was an attempt to add a wrapper https://github.com/pingcap/tidb/pull/9273

// RecordBatch is input parameter of Executor.Next` method.
type RecordBatch struct {
	*Chunk
	requiredRows int
}

This PR was closed without detailed reasons.

@qw4990: After discussing with @zz-jason and @lysu , this approach to control the number of rows has been deprecated, and the new solution is in #9293

btw, I found the terminology is also different from apache arrow:

In arrow, there are

  • Array (or Vector)
  • Column = Chunked: a collection of arrays as one logical large array ([]Array)
  • 2D
    • Table = []Column
    • Record (or RecordBatch)= []Array

In util/chunk, it seems that

  • Column = arrow Array
  • Chunk = []Column = arrow Record
  • List = []Chunk = arrow Table

Oh, I found a comment here: https://github.com/pingcap/tidb/pull/5200/files#r153395314

all Chunk’s API should not do the validation work.

func (c *Column) finishAppendFixed() {
	c.data = append(c.data, c.elemBuf...)
	c.appendNullBitmap(true)
	c.length++
}

// AppendInt64 appends an int64 value into this Column.
func (c *Column) AppendInt64(i int64) {
	*(*int64)(unsafe.Pointer(&c.elemBuf[0])) = i
	c.finishAppendFixed()
}

Why do we need elemBuf?

I submitted a PR with some simple refactoring as the first step. (Didn’t solve the critial problems mentioned here.)

1 Like