Proposal: make parser support all the charsets

xiongjiwei · October 15, 2021, 8:33am

Background

For now, tidb only supports few charsets, so does the parser. It is not friendly to the parser user, e.g. https://github.com/pingcap/parser/issues/1140.

So, I propose that make the parser supports all the charsets that MySQL has supported.

How

make parser can recognize the charset, it is easy, just like https://github.com/pingcap/tidb/pull/28824
when rewriting AST to expression, tidb should check and report an error if any charset is not supported for now. tidb will maintain a supported charset table.
clean up the code like GetSupportedCharsets, this function in the parser and used in tidb for SQL show charset, tidb should keep its own GetSupportedCharsets.

Something needs to discuss

At this moment, _gbk is parsed as an identifier but not charset introducer since the parser does not support gbk for now (the same with other charsets). SQL select _gbk a from t is worked if table t has a column named _gbk. After the parser support all the charset, this query will report a syntax error because _gbk will be recognized as charset introducer, it is a compatibility breaker.

My suggestion is, let it breaks. The main reason is that we can not avoid this. If we support the gbk charset in the feature we will also meet this problem. MySQL can not either if abc charset is supported in the newer MySQL.

Alternative

As you can see in https://github.com/pingcap/parser/pull/1301, add a function named AddCharset, it provides a way to add a customer charset, it is fun. But for every parser user, he will do a lot of jobs to make the parser support charsets that MySQL has supported. It is great if we provide an out-of-box parser. Also, this proposal is helpful to decouple parser and tidb in charset part.