Background
For now, tidb only supports few charsets, so does the parser. It is not friendly to the parser user, e.g. https://github.com/pingcap/parser/issues/1140.
So, I propose that make the parser supports all the charsets that MySQL has supported.
How
- make parser can recognize the charset, it is easy, just like https://github.com/pingcap/tidb/pull/28824
- when rewriting
AST
to expression, tidb should check and report an error if any charset is not supported for now. tidb will maintain a supported charset table. - clean up the code like
GetSupportedCharsets
, this function in the parser and used in tidb for SQLshow charset
, tidb should keep its ownGetSupportedCharsets
.
Something needs to discuss
At this moment, _gbk
is parsed as an identifier
but not charset introducer since the parser does not support gbk
for now (the same with other charsets). SQL select _gbk a from t
is worked if table t has a column named _gbk
. After the parser support all the charset, this query will report a syntax error because _gbk
will be recognized as charset introducer
, it is a compatibility breaker.
My suggestion is, let it breaks. The main reason is that we can not avoid this. If we support the gbk
charset in the feature we will also meet this problem. MySQL can not either if abc
charset is supported in the newer MySQL.
Alternative
As you can see in https://github.com/pingcap/parser/pull/1301, add a function named AddCharset
, it provides a way to add a customer charset, it is fun. But for every parser user, he will do a lot of jobs to make the parser support charsets that MySQL has supported. It is great if we provide an out-of-box parser. Also, this proposal is helpful to decouple parser
and tidb
in charset part.