Calcite(一):javacc语法框架及使用--688IT编程网

Calcite（⼀）：javacc语法框架及使⽤

是⼀个动态数据管理框架。它包含许多组成典型数据库管理系统的部分，但省略了存储原语。它提供了⾏业标准的SQL解析器和验证器，具有可插⼊规则和成本函数的可⾃

定义优化器，逻辑和物理代数运算符，从SQL到代数（以及相反）的各种转换。

以上是官⽅描述，⽤⼤⽩话描述就是，calcite实现了⼀套标准的sql解析功能，⽐如实现了标准hive sql的解析，可以避免繁杂且易出错的语法问题。并暴露了相关的扩展接

⼝供⽤户⾃定义使⽤。其提供了逻辑计划修改功能，⽤户可以实现⾃⼰的优化。（害，好像还是很绕！不管了）

1. calcite的两⼤⽅向

从核⼼功能上讲，或者某种程度上讲，我们可以将calicite分为两⼤块，⼀块是对sql语法的解析，另⼀块是对语义的转化与实现；

为什么要将其分为⼏块呢？我们知道，基本上所有的分层，都是为了简化各层的逻辑。如果我们将所有的逻辑全放在⼀个层上，必然存在⼤量的耦合，互相嵌套，很难实现

专业的⼈做专业的事。语法解析，本⾝是⼀件⽐较难的事情，但是因为有很多成熟的编译原理理论⽀持，所以，这⽅⾯有许多现成的实现可以利⽤，或者即使是⾃⼰单独实现这

⼀块，也不会有太⼤⿇烦。所以，这⼀层是⼀定要分出来的。

⽽对语义的转化与实现，则是⽤户更关注的⼀层，如果说前⾯的语法是标准规范的话，那么语义才是实现者最关⼼的东西。规范是为了减轻使⽤者的使⽤难度，⽽其背后的

逻辑则可能有天壤之别。当有了前⾯的语法解析树之后，再来进⼀步处理语义的东西，必然⽅便了许多。但也必定是个复杂的⼯作，因为上下⽂关联语义，并不好处理。

⽽我们本篇只关注语法解析这⼀块功能，⽽calcite使⽤javacc作为其语法解析器，所以我们⾃然主关注向javacc了。与javacc类似的，还有antlr，这个留到我们后⾯再说。

calcite中，javacc应该属于⼀阶段的编译，⽽java中引⼊javacc编译后的样板代码，再执⾏⾃⼰的逻辑，可以算作是⼆阶段编译。我们可以简单的参考下下⾯这个图，说明其

意义。

2. javacc的语法框架

本⽂仅站在⼀个使⽤者的⾓度来讲解javacc, 因为javacc本⾝必然是复杂的难以讲清的, ⽽且如果想要细致了解javacc则肯定是需要借助官⽹的。

⾸先，来看下javacc的编码框架：

javacc_options /* javacc 的各种配置选项设置，需了解具体配置含义后以kv形式配置 */

"PARSER_BEGIN" "(" <IDENTIFIER> ")" /* parser代码开始定义，标识下⾯的代码是纯粹使⽤java编写的 */

java_compilation_unit /* parser的⼊⼝代码编写，纯java, 此处将决定外部如何调⽤parser */

"PARSER_END" "(" <IDENTIFIER> ")" /* parser代码结束标识，javacc将会把以上代码纯粹当作原⽂copy到parser中 */

( production )* /* 各种语法产⽣式，按照编译原理的类似样⼦定义语法产⽣式，由javacc去分析具体代码逻辑，嵌⼊到parser中，该部分产⽣式代码将被编译到上⾯的parser中，所以⽅法可以完全供parser调⽤<EOF> /* ⽂件结束标识 */

以上就是javacc的语法定义的框架了，它是⼀个整个的parser.jj⽂件。即这个⽂件只要按照这种框架写了，然后调⽤javacc进⾏编译后，可以得到⼀系列的编译器样板代码

了。

但是，如何去编写去编写这些语法呢？啥都不知道，好尴尬。不着急，且看下⼀节。

3. javacc中的关键词及使⽤

之所以我们⽆从下⼿写javacc的jj⽂件，是因为我们不知道有些什么关键词，以及没有给出⼀些样例。主要熟能⽣巧嘛。

javacc中的关键词⾮常的少，⼀个是因为这种词法解析器的⽅法论是⾮常成熟的，它可以按照任意的语法作出解析。⼆⼀个是它不负责太多的业务实现相关的东西，它只管

理解语义，翻译即可。⽽它其中仅有的⼏个关键词，也还有⼀些属于辅助类的功能。真正必须的关键词就更少了。列举如下：

TOKEN /* 定义⼀些确定的普通词或关键词，主要⽤于被引⽤ */

SPECIAL_TOKEN /* 定义⼀些确定的特殊⽤途的普通词或关键词，主要⽤于被引⽤或抛弃 */

SKIP /* 定义⼀些需要跳过或者忽略的单词或短语，主要⽤于分词或者注释 */

MORE /* token的辅助定义⼯具，⽤于确定连续的多个token */

EOF /* ⽂件结束标识或者语句结束标识 */

IGNORE_CASE /* 辅助选项，忽略⼤⼩写 */

JAVACODE /* 辅助选项，⽤于标识本段代码是java */

LOOKAHEAD /* 语法⼆义性处理⼯具，⽤于预读多个token，以便明确语义 */

PARSER_BEGIN /* 样板代码，固定开头 */

PARSER_END /* 样板代码，固定结尾 */

TOKEN_MGR_DECLS /* 辅助选项 */

有了这些关键词的定义，我们就可以来写个hello world 了。其主要作⽤就是验证语法是否是 hello world.

options {

STATIC = false;

ERROR_REPORTING = true;

JAVA_UNICODE_ESCAPE = true;

UNICODE_INPUT = false;

IGNORE_CASE = true;

DEBUG_PARSER = false;

DEBUG_LOOKAHEAD = false;

DEBUG_TOKEN_MANAGER = false;

}

PARSER_BEGIN(HelloWorldParser)

package my;

import java.io.FileInputStream;

/**

* hello world parser

@SuppressWarnings({"nls", "unused"})

public class HelloWorldParser {

/**

* 测试⼊⼝

public static void main( String args[] ) throws Throwable {

// 编译器会默认⽣成构造⽅法

String sqlFilePath = args[0];

final HelloWorldParser parser = new HelloWorldParser(new FileInputStream(sqlFilePath));

try {

parser.hello();

} catch(Throwable t) {

t.printStackTrace();

return;

}

System.out.println("ok");

}

public void hello () throws ParseException {

helloEof();

}

} // end class

PARSER_END(HelloWorldParser)

void helloEof() :

{}

{

// 匹配到hello world 后，打印⽂字，否则抛出异常

(

<HELLO>

"HELLO2"

)

<WORLD>

{ System.out.println("ok to match hello world."); }

}

TOKEN :

{

<HELLO: "hello">

| <WORLD: "world">

}

SKIP:

{

" "

| "\t"

| "\r"

| "\n"

}

命名为 hello.jj, 运⾏ javacc 编译该jj⽂件。

> javacc hello.jj

> javac my/*.java

> java my.HelloWorldParser

4. javacc中的编译原理

javacc作为⼀个词法解析器，其主要作⽤是提供词法解析功能。当然，只有它⾃⼰知道词是不够的，它还有⼀个⾮常重要的功能，能够翻译成java语⾔（不⽌java）的解析器，这样⽤户就可以调⽤这些解析器进⾏业务逻辑实现了。所以，从某种⾓度上说，它相当于是⼀个脚⼿架，帮我们⽣成⼀些模板代码。

词法解析作为⼀个⾮常通⽤的话题，各种⼤⽜科学家们，早就总结出⾮常多的⽅法论的东西了。即编译原理。但要想深⼊理解其理论，还是⾮常难的，只能各⾃随缘了。随便列举⼏个名词，供⼤家参考：

产⽣式

终结符与⾮终结符，运算分量

预测分析法，左递归，回溯，上下⽂⽆关

DFA, NFA, 正则匹配，模式，kmp算法，trie树

附加操作，声明

LL, LR, ⼆义性

词法

语法

可以说，整个javacc就是编译原理的其中⼀⼩部分实现。当然了，我们平时遇到编译的地⽅⾮常多，因为我们所使⽤的语⾔，都需要被编译成汇编或机器语⾔，才能被执⾏，⽐如javacc, 。所以，编译原理⽆处不在。

这⾥，我们单说jj⽂件如何被编译成java⽂件？总体上，⼤的原理就按照编译原理来就好了。我们只说⼀些映射关系。

"a" "b" -> 代表多个连续token

| -> 对应if或者switch语义

(..)* -> 对应while语义

["a"] -> 对应if语句，可匹配0-1次

(): {} -> 对应语法的产⽣式

{} -> 附加操作，在匹配后嵌⼊执⾏

<id> 对应常量词或容易描述的token描述

javacc 默认会⽣成⼏个辅助类：

XXConstants: 定义⼀些常量值，⽐如将TOKEN定义的值转换为⼀个个的数字；

HelloWorldParserTokenManager: token管理器, ⽤于读取token, 可以⾃定义处理;

JavaCharStream: CharStream的实现，会根据配置选项⽣成不同的类;

ParseException: 解析错误时抛出的类;

Token: 读取到的单词描述类;

TokenMgrError: 读取token错误时抛出的错误;

从编写代码的⾓度来说，我们基本上只要掌握基本的样板格式和正则表达式就可以写出javacc的语法了。如果想要在具体的java代码中应⽤，则需要⾃⼰组织需要的语法树结构或其他了。

5. javacc 编译实现源码解析

其⼊⼝在: src/main/java/org/javacc/parser/Main.java

/**

* A main program that exercises the parser.

public static void main(String args[]) throws Exception {

int errorcode = mainProgram(args);

}

/**

* The method to call to exercise the parser from other Java programs.

* It returns an error code. See how the main program above uses

* this method.

public static int mainProgram(String args[]) throws Exception {

if (args.length == 1 && args[args.length -1].equalsIgnoreCase("-version")) {

System.out.println(Version.versionNumber);

return 0;

}

// Initialize all static state

reInitAll();

JavaCCGlobals.bannerLine("Parser Generator", "");

JavaCCParser parser = null;

if (args.length == 0) {

System.out.println("");

help_message();

return 1;

} else {

System.out.println("(type \"javacc\" with no arguments for help)");

}

if (Options.isOption(args[args.length-1])) {

System.out.println("Last argument \"" + args[args.length-1] + "\" is not a filename.");

return 1;

}

for (int arg = 0; arg < args.length-1; arg++) {

if (!Options.isOption(args[arg])) {

System.out.println("Argument \"" + args[arg] + "\" must be an option setting.");

return 1;

}

Options.setCmdLineOption(args[arg]);

}

try {

java.io.File fp = new java.io.File(args[args.length-1]);

if (!fp.exists()) {

System.out.println("File " + args[args.length-1] + " not found.");

return 1;

}

if (fp.isDirectory()) {

System.out.println(args[args.length-1] + " is a directory. Please use a valid file name.");

return 1;

}

// javacc 本⾝也使⽤的语法解析器⽣成 JavaCCParser, 即相当于⾃依赖咯

parser = new JavaCCParser(new java.io.BufferedReader(new java.io.InputStreamReader(new java.io.FileInputStream(args[args.length-1]), GrammarEncoding()))); } catch (SecurityException se) {

System.out.println("Security violation while trying to open " + args[args.length-1]);

return 1;

} catch (java.io.FileNotFoundException e) {

System.out.println("File " + args[args.length-1] + " not found.");

return 1;

}

try {

System.out.println("Reading from file " + args[args.length-1] + " . . .");

// 使⽤静态变量来实现全局数据共享

JavaCCGlobals.fileName = igFileName = args[args.length-1];

JavaCCGlobals.jjtreeGenerated = JavaCCGlobals.isGeneratedBy("JJTree", args[args.length-1]);

// javacc 语法解析⼊⼝

// 经过解析后，它会将各种解析数据放⼊到全局变量中

parser.javacc_input();

// 2012/05/02 - Moved this here as cannot evaluate output language

// until the cc file has been processed. Was previously setting the 'lg' variable

// to a lexer before the configuration override in the cc file had been read.

String outputLanguage = OutputLanguage();

/ TODO :: CBA -- Require Unification of output language specific processing into a single Enum class

boolean isJavaOutput = Options.isOutputLanguageJava();

boolean isCPPOutput = outputLanguage.equals(Options.OUTPUT_LANGUAGE__CPP);

// 2013/07/22 Java Modern is a

boolean isJavaModern = isJavaOutput && JavaTemplateType().equals(Options.JAVA_TEMPLATE_TYPE_MODERN);

if (isJavaOutput) {

lg = new LexGen();

} else if (isCPPOutput) {

lg = new LexGenCPP();

} else {

return unhandledLanguageExit(outputLanguage);

}

if (UnicodeInput())

{

NfaState.unicodeWarningGiven = true;

System.out.println("Note: UNICODE_INPUT option is specified. " +

"Please make sure you create the parser/lexer using a Reader with the correct character encoding.");

}

// 将词法解析得到的信息，重新语义加强，构造出更连贯的上下⽂信息，供后续使⽤

Semanticize.start();

boolean isBuildParser = BuildParser();

/ 2012/05/02 -- This is not the best way to add-in GWT support, really the code needs to turn supported languages into enumerations

// and have the enumerations describe the deltas between the outputs. The current approach means that per-langauge configuration is distributed

// and small changes between targets does not benefit from inheritance.

if (isJavaOutput) {

if (isBuildParser) {

// 1. ⽣成parser框架信息

new ParseGen().start(isJavaModern);

}

// Must always create the lexer object even if not building a parser.

// 2. ⽣成语法解析信息

new LexGen().start();

// 3. ⽣成其他辅助类

Options.setStringOption(Options.NONUSER_OPTION__PARSER_NAME, JavaCCGlobals.cu_name);

OtherFilesGen.start(isJavaModern);

} else if (isCPPOutput) { // C++ for now

if (isBuildParser) {

new ParseGenCPP().start();

}

if (isBuildParser) {

new LexGenCPP().start();

}

Options.setStringOption(Options.NONUSER_OPTION__PARSER_NAME, JavaCCGlobals.cu_name);

OtherFilesGenCPP.start();

} else {

unhandledLanguageExit(outputLanguage);

}

// 编译结果状态判定，输出

if ((_error_count() == 0) && (isBuildParser || BuildTokenManager())) {

if (_warning_count() == 0) {

if (isBuildParser) {

System.out.println("Parser generated successfully.");

}

} else {

System.out.println("Parser generated with 0 errors and "

+ _warning_count() + " warnings.");

}

return 0;

} else {

System.out.println("Detected " + _error_count() + " errors and "

+ _warning_count() + " warnings.");

return (_error_count()==0)?0:1;

}

} catch (MetaParseException e) {

System.out.println("Detected " + _error_count() + " errors and "

+ _warning_count() + " warnings.");

return 1;

} catch (ParseException e) {

System.out.String());

System.out.println("Detected " + (_error_count()+1) + " errors and "

+ _warning_count() + " warnings.");

return 1;

}

以上，就是javacc的编译运⾏框架，其词法解析仍然靠着⾃⾝的jj⽂件，⽣成的 JavaCCParser 进⾏解析：

1. ⽣成的 JavaCCParser, 然后调⽤ javacc_input() 解析出词法信息;

2. 将解析出的语法信息放⼊到全局变量中;

3. 使⽤Semanticize 将词法语义加强，转换为javacc可处理的结构;

4. 使⽤ParseGen ⽣成parser框架信息;

5. 使⽤LexGen ⽣成语法描述⽅法;

6. 使⽤OtherFilesGen ⽣成同级辅助类;

下⾯我们就前⾯⼏个重点类，展开看看其实现就差不多了。

5.1. javacc语法定义

前⾯说了，javacc在编译其他语⾔时，它⾃⼰⼜定义了⼀个语法⽂件，⽤于第⼀步的词法分析。可见这功能的普启遍性。我们⼤致看下⼊⼝即可，更多完整源码可查看: src/main/javacc/JavaCC.jj

void javacc_input() :

{

String id1, id2;

initialize();

}

{

javacc_options()

{

}

"PARSER_BEGIN" "(" id1=identifier()

{

addcuname(id1);

}

")"

{

processing_cu = true;

parser_class_name = id1;

if (!isJavaLanguage()) {

while(getToken(1).kind != _PARSER_END) {

getNextToken();

}

CompilationUnit()

{

processing_cu = false;

}

"PARSER_END" "(" id2=identifier()

{

compare(getToken(0), id1, id2);

}

")"

( production() )+

<EOF>

}

...

可以看出，这种语法定义，与说明⽂档相差不太多，可以说是⼀种⽐较接近⾃然语⾔的实现了。

5.2. Semanticize 语义处理

Semanticize 将前⾯词法解析得到数据，进⼀步转换成容易被理解的语法树或者其他信息。// org.javacc.parser.Semanticize#start

static public void start() throws MetaParseException {

if (_error_count() != 0) throw new MetaParseException();

if (Lookahead() > 1 && !ForceLaCheck() && SanityCheck()) {

JavaCCErrors.warning("Lookahead adequacy checking not being performed since option LOOKAHEAD " +

"is more than 1. Set option FORCE_LA_CHECK to true to force checking.");

}

* The following walks the entire parse tree to convert all LOOKAHEAD's

* that are not at choice points (but at beginning of sequences) and converts

* them to trivial choices. This way, their semantic lookahead specification

* can be evaluated during other lookahead evaluations.

for (Iterator<NormalProduction> it = bnfproductions.iterator(); it.hasNext();) {

ExpansionTreeWalker.postOrderWalk(((()).getExpansion(),

new LookaheadFixer());

}

* The following loop populates "production_table"

for (Iterator<NormalProduction> it = bnfproductions.iterator(); it.hasNext();) {

NormalProduction p = it.next();

if (production_table.Lhs(), p) != null) {

JavaCCErrors.semantic_error(p, p.getLhs() + " occurs on the left hand side of more than one production.");

}

* The following walks the entire parse tree to make sure that all

* non-terminals on RHS's are defined on the LHS.

for (Iterator<NormalProduction> it = bnfproductions.iterator(); it.hasNext();) {

ExpansionTreeWalker.preOrderWalk((it.next()).getExpansion(), new ProductionDefinedChecker());

}

* The following loop ensures that all target lexical states are

* defined. Also piggybacking on this loop is the detection of

* <EOF> and <name> in token productions. After reporting an

* error, these entries are removed. Also checked are definitions

* on inline private regular expressions.

* This loop works slightly differently when USER_TOKEN_MANAGER

* is set to true. In this case, <name> occurrences are OK, while

* regular expression specs generate a warning.

for (Iterator<TokenProduction> it = rexprlist.iterator(); it.hasNext();) {

TokenProduction tp = (TokenProduction)(it.next());

List<RegExprSpec> respecs = tp.respecs;

for (Iterator<RegExprSpec> it1 = respecs.iterator(); it1.hasNext();) {

RegExprSpec res = (RegExprSpec)(());

if (State != null) {

if ((State) == null) {

JavaCCErrors.semantic_error(res.nsTok, "Lexical state \"" + State +

"\" has not been defined.");

}

if (p instanceof REndOfFile) {

//JavaCCErrors.semantic_p, "Badly placed <EOF>.");

if (tp.lexStates != null)

JavaCCErrors.semantic_p, "EOF action/state change must be specified for all states, " +

"i.e., <*>TOKEN:.");

if (tp.kind != TokenProduction.TOKEN)

JavaCCErrors.semantic_p, "EOF action/state change can be specified only in a " +

"TOKEN specification.");

if (nextStateForEof != null || actForEof != null)

JavaCCErrors.semantic_p, "Duplicate action/state change specification for <EOF>.");

actForEof = res.act;

nextStateForEof = State;

prepareToRemove(respecs, res);

} else if (tp.isExplicit && UserTokenManager()) {

JavaCCErrors.p, "Ignoring regular expression specification since " +

"option USER_TOKEN_MANAGER has been set to true.");

} else if (tp.isExplicit && !UserTokenManager() && p instanceof RJustName) {

JavaCCErrors.p, "Ignoring free-standing regular expression reference. " +

"If you really want this, you must give it a different label as <NEWLABEL:<"

+ p.label + ">>.");

prepareToRemove(respecs, res);

} else if (!tp.isExplicit && p.private_rexp) {

JavaCCErrors.semantic_p, "Private (#) regular expression cannot be defined within " +

"grammar productions.");

}

removePreparedItems();

* The following loop inserts all names of regular expressions into

* "named_tokens_table" and "ordered_named_tokens".

* Duplications are flagged as errors.

for (Iterator<TokenProduction> it = rexprlist.iterator(); it.hasNext();) {

TokenProduction tp = (TokenProduction)(it.next());

List<RegExprSpec> respecs = tp.respecs;

for (Iterator<RegExprSpec> it1 = respecs.iterator(); it1.hasNext();) {

RegExprSpec res = (RegExprSpec)(());

if (!(p instanceof RJustName) && !p.label.equals("")) {

String s = p.label;

正则匹配两个大写字母加两个数字Object obj = named_tokens_table.put(s, p);

if (obj != null) {

JavaCCErrors.semantic_p, "Multiply defined lexical token name \"" + s + "\".");

} else {

ordered_named_tokens.p);

688IT编程网

Calcite(一):javacc语法框架及使用

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

688IT编程网

Calcite(一):javacc语法框架及使用

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式 最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

nginx map用法正则

shell 正则表达式最后一行