Presto Source Solution - From SQL to AST Abstract Grammar Tree

Previous Submitted Query for Presto Source Resolution Describes the principle of submitting SQL to Coordinator from client to client in Cli mode and JDBC mode. In this article, we will look at how SQL submission to Coordinator is preprocessed and how SQL becomes an AST abstract syntax tree.

The source sequence diagram is as follows:

Next, let's take a closer look at the more important classes and methods throughout the process (some of them are temporarily ignored):
QueuedStatementResource: Responsible for handling client's estful requests, including receiving queries, query execution status, etc. Key interfaces are:

URLRequest MethodEffect
/v1/statementPOSTSubmit Query
queued/{queryId}/{slug}/{token}GetQuery Execution Status

The first request URL is/v1/statement, corresponding to the QueuedStatementResource.postStatement method, which handles the HTTP parameters of the request, constructs the SessionContext (for subsequent construction of the Session object), and returns nextUri (the URL of the client's next request)An important thing to do is to construct a SessionContext with a Parser for the X-Presto-Prepared-Statement parameter in the request header, which is syntax parsing. Like the Parser later in this article, the Execute type of SQL will enter this logic.

preparedStatements = parsePreparedStatementsHeaders(headers);

The query is not immediately submitted when the first/v1/statement request is submitted, and the second request through the first returned nextUri triggers the submission of the query, which is created by calling DispatchManager.createQuery in the Query.waitForDispatched method with specific logic.

private ListenableFuture<?> waitForDispatched()
        {
            // if query query submission has not finished, wait for it to finish
            synchronized (this) {
                if (querySubmissionFuture == null) {
                    querySubmissionFuture = dispatchManager.createQuery(queryId, slug, sessionContext, query);
                }
                if (!querySubmissionFuture.isDone()) {
                    return querySubmissionFuture;
                }
            }

            // otherwise, wait for the query to finish
            return dispatchManager.waitForDispatched(queryId);
        }

When the results of the SQL execution are obtained through the / v1/statement/executing/{queryId}/{slug}/{token} interface of the ExecutingStatementResource, the ExecutingStatementResource will not interact with the client until the query is actually submitted.

// ExecutingStatementResource
 @GET
    @Path("{queryId}/{slug}/{token}")
    @Produces(MediaType.APPLICATION_JSON)
    public void getQueryResults(
            @PathParam("queryId") QueryId queryId,
            @PathParam("slug") String slug,
            @PathParam("token") long token,
            @QueryParam("maxWait") Duration maxWait,
            @QueryParam("targetResultSize") DataSize targetResultSize,
            @HeaderParam(X_FORWARDED_PROTO) String proto,
            @Context UriInfo uriInfo,
            @Suspended AsyncResponse asyncResponse)
    {
        Query query = getQuery(queryId, slug, token);
        if (isNullOrEmpty(proto)) {
            proto = uriInfo.getRequestUri().getScheme();
        }

        DataSize targetResultSizeToUse = Optional.ofNullable(targetResultSize).map(size -> Ordering.natural().min(size, MAX_TARGET_RESULT_SIZE))
                .orElse(defaultTargetResultSize);

        asyncQueryResults(query, token, maxWait, targetResultSizeToUse, uriInfo, proto, asyncResponse);
    }

The above Query.waitForDispatched method is used to enter Dispatcher's createQuery method. The specific logic is lost in the createQueryInternal method through asynchronous threads. The logic of createQueryInternal includes lexical analysis, grammatical analysis, semantic analysis, selecting resource group s, generating execution plans, and so on.(Of course, the implementations are within each class, just like process-oriented languages).

/**
     *  Creates and registers a dispatch query with the query tracker.  This method will never fail to register a query with the query
     *  tracker.  If an error occurs while creating a dispatch query, a failed dispatch will be created and registered.
     */
    private <C> void createQueryInternal(QueryId queryId, Slug slug, SessionContext sessionContext, String query, ResourceGroupManager<C> resourceGroupManager)
    {
        Session session = null;
        PreparedQuery preparedQuery = null;
        try {
            if (query.length() > maxQueryLength) {
                int queryLength = query.length();
                query = query.substring(0, maxQueryLength);
                throw new PrestoException(QUERY_TEXT_TOO_LARGE, format("Query text length (%s) exceeds the maximum length (%s)", queryLength, maxQueryLength));
            }

            // decode session
            session = sessionSupplier.createSession(queryId, sessionContext);

            // prepare query
            preparedQuery = queryPreparer.prepareQuery(session, query);
            
            // select resource group
            Optional<String> queryType = getQueryType(preparedQuery.getStatement().getClass()).map(Enum::name);
            SelectionContext<C> selectionContext = resourceGroupManager.selectGroup(new SelectionCriteria(
                    sessionContext.getIdentity().getPrincipal().isPresent(),
                    sessionContext.getIdentity().getUser(),
                    Optional.ofNullable(sessionContext.getSource()),
                    sessionContext.getClientTags(),
                    sessionContext.getResourceEstimates(),
                    queryType));

            // apply system default session properties (does not override user set properties)
            session = sessionPropertyDefaults.newSessionWithDefaultProperties(session, queryType, selectionContext.getResourceGroupId());

            // mark existing transaction as active
            transactionManager.activateTransaction(session, isTransactionControlStatement(preparedQuery.getStatement()), accessControl);

            DispatchQuery dispatchQuery = dispatchQueryFactory.createDispatchQuery(
                    session,
                    query,
                    preparedQuery,
                    slug,
                    selectionContext.getResourceGroupId());

            boolean queryAdded = queryCreated(dispatchQuery);
            if (queryAdded && !dispatchQuery.isDone()) {
                submitQuerySync(dispatchQuery, selectionContext);
            }
        }
        catch (Throwable throwable) {
            // creation must never fail, so register a failed query in this case
            if (session == null) {
                session = Session.builder(new SessionPropertyManager())
                        .setQueryId(queryId)
                        .setIdentity(sessionContext.getIdentity())
                        .setSource(sessionContext.getSource())
                        .build();
            }
            Optional<String> preparedSql = Optional.ofNullable(preparedQuery).flatMap(PreparedQuery::getPrepareSql);
            DispatchQuery failedDispatchQuery = failedDispatchQueryFactory.createFailedDispatchQuery(session, query, preparedSql, Optional.empty(), throwable);
            queryCreated(failedDispatchQuery);
        }
    }

You can see that the start of the createQueryInternal method limits the length of the SQL. The next step is to construct a Session object from the SessionContext and QueryId obtained earlier, which includes queryId, catelog, schema, SystemProperties, ConnectorProperties, ClientTags, and so on.

public final class Session
{
    private final QueryId queryId;
    private final Optional<TransactionId> transactionId;
    private final boolean clientTransactionSupport;
    private final Identity identity;
    private final Optional<String> source;
    private final Optional<String> catalog;
    private final Optional<String> schema;
    private final SqlPath path;
    private final TimeZoneKey timeZoneKey;
    private final Locale locale;
    private final Optional<String> remoteUserAddress;
    private final Optional<String> userAgent;
    private final Optional<String> clientInfo;
    private final Optional<String> traceToken;
    private final Optional<String> labelInfo;
    private Set<String> clientTags;
    private final Set<String> clientCapabilities;
    private final ResourceEstimates resourceEstimates;
    private final long startTime;
    private Map<String, String> systemProperties;
    private Map<CatalogName, Map<String, String>> connectorProperties;
    private final Map<String, Map<String, String>> unprocessedCatalogProperties;
    private final SessionPropertyManager sessionPropertyManager;
    private final Map<String, String> preparedStatements;
}

The next step is to parse SQL into ATS, where there is only one line of code on the surface, but it includes several key processes, such as lexical and grammatical analysis, to generate an AST tree. The returned PreparedQuery contains a Statement object (the AST tree after parsing).Different types of SQLs correspond to different subclasses of Statements, such as Query for select query, Creatable for CreateTable, etc. Subsequent semantic analysis will be based on this AST tree for visit.

// prepare query
            preparedQuery = queryPreparer.prepareQuery(session, query);

Call Paser through reflection when SQL comes in for parsing

private Node invokeParser(String name, String sql, Function<SqlBaseParser, ParserRuleContext> parseFunction, ParsingOptions parsingOptions)
    {
        try {
            SqlBaseLexer lexer = new SqlBaseLexer(new CaseInsensitiveStream(CharStreams.fromString(sql)));
            CommonTokenStream tokenStream = new CommonTokenStream(lexer);
            SqlBaseParser parser = new SqlBaseParser(tokenStream);

            // Override the default error strategy to not attempt inserting or deleting a token.
            // Otherwise, it messes up error reporting
            parser.setErrorHandler(new DefaultErrorStrategy()
            {
                @Override
                public Token recoverInline(Parser recognizer)
                        throws RecognitionException
                {
                    if (nextTokensContext == null) {
                        throw new InputMismatchException(recognizer);
                    }
                    else {
                        throw new InputMismatchException(recognizer, nextTokensState, nextTokensContext);
                    }
                }
            });

            parser.addParseListener(new PostProcessor(Arrays.asList(parser.getRuleNames()), parser));

            lexer.removeErrorListeners();
            lexer.addErrorListener(LEXER_ERROR_LISTENER);

            parser.removeErrorListeners();

            if (enhancedErrorHandlerEnabled) {
                parser.addErrorListener(PARSER_ERROR_HANDLER);
            }
            else {
                parser.addErrorListener(LEXER_ERROR_LISTENER);
            }

            ParserRuleContext tree;
            try {
                // first, try parsing with potentially faster SLL mode
                parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
                tree = parseFunction.apply(parser);
            }
            catch (ParseCancellationException ex) {
                // if we fail, parse with LL mode
                tokenStream.seek(0); // rewind input stream
                parser.reset();

                parser.getInterpreter().setPredictionMode(PredictionMode.LL);
                tree = parseFunction.apply(parser);
            }

            return new AstBuilder(parsingOptions).visit(tree);
        }
        catch (StackOverflowError e) {
            throw new ParsingException(name + " is too large (stack overflow while parsing)");
        }
    }

Lexical analysis with Antlr to convert one character into one token

SqlBaseLexer lexer = new SqlBaseLexer(new CaseInsensitiveStream(CharStreams.fromString(sql)));
            CommonTokenStream tokenStream = new CommonTokenStream(lexer);

SqlParser object is constructed using tokenStream after lexical analysis, addParseListener and addErrorListener increase event listeners during parsing, then set the redictionMode of ATN (a graph data organization in Anltr that represents a grammatical state machine) interpreter, PredictionMode l has SLL and LL, respectively.(Top-down parser for context-free grammar. Processing input from left to right, then deriving the grammar tree from the leftmost execution of a sentence pattern) (limited understanding of both is available to interested readers themselves), and you can see from the code that if the SLL schema parsing fails, it will switch to LL immediately for parsing.

try {
                // first, try parsing with potentially faster SLL mode
                parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
                tree = parseFunction.apply(parser);
            }
            catch (ParseCancellationException ex) {
                // if we fail, parse with LL mode
                tokenStream.seek(0); // rewind input stream
                parser.reset();

                parser.getInterpreter().setPredictionMode(PredictionMode.LL);
                tree = parseFunction.apply(parser);
            }

ParserRuleContext (a subclass of SyntaxTree in Antlr) is parsed with an ANTLR grammar tree. The corresponding parsing process is the same as the lexical rules defined in the SqlBase.g4 file (it is strongly recommended that you learn Antlr, which is helpful for learning the source code of the SQL parsing class). We can write a SQL to see the grammar tree generated by Antlr using the Antlr plug-in:

SELECT A FROM TEST_TABLE WHERE A = "abc" GROUP BY A

After Anltr parses, give the ParserTree parsed to AstBuilder for visit (Visitor mode is heavily used in Presto)The next step is to traverse Antlr's grammar tree to get an AST tree defined in Presto that is easier to understand and manipulate, traversed in visitor mode and recursively accessed each child node from top to bottom. The logical sketch is probably as follows:

AstBuilder implements SqlBaseBaseVisitor in antlr4, rewriting the visitXXX method as follows:


Specifically how AstBuilder traverses each Node in visitor mode suggests that the reader debug the source code himself, but the verbal description is not very good (the author's limited ability is also part of the reason), so his debug can be impressed.

At this point, the process from SQL to AST tree is finished, followed by semantic analysis, generation of execution falsehoods, execution of scheduling and other processes, followed by time to rewrite. There are some incorrect places above that you are welcome to criticize and correct.

Tags: presto

Posted on Wed, 22 Sep 2021 13:59:48 -0400 by brad_fears