The spark sql predicate push down logic optimizer pushdownpredictions contains three rules:


PushPredicateThroughNonJoin is a logical execution plan optimizer pushed down by predicates in non join cases in sparksql
The premise that predicates can be pushed down: it does not affect the query results, that is, it is necessary to ensure that the results of two sql executions before and after pushing down are the same
The PushPredicateThroughNonJoin optimizer handles six types of cases that can be pushed down

Handle the case where the Filter node is a Project node

Why should the field in the project be certainty Ability to push down?
Because if the fields in the project are non deterministic, the query effect before and after push down is different
For example:
Monostatically is used in sql_ increasing_ id() function (generates a non deterministic expression of 64 bit integer self incrementing id)
select a,b,id from ( select A,B,monotonically_increasing_id() as id from testdata2 where a>2 )tmp where b<1 If you push down, it is equivalent to: select a,b,id from ( select A,B,monotonically_increasing_id() as id from testdata2 where a>2 and b<1 )tmp
Compared with the above two sql, the corresponding self incrementing id obtained from the data of the sql is different in the case of filtering a > 2 and filtering (a > 2 and B < 1)
Other functions include rand() function. When filtering a > 2 and filtering (a > 2 and B < 1), the effect of rand() must be different
Handle the case where the Filter node is the Aggregate node

select a,b from ( select A,B,count(1) as c from testdata2 where a>2 group by a,b )tmp where c=1 and b<5

The c field is obtained from the sub query count(1). c=1 cannot be pushed down, but B < 5
Handle the case where the Filter node is a Window node
This is somewhat similar to the processing of Aggregate. The following conditions can be pushed down:
- The expression of the predicate must be the partition key of the window aggregation
- Predicates must be deterministic

select a,b,rn from ( select A,B,row_number() over(partition by a order by b desc ) as rn from testdata2 )tmp where a>1 and b<5

a> 1 pushed down to before the window function is executed, because b is not in the field after partition by, so b < 5
Not pushed down
Handle the case where the Filter node is a Union node

select tmpc from ( select A as tmpc from testdata2 union all select b as tmpc from testdata2 ) tmp where tmpc>1 and rand()>0.1

The field of certainty is pushed down, while the non deterministic rand() function is not pushed down
Handle the case where there are other nodes under the Filter node
For other nodes, list the nodes that can be pushed down with unified logic:
def canPushThrough(p: UnaryNode): Boolean = p match { // Note that some operators (e.g. project, aggregate, union) are being handled separately // (earlier in this rule). case _: AppendColumns => true case _: Distinct => true case _: Generate => true case _: Pivot => true case _: RepartitionByExpression => true case _: Repartition => true case _: ScriptTransformation => true case _: Sort => true case _: BatchEvalPython => true case _: ArrowEvalPython => true case _ => false }
Unified processing logic:

summary
- In the case of non join, PushPredicateThroughNonJoin can be optimized: the sub nodes of the Filter section are Project, Aggregate, Window, Union, EventTimeWatermark (real-time), AppendColumns, Distinct, Generate, Pivot, RepartitionByExpression, Repartition, ScriptTransformation, Sort, batchevalpthon In the case of arrowhevalpython, optimization can be performed
- It is a very important condition that the field or expression is deterministic. When optimizing, the condition in the Filter is generally determined by whether it is deterministic Deterministically divide into pushable parts and non pushable parts for operation respectively