"partition.assignment.strategy" exception handling caused by spark streaming connection kafka

Server running environment: spark 2.4.4 + scall 2.11.12 + kafka 2.2.2

Because the business is relatively simple, kafka only has fixed topics, so it always uses the following script to perform real-time flow calculation

spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.4.4 --py-files /data/service/xxx.zip /data/service/xxx.py

In the code, KafkaUtils of pyspark.streaming.kafka is used to create the connection between spark streaming and Kafka. After running for a long time, there is no problem

With the access of new services, Kafka needs to use dynamic topics in new functions. Regular expressions are needed. After checking KafkaUtils source code and related materials, it is found that KafkaUtils does not support dynamic topics. It needs spark-streaming-kafka-0-10 to support it

View document http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html And http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html After that, we use structured streaming to implement

Implementation code:

import sys

from pyspark.sql import SparkSession

def process_row(row):
    # Write row to storage

if __name__ == "__main__":
    if len(sys.argv) != 4:
        Usage: structured_kafka_wordcount.py <bootstrap-servers> <subscribe-type> <topics>
        """, file=sys.stderr)

    bootstrapServers = sys.argv[1]
    subscribeType = sys.argv[2]
    topics = sys.argv[3]

    spark = SparkSession\

    # Create DataSet representing the stream of input lines from kafka
    ds = spark\
        .option("kafka.bootstrap.servers", bootstrapServers)\
        .option(subscribeType, topics)\
        .selectExpr("CAST(value AS STRING)")

    query = ds.writeStream.foreach(process_row).start()

Execute submit task command

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4 /data/service/demo.py master:9092 subscribePattern event.log.*

The following error has been reported since submission

Missing required configuration "partition.assignment.strategy" which has no default value

After checking a lot of data, it is said that parameters need to be added, Kafka partition allocation strategy is configured, and readStream is modified to:

    ds = spark\
        .option("kafka.bootstrap.servers", bootstrapServers)\
        .option("kafka.partition.assignment.strategy", "range")\
        .option(subscribeType, topics)\

Running the exception message again and changing it to unable to connect kafka. It's been a whole day, and people are going to crash. It's not finished yet

It's good to find out https://xbuba.com/questions/44959483 , Daniel suggested that it might be caused by the conflict between the jar of Kafka version 0.8 and the jar of Kafka version 0.10

Use command to find

find / -name 'spark-streaming-kafka*'
find / -name 'spark-sql-kafka*'

It is found that spark-streaming-kafka-0-8_.11 and spark-sql-kafka-0-10_.11 folders and related jar files exist under / root/.ivy2/cache/org.apache.spark /

After the spark-streaming-kafka-0-8_2.11 is removed, the execution code runs normally

Since the old script still uses kafka0.8, in order to be compatible with the two versions and run at the same time, you need to clear the jar s of kafka0.8 and kafka0.10 under the directory / root/.ivy2/cache/org.apache.spark /

Then log in https://repo1.maven.org/maven2/org/apache/spark/ Download the jar corresponding to spark-streaming-kafka-0-8 and spark-sql-kafka-0-10, and change the parameter of the submit command spark submit to:

spark-submit --jars /data/service/spark-streaming-kafka-0-8-assembly_2.11-2.4.4.jar --py-files /data/service/xxx.zip /data/service/xxx.py
spark-submit --jars /data/service/spark-sql-kafka-0-10_2.11-2.4.4.jar /data/service/demo.py master:9092 subscribePattern event.log.*

There is no problem in running the two scripts after modification (PS: the old script originally wanted to start directly with the package of org.apache.spark: spark-sql-kafka-0-10_.11:2.4.4, and then it was directly blundered after execution, indicating that it should be changed to org.apache.spark: spark-streaming-kafka-0-8_.11:2.4.4)

Tags: Python Spark kafka Apache SQL

Posted on Tue, 17 Mar 2020 23:48:00 -0400 by echo64