0485 - how to specify Python running environment of PySpark in code

Warm tip: if you can't see the picture clearly by using the computer, you can use the mobile phone to open the article and click the picture in the article to enlarge and view the original HD picture.
github of Fayson:
Tip: the code block can be viewed by sliding left and right

1 purpose of document preparation

Fayson's previous article< 0483 - how to specify Python running environment of PySpark >Describes the runtime environment for specifying Python when using spark 2-submit commit. Some users also need to specify Python running environment in the PySpark code. In this article, Fayson mainly introduces how to specify Python running environment in the code.

  • testing environment


2.CM and CDH version is 5.15.0

3. Python 2.7.5 and python 3.6

2 prepare Python environment

Here, Fayson prepares two environments, python 2 and python 3, as follows:

1. Download two installation packages of Python 2 and python 3 on the Anaconda official website. The installation process, Fayson, is not introduced here

Anaconda3-5.2.0-linux-x86-64.sh and anaconda2-5.3.1-linux-x86-64.sh

2. Package two environments, python 2 and python 3, and enter the installation directory of Python 2 and python 3

Use the zip command to package the two environments separately

[root@cdh05 anaconda2]# cd /opt/cloudera/anaconda2
[root@cdh05 anaconda2]# zip -r /data/disk1/anaconda2.zip ./*

[root@cdh05 anaconda3]# cd /opt/cloudera/anaconda3
[root@cdh05 anaconda3]# zip -r /data/disk1/anaconda3.zip ./*

Note: This is to enter the python installation directory for compression, without Python's parent directory

3. Upload the prepared Python 2 and python 3 to HDFS

[root@cdh05 disk1]# hadoop fs -put anaconda2.zip /tmp
[root@cdh05 disk1]# hadoop fs -put anaconda3.zip /tmp
[root@cdh05 disk1]# hadoop fs -ls /tmp/anaconda*

After completing the above steps, you are ready to run PySpark, and then specify the running environment when you submit the code.

3 prepare the PySpark sample job

Here, a simple PI PySpark code is used as an example to explain. There are some differences between the example code and the previous article, adding the case code of the specified python running environment. The example code is as follows:

from __future__ import print_function
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("PythonPi") \
    .config("spark.pyspark.python", "python/bin/python3.6") \
    .config("spark.pyspark.driver.python", "python3.6") \
    .config("spark.yarn.dist.archives", "hdfs://nameservice1/tmp/anaconda3.zip#python") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.memory", "4g") \
#.config("spark.pyspark.driver.python", "/opt/cloudera/anaconda2/bin/python2.7")\

partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions

def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 < 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))


4 example run

Before running, we first execute the environment variables to load Spark and pyspark. Otherwise, we will not find the error of "SparkSession" module when executing python code. To run python code, we need to ensure that the node has a Spark2 Gateway client configuration.

1. Execute the following command to load the Spark and python environment variables

export SPARK_HOME=/opt/cloudera/parcels/SPARK2/lib/spark2
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$SPARK_HOME/python/lib/pyspark.zip:$PYTHONPATH

2. Use python command to run pi_test.py code on the command line

[root@cdh05 ~]# python pi_test.py 

Job submitted successfully

3. Job executed successfully

4. View the Python environment of the job

5 Summary

When executing PySpark code with python command, you need to ensure that there are environment variables of Spark on the node where the code is currently executed.

Before running the code, you need to specify the environment variables of SPARK_HOME and PYTHONPATH to load the Python environment compiled by Spark into the environment variables.

After packing Python 2 and python 3, the running environment of PySpark, and putting them in HDFS, the process of job startup will be slower than before, and the python environment needs to be obtained from HDFS.

Tip: the code block can be viewed by sliding left and right
For heaven and earth, for the people, for the saints to follow the unique learning, for the world open peace.

Published 315 original articles, won praise 11, visited 20000+
Private letter follow

Tags: Python Spark Hadoop github

Posted on Sun, 12 Jan 2020 02:51:33 -0500 by elmas156