Kylin configures Spark and builds Cube

HDP version: 2.6.4.0

Kylin version: 2.5.1

Machine: Three CentOS-7,8G memory

In addition to MapReduce, Kylin's computing engine also has a faster SPARK. This article tests the speed at which Spark builds a Cube with Kylin's own example, kylin_sales_cube.

1. Configure Kylin's Spark parameters

Before running Spark cubing, it is recommended that you review these configurations and customize them to suit your cluster.The following is the recommended configuration, which turns on dynamic resource allocation for Spark:

## Spark conf (default is in spark/conf/spark-defaults.conf) kylin.engine.spark-conf.spark.master=yarn kylin.engine.spark-conf.spark.submit.deployMode=cluster kylin.engine.spark-conf.spark.yarn.queue=default kylin.engine.spark-conf.spark.driver.memory=2G kylin.engine.spark-conf.spark.executor.memory=4G kylin.engine.spark-conf.spark.executor.instances=40 kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024 kylin.engine.spark-conf.spark.shuffle.service.enabled=true kylin.engine.spark-conf.spark.eventLog.enabled=true kylin.engine.spark-conf.spark.eventLog.dir=hdfs\:///kylin/spark-history kylin.engine.spark-conf.spark.history.fs.logDirectory=hdfs\:///kylin/spark-history #kylin.engine.spark-conf.spark.hadoop.yarn.timeline-service.enabled=false # #### Spark conf for specific job #kylin.engine.spark-conf-mergedict.spark.executor.memory=6G #kylin.engine.spark-conf-mergedict.spark.memory.fraction=0.2 # ## manually upload spark-assembly jar to HDFS and then set this property will avoid repeatedly uploading jar ## at runtime kylin.engine.spark-conf.spark.yarn.archive=hdfs://node71.data:8020/kylin/spark/spark-libs.jar kylin.engine.spark-conf.spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec # ## If it is an HDP version, uncomment the following three lines of configuration kylin.engine.spark-conf.spark.driver.extraJavaOptions=-Dhdp.version=current kylin.engine.spark-conf.spark.yarn.am.extraJavaOptions=-Dhdp.version=current kylin.engine.spark-conf.spark.executor.extraJavaOptions=-Dhdp.version=current

The kylin.engine.spark-conf.spark.yarn.archive configuration specifies the jar package that the Kylin engine will run on, which needs to be generated by itself and uploaded to HDFS.Since the user who executes the Kylin service is kylin, switch to the kylin user to execute first.The commands are as follows:

su - kylin cd /usr/hdp/2.6.4.0-91/kylin # Generate spark-libs.jar file jar cv0f spark-libs.jar -C $KYLIN_HOME/spark/jars/ ./ # Upload to a specified directory on HDFS hadoop fs -mkdir -p /kylin/spark/ hadoop fs -put spark-libs.jar /kylin/spark/

2. Modify the configuration of Cube

After configuring Kylin's Spark parameters, we need to modify Cube's computing engine to Spark as follows:

First specify the generated Cube script that comes with Kylin: sh $/bin/sample.sh, which loads two Cubes on the Kylin Web page.

Next, visit our Kylin Web UI and click the Model -> Action -> Edit button:

Click Step 5: Advanced Setting, scroll down the page, change the Cube Engine type, and change MapReduce to Spark.Then save the configuration changes.As shown in the following figure:

Click Next to enter the Configuration Overwrites page, click + Property to add the attribute kylin.engine.spark.rdd-partition-cut-mb with a value of 500 (for the following reasons):

Sample cube has two memory-depleting measurements: COUNT DISTINCT and TOPN(100); when source data is small, their size estimates are not accurate: the estimated size is much larger than the actual size, resulting in more RDD partitions being sliced, slowing down the build.500 is a reasonable number for this.Click Next and Save to save the cube.

For cube s without COUNT DISTINCT and TOPN, leave the default configuration.

3. Building Cube

After saving the modified cube configuration, click Action -> Build, select the start time of the build (be sure to have data in the start time, otherwise building cube is meaningless), and then start building cube.

During the cube building process, you can open the Yarn ResourceManager UI to view the status of the task.When cube is built to step 7, you can open Spark's UI page, which shows the progress of each stage as well as detailed information.

Kylin uses its own internal Spark, so we need to start the Spark History Server extra.

$/spark/sbin/start-history-server.sh hdfs://<namenode_host>:8020/kylin/spark-history

Visit: http://ip:18080/ You can see the job details of Spark building Cube, which can be very helpful for troubleshooting and performance tuning.

IV. FAQ

In the process of building Cube with Spark, two errors have been encountered and both have been resolved. This is a record to let you know that the public number is full of dry goods.

1. Spark on Yarn Configuration Adjustment

Error Content:

Exception in thread "main" java.lang.IllegalArgumentException: Required executor memory (4096+1024 MB) is above the max threshold (4096 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.

Problem Analysis:

According to the error log analysis, the execution memory required for the task (4096 + 1024MB) is higher than the maximum threshold for this cluster.You can adjust the execution memory of Spark tasks or Yarn's associated configuration.

The corresponding configurations for the execution memory required for the Spark task (4096 + 1024MB) are:

kylin.engine.spark-conf.spark.executor.memory=4G
kylin.engine.spark-conf.spark.yarn.executor.memoryOverhead=1024

Yarn related configuration:

yarn.nodemanager.resource.memory-mb: NodeManager is a proxy for a single node in YARN that interacts with the Application Master of the application and the ResourceManager of the cluster manager.This property represents the total amount of physical memory that Yarn can use for this node.
yarn.scheduler.maximum-allocation-mb: Represents the maximum amount of physical memory a single task can request.The configuration value cannot be larger than the yarn.nodemanager.resource.memory-mb configuration value size.

Solution:

For example, to adjust the Yarn configuration, adjust the size of yarn.scheduler.maximum-allocation-mb. Because it depends on yarn.nodemanager.resource.memory-mb, both configurations are adjusted to values larger than the execution memory (4096+1024 MB), for example, 5888 MB.

2. Step 8 of building Cube: Convert Cuboid Data to HFile error

Error Content:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.hbase.io.hfile.HFile

Problem Analysis:

The spark-libs.jar file specified by the kylin.engine.spark-conf.spark.yarn.archive parameter value is missing the class file associated with HBase.

Solution:

Due to the large number of missing HBase-related class files, the solution given by Kylin's website still reports that no class files can be found, so I added all HBase-related jar packages to spark-libs.jar.If you have generated spark-libs.jar and uploaded it to HDFS, then you need to repackage the upload.The steps are as follows:

su - kylin cd /usr/hdp/2.6.4.0-91/kylin cp -r /usr/hdp/2.6.4.0-91/hbase/lib/hbase* /usr/hdp/2.6.4.0-91/kylin/spark/jars/ rm -rf spark-libs.jar;jar cv0f spark-libs.jar -C spark/jars/ ./ hadoop fs -rm -r /kylin/spark/spark-libs.jar hadoop fs -put spark-libs.jar /kylin/spark/

Then switch to the Kylin Web page and continue building the Cube.

5. Comparison between Spark and MapReduce

Building Cube with Spark takes about seven minutes, as shown in the following figure:

Building Cube with MapReduce takes about 15 minutes, as shown in the following figure:

It's still a lot faster to build cube s with Spark!

6. Summary

This article mainly introduces:

How to configure Kylin's related Spark parameters
How to change Cube's computing engine
Generate spark-libs.jar package and upload to HDFS
FAQ in Spark's Cube Building Process
Comparing the speed at which Spark and MapReduce build Cube

Reference link for this article: