Flink Sql on Zeppelin -- environment preparation

Environmental preparation


  • Why Sql
    • At present, there are many ways to develop Flink. Generally speaking, students write JAVA/SCALA/PYTHON projects and submit them to the cluster for running
      • This method is more flexible, because you can write tasks in the code. Any dimension table JOIN and parameter tuning can be easily done
      • However, the requirements for developing students are high, and there is a certain learning cost. For example, some students are good at JAVA and some are good at PYTHON. In the process of our project development, multiple languages will not be allowed to coexist. Generally speaking, JAVA is chosen as our development language. Then, for students who are good at PYTHON, it will be more difficult to climb the mountain of JAVA from scratch and be proficient in using it in a short time.
      • Therefore, the best choice is to have a language with low learning cost, which most students have learned and used, or which is easy to start. That's Sql
  • My other blog series is Flink Sql, so why open a new Flink Sql series?
    • First of all, my code is also written and run in IDEA, which is mixed with JAVA and SQL, which makes some people wonder. Is this a Flink JAVA tutorial or a Flink SQL tutorial?
    • Secondly, although my company has its own Flink Sql platform, it can't be demonstrated to you due to various circumstances
    • At present, the community is also promoting the pure Sql platform, such as the Sql client command-line tool that comes with Flink. Although most functions have been supported, including the statement that has not been supported in the code, such as CREATE VIEW, but the function is really single, and does not support the REST way to submit our code, so everyone can not be equipped with Flink's client on their own computer, right? There are many other shortcomings, so we will not list them one by one. In my opinion, Sql client is just a big toy at present. When you are mature, you will abandon it
    • At present, ververica has also launched a Sql client - Flink Sql gateway + Flink jdbc driver, which can also be used to build a pure Sql development platform. The disadvantages are also obvious. First, there is no visual interface, which is also used through command line or self encapsulation. Second, the community is small in scale and low in activity, and many people don't necessarily know about it.
    • So, is there a tool with graphical interface, perfect function and high community activity?
  • Apache Zeppelin!
    • Zeppelin, a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.
    • The above introduction comes from the official website, which roughly means: Zeppelin is a Web-based notebook that provides interactive data analysis. It is convenient for you to make beautiful documents that can be data-driven, interactive and collaborative, and supports multiple languages, including Scala, Sql, etc. For more information, visit Official website
    • Here's a key point. First, we don't introduce the advantages of Zeppelin. Let's follow me to experience it slowly during the learning process~~

Installation & Configuration

  • To use Flink in Zeppelin, you need to download the latest Zeppelin 0.9.0 and Flink 1.10. As of June 8, 2020, Zeppelin 0.9.0 has not been officially released. However, Chien Feng has compiled it for you. Link: https://pan.baidu.com/s/1P93evudRiUzh6y-6X5lNFg Extraction code: n1rd
  • You can also pull the code and compile it by yourself. GitHub portal: Point me
  • In addition, I use Flink 1.10, and Scala is 2.11, which has been downloaded and installed. By default, everyone is configured. The students who run in IDEA like me before, Poke me to download Download and unzip it. If you have any problem, please contact me
  • Now, follow me and configure Zeppelin
    	#1.1 decompression
    	tar -zxvf zeppelin-0.9.0-SNAPSHOT.tar.gz
    	#1.2 enter the conf directory
    	cd zeppelin-0.9.0-SNAPSHOT/conf
    	#1.3 modify the configuration file name, otherwise the application cannot be loaded to the
    	mv zeppelin-env.sh.template zeppelin-env.sh
    	#1.4.1 modify configuration file
    	vim zeppelin-env.sh
    	#1.4.2 on the editor page, insert two lines of content
    	export JAVA_HOME=Here it's changed to jdk Directory of! Do not copy
    	export ZEPPELIN_ADDR=Here's what to bind IP,If Zeppelin If it's not installed on the machine, don't write,Otherwise, other machines can't pass ip+port Visit
    	#1.4.3 save and exit.
    	#2.1 because I plan to run Flink on the Yarn and then connect to Hive, now I will go to the directory of Flink to add a few Jar packages. Students who don't plan to run on the Yarn can directly jump to step 3.1
    	cd ~/flink/lib
    	#2.2 download the relevant Jar package of Flink On Yarn. The Jar package version should correspond to your Flink and Hadoop version. My Hadoop version is 2.7.1 
    	wget https://repo1.maven.org/maven2/org/apache/flink/flink-hadoop-compatibility_2.11/1.10.0/flink-hadoop-compatibility_2.11-1.10.0.jar
    	wget https://repo1.maven.org/maven2/org/apache/flink/flink-shaded-hadoop-2-uber/2.7.5-9.0/flink-shaded-hadoop-2-uber-2.7.5-9.0.jar
    	#2.3 download Flink connect Hive Related to Jar Bag, mine Hive Version 2.1.1. Because Hive The version may be different from others. You can refer to the documents on the official website, https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/table/hive/#dependencies
    	wget https://repo1.maven.org/maven2/org/apache/hive/hive-exec/2.1.1/hive-exec-2.1.1.jar
    	wget https://repo1.maven.org/maven2/org/apache/flink/flink-connector-hive_2.11/1.10.0/flink-connector-hive_2.11-1.10.0.jar
    	#3.1 after the above steps are completed, go to the bin directory of Zeppelin
    	cd ~/zeppelin-0.9.0-SNAPSHOT/bin
    	#3.2 start up!
    	./zeppelin-daemon.sh start
  • If you see that the console normally outputs zeppelin start [OK], then the installation is complete. Otherwise, go to the log directory of zeppelin, check the log, and analyze the reason for the startup failure
  • Then open the browser, input the server address and port. The default port is 8080. If you can see the following page, it indicates that it is normal. Otherwise, analyze the log
  • Next, we will configure the Interpreter on the page, click the username anonymous in the upper right corner, and click the Interpreter to enter the configuration page
  • Filter our Interpreter here
  • You can submit Flink tasks in three different forms in Zeppelin, and you need to configure FLINK_HOME and flink.execution.mode , the first parameter is the installation directory of Flink, and the second parameter is an enumeration value. There are three options
    • Local
      • MiniCluster will be started, which is suitable for POC stage. Only the above two parameters need to be configured
    • Remote
      • To connect a Standalone cluster, in addition to configuring FLINK_HOME and flink.execution.mode In addition, configuration is required flink.execution.remote.host and flink.execution.remote.port, for specific configuration content, please refer to flink-conf.yaml
    • Yarn
      • The pattern we will use later will start a Flink cluster of the Yarn session pattern on the Yarn. In addition to configuring FLINK_HOME and flink.execution.mode In addition, Hadoop needs to be configured_ CONF_ DIR


  • Enter the homepage and click the existing Demo notebook
  • This is a simple WordCount, Batch mode, and the code is written by Scala
  • Click the run button and wait for the output result
  • At the same time, open the Web management page of Yarn, find that a Flink application has been launched on the page, and click the red frame to enter the Flink Yarn Session cluster
  • Found that the task we submitted is running
  • After the task is completed, go back to the Zeppelin page and find that the result has been output
  • So far, we have completed the installation and configuration of Zeppelin, and can successfully submit the Flink job to run on the Yarn cluster, and output the correct results at the same time

Pit record

  • Error in submitting task -- JAVA version is too low
    org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Fail to launch interpreter process:
    Apache Zeppelin requires either Java 8 update 151 or newer
    	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:134)
    	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:298)
    	at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:433)
    	at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:75)
    	at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
    	at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:130)
    	at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:159)
    	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    	at java.lang.Thread.run(Thread.java:745)
    Caused by: java.io.IOException: Fail to launch interpreter process:
    Apache Zeppelin requires either Java 8 update 151 or newer
    	at org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:130)
    	at org.apache.zeppelin.interpreter.ManagedInterpreterGroup.getOrCreateInterpreterProcess(ManagedInterpreterGroup.java:65)
    	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getOrCreateInterpreterProcess(RemoteInterpreter.java:110)
    	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:163)
    	at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:131)
    	... 13 more

The reason for this problem is that JAVA is the environment variable on our server_ The JAVA version of home is 1.8.0_72-b15, although we modify zeppelin above- env.sh When zeppelin started the Interpreter, it did not pass in the environment variables. Then I will check whether the community has fixed the bug. If not, I will submit it to jira. There are two ways to modify
-Modify environment variable JAVA_ Modify the JAVA address corresponding to home and restart zeppelin
-Because there are other applications on the server, otherwise upgrade the JDK, and modify the common.sh Documents.
bash vim ~/zeppelin/bin/common.sh #Jump to line 66 and change java_ver_output=$("${JAVA:- Later Java is modified to your higher version JDK address. For example, if my JDK address is / home / data / programs / JDK, line 66 will be modified to java_ver_output=$("${JAVA:-/home/data/programs/jdk/bin/java }" -version 2>&1)
Restart after the modification and submit the task again. You should be able to submit the task normally

  • Wrong task submission times - network failure
    Exception in thread "main" org.apache.zeppelin.shaded.org.apache.thrift.transport.TTransportException: java.net.SocketException: Network is unreachable (connect failed)
            at org.apache.zeppelin.shaded.org.apache.thrift.transport.TSocket.open(TSocket.java:226)
            at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.<init>(RemoteInterpreterServer.java:167)
            at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.<init>(RemoteInterpreterServer.java:152)
            at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.main(RemoteInterpreterServer.java:321)
    Caused by: java.net.SocketException: Network is unreachable (connect failed)
            at java.net.PlainSocketImpl.socketConnect(Native Method)
            at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
            at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
            at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
            at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
            at java.net.Socket.connect(Socket.java:606)
            at org.apache.zeppelin.shaded.org.apache.thrift.transport.TSocket.open(TSocket.java:221)
            ... 3 more
            at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:166)
            at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:131)
            ... 13 more
    At present, I haven't really determined the cause of this error report. Neither of my two computers is good. Both of them are Ubuntu, a subsystem installed on Windows. Secondly, docker is installed on both computers. I don't know if these two reasons are true. I will have a chance to test them again later. The solution is to use ZEPPELIN_LOCAL_IP is injected into the environment variable. Then restart the application and submit the task again. If you also encounter this problem, you can leave a message below, I will reply in time

Finally, I'd like to publicize the nail group of Flink on Zeppelin. If you have any questions, you can discuss them in it, as well as Chien Feng. If you have any questions, just ask them directly

Tags: Java Apache SQL hive

Posted on Mon, 08 Jun 2020 22:13:41 -0400 by Dave Liebman