Previous
- 1. CentOS7 hadoop3.3.1 Installation (single machine distributed, pseudo distributed, distributed)
- 2. Implementation of HDFS by JAVA API
- 3. MapReduce programming examples
- IV. Zookeeper3.7 Installation
- 5. Shell operation of Zookeeper
- 6. Java API Operations zookeeper Node)
Setup of Hadoop3.3.1 HA High Availability Cluster
(NameNode High Availability + Yarn High Availability based on Zookeeper)
Name Node HA of QJM
Use Quorum Journal Manager or regular shared storage
Name Node HA of QJM
Hadoop HA Mode Setup (High Availability)
1. Cluster Planning
There are three virtual machines, master, worker1 and worker2.
There are three namenode s, resourcemanager on worker1, woker2.
master | woker1 | worker2 | |
---|---|---|---|
NameNode | yes | yes | yes |
DataNode | no | yes | yes |
JournalNode | yes | yes | yes |
NodeManager | no | yes | yes |
ResourceManager | no | yes | yes |
Zookeeper | yes | yes | yes |
ZKFC | yes | yes | yes |
Because the virtual machine was not recreated, it was modified on the original basis. So the name is also hadoop1, hadoop2, hadoop3
hadoop1 = master
hadoop2 = worker1
hadoop3 = worker2
2. Zookeeper cluster building:
Reference resources: IV. Zookeeper3.7 Installation
3. Modify Hadoop Cluster Profile
Modify vim core-site.xml
vim core-site.xml
core-site.xml:
<configuration> <!-- HDFS Main entrance, mycluster Just as the logical name of the cluster, you can change it at will, but be sure to hdfs-site.xml in dfs.nameservices Values are consistent--> <property> <name>fs.defaultFS</name> <value>hdfs://mycluster</value> </property> <!-- Default hadoop.tmp.dir Pointing to is/tmp Directory, which will result in namenode and datanode>All data is saved in the volatile catalog and modified here--> <property> <name>hadoop.tmp.dir</name> <value>/export/servers/data/hadoop/tmp</value> </property> <!--User role configuration, not configuring this will result in web Page Error--> <property> <name>hadoop.http.staticuser.user</name> <value>root</value> </property> <!--zookeeper Cluster address, where a single unit can be configured, such as a cluster separated by commas--> <property> <name>ha.zookeeper.quorum</name> <value>hadoop1:2181,hadoop2:2181,hadoop3:2181</value> </property> <!-- hadoop link zookeeper Timeout Settings --> <property> <name>ha.zookeeper.session-timeout.ms</name> <value>1000</value> <description>ms</description> </property> </configuration>
The Hadoop1, hadoop2, hadoop3 in the zookeeper address specified above are replaced by the hostname of your machine (you need to configure the mapping of hostname to ip first) or ip
Modify hadoop-env.sh
vim hadoop-env.sh
hadoop-env.sh
When using cluster management scripts, the java command will not take effect when using ssh because the configuration of environment variables in the / etc/profile file will not be read when using ssh for remote login, so you need to explicitly configure the absolute path of jdk in the configuration file (if the jdk paths of each node are different, the java_HOME of the cost machine should be changed in hadoop-env.sh).
There are strict restrictions on role permissions in hadoop 3.x, which additionally specifies the user to which the role belongs compared to hadoop 2.x.
This is only for building HDFS clusters. If YARN is involved, you should also modify the configuration in the corresponding yarn-env.sh files.
Add the following at the end of the script:
export JAVA_HOME=/opt/jdk1.8.0_241 export HDFS_NAMENODE_USER="root" export HDFS_DATANODE_USER="root" export HDFS_ZKFC_USER="root" export HDFS_JOURNALNODE_USER="root"
Modify hdfs-site.xml
vim hdfs-site.xml
hdfs-site.xml
<configuration> <!-- Specify number of copies --> <property> <name>dfs.replication</name> <value>3</value> </property> <!-- To configure namenode and datanode Working Directory-Data Store Directory --> <property> <name>dfs.namenode.name.dir</name> <value>/export/servers/data/hadoop/tmp/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/export/servers/data/hadoop/tmp/dfs/data</value> </property> <!-- Enable webhdfs --> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> <!--Appoint hdfs Of nameservice by cluster1,Need and core-site.xml Consistency in dfs.ha.namenodes.[nameservice id]For in nameservice Each of them NameNode Set the unique identifier. Configure a comma-separated NameNode ID List. This will be DataNode Identify as All NameNode. For example, if you use"cluster1"As nameservice ID,And use"nn1"and"nn2"As NameNodes Identifier --> <property> <name>dfs.nameservices</name> <value>mycluster</value> </property> <!-- cluster There are three below NameNode,Namely nn1,nn2,nn3--> <property> <name>dfs.ha.namenodes.mycluster</name> <value>nn1,nn2,nn3</value> </property> <!-- nn1 Of RPC Mailing address --> <property> <name>dfs.namenode.rpc-address.mycluster.nn1</name> <value>hadoop1:9000</value> </property> <!-- nn1 Of http Mailing address --> <property> <name>dfs.namenode.http-address.mycluster.nn1</name> <value>hadoop1:9870</value> </property> <!-- nn2 Of RPC Mailing address --> <property> <name>dfs.namenode.rpc-address.mycluster.nn2</name> <value>hadoop2:9000</value> </property> <!-- nn2 Of http Mailing address --> <property> <name>dfs.namenode.http-address.mycluster.nn2</name> <value>hadoop2:9870</value> </property> <!-- nn3 Of RPC Mailing address --> <property> <name>dfs.namenode.rpc-address.mycluster.nn3</name> <value>hadoop3:9000</value> </property> <!-- nn3 Of http Mailing address --> <property> <name>dfs.namenode.http-address.mycluster.nn3</name> <value>hadoop3:9870</value> </property> <!-- Appoint NameNode Of edits Shared storage location for metadata. that is JournalNode list this url Configuration format: qjournal://host1:port1;host2:port2;host3:port3/journalId journalId Recommended Use nameservice,The default port number is 8485 --> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://hadoop1:8485;hadoop2:8485;hadoop3:8485/mycluster</value> </property> <!-- Appoint JournalNode Local magnetism H Location --> <property> <name>dfs.journalnode.edits.dir</name> <value>/export/servers/data/hadoop/tmp/journaldata</value> </property> <!-- open NameNode Failed Auto Switch --> <property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property> <!-- Configuration failure automatic switch implementation --> <property> <name>dfs.client.failover.proxy.provider.mycluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <!-- Configure the isolation mechanism method, with multiple mechanisms split by line breaks, that is, each mechanism uses a temporary line --> <property> <name>dfs.ha.fencing.methods</name> <value> sshfence shell(/bin/true) </value> </property> <!-- Use sshfence Required for isolation mechanisms ssh No landing --> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/root/.ssh/id_rsa</value> </property> <!-- To configure sshfence Isolation mechanism timeout --> <property> <name>dfs.ha.fencing.ssh.connect-timeout</name> <value>30000</value> </property> <property> <name>ha.failover-controller.cli-check.rpc-timeout.ms</name> <value>60000</value> </property> <!--Specify secondary name node--> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop3:9868</value> </property> </configuration>
To create a journaldata folder
workers
In hadoop 2.x, this file is called slaves and configures the host addresses of all datanodes. All you need to do is fill in all the datanode host names.
hadoop1 hadoop2 hadoop3
Yarn High Availability
vim mapred-site.xml
Modify mapred-site.xml
<configuration> <!-- Appoint mr Frame is yarn mode --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!-- To configure MapReduce JobHistory Server Address, default port 10020 --> <property> <name>mapreduce.jobhistory.address</name> <value>hadoop1:10020</value> </property> <!-- To configure MapReduce JobHistory Server web ui Address, default port 19888 --> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>hadoop1:19888</value> </property> </configuration>
vim yarn-site.xml
Modify yarn-site.xml
<configuration> <!-- open RM High Availability --> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <!-- Appoint RM Of cluster id --> <property> <name>yarn.resourcemanager.cluster-id</name> <value>yrc</value> </property> <!-- Appoint RM Name --> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <!-- Specify separately RM Address --> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>hadoop2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>hadoop3</value> </property> <!-- Appoint zk Cluster Address --> <property> <name>yarn.resourcemanager.zk-address</name> <value>hadoop1:2181,hadoop2:2181,hadoop2:2181</value> </property> <!--Reducer How to get data--> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <!--Log aggregation turned on--> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> <!--Log retention time set to 1 day--> <property> <name>yarn.log-aggregation.retain-seconds</name> <value>86400</value> </property> <!-- Enable automatic recovery --> <property> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <!-- Formulate resourcemanager The status information of is stored in zookeeper On Cluster --> <property> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> </configuration>
All modified and distributed to other cluster nodes
(in the hadoop/etc path)
scp /export/servers/hadoop-3.3.1/etc/hadoop/* hadoop2:/export/servers/hadoop-3.3.1/etc/hadoop/
scp /export/servers/hadoop-3.3.1/etc/hadoop/* hadoop3:/export/servers/hadoop-3.3.1/etc/hadoop/
Start zookeeper cluster
Start on each machine:
zkServer.sh start zkServer.sh status
Format namenode, zkfc
First, start journalnode on all virtual machines:
hdfs --daemon start journalnode
Once all are started, on the master(hadoop1) node, format the namenode
hadoop namenode -format
Format the namenode once because it was fully distributed before
However, the datanode, namenode in the cluster is related to the CuluserID in/current/VERSION/
So format it again and start it up, and the other two nodes synchronously formatted namenode s do not conflict
FormZK Same
Then start the namenode separately:
hdfs namenode
Then, on the other two machines, synchronize the formatted namenode:
hdfs namenode -bootstrapStandby
The transfer information should be visible from the master.
After the transfer is complete, on the master node, format zkfc:
hdfs zkfc -formatZK
Start hdfs
On the master node, start dfs first:
start-dfs.sh
Then start yarn:
start-yarn.sh
Start the mapreduce task history server:
mapred --daemon start historyserver
You can see how each node starts its process:
Try HA mode
First look at the status of each namenode host:
hdfs haadmin -getServiceState nn1 hdfs haadmin -getServiceState nn2 hdfs haadmin -getServiceState nn3
You can see that there are two standbies and one active.
On the master node of active, kill the namenode process:
View the node again at this time
As you can see, nn1 has been switched to active, and Hadoop High Availability Cluster is basically set up.