Hadoop? Hdfs2. X? High availability building

Architecture specification

HDFS 2.x HA

HDFS High Availability Using the Quorum Journal Manager

Set up instructions

virtual machine NN-1 NN-2 DN ZK ZKFC JNN
node01 * * *
node02 * * * * *
node03 * * *
node04 * *

Steps to build

Official document: https://hadoop.apache.org/docs/r2.6.5/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

  1. Install jdk, hadoop and configure environment variables

  2. Set ssh keyless login, and node01 and node02 are keyless access to each other.

  3. Configure the hdfs-site.xml file and core-site.xml through official documentation

    Configuration details

    To configure HA NameNodes, you must add several configuration options to your hdfs-site.xml configuration file.

    The order in which you set these configurations is unimportant, but the values you choose for dfs.nameservices and dfs.ha.namenodes.[nameservice ID] will determine the keys of those that follow. Thus, you should decide on these values before setting the rest of the configuration options.

    • dfs.nameservices

    - the logical name for this new nameservice

    Choose a logical name for this nameservice, for example "mycluster", and use this logical name for the value of this config option. The name you choose is arbitrary. It will be used both for configuration and as the authority component of absolute HDFS paths in the cluster.

    Note: If you are also using HDFS Federation, this configuration setting should also include the list of other nameservices, HA or otherwise, as a comma-separated list.

    <property>
      <name>dfs.nameservices</name>
      <value>mycluster</value>
    </property>
    
    • dfs.ha.namenodes.[nameservice ID]

    - unique identifiers for each NameNode in the nameservice

    Configure with a list of comma-separated NameNode IDs. This will be used by DataNodes to determine all the NameNodes in the cluster. For example, if you used "mycluster" as the nameservice ID previously, and you wanted to use "nn1" and "nn2" as the individual IDs of the NameNodes, you would configure this as such:

    <property>
      <name>dfs.ha.namenodes.mycluster</name>
      <value>nn1,nn2</value>
    </property>
    

    Note: Currently, only a maximum of two NameNodes may be configured per nameservice.

    • dfs.namenode.rpc-address.[nameservice ID].[name node ID]

    - the fully-qualified RPC address for each NameNode to listen on

    For both of the previously-configured NameNode IDs, set the full address and IPC port of the NameNode processs. Note that this results in two separate configuration options. For example:

    <property>
      <name>dfs.namenode.rpc-address.mycluster.nn1</name>
      <value>machine1.example.com:8020</value>
    </property>
    <property>
      <name>dfs.namenode.rpc-address.mycluster.nn2</name>
      <value>machine2.example.com:8020</value>
    </property>
    

    Note: You may similarly configure the "servicerpc-address" setting if you so desire.

    • dfs.namenode.http-address.[nameservice ID].[name node ID]

    - the fully-qualified HTTP address for each NameNode to listen on

    Similarly to rpc-address above, set the addresses for both NameNodes' HTTP servers to listen on. For example:

    <property>
      <name>dfs.namenode.http-address.mycluster.nn1</name>
      <value>machine1.example.com:50070</value>
    </property>
    <property>
      <name>dfs.namenode.http-address.mycluster.nn2</name>
      <value>machine2.example.com:50070</value>
    </property>
    

    Note: If you have Hadoop's security features enabled, you should also set the https-address similarly for each NameNode.

    • dfs.namenode.shared.edits.dir

    - the URI which identifies the group of JNs where the NameNodes will write/read edits

    This is where one configures the addresses of the JournalNodes which provide the shared edits storage, written to by the Active nameNode and read by the Standby NameNode to stay up-to-date with all the file system changes the Active NameNode makes. Though you must specify several JournalNode addresses, you should only configure one of these URIs. The URI should be of the form: "qjournal://host1:port1;host2:port2;host3:port3/journalId". The Journal ID is a unique identifier for this nameservice, which allows a single set of JournalNodes to provide storage for multiple federated namesystems. Though not a requirement, it's a good idea to reuse the nameservice ID for the journal identifier.

    For example, if the JournalNodes for this cluster were running on the machines "node1.example.com", "node2.example.com", and "node3.example.com" and the nameservice ID were "mycluster", you would use the following as the value for this setting (the default port for the JournalNode is 8485):

    <property>
      <name>dfs.namenode.shared.edits.dir</name>
      <value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
    </property>
    
    • dfs.client.failover.proxy.provider.[nameservice ID]

    - the Java class that HDFS clients use to contact the Active NameNode

    Configure the name of the Java class which will be used by the DFS Client to determine which NameNode is the current Active, and therefore which NameNode is currently serving client requests. The only implementation which currently ships with Hadoop is the ConfiguredFailoverProxyProvider, so use this unless you are using a custom one. For example:

    <property>
      <name>dfs.client.failover.proxy.provider.mycluster</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    
    • dfs.ha.fencing.methods

    - a list of scripts or Java classes which will be used to fence the Active NameNode during a failover

    It is desirable for correctness of the system that only one NameNode be in the Active state at any given time. Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario. However, when a failover occurs, it is still possible that the previous Active NameNode could serve read requests to clients, which may be out of date until that NameNode shuts down when trying to write to the JournalNodes. For this reason, it is still desirable to configure some fencing methods even when using the Quorum Journal Manager. However, to improve the availability of the system in the event the fencing mechanisms fail, it is advisable to configure a fencing method which is guaranteed to return success as the last fencing method in the list. Note that if you choose to use no actual fencing methods, you still must configure something for this setting, for example "shell(/bin/true)".

    The fencing methods used during a failover are configured as a carriage-return-separated list, which will be attempted in order until one indicates that fencing has succeeded. There are two methods which ship with Hadoop: shell and sshfence. For information on implementing your own custom fencing method, see the org.apache.hadoop.ha.NodeFencer class.

    • sshfence

      - SSH to the Active NameNode and kill the process

      The sshfence option SSHes to the target node and uses fuser to kill the process listening on the service's TCP port. In order for this fencing option to work, it must be able to SSH to the target node without providing a passphrase. Thus, one must also configure the dfs.ha.fencing.ssh.private-key-files option, which is a comma-separated list of SSH private key files. For example:

      <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence</value>
      </property>
      
      <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/home/exampleuser/.ssh/id_rsa</value>
      </property>
      

      Optionally, one may configure a non-standard username or port to perform the SSH. One may also configure a timeout, in milliseconds, for the SSH, after which this fencing method will be considered to have failed. It may be configured like so:

      <property>
        <name>dfs.ha.fencing.methods</name>
        <value>sshfence([[username][:port]])</value>
      </property>
      <property>
        <name>dfs.ha.fencing.ssh.connect-timeout</name>
        <value>30000</value>
      </property>
      
    • shell

      - run an arbitrary shell command to fence the Active NameNode

      The shell fencing method runs an arbitrary shell command. It may be configured like so:

      <property>
        <name>dfs.ha.fencing.methods</name>
        <value>shell(/path/to/my/script.sh arg1 arg2 ...)</value>
      </property>
      

      The string between '(' and ')' is passed directly to a bash shell and may not include any closing parentheses.

      The shell command will be run with an environment set up to contain all of the current Hadoop configuration variables, with the '_' character replacing any '.' characters in the configuration keys. The configuration used has already had any namenode-specific configurations promoted to their generic forms – for example dfs_namenode_rpc-address will contain the RPC address of the target node, even though the configuration may specify that variable as dfs.namenode.rpc-address.ns1.nn1.

      Additionally, the following variables referring to the target node to be fenced are also available:

      $target_host hostname of the node to be fenced
      $target_port IPC port of the node to be fenced
      $target_address the above two, combined as host:port
      $target_nameserviceid the nameservice ID of the NN to be fenced
      $target_namenodeid the namenode ID of the NN to be fenced

      These environment variables may also be used as substitutions in the shell command itself. For example:

      <property>
        <name>dfs.ha.fencing.methods</name>
        <value>shell(/path/to/my/script.sh --nameservice=$target_nameserviceid $target_host:$target_port)</value>
      </property>
      

      If the shell command returns an exit code of 0, the fencing is determined to be successful. If it returns any other exit code, the fencing was not successful and the next fencing method in the list will be attempted.

      Note: This fencing method does not implement any timeout. If timeouts are necessary, they should be implemented in the shell script itself (eg by forking a subshell to kill its parent in some number of seconds).

    • fs.defaultFS

    - the default path prefix used by the Hadoop FS client when none is given

    Optionally, you may now configure the default path for Hadoop clients to use the new HA-enabled logical URI. If you used "mycluster" as the nameservice ID earlier, this will be the value of the authority portion of all of your HDFS paths. This may be configured like so, in your core-site.xml file:

    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://mycluster</value>
    </property>
    
    • dfs.journalnode.edits.dir

    - the path where the JournalNode daemon will store its local state

    This is the absolute path on the JournalNode machines where the edits and other local state used by the JNs will be stored. You may only use a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring this directory on a locally-attached RAID array. For example:

    <property>
      <name>dfs.journalnode.edits.dir</name>
      <value>/path/to/journal/node/local/data</value>
    </property>
    

    hdfs-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
        http://www.apache.org/licenses/LICENSE-2.0
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
        <property>
            <name>dfs.replication</name>
            <value>2</value>
        </property>
        <property>
            <name>dfs.nameservices</name>
            <value>mycluster</value>
        </property>
        <property>
    	<name>dfs.ha.namenodes.mycluster</name>
    	<value>nn1,nn2</value>
        </property>
        <property>
    	<name>dfs.namenode.rpc-address.mycluster.nn1</name>
    	<value>node01:8020</value>
        </property>
        <property>
    	<name>dfs.namenode.rpc-address.mycluster.nn2</name>
    	<value>node02:8020</value>
        </property>
        <property>
    	<name>dfs.namenode.http-address.mycluster.nn1</name>
    	<value>node01:50070</value>
        </property>
        <property>
    	<name>dfs.namenode.http-address.mycluster.nn2</name>
    	<value>node02:50070</value>
        </property>
        <property>
    	<name>dfs.namenode.shared.edits.dir</name>
    	<value>qjournal://node01:8485;node02:8485;node03:8485/mycluster</value>
        </property>
        <property>
    	<name>dfs.client.failover.proxy.provider.mycluster</name>
    	<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
        </property>
        <property>
    	<name>dfs.ha.fencing.methods</name>
    	<value>sshfence</value>
        </property>
    
        <property>
    	<name>dfs.ha.fencing.ssh.private-key-files</name>
    	<value>/root/.ssh/id_dsa</value>
        </property>
        <property>
    	<name>dfs.journalnode.edits.dir</name>
    	<value>/var/hadoop/ha/journalnode</value>
    </property>
    </configuration>
    
    

    core-site.xml

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <!--
      Licensed under the Apache License, Version 2.0 (the "License");
      you may not use this file except in compliance with the License.
      You may obtain a copy of the License at
    
        http://www.apache.org/licenses/LICENSE-2.0
    
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License. See accompanying LICENSE file.
    -->
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://mycluster</value>
        </property>
        <property>
            <name>hadoop.tmp.dir</name>
            <value>/var/hadoop/ha</value>
        </property>
    </configuration>
    
  4. According to the official prompts, to use zookeeper, you need to write hdfs-site.xml and core-site.xml automatic failover configuration

    Configuring automatic failover

    The configuration of automatic failover requires the addition of two new parameters to your configuration. In your hdfs-site.xml file, add:

    <property>
      <name>dfs.ha.automatic-failover.enabled</name>
      <value>true</value>
    </property>
    

    This specifies that the cluster should be set up for automatic failover. In your core-site.xml file, add:

    <property>
      <name>ha.zookeeper.quorum</name>
      <value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
    </property>
    

    This lists the host-port pairs running the ZooKeeper service.

    As with the parameters described earlier in the document, these settings may be configured on a per-nameservice basis by suffixing the configuration key with the nameservice ID. For example, in a cluster with federation enabled, you can explicitly enable automatic failover for only one of the nameservices by setting dfs.ha.automatic-failover.enabled.my-nameservice-id.

    There are also several other configuration parameters which may be set to control the behavior of automatic failover; however, they are not necessary for most installations. Please refer to the configuration key specific documentation for details.

    hdfs-site.xml

     <property>
       <name>dfs.ha.automatic-failover.enabled</name>
       <value>true</value>
     </property>
    
    

    core-site.xml

     <property>
       <name>ha.zookeeper.quorum</name>
       <value>node02:2181,node03:2181,node04:2181</value>
     </property>
    
  5. Distribute core-site.xml, hdfs-site.xml to other virtual machines

  6. Build ZooKeeper

    tar xf zookeeper-3.4.6.tar.gz -C /opt/
    mv opt/zu
    
  7. Modify Zookeeper profile

    cd /opt/zookeeper-3.4.6/conf
    mv zoo_sample.cfg zoo.cfg
    vi zoo.cfg
    

    Modify the location of dataDir = document in the zoo.cfg configuration file, and add the server address, outgoing communication port and election mechanism port at the end

    # The number of milliseconds of each tick
    tickTime=2000
    # The number of ticks that the initial 
    # synchronization phase can take
    initLimit=10
    # The number of ticks that can pass between 
    # sending a request and getting an acknowledgement
    syncLimit=5
    # the directory where the snapshot is stored.
    # do not use /tmp for storage, /tmp here is just 
    # example sakes.
    dataDir=/var/zookeeper
    # the port at which the clients will connect
    clientPort=2181
    # the maximum number of client connections.
    # increase this if you need to handle more clients
    #maxClientCnxns=60
    #
    # Be sure to read the maintenance section of the 
    # administrator guide before turning on autopurge.
    #
    # http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
    #
    # The number of snapshots to retain in dataDir
    #autopurge.snapRetainCount=3
    # Purge task interval in hours
    # Set to "0" to disable auto purge feature
    #autopurge.purgeInterval=1
    server.1=node02:2888:3888
    server.2=node03:2888:3888
    server.3=node04:2888:3888
    
    
  8. Distribute the entire folder of Zookeeper to other Zookeeper nodes

  9. Create the file location specified by dataDir, and then create myid (the current node id of Zookeeper corresponds to the zoo.cfg configuration)

    mkdir /var/zookeeper
    echo 1 > /var/zookeeper/myid
    
  10. Configure Zookeeper to environment variables and distribute to other Zookeeper nodes

  11. Start Zookeeper zkServer.sh start

  12. Start journalnode hadoop-day.sh start journalnode

  13. Format NN node hdfs namenode -format

  14. Start the formatted NN master node (not prompted in the official document) hadoop-daemon.sh start namenode

  15. Perform the synchronization operation HDFS namenode - bootstrappstandby in NN slave node after formatting according to the official prompt

    Deployment details

    After all of the necessary configuration options have been set, you must start the JournalNode daemons on the set of machines where they will run. This can be done by running the command "hadoop-daemon.sh start journalnode" and waiting for the daemon to start on each of the relevant machines.

    Once the JournalNodes have been started, one must initially synchronize the two HA NameNodes' on-disk metadata.

    • If you are setting up a fresh HDFS cluster, you should first run the format command (hdfs namenode -format) on one of NameNodes.
    • If you have already formatted the NameNode, or are converting a non-HA-enabled cluster to be HA-enabled, you should now copy over the contents of your NameNode metadata directories to the other, unformatted NameNode by running the command "hdfs namenode -bootstrapStandby" on the unformatted NameNode. Running this command will also ensure that the JournalNodes (as configured by dfs.namenode.shared.edits.dir) contain sufficient edits transactions to be able to start both NameNodes.
    • If you are converting a non-HA NameNode to be HA, you should run the command "hdfs -initializeSharedEdits", which will initialize the JournalNodes with the edits data from the local NameNode edits directories.

    At this point you may start both of your HA NameNodes as you normally would start a NameNode.

    You can visit each of the NameNodes' web pages separately by browsing to their configured HTTP addresses. You should notice that next to the configured address will be the HA state of the NameNode (either "standby" or "active".) Whenever an HA NameNode starts, it is initially in the Standby state.

  16. Execute the command hdfs zkfc -formatZK registered to Zookeeper on NN master node

    Initializing HA state in ZooKeeper

    After the configuration keys have been added, the next step is to initialize required state in ZooKeeper. You can do so by running the following command from one of the NameNode hosts.

    $ hdfs zkfc -formatZK
    

    This will create a znode in ZooKeeper inside of which the automatic failover system stores its data.

  17. Registration information can be viewed through Zookeeper

    WatchedEvent state:SyncConnected type:None path:null
    [zk: localhost:2181(CONNECTED) 0] ls /
    [hadoop-ha, zookeeper]
    [zk: localhost:2181(CONNECTED) 1] ls /hadoop-ha/mycluster
    []
    [zk: localhost:2181(CONNECTED) 2] 
    
    
  18. Start the current cluster start-dfs.sh at the NN master node (the node that is free of password login for other nodes)

    [root@node01 hadoop]# start-dfs.sh
    Starting namenodes on [node01 node02]
    node02: starting namenode, logging to /opt/hadoop/hadoop-2.6.5/logs/hadoop-root-namenode-node02.out
    node01: namenode running as process 1754. Stop it first.
    node03: starting datanode, logging to /opt/hadoop/hadoop-2.6.5/logs/hadoop-root-datanode-node03.out
    node04: starting datanode, logging to /opt/hadoop/hadoop-2.6.5/logs/hadoop-root-datanode-node04.out
    node02: starting datanode, logging to /opt/hadoop/hadoop-2.6.5/logs/hadoop-root-datanode-node02.out
    Starting journal nodes [node01 node02 node03]
    node01: journalnode running as process 1464. Stop it first.
    node03: journalnode running as process 1288. Stop it first.
    node02: journalnode running as process 1565. Stop it first.
    Starting ZK Failover Controllers on NN hosts [node01 node02]
    node01: starting zkfc, logging to /opt/hadoop/hadoop-2.6.5/logs/hadoop-root-zkfc-node01.out
    node02: starting zkfc, logging to /opt/hadoop/hadoop-2.6.5/logs/hadoop-root-zkfc-node02.out
    
    
  19. After startup, use jps to check whether there are related java processes

  20. Use hadoop-day.sh stop namenode to close the master node NN

    You can also use hadoop-day.sh stop zkfc to shut down the master node zkfc to test whether the switch is successful

Other orders

  • Zookeeper

    zkServer.sh start

    zkServer.sh stop shutdown

    zkServer.sh status view status

    zkCli.sh client ls / view root

  • Start sequence ZK > > jn > > HDFS

  • ss -nal can view ports

  • start-dfs.sh start hdfs cluster

Published 5 original articles, won praise 1, visited 120
Private letter follow

Tags: Hadoop Zookeeper xml shell

Posted on Wed, 05 Feb 2020 08:32:28 -0500 by ntroycondo