Start learning big data again - hadoop - day46 Hdfs, yarn's HA, mapreduce

Start learning big data again - hadoop - day46 Hdfs, yarn's HA, mapreduce (1)

Ha (high availability)

HA for HDFS

Compared with Hadoop 1. X, HDFS in Hadoop 2. X adds two important features, HA and Federation. Ha is High Availability, which is used to solve the single point of failure of NameNode. This feature provides a standby for the primary NameNode through hot standby. Once the primary NameNode fails, it can quickly switch to the standby NameNode, so as to provide uninterrupted external services. Federation is "Federation". This feature allows multiple namenodes in an HDFS cluster to provide services at the same time. These namenodes are in charge of some directories (horizontal segmentation) and are isolated from each other, but share the underlying DataNode storage resources.

failover principle of HA

The HA of HDFS means that there are two NameNodes in a cluster, running on separate physical nodes. At any point in time, only one NameNode is in the active state and the other is in the Standby state. Active NameNode is responsible for the operation of all clients, while Standby NameNode is used to synchronize the status information of active NameNode to provide fast fault recovery.

In order to ensure that the status of Active NN and Standby NN nodes are synchronized, that is, the metadata is consistent. In addition to sending block location information to two NNs, DataNode also constructs a set of independent daemon "JournalNodes" to synchronize Edits information. When Active NN performs any namespace modification, it needs to persist to more than half of the JournalNodes. The Standby NN is responsible for observing the changes of JNs, reading the Edits information sent from the Active NN, and updating its internal namespace. Once ActiveNN encounters an error, Standby NN needs to ensure that all Edits are read from JNs, and then switch to Active state.
When using HA, the SecondaryNameNode cannot be started, and an error will occur.

Federation of HDFS

HDFS Federation design can solve the following problems in a single namespace:
(1) HDFS cluster scalability. Multiple namenodes are in charge of some directories, so that a cluster can expand to more nodes, and the number of file stores is no longer restricted due to memory constraints as in 1.0.
(2) More efficient performance. Multiple namenodes manage different data and provide external services at the same time, which will provide users with higher read and write throughput.
(3) Good isolation. Users can hand over different business data to different namenodes for management as needed, so that different businesses have little impact.

Federation architecture diagram

HA Construction of HDFS

(if 1-3 has been implemented, do not do it)
Cluster planning before implementation:

masternode1node2
NameNodeNameNode
JournalNodeJournalNodeJournalNode
DataNodeDataNodeDataNode

**************JournalNode - a log node that considers log security

1. Firewall
service firewalld stop
2. Time synchronization
yum install ntp
ntpdate -u s2c.time.edu.cn

3. Key free (remote command execution)
Generate key files on both master nodes
ssh-keygen -t rsa
ssh-copy-id ip

master-->master,node1,node2
node1-->master,node1,node2

4. Modify hadoop configuration file
core-site.xml (modify the original as follows)

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://cluster</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/usr/local/soft/hadoop-2.7.6/tmp</value>
	</property>
	<property>
		<name>fs.trash.interval</name>
		<value>1440</value>
	</property>
	<property>
	      <name>ha.zookeeper.quorum</name>
	      <value>master:2181,node1:2181,node2:2181</value>
	</property>
</configuration>

hdfs-site.xml (modify the original as follows)

<configuration>
<!-- appoint hdfs Path to metadata store -->
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/soft/hadoop-2.7.6/data/namenode</value>
</property>

<!-- appoint hdfs Path to data store -->
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/soft/hadoop-2.7.6/data/datanode</value>
</property>

<!-- Number of data backups -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<!-- Turn off permission verification -->
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>

<!-- open WebHDFS Function (based on REST Interface services) -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

<!-- //The following is the configuration of HDFS HA / / -- >
<!-- appoint hdfs of nameservices Name is mycluster -->
<property>
<name>dfs.nameservices</name>
<value>cluster</value>
</property>

<!-- appoint cluster Two of namenode The names of are nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.cluster</name>
<value>nn1,nn2</value>
</property>

<!-- to configure nn1,nn2 of rpc Communication port -->
<property>
<name>dfs.namenode.rpc-address.cluster.nn1</name>
<value>master:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster.nn2</name>
<value>node1:8020</value>
</property>

<!-- to configure nn1,nn2 of http Communication port -->
<property>
<name>dfs.namenode.http-address.cluster.nn1</name>
<value>master:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster.nn2</name>
<value>node1:50070</value>
</property>

<!-- appoint namenode Metadata stored in journalnode Path in -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://master:8485;node1:8485;node2:8485/cluster</value>
</property>

<!-- appoint journalnode Path to log file storage -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/soft/hadoop-2.7.6/data/journal</value>
</property>

<!-- appoint HDFS Client connection active namenode of java class -->
<property>
<name>dfs.client.failover.proxy.provider.cluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<!-- Configure isolation mechanism as ssh -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>
sshfence
shell(/bin/true)
</value>
</property>

<!-- Specifies the location of the secret key -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>

<!-- Turn on automatic failover -->
<property>  
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
</configuration>

Stop HDFS cluster: stop-dfs.sh

Synchronize to other nodes

	cd /usr/local/soft/hadoop-2.7.6/etc/hadoop
	scp ./* node1:`pwd`
	scp ./* node2:`pwd`

5. Delete the files in the hadoop data storage directory. Each node needs to be deleted
rm -rf /usr/local/soft/hadoop-2.7.6/tmp

6. Start zookeeper. All three machines need to be started
zkServer.sh start
zkServer.sh status

7. Start JN to store hdfs metadata
Execute start command on three JN S:
/usr/local/soft/hadoop-2.7.6/sbin/hadoop-daemon.sh start journalnode

8. The format is executed on an NN. Select master here
hdfs namenode -format
Start current NN
hadoop-daemon.sh start namenode

9. Execute synchronization on the unformatted NN. Execute on another namenode. Select node1 here
/usr/local/soft/hadoop-2.7.6/bin/hdfs namenode -bootstrapStandby

10. The ZK format is executed on the master
!! Be sure to start the zk cluster normally first
/usr/local/soft/hadoop-2.7.6/bin/hdfs zkfc -formatZK

11. Start the hdfs cluster and execute on the master
start-dfs.sh

HA of YARN

    in the YARN cluster of Hadoop, the resource manager is responsible for tracking the resources in the cluster and scheduling applications (such as MapReduce jobs). Before Hadoop 2.4, there was only one resource manager in the cluster. When one rock machine was selected, the whole cluster would be affected. The high availability feature adds the form of redundancy, that is, an active / standby resource manager pair for failover.

RMStateStore

ResourceManager HA is composed of a pair of Active and Standby nodes, which store internal data and data and tags of main applications through RMStataStore.
Alternative RMStateStore implementations currently supported include:
Memory based rmstatestore, filesystem based rmstatestore, and Zookeeper based ZKRMStateStore.
The schema mode of ResourceManager HA is basically the same as that of NameNode HA. Data sharing is provided by RMStateStore, while ZKFC is called a service of ResourceManager process and does not exist independently.

Yan's HA Construction

yarn high availability
1. Modify profile
yarn-site.xml (modify the original as follows)

<configuration>
<!-- NodeManager Ancillary services running on the need to be configured to mapreduce_shuffle Before it can run MapReduce program -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<!-- Open log -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>

<!-- The configuration log deletion time is 7 days,-1 Is disabled in seconds -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>


<!-- //The following is the configuration of YARN HA / / -- >
<!-- open YARN HA -->
<property>
<name>yarn.resourcemanager.ha.enabled</name>
<value>true</value>
</property>

<!-- Enable automatic failover -->
<property>
<name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

<!-- appoint YARN HA Name of -->
<property>
<name>yarn.resourcemanager.cluster-id</name>
<value>yarncluster</value>
</property>

<!-- Specify two resourcemanager Name of -->
<property>
<name>yarn.resourcemanager.ha.rm-ids</name>
<value>rm1,rm2</value>
</property>

<!-- to configure rm1,rm2 Host -->
<property>
<name>yarn.resourcemanager.hostname.rm1</name>
<value>master</value>
</property>
<property>
<name>yarn.resourcemanager.hostname.rm2</name>
<value>node1</value>
</property>

<!-- to configure YARN of http port -->
<property>
<name>yarn.resourcemanager.webapp.address.rm1</name>
<value>master:8088</value>
</property>	
<property>
<name>yarn.resourcemanager.webapp.address.rm2</name>
<value>node1:8088</value>
</property>

<!-- to configure zookeeper Address of -->
<property>
<name>yarn.resourcemanager.zk-address</name>
<value>master:2181,node1:2181,node2:2181</value>
</property>

<!-- to configure zookeeper Storage location for -->
<property>
<name>yarn.resourcemanager.zk-state-store.parent-path</name>
<value>/rmstore</value>
</property>

<!-- open yarn resourcemanager restart -->
<property>
<name>yarn.resourcemanager.recovery.enabled</name>
<value>true</value>
</property>

<!-- to configure resourcemanager The status of is stored in zookeeper in -->
<property>
<name>yarn.resourcemanager.store.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
</property>

<!-- open yarn nodemanager restart -->
<property>
<name>yarn.nodemanager.recovery.enabled</name>
<value>true</value>
</property>

<!-- to configure nodemanager IPC Communication port -->
<property>
<name>yarn.nodemanager.address</name>
<value>0.0.0.0:45454</value>
</property>
</configuration>

Stop the yarn cluster: stop-yarn.sh

Synchronize to other nodes

	cd /usr/local/soft/hadoop-2.7.6/etc/hadoop
	scp ./* node1:`pwd`
	scp ./* node2:`pwd`

2. Start yarn start on the master
start-yarn.sh

3. Start another RM on node1
/usr/local/soft/hadoop-2.7.6/sbin/yarn-daemon.sh start resourcemanager

hdfs FAQs

Cluster startup failed

  view log
The hdfs file cannot be operated

  • Usually because it is in safe mode
    • Leave safe mode: hdfs dfsadmin -safemode leave
    • Enter safe mode: hdfs dfsadmin -safemode enter
    • View safe mode: hdfs dfsadmin -safemode get

mapreduce(1)

MapReduce overview

  • MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve the computing problem of massive data
  • MapReduce runs distributed and consists of two stages: Map and reduce. The Map stage is an independent program, with many nodes running at the same time, and each node processes part of the data. The reduce phase is an independent program, with many nodes running at the same time, and each node processes part of the data [here, reduce can be understood as a separate aggregation program].
  • The MapReduce framework has a default implementation. Users only need to override the map() and reduce() functions to realize distributed computing, which is very simple.
  • The formal parameters and return values of these two functions are < key, value >, so be sure to construct < K, V > when using them.

MapReduce complete map

Mapper, shuffler, Reducer (mapreduce diagram)

Ring buffer


MapReduce task - simple worldcount

package com.shujia.MapReduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
/**
 * read file
 * Count the number of each word
 */
public class Demo2WordCount {
    //  Mapper<LongWritable, Text, Text, LongWritable>
    // Mapper < type of key input Map, type of value input Map, type of key output Map, type of value output Map >
    // The default inputformat of Map is TextInputFormat
    // TextInputFormat: the offset of each row of data will be input to the Map side as key and each row as value
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
        // Implement your own Map logic

        /**
         * @param key: Enter the key of the Map
         * @param value: Enter the Value of the Map
         * @param context: MapReduce In the context of the program, the output of the Map side can be sent to the Reduce side through the context
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
            // Get a row of data
            // hadoop,hive,hbase => hadoop,1  hive,1 hbase,1
            String line = value.toString();
            // Separated by commas
            String[] splits = line.split(",");
            for (String word : splits) {
                // Send the result to the reduce end
                context.write(new Text(word), new LongWritable(1));
            }

        }
    }

    // Reducer<Text, LongWritable,Text, LongWritable>
    // Reducer < type of Key output from Map, type of Value output from Map, type of Key output from Reduce, type of Value output from Reduce >
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        /**
         * @param key: The key output from the Map and grouped (equivalent to making a group by for the key output from the Map)
         * @param values: The set of value s corresponding to the grouped key s output from the Map side
         * @param context: MapReduce In the context of the program, the output of the Reduce end can be finally written to HDFS through the context
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
            // hadoop,{1,1,1,1,1}
            long sum = 0;
            for (LongWritable value : values) {
                sum += value.get();
            }
            // Output the final result
            context.write(key, new LongWritable(sum));
        }
    }

    // Assemble MapReduce task
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        // Configure the separator of key value of the final output of reduce as#
        conf.set("mapred.textoutputformat.separator", "#");
        // Make other configurations

        // Create a Job
        Job job = Job.getInstance(conf);

        // Set the number of reduce
        job.setNumReduceTasks(2);

        // Set the name of the Job
        job.setJobName("MyWordCountMapReduceApp");
        // Set the classes that MapReduce runs
        job.setJarByClass(Demo2WordCount.class);

        // Configure Map
        // Configure the classes that Map Task runs
        job.setMapperClass(MyMapper.class);
        // Set the type of key output by the Map task
        job.setMapOutputKeyClass(Text.class);
        // Set the type of value output by the Map task
        job.setMapOutputValueClass(LongWritable.class);

        // Configure Reduce
        // Configure the classes that Reduce Task runs
        job.setReducerClass(MyReducer.class);
        // Set the type of key output by the Reduce task
        job.setOutputKeyClass(Text.class);
        // Set the type of value output by the Reduce task
        job.setOutputValueClass(LongWritable.class);

        // Configure I / O path
        // Take the first input parameter as the input path and the second parameter as the output path
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // Submit the task and wait for the run to end
        job.waitForCompletion(true);

    }
    /**
     * Package and upload to master
     * Prepare the input data and upload it to / input1 of HDFS
     *  hadoop jar jar Package path com.shujia.MapReduce.Demo2WordCount /input1 /output1
     */
}

|
|
|
|

Previous chapter - hadoop - day45 HDFS parsing and zookeeper installation
Next chapter - update with luck
|
|
|
|
|

It's said that you press your thumb 👍 Amazing things will happen! It seems to be the following picture, people who have heard of it 🧑 Found the object in a month 💑💑💑, And won the grand prize 💴$$$, Get full marks directly in the exam 💯, His appearance suddenly improved 😎, Although you don't seem to need it, do you, Wu Yanzu 🤵!

Tags: Big Data Hadoop HA

Posted on Wed, 01 Sep 2021 19:20:23 -0400 by deras