Senior big data Development Engineer - Hadoop learning notes

Hadoop advanced level

YARN: Hadoop resource scheduling system

What is YARN

  • Apache Hadoop YARN(Yet Another Resource Negotiator) is a sub project of Hadoop, which is introduced to separate Hadoop 2.0 resource management and computing components.
  • YARN has enough universality, and customers support other distributed computing modes.

Analysis of YARN architecture

  • Similar to HDFS, YARN is also a classic Master/Slave architecture.
  • The YARN service consists of one resource manager (RM) and multiple nodemanagers (NM).
  • ResourceManager is the master node and NodeManager is the slave node

1. ResourceManager

  • Resource Manager (RM) is the main role in YARN. It is a global resource manager. The cluster has only one active external service provider.
    • Responsible for resource management and allocation of the whole system;
    • Processing client requests;
    • Start / monitor ApplicationMaster;
    • Monitor NodeManager, resource allocation and scheduling.
  • It is mainly composed of two components:
    • Scheduler;
    • Application Manager (ASM).
  • Scheduler scheduler: allocate resources in the system to running applications according to queue, capacity and other constraints (such as allocating certain resources to each queue and executing a certain number of jobs at most). It should be noted that the scheduler is a "pure scheduler"
    • It does not engage in any work related to specific applications, such as monitoring or tracking the execution status of applications;
    • It is not responsible for restarting the failed tasks caused by application execution failure or hardware failure, which shall be completed by the application master related to the application;
    • The scheduler allocates resources according to the resource requirements of each application, and the resource allocation unit is represented by an abstract concept "Resource Container" (hereinafter referred to as container). Container is a dynamic resource allocation unit, which encapsulates resources such as memory, CPU, disk and network, so as to limit the amount of resources used by each task.
  • Application Manager: it is mainly responsible for managing all applications in the whole system, receiving Job submission requests, and assigning the first Container to the application to run the ApplicationMaster
    • Including application submission;
    • Negotiate resources with the Scheduler scheduler to start the ApplicationMaster;
    • Monitor the running status of the ApplicationMaster and restart it in case of failure, etc.

2. NodeManager

  • NodeManager is the slave role in YARN:
    • When a node starts, it will register with the resource manager and tell the resource manager how many resources it has available;
    • Each computing node runs a NodeManager process to report the node's resource status (disk, memory, cpu and other usage information) through heartbeat (yarn. ResourceManager. Nodemanagers. Heartbeat interval ms per second)
  • Function:
    • Receive and process command requests from the resource manager, and assign a Container to a task of the application;
    • NodeManager monitors the resource usage on this node and the running status of each Container (cpu, memory and other resources);
    • Monitor and report Container usage information to resource manager;
    • Report to RM regularly to ensure the smooth operation of the whole cluster. RM tracks the health status of the whole cluster by collecting the report information of each NodeManager, and NodeManager is responsible for monitoring its own health status;
    • Handle requests from ApplicationMaster;
    • Manages the lifecycle of each Container on the node.
  • Manage logs on each node:
    • During the operation period, through the cooperative work of NodeManager and ResourceManager, these information will be continuously updated and ensure the best performance of the whole cluster;
    • NodeManager is only responsible for managing its own Container. It does not know the application information running on it. The component responsible for managing application information is ApplicationMaster.

3. Container

  • Container is a resource abstraction in YARN. YARN allocates resources by container. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, network, etc. When am requests resources from RM, the resources returned by RM for AM are represented by container.
  • YARN will assign a Container to each task, and the task can only use the specified number of resources in the Container.
  • The relationship between the Container and the cluster NodeManager node is:
    • A NodeManager node can run multiple containers, but one Container will not cross nodes;
    • Any Job or Application must run in one or more containers;
    • In the Yarn framework, the resource manager is only responsible for telling the ApplicationMaster which Containers can be used; The application master also needs to go to the NodeManager to request the assignment of a specific Container.
  • It should be noted that:
    • Container is a dynamic resource division unit, which is dynamically generated according to the requirements of the application;
    • So far, YARN only supports CPU and memory resources, and uses the lightweight resource isolation mechanism Cgroups for resource isolation.
  • Function:
    • Abstraction of task environment;
    • Describe a series of information;
    • Collection of task running resources (cpu, memory, io, etc.);
    • Task running environment.

4. ApplicationMaster

  • Function:
    • Obtain slice data;
    • Request resources for the application and further allocate them to internal tasks;
    • Task monitoring and fault tolerance;
    • Load coordinates resources from the resource manager, and monitors container execution and resource usage through NodeManager.
  • Communication between ApplicationMaster and ResourceManager:
    • It is the core part of the whole Yarn application from submission to operation, and it is the fundamental step for Yarn to carry out dynamic resource management for the whole cluster;
    • The application master periodically sends heartbeat to the resource manager to let rm confirm the health of the appmaster;
    • The dynamic nature of Yarn is the process in which the ApplicationMaster from multiple applications dynamically communicates with the ResourceManager to continuously apply for, release, reapply and release resources.

5. JobHistoryServer

  • Job history service: record the operation history of scheduling in yarn. You can check the hadoop history task through the history task log server. All errors should be checked at the first time.

Configure and enable the history service JobHistoryServer

  • Step 1: modify mapred-site.xml and add the following
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>node01:10020</value>
</property>

<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>node01:19888</value>
</property>
  • Step 2: modify yarn-site.xml and add the following contents
<!-- Enable the log aggregation function to application At run time, each container The logs are aggregated and saved to the file system, usually HDFS -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>
<!-- How often do you delete logs generated by aggregation -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>2592000</value><!--30 day-->
</property>
<!-- How many seconds to keep user logs. Only when the log aggregation function is not enabled yarn.log-aggregation-enable,Only effective; Now on 
<property>
    <name>yarn.nodemanager.log.retain-seconds</name>
    <value>604800</value>
</property>
-->
<!-- Specifies the compression algorithm for logs generated by aggregation -->
<property>
    <name>yarn.nodemanager.log-aggregation.compression-type</name>
    <value>gz</value>
</property>
<!--  nodemanager Local file storage directory  -->
<property>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/bigdata/install/hadoop-3.1.4/hadoopDatas/yarn/local</value>
</property>
<!--  resourceManager  Save the maximum number of completed tasks  -->
<property>
    <name>yarn.resourcemanager.max-completed-applications</name>
    <value>1000</value>
</property>
<property>
    <name>yarn.log.server.url</name>
    <value>http://master:19888/jobhistory/logs</value>
</property>
  • Step 3: synchronize the modified files to other machines:
cd /bigdata/install/hadoop-3.1.4/etc/hadoop

  • Step 4: restart the yarn and jobhistory services
cd /bigdata/install/hadoop-3.1.4/
sbin/stop-yarn.sh
sbin/start-yarn.sh
mapred --daemon stop historyserver
mapred --daemon start historyserver
  • Submit the MR program and view the application log:
hadoop jar hadoop-demo-1.0.jar com.yw.hadoop.mr.p13_map_join.MapJoinMain /order.txt /map_join_out

  • Click history to view:

6. TimelineServer

  • It is used to write log service data. Generally, it is used to write log service data combined with third parties (such as spark, etc.).
  • It is an effective supplement to the function of jobhistoryserver, which can only record job information of mapreduce type.
  • It records not only the information that jobhistory server can record during job running, but also more fine-grained information, such as the queue in which the task runs and the user set when running the task.
  • timelineserver is more powerful, but it is not a substitute for jobhistory. The two functions complement each other.
  • Official website tutorial

YARN application operation principle

1. YARN application submission process

  • The execution process of Application in Yan can be summarized into three steps:
    • Application submission;
    • Start the ApplicationMaster instance of the application;
    • The ApplicationMaster instance manages the execution of the application
  • The specific submission process is as follows:

  • The client program submits an application to the resource manager and requests an ApplicationMaster instance;
  • The resource manager finds a NodeManager that can run a Container and starts the ApplicationMaster instance in the Container;
  • The application master registers with the resource manager. After registration, the client can query the resource manager to obtain the details of its application master, and then directly interact with its application master (at this time, the client actively communicates with the ApplicationMaster, and the application first sends a resource request to meet its own needs to the ApplicationMaster);
  • The ApplicationMaster sends a resource request request to the resource manager according to the resource request protocol;
  • After the Container is successfully allocated, the ApplicationMaster starts the Container by sending the Container launch specification information to the NodeManager. The Container launch specification information contains the data required for the Container and ApplicationMaster to communicate;
  • The code of the application runs in the launched Container in the form of task, and sends the running progress, status and other information to the ApplicationMaster through the application specific protocol;
  • During the operation of the application, the client submitting the application actively communicates with the ApplicationMaster to obtain the operation status, progress update and other information of the application. The communication protocol is also an application specific protocol;
  • After the application is executed and all related work has been completed, the ApplicationMaster cancels the registration with the resource manager, closes it, and returns all containers used to the system.
  • Brief summary:
    • 1. The user submits the application to the resource manager;
    • 2. The ResourceManager applies for resources for the application ApplicationMaster, communicates with a NodeManager, and starts the first Container to start the ApplicationMaster;
    • 3. The ApplicationMaster registers and communicates with the ResourceManager to apply for resources for internal tasks to be executed. Once the resources are obtained, it will communicate with the NodeManager and the corresponding Task has been started;
    • 4. After all tasks are completed, the ApplicationMaster logs out of the ResourceManager and the entire application runs.

2. MapReduce on YARN

  • flow chart:

Submit job
  • ① Print the program into a jar package, run the hadoop jar command on the client, and submit the job to the cluster for operation:
    • In job.waitForCompletion(true), the submit() of Job is called, and the submitJobInternal() method calling JobSubmitter is called in this method.
  • ② submitClient.getNewJobID() requests an MR job id from resourcemanager:
    • Check output directory: if no output directory is specified or the directory already exists, an error is reported;
    • Calculate job slice: if slice cannot be calculated, an error will also be reported;
  • ③ The resources related to running the job, such as the job's jar package, configuration file and input fragment, are uploaded to a directory named by the job ID on HDFS (the jar package copy is 10 by default. When running the job's tasks, such as map task and reduce task, you can read the jar package from these 10 copies);
  • ④ Call submitApplication() of resourcemanager to submit the job;
  • The client queries the job progress every second (map 50% reduce 0%). If the progress changes, print the progress report on the console;
  • If the job is successfully executed, print the relevant counter;
  • However, if the job fails, print the cause of the job failure on the console (learn to view the log, locate the problem, analyze the problem and solve the problem);
Initialize job
  • When the resource manager (hereinafter referred to as RM) receives the call notification of the submitApplication() method, the request is passed to the scheduler of RM, which allocates a container;
  • ⑤ RM communicates with the specified NodeManager and notifies NodeManager to start the container;
    • After receiving the notification, the NodeManager creates a container occupying specific resources;
    • Then run the MRAppMaster process in the container;
  • ⑥ MRAppMaster needs to accept the progress and completion reports of tasks (map tasks and reduce tasks), so appMaster needs to create multiple bookkeeping objects to record these information;
  • ⑦ Obtain the input slice split calculated by the client from HDFS
    • Create a map task for each fragment split;
    • Know how many reduce tasks the current MR wants to create through the mapreduce.job.reduces attribute value (specified by jog.setNumReduceTasks() during programming);
    • Each task (map, reduce) has a task id;
Task assignment
  • If it is a small task, the appMaster will run the MR job in the uberized mode, and the appMaster will decide to execute the tasks of the MR in sequence in its JVM:
    • The reason is that if each task runs in a separate JVM, it needs to start the JVM separately and allocate resources (memory and CPU), which takes time; tasks in multiple JVMs run in parallel in their own JVMs
    • If it is more efficient to execute all tasks sequentially in the JVM of the appMaster, the appMaster will do so, and the tasks run as uber tasks;
    • Before running any task, appMaster calls the setupJob() method to create the OutputCommitter, the final output directory of the job (usually the directory on HDFS) and the temporary directory of task output (such as the intermediate result output directory of map task);
    • Judgment basis for small jobs: there are less than 10 map tasks, only one reduce task, and the MR input size is less than the size of an HDFS block
    • How to open uber? Set the value of attribute mapreduce.job.ubertask.enable to true
configuration.set("mapreduce.job.ubertask.enable", "true");
  • ⑧ If the job does not run in the uber task mode, the appMaster will request the RM for a container for each task (map task, reduce task) in the job:
    • Since all map tasks must be completed before the reduce task enters the sorting stage, the application container for the map task should take precedence over the application container for the reduce task;
    • After 5% of the map tasks are completed, the container application for the reduce task is started;
    • When applying for a container for a map task, follow data localization. The scheduler tries to schedule the container on the node where the input partition of the map task is located (mobile computing, no data movement);
    • The reduce task can run on any computing node in the cluster;
    • By default, 1G memory and 1 virtual kernel are allocated for each map task and reduce task. The attributes determine mapreduce.map.memory.mb, mapreduce.reduce.memory.mb, mapreduce.map.cpu.vcores, mapreduce.reduce.cpu.vcores
Task execution
  • When the scheduler assigns a container of NodeManager (temporarily called NM01) to the current task and passes this information to the appMaster; The appMaster communicates with NM01 and tells NM01 to start a container and the container occupies a specific amount of resources (memory and CPU);
  • After receiving the message, NM01 starts the container, which occupies the specified amount of resources;
  • Run YarnChild in the container, and YarnChild runs the current task (map, reduce);
  • ⑩ Before running the task in the container, pull the resources required to run the task locally, such as job JAR files, configuration files and files in the distributed cache;
Job task progress and status update
  • The job and each task have status (running, successfully completed, failed), the running progress of the current task and the job counter;
  • During the operation of the task, report the execution progress and status (including counters) to the appMaster every 3 seconds;
  • appMaster summarizes the reported results of all currently running tasks;
  • Every 1 second, the client polls and accesses the appMaster to obtain the latest status of job execution. If there is any change, it will be printed on the console;
finish one's homework
  • After receiving the report on the completion of the last task, the appMaster sets the job status to success;
  • When the client polls the appMaster to query the progress, it finds that the job is executed successfully, and the program exits from waitforcompletement();
  • All statistical information of the job is printed on the console;
  • appMaster and the container for running tasks, clean up the intermediate output results and release resources;
  • The job information is saved by the history server for future user query.

3. YARN application life cycle

RM: ResourceManager;AM: ApplicationMaster;NM: NodeManager

  • ① Client submits application to RM, including AM program and command to start AM.
  • ② RM allocates the first container for AM and communicates with the corresponding NM to start the applied am on the container.
  • ③ When AM starts, it registers with RM, allowing the Client to obtain AM information from RM and then communicate directly with AM.
  • ④ AM negotiates container resources for applications through resource request protocol.
  • ⑤ If the container allocation is successful, AM requires NM to start the application in the container. After the application is started, it can communicate with AM independently.
  • ⑥ The application executes in the container and reports to the AM.
  • ⑦ During application execution, Client and AM communicate to obtain application status.
  • ⑧ After application execution is completed, AM logs off and closes to RM to free resources.
  • Summary: apply for resources = = > > start appmaster = = > > apply for container of running task = = > > distribute task = = > > Run task = = > > end task = = > > recycle container.

YARN scheduler

1. Functions of resource scheduler

  • Resource scheduler is one of the core components of YARN. It is a plug-in service component, which is responsible for the management and allocation of resources in the whole cluster. YARN provides three available resource schedulers: FIFO, Capacity Scheduler and Fair Scheduler.

2. Introduction of three schedulers

First in first out scheduler (FIFO)
  • FIFO Scheduler arranges applications into a queue in the order of submission, which is a first in first out queue
    • When allocating resources, first allocate resources to the top application in the queue
    • After the top application requirements are met, it will be allocated to the next one, and so on.
  • FIFO Scheduler is the simplest and easiest to understand scheduler and does not require any configuration, but it is not suitable for shared clusters.
    • Large applications may occupy all cluster resources, which will cause other applications to be blocked.
    • In shared clusters, Capacity Scheduler or Fair Scheduler are more suitable. Both schedulers allow large tasks and small tasks to obtain certain system resources while submitting.
  • It can be seen that in FIFO scheduler, small tasks will be blocked by large tasks.

Capacity scheduler

Fair scheduler

3. Customize the queue to achieve different queues for task submission

It is recommended to take a snapshot of the cluster virtual machine before doing something uncertain on the cluster

  • We have seen that there are various resource scheduling forms in hadoop. The current hadoop task submission is submitted to the default queue by default, and all tasks are submitted to the default queue. In our actual work, we can divide different resources for different users by dividing queues to make different users' tasks, Submit to different queues to isolate resources
  • At present, there are two kinds of resource isolation, static isolation and dynamic isolation:
    • The so-called static isolation is service isolation through cgroups (LINUX control groups) function. For example, HADOOP services include HDFS, HBASE, YARN, etc., then our fixed setting ratio is HDFS:20%; HBASE:40%; YARN:40%. The system will help us divide resources according to the CPU, memory and IO quantity of the whole cluster. First of all, IO cannot be divided, so we can only say who gets it first in case of IO problems Resources, CPU and memory are pre allocated.
    • This kind of segmentation according to a fixed proportion is static segmentation. Think about it carefully. There are too many disadvantages. Suppose I split it in advance according to a certain proportion, but what if I mainly run mapreduce at night and HBASE during the day? Static segmentation can't be well supported, and there are too many defects.
    • As long as dynamic isolation is for YARN and impala, the so-called dynamic is only relatively static, but it is not dynamic. Let's start with YARN. What are the main services in the whole HADOOP environment? mapreduce (here again, mapreduce is an application and YARN is a framework. Figure out this concept). HBASE, HIVE, SPARK, HDFS, impala, in fact, are mainly about these. Many people may disagree. oozie, ES, storm, kylin, flink, etc. are too far away from YARN and do not rely on YARN's resources and services, Moreover, these services are deployed separately, which is OK and has little relevance. Therefore, it is mainly related to YARN, that is, HIVE, SPARK and mapreduce. These services are also the most used at present (HBASE also uses a lot, but it has nothing to do with YARN).
  • According to the above description, you should understand why the so-called dynamic isolation is mainly for YARN. Well, since YARN accounts for so much, it's also good to isolate resources of YARN. If I need to use HADOOP for three parts, I hope to set the priority of resources according to different departments. In fact, it is also set according to the proportion. Three queue name s are established, 30% for the development department, 50% for the data analysis part and 20% for the operation Department.
  • After setting the proportion, set mapreduce.queue.name when submitting the JOB, and the JOB will enter the specified queue. By default, when submitting a JOB to YARN, the rule is root.users.username. If the queue does not exist, the queue name will be automatically generated in this format. After the queue is set, the ACL is used to control who can submit or kill jobs.
  • Judging from the above two resource isolation methods, none of them works well. If I have to choose one, I will choose the latter to isolate YARN resources. The first method of fixed service segmentation really can not support the current business
  • Requirements: in a cluster, multiple users may need to use it. For example, developers need to submit tasks, testers need to submit tasks, and colleagues in other departments also need to submit tasks to the cluster. For multiple users submitting tasks at the same time, we can configure yarn's multi-user resource isolation.
View default Submission Scheme

Step 1: node01 edit yarn-site.xml
$ pwd
/bigdata/install/hadoop-3.1.4/etc/hadoop
vim yarn-site.xml
  • Add the following:
<!--  Specify our task scheduler to use fairScheduler Scheduler; apache Default capacity scheduler; cdh Default fair scheduler  -->
<property>
	<name>yarn.resourcemanager.scheduler.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

<!--  Specify the configuration file path for our task scheduling  -->
<property>
	<name>yarn.scheduler.fair.allocation.file</name>
	<value>/bigdata/install/hadoop-3.1.4/etc/hadoop/fair-scheduler.xml</value>
</property>

<!-- Whether to enable resource preemption. If enabled, when the overall cluster resource utilization has exceeded
yarn.scheduler.fair.preemption.cluster-utilization-threshold When there is such a large proportion, resource preemption is allowed; This includes calculating the amount of resources to be preempted, and then starting preemption
  -->
<property>
	<name>yarn.scheduler.fair.preemption</name>
	<value>true</value>
</property>
<!-- Has the overall cluster resource utilization exceeded this ratio -->
<property>
	<name>yarn.scheduler.fair.preemption.cluster-utilization-threshold</name>
	<value>0.8f</value>
</property>

<!-- Set to true,If no queue name is specified, submit and apply to the queue with the same user name; If set to false Or not set, submit to by default default Queue; If in allocation Queue submission policy is specified in the file, ignoring this attribute  -->
<property>
	<name>yarn.scheduler.fair.user-as-default-queue</name>
	<value>true</value>
	<description>default is True</description>
</property>

<!-- Whether to allow creating queues when submitting applications; If set to true,Is allowed; If set to false,If you want to submit the application to the queue, there is no allocation If specified in the file, the application is submitted to default Queue; Default to true,If the queue prevention policy is allocation If the file is defined, this attribute is ignored  -->
<property>
	<name>yarn.scheduler.fair.allow-undeclared-pools</name>
	<value>false</value>
	<description>default is True</description>
</property>
Step 2: node01 add the fair-scheduler.xml configuration file
vim fair-scheduler.xml
  • The contents are as follows:
<?xml version="1.0"?>
<allocations>
	<!-- In each queue, app The default scheduling policy is fair -->
	<defaultQueueSchedulingPolicy>fair</defaultQueueSchedulingPolicy>

	<user name="hadoop">
		<!-- user hadoop Maximum running app number -->
		<maxRunningApps>30</maxRunningApps>
	</user>
	<!-- If the user does not set the maximum number of runs app Number, then userMaxAppsDefault Restrict user default operation app number -->
	<userMaxAppsDefault>10</userMaxAppsDefault>

	<!-- Define our queue  -->
	<!-- 
	weight The default weight of the queue is 1
	Resource pool weight

	aclSubmitApps
	The user name and group to which tasks can be submitted to this queue;
	The format is:“ user1,user2 group1,group2" Or“ group1,group2" ->If there are only groups, add a space before the group name
	If the user or group submitting the application is in the queue ACLs Or in the parent queue of the current queue ACLs This user has permission to submit applications to this queue
	The following example:
		xiaoli Permission to submit application to queue a;xiaofan In queue a Parent queue of dev of acls Medium, so xiaofan You also have permission to submit and apply to the queue a
		For example, there are queue levels root.dev.a;
		Defined
		<queue name="dev">
			...
			<aclSubmitApps>xiaofan</aclSubmitApps>
			...
			<queue name="a">
				...
				<aclSubmitApps>xiaoli</aclSubmitApps>
			</queue>
		</queue>

	aclAdministerApps
	The user name and group of the management task are allowed; at present, the management action is only kill application
	The format is the same as above. 
	 -->
	<queue name="root">
		<minResources>512 mb,4 vcores</minResources>
		<maxResources>102400 mb,100 vcores</maxResources>
		<maxRunningApps>100</maxRunningApps>
		<weight>1.0</weight>
		<schedulingMode>fair</schedulingMode>
		<aclSubmitApps> </aclSubmitApps>
		<aclAdministerApps> </aclAdministerApps>

		<queue name="default">
			<minResources>512 mb,4 vcores</minResources>
			<maxResources>30720 mb,30 vcores</maxResources>
			<maxRunningApps>100</maxRunningApps>
			<schedulingMode>fair</schedulingMode>
			<weight>1.0</weight>
			<!--  If no task queue is specified, all tasks are submitted to the default Come in the queue -->
			<aclSubmitApps>*</aclSubmitApps>
		</queue>

		<queue name="hadoop">
			<minResources>512 mb,4 vcores</minResources>
			<maxResources>20480 mb,20 vcores</maxResources>
			<maxRunningApps>100</maxRunningApps>
			<schedulingMode>fair</schedulingMode>
			<weight>2.0</weight>
			<aclSubmitApps>hadoop hadoop</aclSubmitApps>
			<aclAdministerApps>hadoop hadoop</aclAdministerApps>
		</queue>

		<queue name="develop">
			<minResources>512 mb,4 vcores</minResources>
			<maxResources>20480 mb,20 vcores</maxResources>
			<maxRunningApps>100</maxRunningApps>
			<schedulingMode>fair</schedulingMode>
			<weight>1</weight>
			<aclSubmitApps>develop develop</aclSubmitApps>
			<aclAdministerApps>develop develop</aclAdministerApps>
		</queue>

		<queue name="test1">
			<minResources>512 mb,4 vcores</minResources>
			<maxResources>20480 mb,20 vcores</maxResources>
			<maxRunningApps>100</maxRunningApps>
			<schedulingMode>fair</schedulingMode>
			<weight>1.5</weight>
			<aclSubmitApps>test1,hadoop,develop test1</aclSubmitApps>
			<aclAdministerApps>test1 group_businessC,supergroup</aclAdministerApps>
		</queue>
	</queue>
	<!-- 
		Contains a series of rule Elements; these rule Element is used to tell scheduler Dispatcher, come in app Which queue is submitted to according to the rule
		There are multiple rule If yes, it will match from top to bottom;
		rule May have argument;be-all rule All with create argument,Represents the current rule Whether a new queue can be created; the default value is true
		If rule of create Set to false,And in allocation If this queue is not configured in, try to match the next one rule
	-->
	<queuePlacementPolicy>
		<!-- app Is submitted to the specified queue; and xml This queue is not defined in the file, create by true,Create; if yes false,If the queue does not exist, it is not created and the next one is matched rule -->
		<rule name="specified" create="false"/>
		<!-- app Submitted to this app The queue named by the group name of the group to which the user belongs; if the queue does not exist, it will not be created and continue to match the next one rule -->
		<rule name="primaryGroup" create="false" />
		<!-- If the above rule No match, then app Submit to queue Queue specified; if not specified queue Property. The default value is root.default -->
		<rule name="default" queue="root.default"/>
	</queuePlacementPolicy>
</allocations>
Step 3: copy the modified configuration file to other machines
scp yarn-site.xml fair-scheduler.xml node02:$PWD
scp yarn-site.xml fair-scheduler.xml node03:$PWD
Step 4: restart the yarn cluster
cd /bigdata/install/hadoop-3.1.4/
sbin/stop-yarn.sh
sbin/start-yarn.sh
Step 5: modify the task submission queue
  • Modify the code and add the queue to which our mapreduce task needs to be submitted
Configuration configuration = new Configuration();

//Case 1
//Comment out configuration.set("mapreduce.job.queuename", "hadoop");
//Matching rule: < rule name = "primarygroup" create = "false" / >

//Case 2
configuration.set("mapreduce.job.queuename", "hadoop");
//Matching rule: < rule name = "specified" create = "false" / >

//Case 3
configuration.set("mapreduce.job.queuename", "hadoopv1");
//In the allocation file, comment out < rule name = "primarygroup" create = "false" / >; refresh the configuration yarn rmadmin -refreshQueues
//Matching rule: < rule name = "default" queue = "root. Default" / >
  • The hive task specifies the submission queue, and the hive-site.xml file is added:
<property>
    <name>mapreduce.job.queuename</name>
    <value>test1</value>
</property>
  • The spark task runs the queue for the specified submission
1- Script mode
--queue hadoop

2- Code mode
sparkConf.set("spark.yarn.queue", "develop")

YARN basic use

1. Configuration file

<!-- $HADOOP_HOME/etc/hadoop/mapred-site.xml -->
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
<!-- $HADOOP_HOME/etc/hadoop/yarn-site.xml -->
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
</configuration>

2. YARN start and stop

  • Start ResourceManager and NodeManager (hereinafter referred to as RM and NM respectively)
# Master node run command
$HADOOP_HOME/sbin/start-yarn.sh
  • Stop RM and NM
#Master node run command
$HADOOP_HOME/sbin/stop-yarn.sh
  • If RM is not started, it can be started separately
#If RM is not started, run the command on the master node
#Obsolete $hadoop_home / SBIN / yen-daemon.sh start ResourceManager
yarn --daemon start resourcemanager

#Instead, it can be closed separately
#Obsolete $hadoop_home / SBIN / yen-daemon.sh stop ResourceManager
yarn --daemon stop resourcemanager
  • If NM is not started, it can be started separately
#If NM is not started, run the command at the corresponding node
#Obsolete $hadoop_home / SBIN / yen-daemon.sh start nodemanager
yarn --daemon start nodemanager
#Instead, it can be closed separately
#Obsolete $HADOOP_HOME/sbin/yarn-daemon.sh stop nodemanager
yarn --daemon stop nodemanager

3. Common commands of yarn

  • To view the list of YARN commands:

  • To view the yarn application command:

# 1. View running tasks
yarn application -list
# 2. Kill running tasks
yarn application -kill application_1617095172572_0003	
# 3. View node list
yarn node -list
# 4. View node status; All port numbers shall be consistent with those in the figure above (randomly assigned)
yarn node -status node03:38122
# 5. View the environment variables that yarn depends on
yarn classpath

  • View yarn logs:
# 1. View the application log
yarn logs -applicationId application_1638460497520_0001

Tags: Big Data Hadoop mapreduce Yarn

Posted on Sun, 05 Dec 2021 18:08:55 -0500 by snpo123