hadoop cluster operation and maintenance (updating)

Error in Job running

Unable to close file because the last block BP-1820686335-10.201.48.27-144816918

ava.io.IOException: Unable to close file because the last block BP-1820686335-10.201.48.27-1448169181587:blk_1850383542_781036567 does not have enough number of replicas.
        at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2705)
        at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:2667)
        at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2621)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:248)
        at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:380)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:1060)
        at org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:67)
        at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:83)
        at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:937)
        at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2299)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2388)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2119)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2081)
        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1972)
        at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:1898)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:514)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:475)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:263)
        at java.lang.Thread.run(Thread.java:745)

Reference: [HDFS] high task reports HDFS exception: last block does not have enough number of replicas. It is known that it is caused by excessive load on hadoop server. Just re execute the high SQL script. To completely solve the problem, you need to
It is suggested to reduce the task concurrency or control the cpu utilization to reduce the network transmission, so that the DN can smoothly report the block to the NN.

Problem conclusion:
Reduce system load. When the cluster occurs, the load is very heavy. All 32 cores (100%) of the CPU are allocated. MR thinks that at least 20% of the CPU should be reserved
The main reason is that there are too many block s. You can consider doing a large directory scan to sort out the directories of too many small files before processing

 java.lang.IllegalArgumentException: java.net.UnknownHostException

Solve the path and check the resourcemanager. It is found that a node exists and the hostname cannot be found. After deletion, the problem is eliminated

But the problem hasn't been explained yet. In yarn, I just saw that the corresponding hostname was not found and the container was not allocated. In addition, the corresponding container was allocated, but the corresponding application was successfully executed

2017-12-21 13:34:36,732 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hive     OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCC        ESS  APPID=application_1513834407876_0012    CONTAINERID=container_e91_1513834407876_0012_01_000086
 595972 2017-12-21 13:34:36,732 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_e91_1513834407876_0012_01_000086 of capacity <memory:4096, vCores:1> on host slave19.bl.bigdata:8041, which has 6 containers, <memory:27648, vCores:12> used and <memory:54272, vCores:36> available after allocation
 595973 2017-12-21 13:34:36,748 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Error trying to assign container token and NM token to an allocated container container_e91_1513694506641_4872_01_000001
 595974 java.lang.IllegalArgumentException: java.net.UnknownHostException: BGhadoop08
 595975         at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:406)
 595976         at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:256)
 595977         at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:220)
 595978         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:455)
 595979         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.allocate(FairScheduler.java:823)
 595980         at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:532)
 595981         at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
 595982         at org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
 595983         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
 595984         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
 595985         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2220)
 595986         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
 595987         at java.security.AccessController.doPrivileged(Native Method)
 595988         at javax.security.auth.Subject.doAs(Subject.java:422)
 595989         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
 595990         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2214)
 595991 Caused by: java.net.UnknownHostException: BGhadoop08

**Analysis**

  1. Are all the stuck tasks on the corresponding server without hostname configured?
  2. How is the speculative execution of hadoop triggered
  3. Why can some tasks be assigned to a task without a hostname, while others cannot

In fact, it is very clear. UnknownHostException

Cluster service

The NTP service for the host could not be found, or the service did not respond to a clock skew request

scene

The CDH cluster starts successfully, but some hosts prompt "the NTP service of the host cannot be found, or the service does not respond to the clock deviation request"

Problem thinking

  1. NTP service is not started normally
  2. Exception in CDH daemon

Solution script

1. First close the CDH service, and then close the cluster service in the interface
2. Turn on NTP service for each host

systemctl restart ntpd

3. Restart cloudera SCM agent on each host

systemctl restart cloudera-scm-agent

Wait for 5 minutes and go to the CDH console to check the results. The exception has been resolved

Tags: Big Data Hadoop Spark

Posted on Wed, 13 Oct 2021 11:44:58 -0400 by thor erik