hbase downtime troubleshooting
The reason is that when using spark to write database data to hbase, it is found that during normal operation, three spark tasks are suddenly suspended. By checking the log, it is found that the program card is in the following position:
2021-10-18 18:23:58,158 INFO jdbc.Utils: Supplied authorities: 192.168.xx.xx:10000 2021-10-18 18:23:58,158 INFO jdbc.Utils: Resolved authority: 192.168.xx.xx:10000 2021-10-18 18:24:24,234 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on node25:33917 in memory (size: 4.8 KiB, free: 366.3 MiB) 2021-10-18 18:24:24,253 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on node25:42263 in memory (size: 4.8 KiB, free: 366.3 MiB) 2021-10-18 18:24:24,264 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on node25:33917 in memory (size: 7.3 KiB, free: 366.3 MiB) 2021-10-18 18:24:24,268 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on node25:42263 in memory (size: 7.3 KiB, free: 366.3 MiB)
At the beginning, the cause of the problem was the data problem, which may be the fake death of the process caused by the data skew. Therefore, after killing the current task with the following command yarn application -kill ${application_id}, an empty table was re selected for testing. It was found that the same problem still occurred, so the preliminarily suspected cause of the fake death caused by the data skew was not tenable, Check again.
Here is an article about solving the problem of data skew: spark performance optimization , this is the article I saw during troubleshooting.
Then continue the troubleshooting. At this time, the log suddenly reported the following problems:
2021-10-18 18:44:35,265 ERROR yarn.ApplicationMaster: User class threw exception: java.sql.SQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1219044: callTimeout=1200000, callDuration=1238974: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true row 'xxxx' on table 'hbase:meta' at null at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:148) at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3078) at org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3070) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:445) at org.apache.hadoop.hive.hbase.HBaseMetaHook.preCreateTable(HBaseMetaHook.java:135) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:832) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.createTable(HiveMetaStoreClient.java:822) at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:212) at com.sun.proxy.$Proxy34.createTable(Unknown Source) at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2773) at com.sun.proxy.$Proxy34.createTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:921) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:937) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4954) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:428) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2664) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2335) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2011) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1709) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1703) at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:224) at org.apache.hive.service.cli.operation.SQLOperation.access$700(SQLOperation.java:87) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:316) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1845) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:329) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.net.SocketTimeoutException: callTimeout=1200000, callDuration=1238974: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true row 'xxxxx' on table 'hbase:meta' at null at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:148) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:395) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:369) at org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1108) at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:448) at org.apache.hadoop.hbase.client.HBaseAdmin$6.rpcCall(HBaseAdmin.java:445) at org.apache.hadoop.hbase.client.RpcRetryingCallable.call(RpcRetryingCallable.java:58) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:106) ... 40 more Caused by: java.io.IOException: Failed to find location, tableName=hbase:meta, row=xxxxx, reload=true at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:222) at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:104)
At this time, it was found that the hbase service stopped abnormally. When restarting the hbase service, it was unexpectedly found that the hbase service could not be started. At this time, the hmaster service failed to start all the time. Check the start log of hmaster and find the following problems:
2021-10-18 20:17:12,663 WARN [main-SendThread(192.168.17.23:2181)] zookeeper.ClientCnxn: Session 0x0 for server 192.168.17.23/192.168.17.23:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:378) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) 2021-10-18 20:17:26,533 INFO [main] zookeeper.ZooKeeper: Session: 0x0 closed 2021-10-18 20:17:26,533 ERROR [main] regionserver.HRegionServer: Failed construction RegionServer org.apache.hadoop.hbase.ZooKeeperConnectionException: master:160000x0, quorum=192.168.17.22:2181,192.168.17.23:2181,192.168.17.24:2181, baseZNode=/hbase Unexpected KeeperException creating base node at org.apache.hadoop.hbase.zookeeper.ZKWatcher.createBaseZNodes(ZKWatcher.java:192) at org.apache.hadoop.hbase.zookeeper.ZKWatcher.<init>(ZKWatcher.java:167) at org.apache.hadoop.hbase.zookeeper.ZKWatcher.<init>(ZKWatcher.java:119) at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:652) at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:494) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2937) at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:236) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2955) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:547) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:526) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1197) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createWithParents(ZKUtil.java:1175) at org.apache.hadoop.hbase.zookeeper.ZKWatcher.createBaseZNodes(ZKWatcher.java:183) ... 14 more
The problem was found that hbase failed to start because hbase could not connect to zookeeper. However, through ps -ef|grep zookeeper, it was found that zookeeper service was running, so there should be no hbase service failure caused by zookeeper service stop. Then, I noticed java.io.IOException: Connection reset by peer. This problem is caused by too many socket connections, So by looking at the number of connections to zookeeper
netstat -anl|grep 2181|grep -i 'zk_ip'|grep ESTABLISHED|wc -l
It is found that the number of connections of zookeeper is 240, but the default number of connections of zookeeper is 60, so the maximum number of connections allowed for zookeeper is modified according to the setting,
In the zoo.cfg file of zookeeper, add the following parameters:
# the maximum number of client connections. # increase this if you need to handle more clients maxClientCnxns=2000
Then restart the zookeeper cluster, execute rmr /hbase in zookeeper's shell, and restart hbase. It is found that the service starts normally.