[Hadoop enterprise development experience] 01 HDFS

1.HDFS storage multi directory

(1) Add a new hard disk to the Linux system
reference resources: https://www.cnblogs.com/yujianadu/p/10750698.html
(2) Production environment server disk condition

(3) Configure multiple directories in the hdfs-site.xml file and pay attention to the access rights of the newly mounted disk
The path to save data in the DataNode node of HDFS is determined by the dfs.datanode.data.dir parameter. Its default value is file://${hadoop.tmp.dir}/dfs/data. If the server has multiple disks, this parameter must be modified. If the server disk is as shown in the figure above, the parameter should be modified to the following values.

<property>
    <name>dfs.datanode.data.dir</name>
<value>file:///dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>
</property>

Note: because the disk conditions of each server node are different, this configuration does not need to be distributed after configuration

2. Cluster data balancing

1) Data balance between nodes

(1) Enable data equalization command

start-balancer.sh -threshold 10

For parameter 10, it means that the difference in disk space utilization of each node in the cluster is no more than 10%, which can be adjusted according to the actual situation.
(2) Stop data equalization command

stop-balancer.sh

Note: for HDFS, you need to start a separate Rebalance Server to perform the Rebalance operation, so try not to execute start-balancer.sh on the NameNode, but find a relatively idle machine.

2) Inter disk data balancing

Inter disk balancing is essentially a copy of files
(1) Generate a balancing plan (we only have one disk and will not generate a plan)

hdfs diskbalancer -plan hadoop103

(2) Execute balanced plan

hdfs diskbalancer -execute hadoop103.plan.json

(3) View the execution of the current balancing task

hdfs diskbalancer -query hadoop103

(4) Cancel balancing task

hdfs diskbalancer -cancel hadoop103.plan.json

3. LZO compression configuration supported by project experience

hadoop itself does not support Lzo compression;
There are two ways to make hadoop support lzo:
1. Compile hadoop source code
2. Compile Hadoop LZO plug-in

The steps of compiling Hadoop LZO mode are as follows:
1) Hadoop LZO compilation
hadoop itself does not support lzo compression, so you need to use the hadoop lzo open source component provided by twitter. hadoop lzo needs to rely on hadoop and lzo for compilation, and the compilation steps are not written;

2) Put the compiled hadoop-lzo-0.4.20.jar into the common directory of Hadoop

[atguigu@hadoop102 common]$ pwd
/opt/module/hadoop-3.1.3/share/hadoop/common
[atguigu@hadoop102 common]$ ls
hadoop-lzo-0.4.20.jar

3) Synchronize hadoop-lzo-0.4.20.jar to Hadoop 103 and Hadoop 104

[atguigu@hadoop102 common]$ xsync hadoop-lzo-0.4.20.jar

4) core-site.xml added configuration to support LZO compression

<configuration>
    <property>
        <name>io.compression.codecs</name>
        <value>
            org.apache.hadoop.io.compress.GzipCodec,
            org.apache.hadoop.io.compress.DefaultCodec,
            org.apache.hadoop.io.compress.BZip2Codec,
            org.apache.hadoop.io.compress.SnappyCodec,
            com.hadoop.compression.lzo.LzoCodec,
            com.hadoop.compression.lzo.LzopCodec
        </value>
    </property>

    <property>
        <name>io.compression.codec.lzo.class</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
</configuration>

5) Synchronize core-site.xml to Hadoop 103 and Hadoop 104

[atguigu@hadoop102 hadoop]$ xsync core-site.xml

6) Start and view cluster

[atguigu@hadoop102 hadoop-3.1.3]$ sbin/start-dfs.sh
[atguigu@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

7) Test - data preparation

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -mkdir /input
[atguigu@hadoop102 hadoop-3.1.3]$ hadoop fs -put README.txt /input

8) Test - Compression

[atguigu@hadoop102 hadoop-3.1.3]$ hadoop jar \
share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount\
-Dmapreduce.output.fileoutputformat.compress=true \
-Dmapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec  \
/MapReduce/01wc/wcinput \
/MapReduce/01wc/outputLzo



The lzo compressed output file is suffixed with. lzo, and the file content is garbled

4.LZO creates an index to support slicing

Above, we know how to configure the output file format of MapRedue to be LZO compressed; However, as the input data of MR, files in pure LZO format do not support slicing, so here's how to make Mr support slicing of LZO files

1. How to create an index for LZO files
The slicability of LZO compressed files depends on its index, so we need to manually create an index for LZO compressed files. If there is no index, there is only one slice of LZO file.

hadoop jar \
/path/to/your/hadoop-lzo.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
big_file.lzo

2.lzo slice test
(1) No index test
1. Upload an LZO file bigtable.lzo (200M) to the root directory of the cluster

[atguigu@hadoop102 module]$ hadoop fs -mkdir /input
[atguigu@hadoop102 module]$ hadoop fs -put bigtable.lzo /input

2. Execute wordcount program and test

[atguigu@hadoop102 module]$ hadoop jar \
/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount \
-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
/input \
/output1

Note: the type specified here is inputFormat, and LzoTextInputFormat is used;

Without indexing, even lzo files have only one slice

(2) Build lzo index test
1. Index the uploaded LZO files:

[atguigu@hadoop102 module]$ hadoop jar \
/opt/module/hadoop-3.1.3/share/hadoop/common/hadoop-lzo-0.4.20.jar  \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/input/bigtable.lzo


Indexing is essentially an MR program;

2. Run the WordCount program again for testing

[atguigu@hadoop102 module]$ hadoop jar \
/opt/module/hadoop-3.1.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar wordcount \
-Dmapreduce.job.inputformat.class=com.hadoop.mapreduce.LzoTextInputFormat \
/input \
/output2


Note: if the above tasks, the following exceptions will be reported during operation

Container [pid=8468,containerID=container_1594198338753_0001_01_000002] is running 318740992B beyond the 'VIRTUAL' memory limit. Current usage: 111.5 MB of 1 GB physical memory used; 2.4 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1594198338753_0001_01_000002 :

Solution: add the following configuration in the / opt/module/hadoop-3.1.3/etc/hadoop/yarn-site.xml file of Hadoop 102, distribute it to Hadoop 103 and Hadoop 104 servers, and restart the cluster.

<!--Whether to start a thread to check the amount of virtual memory being used by each task. If the task exceeds the allocated value, it will be killed directly. The default is true -->
<property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
</property>

Tags: Big Data Hadoop

Posted on Fri, 08 Oct 2021 00:12:02 -0400 by m4x3vo