Big data HDFS application development

1. HDFS Shell operation (development focus)

Through the previous study, we have a basic understanding of HDFS. Now we want to practice to deepen our understanding of HDFS

For HDFS, we can operate on the shell command line, which is similar to operating the file system in linux, but there are some differences in the operation format of specific commands

The format is as follows:

Use the hdfs command of hadoop bin directory, and specify dfs later, indicating that it operates the distributed file system. These belong to fixed formats.

If the bin directory of hadoop is configured in the PATH, hdfs can be used directly here

xxx here is a placeholder. You can specify the corresponding command here for what we want to do with hdfs

Most hdfs commands are similar to the corresponding Linux commands

The schema of HDFS is HDFS, and the authority is the ip and corresponding port number of the node where the namenode is located in the cluster. It is the same to replace the ip with the host name, and the path is the file path information we want to operate

In fact, the following long string is the value of fs.defaultFS attribute in the core-site.xml configuration file, which represents the address of HDFS.

2 common Shell operations of HDFS

Let's learn some common shell operations in HDFS

In fact, hdfs supports many parameters, but many of them are rarely used. Here we take some commonly used ones with you to learn. If you have some special needs in the later stage, you can try to look at the help documents of hdfs

Directly enter hdfs dfs on the command line to view all parameters that can be followed by dfs

Note: [] means optional and < > means required

[root@bigdata01 hadoop-3.2.0]# hdfs dfs
Usage: hadoop fs [generic options]
        [-appendToFile <localsrc> ... <dst>]
        [-cat [-ignoreCrc] <src> ...]
        [-checksum <src> ...]
        [-chgrp [-R] GROUP PATH...]
        [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
        [-chown [-R] [OWNER][:[GROUP]] PATH...]
        [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
        [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
        [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
        [-createSnapshot <snapshotDir> [<snapshotName>]]
        [-deleteSnapshot <snapshotDir> <snapshotName>]
        [-df [-h] [<path> ...]]
        [-du [-s] [-h] [-v] [-x] <path> ...]
        [-expunge]
        [-find <path> ... <expression> ...]
        [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
        [-getfacl [-R] <path>]
        [-getfattr [-R] {-n name | -d} [-e en] <path>]
        [-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
        [-head <file>]
        [-help [cmd ...]]
        [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
        [-mkdir [-p] <path> ...]
        [-moveFromLocal <localsrc> ... <dst>]
        [-moveToLocal <src> <localdst>]
        [-mv <src> ... <dst>]
        [-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
        [-renameSnapshot <snapshotDir> <oldName> <newName>]
        [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
        [-rmdir [--ignore-fail-on-non-empty] <dir> ...]
        [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
        [-setfattr {-n name [-v value] | -x name} <path>]
        [-setrep [-R] [-w] <rep> <path> ...]
        [-stat [format] <path> ...]
        [-tail [-f] <file>]
        [-test -[defsz] <path>]
        [-text [-ignoreCrc] <src> ...]
        [-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
        [-touchz <path> ...]
        [-truncate [-w] <length> <path> ...]
        [-usage [cmd ...]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]

2.1 ls: query specified path information

First look at the first ls command

Check the contents in the hdfs root directory. Nothing is displayed because there is nothing in hdfs by default

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls hdfs://bigdata01:9000/
[root@bigdata01 hadoop-3.2.0]#

In fact, the following hdfs url can be omitted by default, because hdfs will be executed according to HDOOP_HOME automatically identifies the fs.defaultFS attribute in the configuration file

So this abbreviation is also possible

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls /
[root@bigdata01 hadoop-3.2.0]#

2.2 put: upload files locally

Next, we upload a file to hdfs and directly upload it to the root directory of hdfs using README.txt in Hadoop

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -put README.txt  /

After the upload is successful, there is no prompt. Note that no prompt is the best result

Confirm the file just uploaded

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls /
Found 1 items
-rw-r--r--   2 root supergroup       1361 2020-04-08 15:34 /README.txt

It can be found here that the information queried by ls in hdfs is similar to the information queried by ll in linux

Seeing this file here shows that the upload operation just now is successful

2.3 cat: View HDFS file content

After uploading the file, we also want to check the contents of the file in HDFS. It's very simple. Just use cat

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -cat /README.txt
For the latest information about Hadoop, please visit our website at:

   http://hadoop.apache.org/

and our wiki, at:

   http://wiki.apache.org/hadoop/
...........

2.4 get: download files to local

What should we do if we want to download the files in hdfs to the local linux file system? This can be achieved using get

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -get /README.txt .
get: `README.txt': File exists

Note: an error is reported in this way, indicating that the file already exists. My command means to download README.txt in HDFS to the current directory, but the file already exists in the current directory. Either change to another directory or rename the file

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -get /README.txt README.txt.bak
[root@bigdata01 hadoop-3.2.0]# ll
total 188
drwxr-xr-x. 2 1001 1002    203 Jan  8  2019 bin
drwxr-xr-x. 3 1001 1002     20 Jan  8  2019 etc
drwxr-xr-x. 2 1001 1002    106 Jan  8  2019 include
drwxr-xr-x. 3 1001 1002     20 Jan  8  2019 lib
drwxr-xr-x. 4 1001 1002   4096 Jan  8  2019 libexec
-rw-rw-r--. 1 1001 1002 150569 Oct 19  2018 LICENSE.txt
-rw-rw-r--. 1 1001 1002  22125 Oct 19  2018 NOTICE.txt
-rw-rw-r--. 1 1001 1002   1361 Oct 19  2018 README.txt
-rw-r--r--. 1 root root   1361 Apr  8 15:41 README.txt.bak
drwxr-xr-x. 3 1001 1002   4096 Apr  7 22:08 sbin
drwxr-xr-x. 4 1001 1002     31 Jan  8  2019 share

2.5 mkdir [-p]: create folder

Later, we need to maintain many files in hdfs, so we need to create folders for classification management

Let's create a folder and use the mkdir command in hdfs

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -mkdir /test
[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls /
Found 2 items
-rw-r--r--   2 root supergroup       1361 2020-04-08 15:34 /README.txt
drwxr-xr-x   - root supergroup          0 2020-04-08 15:43 /test

If you want to create a multi-level directory recursively, you also need to specify the - p parameter

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -mkdir -p /abc/xyz
You have new mail in /var/spool/mail/root
[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls /
Found 3 items
-rw-r--r--   2 root supergroup       1361 2020-04-08 15:34 /README.txt
drwxr-xr-x   - root supergroup          0 2020-04-08 15:44 /abc
drwxr-xr-x   - root supergroup          0 2020-04-08 15:43 /test

To recursively display the information of all directories, add the - R parameter after ls

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls -R /
-rw-r--r--   2 root supergroup       1361 2020-04-08 15:34 /README.txt
drwxr-xr-x   - root supergroup          0 2020-04-08 15:44 /abc
drwxr-xr-x   - root supergroup          0 2020-04-08 15:44 /abc/xyz
drwxr-xr-x   - root supergroup          0 2020-04-08 15:43 /test

2.6 rm [-r]: delete files / folders

If you want to delete directories or files in hdfs, you can use rm

Delete file

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -rm /README.txt
Deleted /README.txt

To delete a directory, note that to delete a directory, you need to specify the - r parameter

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -rm /test
rm: `/test': Is a directory
[root@bigdata01 hadoop-3.2.0]# hdfs dfs -rm -r /test
Deleted /test

If it is a multi-level directory, can it be deleted recursively? sure

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -rm -r /abc
Deleted /abc

Error resolution:
hdfs upload file does not have permission to write. put: Permission denied: user=root, access=WRITE

Question:

For example, there is a problem that you do not have permission to write when uploading files:

Command:

 hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host1

Error message:

put: Permission denied: user=root, access=WRITE, inode="/user/impala/data/logs/year=2013/month=07/day=28/host=host1":hdfs:impala:drwxr-xr-x

solve:

1. View permissions for this user

[root@hadoop09-test1-rgtj1-tj1 test_pro]# hdfs dfs -ls  /user/impala/data/logs/year=2013/month=07/day=28
Found 1 items
drwxr-xr-x   - hdfs impala          0 2020-02-17 22:32 /user/impala/data/logs/year=2013/month=07/day=28/host=host1

2. Switch users to write

sudo switches the user to specify the user's directory for uploading and writing.

sudo -uhdfs hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host1

3. HDFS case practice

Requirement: count the number of files in HDFS and the size of each file

We first upload several files to HDFS and several txt files in hadoop directory

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -put LICENSE.txt /
[root@bigdata01 hadoop-3.2.0]# hdfs dfs -put NOTICE.txt /
[root@bigdata01 hadoop-3.2.0]# hdfs dfs -put README.txt /

1: Count the number of files in the root directory

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls / |grep /| wc -l    
3

2: Count the size of each file in the root directory, and finally print out the file name and size

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls / |grep / |  awk '{print $8,$5}'
/LICENSE.txt 150569
/NOTICE.txt 22125
/README.txt 1361

4 Java code operation HDFS

4.1 configuring Hadoop environment under Windows

The hadoop running environment needs to be configured on windows system, otherwise the following problems will occur when directly running the code:

winutils.exe is missing
hadoop.dll is missing

Steps:

Step 1: copy the Hadoop 2.7.5 folder to a path without Chinese and spaces

Step 2: configure hadoop environment variable on windows: HADOOP_HOME and% hadoop_ Add home% \ bin to path

Step 3: put the hadoop.dll file under bin directory in Hadoop 2.7.5 folder into the system disk: C:\Windows\System32 directory

Step 4: shut down windows and restart

Previously, we learned how to operate hdfs on the shell command line. Operating hdfs in the shell is a common operation, but we will also encounter some requirements in our work. We will take a look at how to use java code to operate hdfs. Before specific operations, we need to clarify the development environment, and the code editor uses idea. Of course, eclipse can also be used

When creating a project, we will create a maven project. It is more convenient to use Maven to manage dependencies. Here, I will send you the installation package of idea and Maven. Considering that some students may not have used maven, I will talk about the installation and configuration of Maven in windows. Here, we use apache-maven-3.0.5-bin.zip. Of course, other versions are OK. There is no essential difference. Extract apache-maven-3.0.5-bin.zip to a directory. Here, I unzip it to D:\Program Files (x86)\apache-maven-3.0.5 directory. After unzipping, I suggest modifying Maven's configuration file and changing the address of Maven warehouse to other disks, For example, disk D is in the user directory of Disk C by default. Modify the settings.xml file under D:\Program Files (x86)\apache-maven-3.0.5\conf, move the localRepository tag from the comment, and then set the value to D:.m2. The effect is as follows:

The directory name here can be chosen at will, as long as it is easy to identify

<localRepository>D:\.m2</localRepository>

After this modification, the dependent jar packages managed by maven will be saved to the D:.m2 directory.

Next, you need to configure maven's environment variables and Java in windows_ The home environment variable is the same.

First configure m2 in the environment variable_ HOME=D:\Program Files (x86)\apache-maven-3.0.5

Then add% m2 to the PATH environment variable_ Home% \ bin

After the environment variables are configured, open the cmd window and enter the mvn command. As long as it can be executed normally, it means that the local maven environment of windows is configured.

This is not over yet. We need to specify our local maven configuration in the idea

Click File - > settings in the upper left corner of idea to enter the following interface, search maven and add the local maven to it.

Let's create a maven project

The project name is db_hadoop

Note: after the project is created, you need to click Enable Auto Import in the small right corner of the newly opened interface, so that the dependency added to maven will be automatically imported. Otherwise, it will be found that the dependency is imported, but it is still unrecognized in the code. At this time, it needs to be imported manually, which is more troublesome.

ok, after the project is created, you need to introduce hadoop dependencies

Here, we need to introduce Hadoop client dependency package, find it in maven warehouse and add it to pom.xml file

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.2.0</version>
</dependency>

Then create the code

  • Upload file
package com.imooc.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

import java.io.FileInputStream;
import java.net.URI;

/**
 * Java Code operation HDFS
 * File operation: upload file, download file and delete file
 * Created by xuwei
 */
public class HdfsOp {
    public static void main(String[] args) throws Exception{
        //Create a configuration object
        Configuration conf = new Configuration();
        //Specifies the address of the HDFS
        conf.set("fs.defaultFS","hdfs://bigdata01:9000");
        //Gets the object that operates on HDFS
        FileSystem fileSystem = FileSystem.get(conf);


        //Gets the output stream of the HDFS file system
        FSDataOutputStream fos = fileSystem.create(new Path("/user.txt"));
        //Gets the input stream of the local file
        FileInputStream fis = new FileInputStream("D:\\user.txt");

        //Upload file: copy the input stream to the output stream through the tool class to upload the local file to HDFS
        IOUtils.copyBytes(fis,fos,1024,true);

    }

}

Execute the code, find an error and prompt permission rejection, indicating that this user in windows does not have permission to write data to HDFS

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=yehua, access=WRITE, inode="/":root:supergroup:drwxr-xr-x

There are two solutions

The first is to remove the user permission verification mechanism of HDFS and configure dfs.permissions.enabled as false in hdfs-site.xml

The second is to package the code into linux for execution

Here, for the convenience of local testing, we first use the first method

1: Stop Hadoop cluster

[root@bigdata01 ~]# cd /data/soft/hadoop-3.2.0
[root@bigdata01 hadoop-3.2.0]# sbin/stop-all.sh
Stopping namenodes on [bigdata01]
Last login: Wed Apr  8 20:25:17 CST 2020 from 192.168.182.1 on pts/1
Stopping datanodes
Last login: Wed Apr  8 20:25:40 CST 2020 on pts/1
Stopping secondary namenodes [bigdata01]
Last login: Wed Apr  8 20:25:41 CST 2020 on pts/1
Stopping nodemanagers
Last login: Wed Apr  8 20:25:44 CST 2020 on pts/1
Stopping resourcemanager
Last login: Wed Apr  8 20:25:47 CST 2020 on pts/1

2: Modify the hdfs-site.xml configuration file

Note: the configuration files of all nodes in the cluster need to be modified. First modify them on bigdata01 node, and then synchronize to the other two nodes

Operate on bigdata01

[root@bigdata01 hadoop-3.2.0]# vi etc/hadoop/hdfs-site.xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>bigdata01:50090</value>
    </property>
    <property>
        <name>dfs.permissions.enabled</name>
        <value>false</value>
    </property>
</configuration>

Synchronize to the other two nodes

[root@bigdata01 hadoop-3.2.0]# scp -rq etc/hadoop/hdfs-site.xml  bigdata02:/data/soft/hadoop-3.2.0/etc/hadoop/
[root@bigdata01 hadoop-3.2.0]# scp -rq etc/hadoop/hdfs-site.xml  bigdata03:/data/soft/hadoop-3.2.0/etc/hadoop/

3: Start Hadoop cluster

[root@bigdata01 hadoop-3.2.0]# sbin/start-all.sh 
Starting namenodes on [bigdata01]
Last login: Wed Apr  8 20:25:49 CST 2020 on pts/1
Starting datanodes
Last login: Wed Apr  8 20:29:57 CST 2020 on pts/1
Starting secondary namenodes [bigdata01]
Last login: Wed Apr  8 20:29:59 CST 2020 on pts/1
Starting resourcemanager
Last login: Wed Apr  8 20:30:04 CST 2020 on pts/1
Starting nodemanagers
Last login: Wed Apr  8 20:30:10 CST 2020 on pts/1

Re execute the code without error. Check the data on hdfs

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls /
Found 4 items
-rw-r--r--   2 root  supergroup     150569 2020-04-08 15:55 /LICENSE.txt
-rw-r--r--   2 root  supergroup      22125 2020-04-08 15:55 /NOTICE.txt
-rw-r--r--   2 root  supergroup       1361 2020-04-08 15:55 /README.txt
-rw-r--r--   3 yehua supergroup         17 2020-04-08 20:31 /user.txt
[root@bigdata01 hadoop-3.2.0]# hdfs dfs -cat /user.txt
jack
tom
jessic

Other functions need to be implemented to encapsulate and extract the code

package com.imooc.hdfs;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

import java.io.FileInputStream;
import java.io.IOException;
import java.net.URI;

/**
 * Java Code operation HDFS
 * File operation: upload file, download file and delete file
 * Created by xuwei
 */
public class HdfsOp {
    public static void main(String[] args) throws Exception{
        //Create a configuration object
        Configuration conf = new Configuration();
        //Specifies the address of the HDFS
        conf.set("fs.defaultFS","hdfs://bigdata01:9000");
        //Gets the object that operates on HDFS
        FileSystem fileSystem = FileSystem.get(conf);
        put(fileSystem);
    }

    /**
     * File upload
     * @param fileSystem
     * @throws IOException
     */
    private static void put(FileSystem fileSystem) throws IOException {
        //Gets the output stream of the HDFS file system
        FSDataOutputStream fos = fileSystem.create(new Path("/user.txt"));
        //Gets the input stream of the local file
        FileInputStream fis = new FileInputStream("D:\\user.txt");

        //Upload file: copy the input stream to the output stream through the tool class to upload the local file to HDFS
        IOUtils.copyBytes(fis,fos,1024,true);
    }

}
  • Download File

Execute the code and go to disk D of windows to verify whether the file is generated. If so, it indicates that the execution is successful

/**
 * Download File
 * @param fileSystem
 * @throws IOException
 */
private static void get(FileSystem fileSystem) throws IOException{
    //Gets the input stream of the HDFS file system
    FSDataInputStream fis = fileSystem.open(new Path("/README.txt"));
    //Gets the output stream of the local file
    FileOutputStream fos = new FileOutputStream("D:\\README.txt");
    //Download File
    IOUtils.copyBytes(fis,fos,1024,true);
}
  • Delete file

Execute delete operation code

/**
 * Delete file
 * @param fileSystem
 * @throws IOException
 */
private static void delete(FileSystem fileSystem) throws IOException{
    //Delete files and directories
    //If you want to delete a directory recursively, the second parameter needs to be set to true
    //If you delete a file or an empty directory, the second parameter is ignored
    boolean flag = fileSystem.delete(new Path("/LICENSE.txt"),true);
    if(flag){
        System.out.println("Delete succeeded!");
    }else{
        System.out.println("Deletion failed!");
    }
}

Then go to hdfs to verify whether the file has been deleted. From here, you can see that the / LICENSE.txt file has been deleted

[root@bigdata01 hadoop-3.2.0]# hdfs dfs -ls /
Found 3 items
-rw-r--r--   2 root  supergroup      22125 2020-04-08 15:55 /NOTICE.txt
-rw-r--r--   2 root  supergroup       1361 2020-04-08 15:55 /README.txt
-rw-r--r--   3 yehua supergroup         17 2020-04-08 20:31 /user.txt

When we execute the code, we will find that a lot of red warning messages are output. Although it does not affect the code execution, it looks very eye-catching. OCD can't stand it

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
log4j:WARN No appenders could be found for logger (org.apache.htrace.core.Tracer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

How to solve this problem?

By analyzing the error information, it is found that the first is the implementation class without log4j, and the second is the configuration file without log4j

1: Add log4j dependency in pom.xml

<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>1.7.10</version>
</dependency>
<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-log4j12</artifactId>
    <version>1.7.10</version>
</dependency>

2: Add the log4j.properties file in the resources directory

Add log4j.properties in the src\main\resources directory of the project

The contents of log4j.properties file are as follows:

log4j.rootLogger=info,stdout

log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout 
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n

Execute the code again and find that there will be no red warning messages.

5 principle and process of data upload / download


Tags: Big Data Hadoop hdfs

Posted on Mon, 08 Nov 2021 09:38:12 -0500 by mathieumg