Flume deployment and introduction case

1, Flume installation and deployment

1.1 installation address

1. Flume official website address
http://flume.apache.org/
2. Download address
http://archive.apache.org/dist/flume/
3. Document address
http://flume.apache.org/FlumeUserGuide.html

1.2 installation and deployment

1. Upload apache-flume-1.7.0-bin.tar.gz to the / opt/software directory of the server

2. Unzip apache-flume-1.7.0-bin.tar.gz to the directory / opt/module /

[test@hadoop151 software]$ tar -zxvf apache-flume-1.7.0-bin.tar.gz -C /opt/module/

3. Change the name of apache-flume-1.7.0-bin to flume

[test@hadoop151 module]$ mv apache-flume-1.7.0-bin/ flume

4. Modify the file flume-env.sh.template under flume/conf to flume-env.sh, and configure the file flume-env.sh

export JAVA_HOME=/opt/module/jdk1.8.0_144

2, Entry case - Official case of monitoring port data

2.1 case introduction

1. Case needs
Use Flume to monitor a port, collect the port data and print it to the console.

2. Demand analysis

2.2 case steps

1. Install the netcat tool

[test@hadoop151 conf]$ sudo yum install -y nc

2. Judge whether port 44444 is occupied

[test@hadoop151 conf]$ sudo netstat -tunlp | grep 44444

3. Create Flume Agent configuration file flume-netcat-logger.conf
(1) Create the job folder in the flume directory and enter the job folder

[test@hadoop151 flume]$ mkdir job
[test@hadoop151 flume]$ cd job/

(2) Create the Flume Agent configuration file flume-netcat-logger.conf in the job folder

[test@hadoop151 job]$ vim flume-netcat-logger.conf

(3) Add the following to the flume-netcat-logger.conf file

# Name the components on this agent 
a1.sources = r1 
a1.sinks = k1 
a1.channels = c1 
 
# Describe/configure the source 
a1.sources.r1.type = netcat 
a1.sources.r1.bind = localhost 
a1.sources.r1.port = 44444 
 
# Describe the sink 
a1.sinks.k1.type = logger 
 
# Use a channel which buffers events in memory 
a1.channels.c1.type = memory 
a1.channels.c1.capacity = 1000 
a1.channels.c1.transactionCapacity = 100       

# Bind the source and sink to the channel 
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1 

4. Profile resolution

Note: the configuration file is from the official manual http://flume.apache.org/FlumeUserGuide.html

5. Open flume listening port
The first way:

bin/flume-ng agent --conf conf/ --name a1 --conf-file job/flume-netcat-logger.conf Dflume.root.logger=INFO,console

The second way of writing:

bin/flume-ng agent -c conf/ -n a1 -f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

Parameter Description:
– conf/-c: indicates that the configuration file is stored in the conf / directory

– name/-n: indicates that the agent is named a1

– conf file / - F: the configuration file that flume started to read this time is the flume-telnet.conf file under the job folder.

-Dflume.root.logger=INFO,console: - D means to dynamically modify the attribute value of flume.root.logger when flume is running, and set the console log printing level to info level. The log level includes log, info, warn and error.

6. Use the netcat tool to send content to port 44444 of this computer

[test@hadoop151 jdk1.8.0_144]$ nc localhost 44444
hello
OK
123456
OK

7. Observe the received data on the Flume monitoring page

3, Getting started case - real time monitoring of individual additional files

3.1 case introduction

1. Case needs
Monitor Hive logs in real time and upload them to HDFS.

2. Demand analysis

3.2 case steps

1. Flume must have Hadoop related jar package to output data to HDFS

Put the following jar package

commons-configuration-1.6.jar, 
hadoop-auth-2.7.2.jar, 
hadoop-common-2.7.2.jar, 
hadoop-hdfs-2.7.2.jar, 
commons-io-2.4.jar, 
htrace-core-3.1.0-incubating.jar 

Copy to / opt/module/flume/lib folder

2. Create flume-file-hdfs.conf file
create a file

[test@hadoop151 job]$ vim flume-file-hdfs.conf

Note: to read files in a Linux system, execute the command according to the rules of the Linux command. Because Hive log is in Linux system, the type of read file is selected: exec means execute. Represents executing a Linux command to read a file.

Add the following:

# Name the components on this agent 
a2.sources = r2 
a2.sinks = k2 
a2.channels = c2 
 
# Describe/configure the source 
a2.sources.r2.type = exec 
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log 
a2.sources.r2.shell = /bin/bash -c 
 
# Describe the sink 
a2.sinks.k2.type = hdfs 
a2.sinks.k2.hdfs.path = hdfs://hadoop151:9000/flume/%Y%m%d/%H 
# Prefix of uploaded file 
a2.sinks.k2.hdfs.filePrefix = logs- 
# Scroll folders by time 
a2.sinks.k2.hdfs.round = true 
# How many time units to create a new folder 
a2.sinks.k2.hdfs.roundValue = 1 
# Redefining time units 
a2.sinks.k2.hdfs.roundUnit = hour 
# Use local time stamp or not 
a2.sinks.k2.hdfs.useLocalTimeStamp = true 
# How many events are accumulated before flush to HDFS once 
a2.sinks.k2.hdfs.batchSize = 1000 
# Set file type to support compression 
a2.sinks.k2.hdfs.fileType = DataStream 
# How often to generate a new file 
a2.sinks.k2.hdfs.rollInterval = 30 
# Set the scroll size for each file 
a2.sinks.k2.hdfs.rollSize = 134217700 
# Scrolling of files is independent of the number of events 
a2.sinks.k2.hdfs.rollCount = 0 
 
# Use a channel which buffers events in memory 
a2.channels.c2.type = memory 
a2.channels.c2.capacity = 1000 
a2.channels.c2.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a2.sources.r2.channels = c2 
a2.sinks.k2.channel = c2 

Be careful:
For all time related escape sequences, a key with "timestamp" must exist in the Event Header (unless hdfs.useLocalTimeStamp is set to true, this method will automatically add a timestamp using the TimestampInterceptor).

a3.sinks.k3.hdfs.useLocalTimeStamp = true

3. Run Flume

[test@hadoop151 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

4. Open Hadoop and Hive and operate Hive to generate logs

[test@hadoop151 jdk1.8.0_144]$ start-dfs.sh
[test@hadoop152 ~]$ start-yarn.sh 
hive (default)> 

5. Viewing files on HDFS

4, Introduction case - real time monitoring of multiple new files in the directory

4.1 case introduction

1. Case needs
Use Flume to listen for files in the entire directory and upload them to HDFS.

2. Demand analysis

4.2 case steps

1. Create profile

[test@hadoop151 job]$ vim flume-dir-hdfs.conf

Add the following:

a3.sources = r3 
a3.sinks = k3 
a3.channels = c3 
 
#Describe/configure the source 
a3.sources.r3.type = spooldir 
a3.sources.r3.spoolDir = /opt/module/flume/upload 
a3.sources.r3.fileSuffix = .COMPLETED 
a3.sources.r3.fileHeader = true 
#Ignore all files ending in. tmp, do not upload 
a3.sources.r3.ignorePattern = ([^ ]*\.tmp) 
 
# Describe the sink 
a3.sinks.k3.type = hdfs 
a3.sinks.k3.hdfs.path = hdfs://hadoop151:9000/flume/upload/%Y%m%d/%H 
#Prefix of uploaded file 
a3.sinks.k3.hdfs.filePrefix = upload- 
#Scroll folders by time 
a3.sinks.k3.hdfs.round = true 
#How many time units to create a new folder 
a3.sinks.k3.hdfs.roundValue = 1 
#Redefining time units 
a3.sinks.k3.hdfs.roundUnit = hour 
#Use local time stamp or not 
a3.sinks.k3.hdfs.useLocalTimeStamp = true 
#How many events are accumulated before flush to HDFS once 
a3.sinks.k3.hdfs.batchSize = 100 
#Set file type to support compression 
a3.sinks.k3.hdfs.fileType = DataStream 
#How often to generate a new file 
a3.sinks.k3.hdfs.rollInterval = 60
#Set the scrolling size of each file to about 128M 
a3.sinks.k3.hdfs.rollSize = 134217700 
#Scrolling of files is independent of the number of events 
a3.sinks.k3.hdfs.rollCount = 0 
 

# Use a channel which buffers events in memory 
a3.channels.c3.type = memory 
a3.channels.c3.capacity = 1000 
a3.channels.c3.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a3.sources.r3.channels = c3 
a3.sinks.k3.channel = c3 

2. Parameter analysis

3. Start monitoring folder command

[test@hadoop151 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf

Note: when using Spooling Directory Source, do not create and continuously modify files in the monitoring directory. The uploaded files will end with. COMPLETED. The monitored folder will scan file changes every 500 milliseconds.

4. Add files to the folder to be monitored

[test@hadoop151 flume]$ mkdir upload
[test@hadoop151 flume]$ cd upload/
[test@hadoop151 upload]$ touch 1.txt
[test@hadoop151 upload]$ touch 2.tmp
[test@hadoop151 upload]$ touch 3.log

5. Viewing data on HDFS

5, Entry case - real time monitoring of multiple additional files in the directory

5.1 introduction to each Source

Exec source is suitable for monitoring a real-time appended file, but it can't guarantee that the data will not be lost; Spooldir Source can guarantee that the data will not be lost, and it can realize breakpoint continuous transmission, but the delay is high, and it can't be monitored in real time; and Taildir Source can not only realize breakpoint continuous transmission, but also ensure that the data will not be lost, and it can also conduct real-time monitoring.

5.2 case analysis

1, demand
Use Flume to monitor the real-time appending files of the entire directory and upload them to HDFS.

2. Demand analysis

5.3 project steps

1. Create the configuration file flume-taildir-hdfs.conf
Add the following to the file:

a3.sources = r3 
a3.sinks = k3 
a3.channels = c3 
 
# Describe/configure the source 
a3.sources.r3.type = TAILDIR 
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json 
a3.sources.r3.filegroups = f1 
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/file.* 
 
# Describe the sink 
a3.sinks.k3.type = hdfs 
a3.sinks.k3.hdfs.path = hdfs://hadoop151:9000/flume/upload/%Y%m%d/%H 
#Prefix of uploaded file 
a3.sinks.k3.hdfs.filePrefix = upload- 
#Scroll folders by time 
a3.sinks.k3.hdfs.round = true 
#How many time units to create a new folder 
a3.sinks.k3.hdfs.roundValue = 1 
#Redefining time units 
a3.sinks.k3.hdfs.roundUnit = hour 
#Use local time stamp or not 
a3.sinks.k3.hdfs.useLocalTimeStamp = true 
#How many events are accumulated before flush to HDFS once 
a3.sinks.k3.hdfs.batchSize = 100 
#Set file type to support compression 
a3.sinks.k3.hdfs.fileType = DataStream 
#How often to generate a new file 
a3.sinks.k3.hdfs.rollInterval = 60 
#Set the scrolling size of each file to about 128M 
a3.sinks.k3.hdfs.rollSize = 134217700 
#Scrolling of files is independent of the number of events 
a3.sinks.k3.hdfs.rollCount = 0 
 
# Use a channel which buffers events in memory 
a3.channels.c3.type = memory 
a3.channels.c3.capacity = 1000 
a3.channels.c3.transactionCapacity = 100 
 
# Bind the source and sink to the channel 
a3.sources.r3.channels = c3 
a3.sinks.k3.channel = c3

2. Parameter interpretation

3. Start monitoring folder command

[test@hadoop151 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-taildir-hdfs.conf

4. View HDFS file system

Taildir Description:
Taildir Source maintains a json format position File, which will go to the position File regularly
The latest location read by each file is updated in, so breakpoint resume can be realized. The format of Position File is as follows: {"inode": 2496272, "pos": 12, "file": "/ opt/module/flume/files/file1.t xt"} {"inode": 2496275, "pos": 12, "file": "/ opt/module/flume/files/file2.t xt"}
Note: the area where file metadata is stored in Linux is called inode. Each inode has a number. The operating system uses the inode number to identify different files. The Unix/Linux system does not use the file name, but uses the inode number to identify files.

6, Software installation package

flume installation package and required jar package are linked in Baidu cloud disk:
Links: https://pan.baidu.com/s/1irJ38usIHp6C0IkFjBIWxg
Extraction code: 5xor

96 original articles published, 87 praised, 40000 visitors+
Private letter follow

Tags: Apache hive Linux Hadoop

Posted on Wed, 26 Feb 2020 01:02:44 -0500 by john_zakaria