Data acquisition framework Flume

Flume official website

(1) Flume official website address: http://flume.apache.org/
(2) Document viewing address: http://flume.apache.org/FlumeUserGuide.html
(3) Download address: http://archive.apache.org/dist/flume/

Flume overview

(1)Flume is provided by cdh
(2) Framework for massive log collection, aggregation and transmission
(3) Highly available, highly reliable, distributed

  • High availability: after flume hangs up, there are other flumes that can work instead
  • High reliability: reliable data transmission without loss
  • Distributed means flume can be deployed on multiple log servers for data collection and then gathered together. In fact, flume itself is a single application, not a distributed cluster
    (4) Real time, batch data

Flume's use in big data scenarios

Flume architecture

1. Agent

(1) Agent is a JVM process. Starting Flume is to start the agent process, which is displayed as an Application process in Linux
(2) It encapsulates the data in the form of event s and sends the data from the source to the destination
(3) Agent is mainly composed of three parts: Source, Channel and Sink

2. Source

(1) Source is responsible for collecting data
(2) The source component can handle various types and formats of log data, including avro, thrift, exec, jms, spooling directory, netcat, taildir, sequence generator, syslog, http, legacy, and custom

3. Sink

Sink does two things
(1) Sink constantly polls event s in the Channel and removes them in batches (pull)
(2) Batch write these event s to the storage system or another Flume Agent.
(3) Sink component destinations include hdfs, logger, avro, thrift, ipc, file, HBase, solr, and custom.

4. Channel

(1) Channel is the buffer between Source and Sink.
In fact, it is OK if the Channel does not. The main function of the Channel is to provide buffer. Therefore, the Channel allows Source and Sink to operate at different rates.
(2)Channel is thread safe and can handle write operations of several sources and read operations of several Sink at the same time.
(3)Flume comes with two channels: Memory Channel and File Channel.
Memory Channel is a queue in memory. Memory Channel is applicable when data loss does not need to be concerned. Program death, machine downtime or restart can lead to data loss.
File Channel writes all events to disk without losing data.

5. Event


(1)event is the data transmission unit in Flume framework.
(2)event consists of Header and Body.

  • The Header is used to store some attributes of the event. It is a K-V structure and is empty by default.
  • The Body is used to store the data in the form of byte array.

(3) Finally, only data is stored on HDFS without Header.

6. ChannalSelector

(1) Application scenario: one source followed by multiple channel s

(2) Flume comes with two selectors
1.Replicating Channel Selector: replication selector, default
The event is transmitted from the source to each channel
2.MultiPlexing Channel Selector: multiplexer
Allocate different channel s according to the event header

(3) Working time point: between Source and Channel

7.Interceptor interceptor

Set the key value value to the event header, which determines which channel the event goes to, so it should be used in conjunction with the MultiPlexing Channel Selector

It is used after the source reads data and before the channel selector

8.SinkProcessor

It is used in the scenario of multiple sinks followed by a channel to determine which event s in the channel are given to which sink

Flume installation

1. Installation address

(1) Flume official website address: http://flume.apache.org/
(2) Document viewing address: http://flume.apache.org/FlumeUserGuide.html
(3) Download address: http://archive.apache.org/dist/flume/

2. Installation and deployment

(1) Upload apache-flume-1.9.0-bin.tar.gz to the / opt/software directory of linux
(2) Unzip apache-flume-1.9.0-bin.tar.gz to / opt/module /

[atguigu@hadoop102 software]$ tar -zxf /opt/software/apache-flume-1.9.0-bin.tar.gz -C /opt/module/

(3) Change the name of apache-flume-1.9.0-bin to flume

[atguigu@hadoop102 module]$ mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume

(4) Delete guava-11.0.2.jar under lib folder to be compatible with Hadoop 3.1.3

[atguigu@hadoop102 lib]$  rm /opt/module/flume/lib/guava-11.0.2.jar

Flume case

1. Monitor port data in real time

Requirements: use Flume to listen to a port, collect the port data, and print it to the console
(1) Write the agent configuration file flume-netcat-logger.conf

(2) Start flume agent process
The first way to write:

[atguigu@hadoop102 flume]$ bin/flume-ng agent \
--conf conf/ --name a1 --conf-file job/flume-netcat-logger.conf \
-Dflume.root.logger=INFO,console

The second way to write:

[atguigu@hadoop102 flume]$ bin/flume-ng agent -c conf/ -n a1 \
-f job/flume-netcat-logger.conf -Dflume.root.logger=INFO,console

Parameter Description:
--conf/-c: indicates that the configuration file is stored in the conf / directory
--name/-n: the name given to the agent in the corresponding configuration file is a1
--Conf file / - F: flume the directory and file name of the configuration file read this time
-Dflume.root.logger=INFO,console: - D means to dynamically modify the configuration of flume runtime
flume.root.logger parameter attribute value, and set the console log printing level to INFO level.
Log levels include: log, info, warn and error.
(3) Testing
Use the netcat tool in linux to send content to the 44444 port of this machine

[atguigu@hadoop102 ~]$ nc localhost 44444
hello 
atguigu

Data read from console:

It can be seen that the message is encapsulated with Event, and the header is an empty map and the body is a byte array

Supplementary knowledge: localhost, domain name and 0.0.0.0 are different

  • The localhost of each host refers to the local address, which cannot be accessed externally, but can only be accessed by the processes in the host.

  • Hadoop 102 is a public IP that can be accessed by local processes and processes on other hosts

  • The local process does not need network bandwidth to access localhost. It needs bandwidth to go Hadoop 102.

  • nc -l 0.0.0.0 4444 is equivalent to NC Hadoop 102 4444 and nc localhost 4444

2. Real time monitor a single appended file to HDFS

Requirements: monitor Hive logs in real time and upload them to HDFS
(1) Flume relies on Hadoop related jar packages to output data to HDFS

  • Check / etc/profile.d/my_env.sh file and confirm that Hadoop and Java environment variables are configured correctly
JAVA_HOME=/opt/module/jdk1.8.0_212
HADOOP_HOME=/opt/module/ha/hadoop-3.1.3
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export PATH JAVA_HOME HADOOP_HOME

2.1 exec source and HDFS sink

(2) Write the agent configuration file flume-file-hdfs.conf

[atguigu@hadoop102 job]$ vim flume-file-hdfs.conf

Note: if you want to read files in Linux system, you have to execute commands according to the rules of Linux commands. Because Hive logs are in the Linux system, the type of file read is: exec, which means execute. Means to execute linux commands to read files

The tail -f command starts reading from the tenth line at the end of the file by default, so breakpoint continuation is not supported;

Add the following:

# Name the components on this agent
a2.sources = r2
a2.sinks = k2
a2.channels = c2

# Describe/configure the source
a2.sources.r2.type = exec
a2.sources.r2.command = tail -F /opt/module/hive/logs/hive.log

# Describe the sink
a2.sinks.k2.type = hdfs
# Set the directory where the collected data is stored on hdfs
a2.sinks.k2.hdfs.path = hdfs://hadoop102:8020/flume/%Y%m%d/%H
#File name prefix for storing collected data in HDFS
a2.sinks.k2.hdfs.filePrefix = logs-

#Whether to scroll folder related configuration by time:
a2.sinks.k2.hdfs.round = true
#How many time units to create a new folder
a2.sinks.k2.hdfs.roundValue = 1
#Redefine time units
a2.sinks.k2.hdfs.roundUnit = hour

#Whether to use the local timestamp must be changed to true, because the scrolling of the file is named after the timestamp
a2.sinks.k2.hdfs.useLocalTimeStamp = true

#How many events are accumulated before flush ing to HDFS and writing to HDFS in batches
a2.sinks.k2.hdfs.batchSize = 100

#Set the file type. DataStream does not support compression
a2.sinks.k2.hdfs.fileType = DataStream

# How often do I generate a new file 	 Unit s
a2.sinks.k2.hdfs.rollInterval = 60
#How big is the file to start scrolling in bytes
a2.sinks.k2.hdfs.rollSize = 134217700
#How many events are stored in the file to start scrolling? Setting 0 is not enabled because the size of events is not fixed
a2.sinks.k2.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a2.channels.c2.type = memory
a2.channels.c2.capacity = 1000
a2.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r2.channels = c2
a2.sinks.k2.channel = c2

  • hdfs sink description

(1) hdfs.useLocalTimeStamp
In the configuration of hdfs sink, for all time-related escape sequences, there must be a key with "timestamp" in the Event Header, but by default, the header of the event read in by source is empty, so an error will be reported; The solution is to configure the parameter hdfs.useLocalTimeStamp and set it to true. This method will automatically add timestamp using the TimestampInterceptor

In this example, the file name written to hdfs is a timestamp, so it must be configured here

(2) hdfs.fileType
There is also compression configuration: these official websites have instructions

(3) hdfs file scrolling
(1) The file will scroll events according to time and file size, and will not scroll files according to the number of events.
(2) The file being written by HDFS displays the suffix. tmp in HDFS. Remove the suffix after scrolling
(3) .tmp is generated when new data is collected to hdfs; Only when the time or file size reaches, the scrolling will be completed; If no data comes in all the time, a. tmp file will not be formed

(3) Run Flume

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/flume-file-hdfs.conf

(4) Start Hadoop and Hive and operate Hive to generate logs

[atguigu@hadoop102 hadoop-2.7.2]$ sbin/start-dfs.sh
[atguigu@hadoop103 hadoop-2.7.2]$ sbin/start-yarn.sh

[atguigu@hadoop102 hive]$ bin/hive
hive (default)>

3. Real time monitor multiple new files in the directory Spooldir source

Requirement: use Flume to listen to the files in the whole directory and upload them to HDFS

  • spooldir source
    Function: monitor multiple files in the whole directory and collect the contents of files not collected in the directory
    Execution principle: start Spooldir Source to monitor a target directory, and the files in the directory that are not marked with suffix will be collected.
    1. The collected file will be suffixed with. completed to identify that the file has been collected
    2. If the new file ends in. completed, it will not be collected, and adding data to the. completed file will not be collected
    3. If 1.txt is collected and becomes 1.txt.completed, creating a new 1.txt file again will cause the task to hang up
    4. If the new file name matches the ignored file name, it will not be collected

The scene can collect the data of multiple newly generated files in the directory, and the files cannot be collected repeatedly.

(1) Create the configuration file flume-dir-hdfs.conf

[atguigu@hadoop102 job]$ vim flume-dir-hdfs.conf

Add the following

a3.sources = r3
a3.sinks = k3
a3.channels = c3

# Describe/configure the source
a3.sources.r3.type = spooldir
#Directory to monitor
a3.sources.r3.spoolDir = /opt/module/flume/upload   
#Add the suffix after the file suffix collection is completed
a3.sources.r3.fileSuffix = .COMPLETED				
a3.sources.r3.fileHeader = true
#Ignore all files ending in. tmp and do not upload
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)	     

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop102:8020/flume/upload/%Y%m%d/%H
#Prefix of uploaded file
a3.sinks.k3.hdfs.filePrefix = upload-
#Scroll folders by time
a3.sinks.k3.hdfs.round = true
#How many time units to create a new folder
a3.sinks.k3.hdfs.roundValue = 1
#Redefine time units
a3.sinks.k3.hdfs.roundUnit = hour
#Use local timestamp
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#How many events are accumulated to flush to HDFS once
a3.sinks.k3.hdfs.batchSize = 100
#Set the file type to support compression
a3.sinks.k3.hdfs.fileType = DataStream
#How often do I generate a new file
a3.sinks.k3.hdfs.rollInterval = 60
#Set the scroll size of each file to about 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#File scrolling is independent of the number of events
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

(2) Start monitor folder command

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-dir-hdfs.conf

Note: when using Spooling Directory Source, do not create and continuously modify files in the monitoring directory; The collected file will end with. COMPLETED; The monitored folder scans for file changes every 500 milliseconds.
(3) Create the upload directory in the / opt/module/flume directory

[atguigu@hadoop102 flume]$ mkdir upload

Add files to the upload folder

[atguigu@hadoop102 upload]$ touch atguigu.txt
[atguigu@hadoop102 upload]$ touch atguigu.tmp
[atguigu@hadoop102 upload]$ touch atguigu.log

(4) View data on HDFS

4. Real time monitor multiple additional written files in multiple directories

Tail dirsource

Function: collect the additional contents of all files in the monitoring directory. If a new file is created in the directory, it can also be collected
Features: multi file monitoring + breakpoint continuation.
Breakpoint continuation Description: use json file to store the collection point of each file. If flume hangs and restarts, it can continue to collect from the collection point.

Performance comparison:
Exec Source: the monitoring document file is appended and cannot be renewed at a breakpoint
TailDir Source: monitors the addition and creation of multiple files in the directory, and can continue transmission at breakpoints
Spooldir Sourc: used to synchronize new files in the directory. It is not suitable for monitoring and synchronizing files with real-time log additions;

Taildir Description:
(1)Taildir Source maintains a position File in json format. It will regularly update the latest position read by each file in the position File, so it can realize breakpoint continuation.

(2) The format of position file is as follows:

{"inode":2496272,"pos":12,"file":"/opt/module/flume/files/file1.txt"}
{"inode":2496275,"pos":12,"file":"/opt/module/flume/files/file2.txt"}

There are three messages in the position file:
1. The area where file metadata is stored in UNIX / Linux system is called inode. Each inode has a number. The operating system uses the inode number to identify different files without using the file name. If you change the file name, the inode number will not be changed.
2.pos: number of bytes read from the file
3.file: absolute path of the file

tailDir important bug: if the inode or file name changes, the file content will be collected again
The log file generated by log4j technology will be renamed with the date. For example, at 12 o'clock today, it will be recorded from hive.log to hive.log-2020-10-31, and a new hive.log file will be generated to record the log of the new day; This will lead to repeated log collection; [absolute path changed, inode unchanged]
reference resources: https://blog.csdn.net/maoyuanming0806/article/details/79391010

taildir source code modification

Flume source package (download address: http://mirror.bit.edu.cn/apache/flume/1.7.0/apache-flume-1.7.0-src.tar.gz )

Modify the writing of position file at position 1

Modify the reading of log file in location 2

Read and write only according to the inode value of the file

After modification, package it and upload it to the lib directory of flume to replace the original jar package

Practical operation

Case requirement: use Flume to monitor the real-time added and written files of the whole directory and upload them to HDFS

(1) Create the configuration file flume-taildir-hdfs.conf

[atguigu@hadoop102 job]$ vim flume-taildir-hdfs.conf

Add the following

a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = TAILDIR
# Specify position_file location
a3.sources.r3.positionFile = /opt/module/flume/tail_dir.json
a3.sources.r3.filegroups = f1 f2
# Define monitored files
a3.sources.r3.filegroups.f1 = /opt/module/flume/files/.*file.*
a3.sources.r3.filegroups.f2 = /opt/module/flume/files2/.*log.*

# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop102:8020/flume/upload2/%Y%m%d/%H
#Prefix of uploaded file
a3.sinks.k3.hdfs.filePrefix = upload-
#Scroll folders by time
a3.sinks.k3.hdfs.round = true
#How many time units to create a new folder
a3.sinks.k3.hdfs.roundValue = 1
#Redefine time units
a3.sinks.k3.hdfs.roundUnit = hour
#Use local timestamp
a3.sinks.k3.hdfs.useLocalTimeStamp = true
#How many events are accumulated to flush to HDFS once
a3.sinks.k3.hdfs.batchSize = 100
#Set the file type to support compression
a3.sinks.k3.hdfs.fileType = DataStream
#How often do I generate a new file
a3.sinks.k3.hdfs.rollInterval = 60
#Set the scroll size of each file to about 128M
a3.sinks.k3.hdfs.rollSize = 134217700
#File scrolling is independent of the number of events
a3.sinks.k3.hdfs.rollCount = 0

# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 1000
a3.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

(2) Start monitor folder command

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/flume-taildir-hdfs.conf

(3) Append content to files folder
Create the files folder in the / opt/module/flume directory

[atguigu@hadoop102 flume]$ mkdir files
#Add files to the upload folder
[atguigu@hadoop102 files]$ echo hello >> file1.txt
[atguigu@hadoop102 files]$ echo atguigu >> file2.txt

(4) View data on HDFS

Flume advanced

Flume transaction

(1) put transaction

(1) The put transaction occurs when the source puts data into the channel. The data is put into the channel by the source in the unit of batch, which is PutList. After the data collected by source, it will be put into the batch first. When enough event s are saved, it will execute the doCommit operation and try to submit. If the channel has enough buffers, put them into the channel queue. If not, rollback!
(2) The event capacity of putList is the transaction capacity in the flume configuration file
(3) The consequence of rollback is that all data in putList is destroyed. At the same time, an exception is reported to source. Source learns that this batch of data has not been submitted successfully
Note: the general put transaction rollback in Flume is to delete all the data in putList. Whether the data can be collected again depends on the Source used. For TailDirSource, the collection points in the file will be recorded, so the rolled back data can be collected again, and the data will be lost for the Source without recording breakpoints

(2) take transaction

(1) Sink takes the event in the channel and pulls it into the takeList. When the takeList has enough batches, it will execute doCommit. If all the data is written out by sink, the takeList data will be cleared, and the same data as the takeList in the channel will also be cleared. If sink fails to write out the data successfully, rollback(rollback returns the original path of takeList to the channel, so ensure that there is enough capacity in the channel for takeList rollback)

(2) The event capacity of the takeList is also the transaction capacity in the flume configuration file

(3) Consequences of take transaction rollback: it may cause duplicate data written out. Assuming that only part of the data in a batch is written out and a rollback occurs, the data will be taken again from the channel, and some of these data have been written out

Flume internal data processing flow


(1) Data flow in Flume
a) source collect data - > encapsulate data into event
b) Interceptor
c) The Channel selector selects the Channel for the event
d) event enters the channel
e) The sink processor is responsible for which sink the event is given and how
f) sink write
(2) Important components
1. The channelselector selects which Channel the Event will be sent to.
ReplicatingSelector: the same Event will be sent to all channels
Multiplexing: different events will be sent to different channels according to the header.
Multiplexing needs to be used with interceptor and multi channel, and the header can be set in the interceptor

2.SinkProcessor
(1) DefaultSinkProcessor can only bind one sink to a channel
(2) Loadbalancing sinkprocessor and FailoverSinkProcessor correspond to Sink Group and can bind multiple sinks
(3) Loadbalancing sink processor can realize the function of load balancing
Sink pulls event s from the channel according to certain rules. There are two common rules: (1) random (2) round robin: theoretically, a - > b - > C-A - > b - > C... The round robin pull rule is adopted by default

Since the event is not distributed by the channel, but take n by sink, the effect of round robin will not be seen in the actual test. (Note: suppose that when sink a is found in the round robin, a can go to the channel to get the data, but the channel may be empty at this time, so a can't get the data this time.)
(4) FailoverSinkProcessor is equivalent to high availability and failover; When sink a breaks down, sink b takes over
Only one sink can work at a time. If multiple sinks are available, it depends on the priority, who is higher and who is used.
(5) Only the load balancing SinkProcessor has multiple sinks working at the same time. FailoverSinkProcessor prepares multiple sinks, but only one can work at the same time.

Flume topology

1. Simple series connection


Too many flumes will not only affect the transmission rate, but also affect the whole transmission system once a node flume goes down during the transmission process

2. Replication and multiplexing (1 souce multichannel)


Replication and multiplexing: multiple channel s after a source
Application scenario: multiple destinations
(1) Copy: flow the same event to multiple destinations.
(2) Multiplexing: distributing different data to different destinations.

3. Load balancing and failover (1channel multi sink)


Background: multiple sinks (sink groups) behind a channel
Application scenario: load balancing and high availability of sink
Note: load balancing means that multiple sinks are written out in turn, and the data read by each sink is different.

In the practical application of load balancing and failover, the sink destination is the same;

4. Aggregation (multiple source s, 1sink is the most common)


Application scenario: web applications are usually distributed on hundreds of servers, even thousands or tens of thousands of servers. The logs generated are also very troublesome to process. Each server deploys a flume to collect logs, which is transmitted to a flume that collects logs, and then the flume is uploaded to hdfs, hive, hbase, etc. for log analysis

Flume case 2

1. Copy

1) Case requirements
Flume-1 monitors file changes and passes the changes to Flume-2 and Flume-3
Flume-2 is responsible for storing to HDFS
Flume-3 is responsible for outputting to the local file system

2) Demand analysis:
Single source and multiple sinks. The contents of the two sinks are the same, but the destinations are different; Therefore, the channel selector must be used

Flume1: Exec Source,Replicating ChannelSelector, two memory channels, two Avro Sink

Flume2: Avro Source,Memory Channel,HDFS Sink

Flume3: Avro Source,Memory Channel,File_roll Sink

(1) Configuration file flume-file-flume.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2  
a1.channels = c1 c2

# Copy the data flow to all channel channelselector configurations
a1.sources.r1.selector.type = replicating

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/hive/logs/hive.log
a1.sources.r1.shell = /bin/bash -c

# Describe the sink
# avro on sink side is a data sender
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102 
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

(2) Configuration file flume-flume-hdfs.conf
Avro source

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
# avro source is the passive party and needs to bind the hostname and port specified by the upstream avro sink  
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102   // hostname or IP address to listen on
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop102:8020/flume2/%Y%m%d/%H
#Prefix of uploaded file
a2.sinks.k1.hdfs.filePrefix = flume2-
#Scroll folders by time
a2.sinks.k1.hdfs.round = true
#How many time units to create a new folder
a2.sinks.k1.hdfs.roundValue = 1
#Redefine time units
a2.sinks.k1.hdfs.roundUnit = hour
#Use local timestamp
a2.sinks.k1.hdfs.useLocalTimeStamp = true
#How many events are accumulated to flush to HDFS once
a2.sinks.k1.hdfs.batchSize = 100
#Set the file type to support compression
a2.sinks.k1.hdfs.fileType = DataStream
#How often do I generate a new file
a2.sinks.k1.hdfs.rollInterval = 600
#Set the scroll size of each file to about 128M
a2.sinks.k1.hdfs.rollSize = 134217700
#File scrolling is independent of the number of events
a2.sinks.k1.hdfs.rollCount = 0

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) Configuration file flume-flume-dir.conf
Avro souce,file_roll Sink.

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = file_roll
a3.sinks.k1.sink.directory = /opt/module/data/flume3

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

Tip: file_ The local directory output by roll sink must be an existing directory. If the directory does not exist, a new directory will not be created.

2. Failover

1) Case requirements
Flume1 monitors a port. The sink in the sink group is connected to Flume2 and Flume3 respectively. FailoverSinkProcessor is used to realize the function of failover

(1) Upstream flume: flume-netcat-flume.conf
1 netcat source, 1 channel, 1 sink group (2 sinks)

# Name the components on this agent
a1.sources = r1
a1.channels = c1
#Add sink group
a1.sinkgroups = g1
a1.sinks = k1 k2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#Set the group processor to failover
a1.sinkgroups.g1.processor.type = failover

#Set the priority for all sink reception. The higher the priority, the priority will be selected (no priority is specified, according to the configuration order of sink)
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10

a1.sinkgroups.g1.processor.maxpenalty = 10000

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141

a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop102
a1.sinks.k2.port = 4142

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
#Set the members of sink group!
a1.sinkgroups.g1.sinks = k1 k2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

(2) Downstream flume1:flume-flume-console1.conf

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop102
a2.sources.r1.port = 4141

# Describe the sink
a2.sinks.k1.type = logger

# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) Downstream flume2:flume-flume-console2.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop102
a3.sources.r1.port = 4142

# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 1000
a3.channels.c2.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

(4) Execution profile

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group2/flume-flume-console2.conf -Dflume.root.logger=INFO,console

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group2/flume-flume-console1.conf -Dflume.root.logger=INFO,console

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group2/flume-netcat-flume.conf

Note: use jps -ml to view Flume process.
Flume3 has higher priority in the current case. If flume3 goes down, flume2 will write it out. If flume3 restarts and the sink connected to flume3 in flume1 passes the backoff time, flume3 will take over the work because flume3 has a higher priority

3. Load balancing


Related configurations of sink group:

backoff: back off. If a sink fails, you don't have to look for the sink in turn. If you close it in a small black house for a period of time, the time increases exponentially. maxTimeOut specifies the maximum size of the sink (the duration of closing it in a small black house in the future will not increase.) sink failure explanation: it means that there is a problem at the downstream end of the sink, and the sink can't write data normally

#name
a1.sources = r1
a1.channels = c1 
a1.sinks = k1 k2

# source
a1.sources.r1.type = netcat
a1.sources.r1.bind = hadoop102
a1.sources.r1.port = 44444

# channel 
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# sink processor loadbalancing
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
#The process selector for load balancing has round_ Robin and random
a1.sinkgroups.g1.processor.selector = round_robin
#Whether to avoid
a1.sinkgroups.g1.processor.backoff = true

# sink 
a1.sinks.k1.type = avro 
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4444

a1.sinks.k2.type = avro 
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 5555

#bind
a1.sources.r1.channels = c1 
a1.sinks.k1.channel = c1

(2) Downstream configuration flume2

#name
a2.sources = r1
a2.channels = c1 
a2.sinks = k1 

# source
a2.sources.r1.type = avro
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 4444

# channel 
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# sink
a2.sinks.k1.type = logger

# bind 
a2.sources.r1.channels = c1 
a2.sinks.k1.channel = c1

(3)flume3

#name
a3.sources = r1
a3.channels = c1 
a3.sinks = k1 

# source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 5555

# channel 
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# sink
a3.sinks.k1.type = logger

# bind 
a3.sources.r1.channels = c1 
a3.sinks.k1.channel = c1

Load balancing and failover summary:
common ground:
1.sink group a1.sinkgroups = g1 A1.sinkgroups.g1.sinks = k1 k2
2.sink fails to back off. When the sink fails to pull data (there is no data in the channel or the downstream is broken, the sink will back off for a period of time)
difference:
failover: to set the priority of sink, do not specify the order of sink in the configuration file
load balance: two rounds: round_robin and random

4 polymerization


(1) Upstream: flume1-exec-flume.conf Hadoop 102

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/group.log

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop104
a1.sinks.k1.port = 4141

# Describe the channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

(2) Upstream: flume2-netcat-flume.conf Hadoop 103

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1

# Describe/configure the source
a2.sources.r1.type = netcat
a2.sources.r1.bind = hadoop103
a2.sources.r1.port = 44444

# Describe the sink
a2.sinks.k1.type = avro
a2.sinks.k1.hostname = hadoop104
a2.sinks.k1.port = 4141

# Use a channel which buffers events in memory
a2.channels.c1.type = memory
a2.channels.c1.capacity = 1000
a2.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

(3) Downstream: flume3-flume-logger.conf

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c1

# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = hadoop104
a3.sources.r1.port = 4141

# Describe the sink
# Describe the sink
a3.sinks.k1.type = logger

# Describe the channel
a3.channels.c1.type = memory
a3.channels.c1.capacity = 1000
a3.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a3.sources.r1.channels = c1
a3.sinks.k1.channel = c1

The hostname and port in the avro sink of flume1 and flume2 are the same.
The host of Avro source bind is its own host IP,
The host name and port of avro sink should be consistent with the host and port of downstream avro source bind
(avro souce is itself, and upstream sink synchronizes downstream souce)

(4) Open the corresponding configuration file respectively
Upstream and downstream flume start the downstream flume first and then the upstream flume

[atguigu@hadoop104 flume]$ bin/flume-ng agent --conf conf/ --name a3 --conf-file job/group3/flume3-flume-logger.conf -Dflume.root.logger=INFO,console

[atguigu@hadoop102 flume]$ bin/flume-ng agent --conf conf/ --name a2 --conf-file job/group3/flume1-logger-flume.conf

[atguigu@hadoop103 flume]$ bin/flume-ng agent --conf conf/ --name a1 --conf-file job/group3/flume2-netcat-flume.conf

5. Customize the Interceptor

(1) It must be used with Multiplexing channel selector
(2) Principle of multiplexing: different events are sent to different channels according to the value of a key in the event Header. Therefore, we need to customize an Interceptor to assign different values to the keys in the headers of different types of events.

requirement analysis
When Flume is used to collect the local logs of the server, different types of logs need to be sent to different analysis systems according to different log types.

In this case, we simulate logs with port data and different types of logs with numbers (single) and letters (single). We need to customize the interceptor to distinguish numbers and letters and send them to different analysis systems (channels).

Implementation steps
(1) The maven project introduces the following dependency flume ng core

<dependency>
    <groupId>org.apache.flume</groupId>
    <artifactId>flume-ng-core</artifactId>
    <version>1.9.0</version>
</dependency>

(2) Customize the CustomInterceptor class and implement the Interceptor interface

package com.atguigu.flume.interceptor;

import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;

import java.util.List;
//(0) implement Interceptor interface
public class CustomInterceptor implements Interceptor {

	 //No
    @Override
    public void initialize() {

    }
	//(1) Intercept method, get the body, judge according to the body content, getHeader, put
    @Override
    public Event intercept(Event event) {

        byte[] body = event.getBody();
        if (body[0] < 'z' && body[0] > 'a') {
            event.getHeaders().put("type", "letter");
        } else if (body[0] > '0' && body[0] < '9') {
            event.getHeaders().put("type", "number");
        }
        return event;

    }
	//(2) Intercepting multiple event traversal calls intercept
    @Override
    public List<Event> intercept(List<Event> events) {
        for (Event event : events) {
            intercept(event);
        }
        return events;
    }

@Override
//No
    public void close() {

    }
	//(3) Static internal class, interceptor builder must write
    public static class Builder implements Interceptor.Builder {
		
        @Override
		//(4) Get interceptor object
        public Interceptor build() {
            return new CustomInterceptor();
        }

        @Override
		//Read the interceptor configuration in flume without writing
        public void configure(Context context) {
        }
    }
}

(3) Package and put it into the lib directory of flume
(4) Edit flume profile Hadoop 102

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Configuring interceptors and multiplexed channel selector s
a1.sources.r1.interceptors = i1
# Fully qualified name of inner class
a1.sources.r1.interceptors.i1.type = com.atguigu.flume.interceptor.CustomInterceptor$Builder     
# Multiplexer
a1.sources.r1.selector.type = multiplexing          
# Specify that the key of the header is type          
a1.sources.r1.selector.header = type       	
# --Specify the entry channel c1 with value as letter			            
a1.sources.r1.selector.mapping.letter = c1		
# --Specify the entry channel c2 where value is letter
a1.sources.r1.selector.mapping.number = c2		

# Describe the sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop103
a1.sinks.k1.port = 4141

a1.sinks.k2.type=avro
a1.sinks.k2.hostname = hadoop104
a1.sinks.k2.port = 4242

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Use a channel which buffers events in memory
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100


# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

Configure an avro source and a logger sink for Flume4 on Hadoop 103.

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop103
a1.sources.r1.port = 4141

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

Configure an avro source and a logger sink for Flume3 on Hadoop 104.

a1.sources = r1
a1.sinks = k1
a1.channels = c1

a1.sources.r1.type = avro
a1.sources.r1.bind = hadoop104
a1.sources.r1.port = 4242

a1.sinks.k1.type = logger

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.channel = c1
a1.sources.r1.channels = c1

(4) Start the flume process on Hadoop 102, Hadoop 103 and Hadoop 104 respectively, and pay attention to the sequence.
(5) In Hadoop 102, use netcat to send letters and numbers to localhost:44444.
(6) Observe the logs printed by Hadoop 103 and Hadoop 104.

Tags: Big Data flume

Posted on Thu, 07 Oct 2021 22:23:14 -0400 by ReD_BeReT