Flume simple case

Collection directory to HDFS

**Collection requirements: * * new files will be generated in a specific directory of the server. Whenever new files appear, they need to be collected into HDFS
According to the requirements, first define the following three elements
Collection source, i.e. source - monitoring file directory: spooldir
Sinking target, that is, sink - HDFS file system: hdfs sink
The transfer channel between source and sink - channel. You can use either file channel or memory channel
Configuration file preparation:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
##Note: the same name file cannot be lost to the monitor directory repeatedly
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /root/logs
a1.sources.r1.fileHeader = true
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = eventsa1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#The generated file type is Sequencefile by default. If DataStream is available, it is normal text
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Channel parameter explanation:
capacity: by default, the maximum number of event s that can be stored in this channel
trasactionCapacity: the maximum number of event s that can be obtained from the source or sent to the sink at a time

Collect files to HDFS

Collection requirements: for example, the business system uses the log generated by log4j, the log content is increasing, and the data added to the log file needs to be collected to hdfs in real time
According to the requirements, first define the following three elements
Collection source, i.e. source - content update of monitoring file: exec "tail -F file"
Sinking target, that is, sink - HDFS file system: hdfs sink
The transfer channel between Source and sink - channel. You can use either file channel or memory channel
Configuration file preparation:

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/logs/test.log
a1.sources.r1.channels = c1
# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/
a1.sinks.k1.hdfs.filePrefix = eventsa1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollInterval = 3
a1.sinks.k1.hdfs.rollSize = 20
a1.sinks.k1.hdfs.rollCount = 5
a1.sinks.k1.hdfs.batchSize = 1
a1.sinks.k1.hdfs.useLocalTimeStamp = true
#The generated file type is Sequencefile by default. If DataStream is available, it is normal text
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Parameter resolution:
 rollInterval
Default: 30
How long is the hdfs sink interval to scroll the temporary file to the final target file, unit: second;
If it is set to 0, the file will not be scrolled according to the time;
Note: roll refers to hdfs sink renaming the temporary file to the final target file
Open a temporary file to write data;
 rollSize
Default: 1024
When the temporary file reaches the size (unit: bytes), scroll to the target file;
If it is set to 0, the file is not scrolled according to the temporary file size;
 rollCount
Default: 10
When the events data reaches this number, the temporary files will be scrolled to the target files;
If it is set to 0, the file will not be scrolled according to the events data;
 round
Default: false
Whether to enable "discard" in time? Here, "discard" is similar to "rounding".
 roundValue
Default: 1
The value of "abandon" in time;
 roundUnit
Default: seconds
The unit of "abandoning" in time, including: second,minute,hour
Example:
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
When the time is 17:38:59 on October 16, 2015, hdfs.path will still be parsed as:
/flume/events/20151016/17:30/00
Because it is set to discard the time within 10 minutes, the directory generates a new one every 10 minutes.

Published 7 original articles, praised 0, visited 61
Private letter follow

Tags: log4j

Posted on Sat, 15 Feb 2020 05:34:36 -0500 by sebmaurer