Collection directory to HDFS
**Collection requirements: * * new files will be generated in a specific directory of the server. Whenever new files appear, they need to be collected into HDFS
According to the requirements, first define the following three elements
Collection source, i.e. source - monitoring file directory: spooldir
Sinking target, that is, sink - HDFS file system: hdfs sink
The transfer channel between source and sink - channel. You can use either file channel or memory channel
Configuration file preparation:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source ##Note: the same name file cannot be lost to the monitor directory repeatedly a1.sources.r1.type = spooldir a1.sources.r1.spoolDir = /root/logs a1.sources.r1.fileHeader = true # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/ a1.sinks.k1.hdfs.filePrefix = eventsa1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute a1.sinks.k1.hdfs.rollInterval = 3 a1.sinks.k1.hdfs.rollSize = 20 a1.sinks.k1.hdfs.rollCount = 5 a1.sinks.k1.hdfs.batchSize = 1 a1.sinks.k1.hdfs.useLocalTimeStamp = true #The generated file type is Sequencefile by default. If DataStream is available, it is normal text a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Channel parameter explanation:
capacity: by default, the maximum number of event s that can be stored in this channel
trasactionCapacity: the maximum number of event s that can be obtained from the source or sent to the sink at a time
Collect files to HDFS
Collection requirements: for example, the business system uses the log generated by log4j, the log content is increasing, and the data added to the log file needs to be collected to hdfs in real time
According to the requirements, first define the following three elements
Collection source, i.e. source - content update of monitoring file: exec "tail -F file"
Sinking target, that is, sink - HDFS file system: hdfs sink
The transfer channel between Source and sink - channel. You can use either file channel or memory channel
Configuration file preparation:
# Name the components on this agent a1.sources = r1 a1.sinks = k1 a1.channels = c1 # Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /root/logs/test.log a1.sources.r1.channels = c1 # Describe the sink a1.sinks.k1.type = hdfs a1.sinks.k1.hdfs.path = /flume/tailout/%y-%m-%d/%H%M/ a1.sinks.k1.hdfs.filePrefix = eventsa1.sinks.k1.hdfs.round = true a1.sinks.k1.hdfs.roundValue = 10 a1.sinks.k1.hdfs.roundUnit = minute a1.sinks.k1.hdfs.rollInterval = 3 a1.sinks.k1.hdfs.rollSize = 20 a1.sinks.k1.hdfs.rollCount = 5 a1.sinks.k1.hdfs.batchSize = 1 a1.sinks.k1.hdfs.useLocalTimeStamp = true #The generated file type is Sequencefile by default. If DataStream is available, it is normal text a1.sinks.k1.hdfs.fileType = DataStream # Use a channel which buffers events in memory a1.channels.c1.type = memory a1.channels.c1.capacity = 1000 a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channel a1.sources.r1.channels = c1 a1.sinks.k1.channel = c1
Parameter resolution:
rollInterval
Default: 30
How long is the hdfs sink interval to scroll the temporary file to the final target file, unit: second;
If it is set to 0, the file will not be scrolled according to the time;
Note: roll refers to hdfs sink renaming the temporary file to the final target file
Open a temporary file to write data;
rollSize
Default: 1024
When the temporary file reaches the size (unit: bytes), scroll to the target file;
If it is set to 0, the file is not scrolled according to the temporary file size;
rollCount
Default: 10
When the events data reaches this number, the temporary files will be scrolled to the target files;
If it is set to 0, the file will not be scrolled according to the events data;
round
Default: false
Whether to enable "discard" in time? Here, "discard" is similar to "rounding".
roundValue
Default: 1
The value of "abandon" in time;
roundUnit
Default: seconds
The unit of "abandoning" in time, including: second,minute,hour
Example:
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
When the time is 17:38:59 on October 16, 2015, hdfs.path will still be parsed as:
/flume/events/20151016/17:30/00
Because it is set to discard the time within 10 minutes, the directory generates a new one every 10 minutes.