Business data collection_ Zero drift processing method (Flume+Kafka+HDFS)

I am a non college student. I can find a big data post after graduation. Although I major in big data, I basically rely on self-study, you know~

If there is something wrong, you are welcome to criticize and correct it

Recently, I made a small actual battle of a user behavior collection platform, and the overall architecture is

Flume-taildir source + kafka channel =>

Kafka =>

Flume Kafka source + memory channel + HDFS sink

Requirements: collect user behavior data to HDFS and divide it into files in days

Analysis: because it takes time to transmit data, the log at 23:59:58 may have been the next day in hdfs. Therefore, it drifts when you are not careful

If zero drift is not considered, kafka channel can be used instead of the combination of Kafka cause + memory channel

However, there is a defect in using kafka channel, that is, it is impossible to write interceptors!

Because the kafka channel does not need a source end to read data, but the interceptor is bound to the source end. Without a source, the interceptor cannot be written

Here I use an interceptor to add a time stamp to the Head of each event

Is there any basis for this? Is that reasonable?

Then use the secret script again! Check flume.apache.org directly

This is a line of note in hdfs sink on flume's official website. You can see that you need to add a timestamp to the header!!!

ok, with official support, I'm not afraid. Start IDEA directly!!!

Let's start with the writing point of the flume interceptor

First implement the Interceptor interface, and then rewrite its methods. Don't forget to write a static internal class Builder!!

@Override public Event intercept(Event event) { Map<String, String> headers = event.getHeaders(); String log = new String(event.getBody(), StandardCharsets.UTF_8); JSONObject jsonObject = JSONObject.parseObject(log); String ts = jsonObject.getString("ts"); headers.put("timestamp", ts); return event; }

The important thing is the intercept method of this single Event. After all, another method for list < Event > actually calls this method repeatedly

First, get the event header, then parse the json log file, and add the corresponding data named "ts"(timestamp) to the header!

Now come to a complete interceptor code~~

import com.alibaba.fastjson.JSONObject; import org.apache.flume.Context; import org.apache.flume.Event; import org.apache.flume.interceptor.Interceptor; import java.nio.charset.StandardCharsets; import java.util.List; import java.util.Map; public class TimeStampInterceptor implements Interceptor { @Override public void initialize() { } @Override public Event intercept(Event event) { Map<String, String> headers = event.getHeaders(); String log = new String(event.getBody(), StandardCharsets.UTF_8); JSONObject jsonObject = JSONObject.parseObject(log); String ts = jsonObject.getString("ts"); headers.put("timestamp", ts); return event; } @Override public List<Event> intercept(List<Event> events) { for (Event event : events) { intercept(event); } return events; } @Override public void close() { } public static class Builder implements Interceptor.Builder { @Override public Interceptor build() { return new TimeStampInterceptor(); } @Override public void configure(Context context) { } } }

ok! This solves the problem of zero drift!

Business data collection_ Zero drift processing method (Flume+Kafka+HDFS)

Here I use an interceptor to add a time stamp to the Head of each event

Is there any basis for this? Is that reasonable?

1 October 2021, 20:24 | Views: 9819

Add new comment

0 comments