Find Common Friends - Data Mining - Scala Edition

Hello, there are many language implementations on the Internet about the algorithm of "Find common friends". When I have time today, I have studied the writing of the Scala algorithm myself. The complete code can refer to the Git address:https://github.com/benben7466/SparkDemo/blob/master/spark-test/src/main/scala/testCommendFriend.s ...

Posted on Sat, 04 Jul 2020 10:58:46 -0400 by EGNJohn

Building analysis data lake with Apache Spark and Apache Hudi

Welcome to WeChat official account: ApacheHudi 1. Introduction Most modern data lakes are built on some kind of distributed file system (DFS), such as HDFS or cloud based storage, such as AWS S3. One of the basic principles to follow is the "write once read many" access model for files. This ...

Posted on Sun, 14 Jun 2020 22:39:40 -0400 by tozanni

Spark_ Correct use of checkpoint in spark and its difference from cache

1.Spark performance tuning: use of checkPoint https://blog.csdn.net/leen0304/article/details/78718346   summary   Checkpoint means to establish checkpoints, similar to snapshots. For example, in spark computing, the computing process DAG is very long, and the server needs to complete the wh ...

Posted on Sun, 14 Jun 2020 00:48:31 -0400 by brokeDUstudent

Machine learning - overview, feature extraction of data (notes)

1. The relationship among artificial intelligence, machine learning and deep learning What machine learning can do. Recommended books for learningLearning objectives 2. What is machine learning Machine learning is to automatically analyze and obtain laws (models) from data, and use laws to predict u ...

Posted on Fri, 12 Jun 2020 00:23:35 -0400 by Goose

Spark streaming reads the database data extracted from Flume by Kafka and saves it in HBase. Hive maps HBase for query

Recently, the company is working on real-time flow processing. The specific requirements are: real-time import of relevant data tables in relational databases (MySQL, Oracle) into HBase, and use Hive mapping HBase for data query. The company uses the big data cluster built by CDH6.3.1~ 1, Configure ...

Posted on Wed, 10 Jun 2020 00:55:16 -0400 by jcleary

Flexible use of Spark window functions lead and lag for online time length statistics

brief introduction In data statistics, it is often necessary to count some time-consuming data, such as online time. Some of these data are better to count, and some are a little bit more troublesome. For example, count the online time of users according to the log in and log out. We can use the window functions lead and lag to complete, which ...

Posted on Tue, 09 Jun 2020 23:56:54 -0400 by dpiland

Submit Spark tasks remotely to yarn cluster

Reference article:How to submit spark tasks to yarn cluster remotely in idea Several modes of running spark tasks: 1, local mode, write code in idea and run directly. 2,standalone mode, need to jar package program, upload to cluster, spark-submit submit to cluster run 3,yarn mode (local,client,cluster) as above, also requires jar packa ...

Posted on Thu, 21 May 2020 20:09:40 -0400 by twilightnights

RDD common operators of spark notes

hello everyone! Here are the saprk operator notes I learned during the epidemic holiday. I just spent the whole afternoon sorting them out and sharing them with you! It's not easy to code. If it helps you, remember to like it! Article catalog 1, spark action operator 2, spark single value type 3, spark double value type 4, spa ...

Posted on Mon, 18 May 2020 04:02:17 -0400 by Mattyspatty

Machine Learning Model Training Scheme in Mass Data Scenarios

It is very difficult to train the machine learning model by single point in the process of actual processing and solving the engineering problem of machine learning.These scenarios include online recommendations, CTR estimates, Lookalike marketing, and so on. When there are hundreds of millions of data, tens of thousands of dimensional features ...

Posted on Mon, 11 May 2020 23:50:09 -0400 by les48

Analysis of Hadoop YARN ResourceManager crash caused by data limit of ZooKeeper node

This problem makes us encounter again. It happens infrequently, but once it happens, it will cause resource manager service crash, ZK registration watch too many and other problems. It has always been a hindrance to not completely solve this problem, so based on the previous two times of analysis and reading the latest version of Hadoop 3.2.1 c ...

Posted on Sun, 10 May 2020 10:38:53 -0400 by FireWhizzle