Spark Core Knowledge Points Review-1

Day1111 Spark Task Scheduling Several key components of Spark Spark Core Concepts and characteristics of RDD Two types of RDD generation Two types of RDD operators Operator Practice partition RDD Dependency DAG: Directed Acyclic Graph Task Submission cache checkPoint Custom Sorting Custom Partiti ...

Posted on Mon, 11 Nov 2019 22:46:13 -0500 by 2oMst

Vi. installation and configuration Hive (the sixth operation)

MySQL installation Download MySQL server from the official website (yum installation) wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm If wget is not available, download and install wget: yum -y install wget decompression rpm -ivh mysql-community-release-el7-5.noarch.rpm install yum install m ...

Posted on Mon, 04 Nov 2019 13:46:40 -0500 by Bike Racer

Hive built in function

Hive built in function The first blog ~ ~ ~ hahaha, as a learning note, if there are narrative mistakes or other better methods, I hope you can give me some advice in the comment area! Hey hey! Query Hive built-in functions Enter hive client: [user1@master ~]$hive Query Hive built-in functions: hive ...

Posted on Tue, 22 Oct 2019 15:16:03 -0400 by BrandonKahre

Optimize ORC and Parquet files to improve large SQL read performance

This paper is compiled from the IBM developer community. It mainly describes the problems of ORC and Parquet files in HDFS, and how these small files affect the reading performance of Big SQL. It also explores possible solutions to compress small files into large files using existing tools in order to improve the reading performance. brief intr ...

Posted on Tue, 08 Oct 2019 19:41:06 -0400 by kincet7

83 Site Click Stream Data Analysis Case (Module Development - ETL)

The data analysis process of this project is implemented in hadoop cluster, mainly using the tool of hive data warehouse. Therefore, the data collected and pretreated need to be loaded into the hive data warehouse for subsequent mining and analysis. 1. Create raw data tables Establishment of Post Sour ...

Posted on Sun, 06 Oct 2019 01:40:02 -0400 by inni

lzo removal scheme (file format conversion)

Execution and testing process: 1. Create lzo related tables: (validation process, negligible) create external table test_lzo( id int )partitioned by(`date_par` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzo ...

Posted on Sat, 05 Oct 2019 04:13:14 -0400 by nestorvaldez

Flume - Common Errors

Flume - Common Errors 1 Closing file failed. Will retry again in 120 seconds. 1.1 Error Reporting Phenomenon and Solution Detailed error reporting information is as follows: 09 Aug 2019 17:00:31,787 WARN [SinkRunner-PollingRunner-DefaultSinkProcessor] (org.apache.flume.sink.hdfs.BucketWriter$Close ...

Posted on Thu, 03 Oct 2019 07:32:02 -0400 by LazyJones

Use Hive to analyze website access log, statistics log activity, new user data and script writing

demo Article directory demo demand data table create load data Diurnal activity Nissin shell script demand Create a table to store new data (partitioned tables) every day. Statistics of daily active users (daily activities) (need user's ip, user's account, user access time of the earliest url ...

Posted on Tue, 01 Oct 2019 11:25:31 -0400 by dino345

Flink dropping HDFS data partitioning by event time

0x1 Digest For the convenience of query and analysis, almost all tables in Hive offline warehouse are partitioned, most commonly by day. Flink writes data to HDFS through the following configuration. BucketingSink<Object> sink = new BucketingSink<>(path); //In this way, data cross-sky partitioning can be realized. sink.setBucketer(n ...

Posted on Sun, 29 Sep 2019 09:31:41 -0400 by silversinner

8. Detailed partition and bucket partition of Hive

First, let's set up a database: create database if not exists myhive1; Use this database: use myhive1; Delete the database table student if it exists: drop table if exists student; Create a database table student and create a partition: create table student(id int, name string, sex string ,age int, department string) row format delimited fi ...

Posted on Mon, 23 Sep 2019 21:54:42 -0400 by flhtc