Technology sharing and quick handling of MySQL duplicate data

Author: Yang Taotao

It's just that I'm helping customers migrate from the database of daydream to MySQL recently. Let me also briefly talk about the processing of duplicate data.

There are three types of data stored in the database:

  1. One is data filtered by strict meaning. For example, the program side filters the data source, the database side sets the check mark on the table field to filter the data source, sets the trigger filter, calls the stored procedure filter and so on;
  2. The other is raw data without any processing. For example, the abnormal code on the program side causes abnormal data, the database side does not set any filtering rules for data retention, and so on. In this way, a series of garbage data will be generated, including the repeated data I want to talk about today;
  3. The last is the repeated data that SQL statement may produce during execution, such as two tables' external connection, which always produces a series of NULL. The duplicate data I want to talk about today does not include the duplicate data generated in the execution of SQL statements, but only includes the processing of the original duplicate data. Next, in a few classic scenarios,
First, the recording is completely repeated, which is actually the simplest de duplication scene.

For example, table d1 without primary key

mysql-(ytt/3305)->show create table d1\G
*************************** 1. row ***************************
       Table: d1
Create Table: CREATE TABLE `d1` (
  `r1` int(11) DEFAULT NULL,
  `r2` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

The total number of records is four million.

mysql-(ytt/3305)->select count(*) from d1 limit 2;

+----------+
| count(*) |
+----------+
|  4000000 |
+----------+
1 row in set (0.18 sec)

It can be seen that three quarters of the records are repeated.

mysql-(ytt/3305)->select count(distinct r1,r2) from d1 ;
+-----------------------+
| count(distinct r1,r2) |
+-----------------------+
|               1000000 |
+-----------------------+
1 row in set (2.68 sec)

For example, there are four records (1, 1).

mysql-(ytt/3305)-> select * from db1 order by r1,r2 limit 5;
+------+------+
| r1   | r2   |
+------+------+
|    1 |    1 |
|    1 |    1 |
|    1 |    1 |
|    1 |    1 |
|    2 |    2 |
+------+------+
5 rows in set (1.65 sec)

This kind of de duplication is very simple, either in the database layer, or export the data and filter it to the database.

To do this in the database is just to create a new clone table, filter out the normal data, rename it, and delete the original table. The steps are not very complicated, as follows:

mysql-(ytt/3305)->create table d2 like d1;
Query OK, 0 rows affected (0.01 sec)

Time is mainly spent in de duplication and inserting new tables here

mysql-(ytt/3305)->insert into d2 select distinct r1,r2 from d1;
Query OK, 1000000 rows affected (19.40 sec)
Records: 1000000  Duplicates: 0  Warnings: 0

mysql-(ytt/3305)->alter table d1 rename to d1_bak;
Query OK, 0 rows affected (0.00 sec)

mysql-(ytt/3305)->alter table d2 rename to d1;
Query OK, 0 rows affected (0.00 sec)

mysql-(ytt/3305)->drop table d1_bak;
Query OK, 0 rows affected (0.00 sec)

It took about 20 seconds. Let's look at the system level. Export the data first,

mysql-(ytt/3305)->select * from db1 into outfile '/var/lib/mysql-files/d1.txt';
Query OK, 4000000 rows affected (1.84 sec)

System level de duplication, using the OS tools sort and uniq.

root@ytt-pc:/var/lib/mysql-files# time cat d1.txt |sort -g  |uniq > d1_uniq.txt

real	0m7.345s
user	0m7.528s
sys	0m0.272s

Import to the original table,

mysql-(ytt/3305)->truncate table d1;
Query OK, 0 rows affected (0.05 sec)

root@ytt-pc:/var/lib/mysql-files# mv d1_uniq.txt  d1.txt

Import the processed data directly into the database

root@ytt-pc:/home/ytt/scripts# time mysqlimport -uytt -pytt -P3305 -h 127.0.0.1   --use-threads=2 -vvv ytt /var/lib/mysql-files/d1.txt 
mysqlimport: [Warning] Using a password on the command line interface can be insecure.
Connecting to 127.0.0.1
Selecting database ytt
Loading data from SERVER file: /var/lib/mysql-files/d1.txt into d1
ytt.d1: Records: 1000000  Deleted: 0  Skipped: 0  Warnings: 0
Disconnecting from 127.0.0.1

real	0m3.272s
user	0m0.012s
sys	0m0.008s

Look at the processed records,

mysql-(ytt/3305)->select * from d1 where 1 order by r1,r2 limit 2;
+------+------+
| r1   | r2   |
+------+------+
|    1 |    1 |
|    2 |    2 |
+------+------+
2 rows in set (0.40 sec)

The OS layer is slightly more efficient, including data export, data De duplication and data import, which is almost half of the database layer time.

The second is similar to the first. The difference is that the table has a primary key, but other field record values are duplicate.

For example, in addition to the primary key in table d4, other records are exactly the same as before. The records are as follows:

mysql-(ytt/3305)->select * from d4 order by r1,r2 limit 5;
+---------+------+------+
| id      | r1   | r2   |
+---------+------+------+
|       1 |    1 |    1 |
| 3000001 |    1 |    1 |
| 2000001 |    1 |    1 |
| 1000001 |    1 |    1 |
|       2 |    2 |    2 |
+---------+------+------+
5 rows in set (1.08 sec)

But this kind of business needs to be discussed with specific businesses. For example, I need to leave the maximum primary key value of duplicate records, such as the above one, with the maximum id of 3000001. In this way, you can duplicate one sql,

mysql-(ytt/3305)->delete a from d4 a left join (select max(id) id from d4 group by r1, r2) b using(id) where b.id is null;

Query OK, 3000000 rows affected (23.29 sec)

300W rows of duplicate records are removed, leaving a quarter of normal data.

mysql-(ytt/3305)->select count(*) from d4;
+----------+
| count(*) |
+----------+
|  1000000 |
+----------+
1 row in set (0.06 sec)

Let's see the effect. The maximum value is reserved and others are deleted.

mysql-(ytt/3305)->select * from d4 order by r1,r2 limit 5;
+---------+------+------+
| id      | r1   | r2   |
+---------+------+------+
| 3000001 |    1 |    1 |
| 3000002 |    2 |    2 |
| 3000003 |    3 |    3 |
| 3000004 |    4 |    4 |
| 3000005 |    5 |    5 |
+---------+------+------+
5 rows in set (0.25 sec)
The third is different from the previous two. This is reflected in redundant characters in field values, such as spaces, redundant line breaks, etc. Here are a few examples:

1. It is the simplest in this category to remove the blank characters before and after the field value. This kind of MySQL has ready-made functions, just a basic SQL.  

surface y11 There are 500 W Row sample data
mysql-(ytt/3305)->select count(*) from y11;
+----------+
| count(*) |
+----------+
|  5242880 |
+----------+
1 row in set (0.30 sec)

Using trim function.

mysql-(ytt/3305)->update y11 set r1 = trim(r1), r2 = trim(r2);
Query OK, 5242880 rows affected (2 min 1.56 sec)
Rows matched: 5242880  Changed: 5242880  Warnings: 0

mysql-(ytt/3305)->select * from y11 limit 5;
+----+------------------------+------------------------+
| id | r1                     | r2                     |
+----+------------------------+------------------------+
|  1 | sql server             | sql server             |
|  2 | sql server             | sql server             |
|  3 | sql server             | sql server             |
|  6 | db2 mysql oracle mysql | db2 mysql oracle mysql |
|  7 | db2 mysql oracle mysql | db2 mysql oracle mysql |
+----+------------------------+------------------------+
5 rows in set (0.00 sec)

2. Remove all kinds of blank characters in the middle of the word (spaces, line breaks, tabs, etc.); before and after the word, there are spaces in the middle.

It is still table y11. From the result, all kinds of line breaks and spaces have made the result unable to display normally.

mysql-(ytt/3305)->select * from y11 limit 5;
+----+-----------------------------------------------------+------------------------------------------------------+
| id | r1                                                  | r2                                                   |
+----+-----------------------------------------------------+------------------------------------------------------+
|  1 | sql server                                          | sql server                                           |
|  2 | sql       server                                    | sql            server                                |
server                                         | sql 	 server                                         |
 |           mysql | db2         mysql         oracle
 | 7 | db2         mysql         oracle	             mysql | db2         mysql         oracle	             mysql
+----+-----------------------------------------------------+------------------------------------------------------+
5 rows in set (0.00 sec)

Perhaps the first way to think about it is to export the data to a text file, and then import it after processing with various tools on linux, such as:

mysql-(ytt/3305)->select * from y11 into outfile '/var/lib/mysql-files/y11.txt'  fields terminated by ',' enclosed by '"';
Query OK, 5242880 rows affected (3.54 sec)

mysql-(ytt/3305)->truncate y11;
Query OK, 0 rows affected (0.23 sec)

Replace all blank characters with sed.

root@ytt-pc:/var/lib/mysql-files# time sed -i 's/\s\+/ /g' y11.txt

real	0m27.476s
user	0m20.105s
sys	0m7.233s

Import to table y11

mysql-(ytt/3305)->load data infile '/var/lib/mysql-files/y11.txt' into table y11 fields terminated by ',' enclosed by '"';
Query OK, 5242880 rows affected (30.25 sec)
Records: 5242880  Deleted: 0  Skipped: 0  Warnings: 0

Although the above goal has been achieved, the process is too cumbersome. If the MySQL layer really can't solve it, it will be considered again.

You can use MySQL's regular replacement function to directly replace redundant characters into a space, which is also a simple SQL.

mysql-(ytt/3305)->update y11 set r1 = regexp_replace(r1,'[[:space:]]+',' '), r2 = regexp_replace(r2,'[[:space:]]+',' ');
Query OK, 4194304 rows affected (1 min 32.05 sec)
Rows matched: 5242880  Changed: 4194304  Warnings: 0

It's just a little longer, but it doesn't have a big impact.

mysql-(ytt/3305)->select * from y11 limit 5;
+----+------------------------+-------------------------+
| id | r1                     | r2                      |
+----+------------------------+-------------------------+
|  1 | sql server             | sql server              |
|  2 | sql server             | sql server              |
|  3 | sql server             | sql server              |
|  6 | db2 mysql oracle mysql | db2 mysql oracle mysql  |
|  7 | db2 mysql oracle mysql | db2 mysql oracle mysql  |
+----+------------------------+-------------------------+
5 rows in set (0.00 sec)

I think it is inevitable that there will be data De duplication scenarios in daily data processing. I hope this part of the content will be helpful to you.

Tags: Database MySQL SQL Oracle

Posted on Mon, 10 Feb 2020 02:41:41 -0500 by dlkinsey85