Record a production accident -- disk full

Written in front

Today, in the production environment running on Alibaba cloud ECS, all of a sudden, the access is abnormal and the interface reports errors. However, the company has no professional operation and maintenance personnel, so it can only solve the problem on its own.

Troubleshooting

On the surface, the database reports errors first

Caused by: org.postgresql.util.PSQLException: ERROR: could not extend file "base/16385/16587_fsm": No space left on device //Suggestion: Check free disk space

Intuitively, the device has no free space, that is, the disk is full.

Enter the server background and execute

$ df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 3.5M 1.6G 1% /run /dev/vda1 59G 56G 0 100% / tmpfs 7.9G 4.0K 7.9G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup /dev/mapper/vg0-vol0 1000G 14G 937G 2% /data tmpfs 1.6G 0 1.6G 0% /run/user/0

It is found that the disk is full, and it is completely full. The system disk takes up 100%. It's estimated that no service can run. /dev/vda1 59G 56G 0 100% /

However, it was found that / dev / mapper / vg0-vol0 1000G 14g 937g 2% / data, 1000G only used 2%

Alicloud ECS is divided into system disk and data disk. 1000G is data disk

The first reaction is that the data of the built PG database has not been moved to the data disk.

Move the Postgres database data directory to the system disk

Reference resources How to move the PostgreSQL data directory to a new location on Ubuntu 16.04

$ sudo -u postgres psql postgres# SHOW data_directory; # View current data directory data_directory ------------------------------ /var/lib/postgresql/9.5/main (1 row) postgres# \q; # Sign out # To ensure data integrity, we will close PostgreSQL before actually changing the data directory $ sudo systemctl stop postgresql # Ensure closure is complete $ sudo systemctl status postgresql . . . Jul 22 16:22:44 ubuntu-512mb-nyc1-01 systemd[1]: Stopped PostgreSQL RDBMS. $ sudo rsync -av /var/lib/postgresql /data # /data is the new directory to be migrated to $ cd /data $ ls ... postgresql # Delete original data directory $ sudo rm -rf /var/lib/postgresql # Link the new data directory to the original data directory $ sudo ln -s /data/postgresql /var/lib/postgresql # Restart Postgres database $ sudo systemctl start postgresql $ sudo systemctl status postgresql

Complete the above steps to move the postgre database data directory to Alibaba cloud data disk

Think it's OK

$ df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 3.5M 1.6G 1% /run /dev/vda1 59G 56G 51M 100% / tmpfs 7.9G 4.0K 7.9G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup /dev/mapper/vg0-vol0 1000G 14G 937G 2% /data tmpfs 1.6G 0 1.6G 0% /run/user/0

The threads don't move...

Ubuntu query large files

It is speculated that the disk is full due to the existence of large files

$ cd / $ find . -type f -size +800M -print0 | xargs -0 du -h 5.6G ./var/log/syslog.1 6.7G ./var/log/syslog ... $ rm ...

If we find a big log file, we can delete it without hesitation. If we meet some unknown files, don't delete them. We must find out the function of the file. If you can delete them, delete them. Don't accidentally delete the library and run away...

After deletion, check again

$ df -h Filesystem Size Used Avail Use% Mounted on udev 7.9G 0 7.9G 0% /dev tmpfs 1.6G 3.4M 1.6G 1% /run /dev/vda1 59G 45G 12G 80% / tmpfs 7.9G 4.0K 7.9G 1% /dev/shm tmpfs 5.0M 4.0K 5.0M 1% /run/lock tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup /dev/mapper/vg0-vol0 1000G 14G 936G 2% /data tmpfs 1.6G 0 1.6G 0% /run/user/0

12G more.

View processes that have deleted space but have not been freed

At this time, the service should be able to recover successfully. But you will soon find that the disk is full again, and this time, the log file is not large.

Check the deleted files to see if the space is free. If not, kill the pid

When rm is used to delete a file, although the file has been deleted, the space is not released because the file is occupied by other processes

$ sudo lsof -n |grep deleted java 17866 root 237r REG 253,1 163541 1709285 /tmp/tomcat.8250394289784312179.8080/work/Tomcat/localhost/ROOT/upload_c6db0c17_6e6a_4141_bfb6_ac1b2d8a3b0b_00000000.tmp (deleted) ... $ sudo kill -9 17866

Using the df-h command again, the disk utilization is reduced a lot.

summary

It's terrible that the server system disk is full! At that time, all services will become unavailable, and the business system will have many strange problems. Therefore, the operation and maintenance needs to check the disk usage of the server regularly. Alibaba cloud ECS users can turn on the alarm, find problems in time and solve problems!
Alicloud ECS provides system disks and data disks. Remember, for example, Pg, Redis, Cassandra and other services that are easy to occupy disk, you must put the data directory on the data disk provided by alicloud ECs.
/var/log is the system log directory, which can be deleted as early as possible under constant attention.
The best way to free the disk space occupied by files is to empty the files online, which can be done by the following commands:

[root@localhost ~]# echo "" >/var/log/syslog

In this way, the disk space can not only be released immediately, but also ensure that the process continues to write logs to the files. This method is often used to clean the log files generated by Apache, Tomcat, Nginx and other Web services online.

Finally, how important it is to have a professional operation and maintenance!

Record a production accident -- disk full

Move the Postgres database data directory to the system disk

Ubuntu query large files

View processes that have deleted space but have not been freed

10 February 2020, 07:52 | Views: 8476

Add new comment

0 comments