reference material:
Ceph's RADOS design principle and Implementation
https://docs.ceph.com/en/latest/rados/operations/crush-map/
http://strugglesquirrel.com/2019/02/02/ceph%E8%BF%90%E7%BB%B4%E5%A4%A7%E5%AE%9D%E5%89%91%E4%B9%8B%E9%9B%86%E7%BE%A4osd%E4%B8%BAfull/
https://docs.ceph.com/en/latest/rados/operations/balancer/#modes
CRUSH
Crush algorithm is a tool used to calculate which OSD objects are distributed on. It consists of two steps:
-
Calculate the mapping from object to PG and use hash algorithm. If the number of PG remains unchanged, the result remains unchanged.
Hash(oid) = pgid
-
The straw2 algorithm is generally used to calculate the mapping from PG to specific OSD (detailed in the Crush chapter of Ceph design principle and Implementation). This result can be changed by adjusting the weight of OSD.
CRUSH(pgid) = OSDid
It can be seen that objects are distributed among PGs, and the mapping result remains stable when the number of PG remains unchanged. In this way, managing the distribution of PG is equivalent to managing the distribution of objects in the whole cluster. In the first chapter of Ceph's Rados design principle and implementation, the mode of PG splitting and capacity expansion and the reasons for its efficiency and simplicity are introduced in detail.
Edit CRUSH map
Crush map is mainly composed of two parts: cluster map and placement rule. The former describes the distribution of the whole cluster equipment, and the latter specifies the steps and rules for selecting OSD.
Get CRUSH map
The CRUSH map obtained by this command has been compiled and needs to be decompiled before it can be edited as text.
ceph osd getcrushmap -o {compiled-crushmap-filename} [root@node-1 ~]# ceph osd getcrushmap -o crushmap 7 [root@node-1 ~]# ls -al crushmap -rw-r--r-- 1 root root 845 6 October 22:31 crushmap
Decompile CRUSH map
Convert the CRUSH map obtained by getcrush map into editable and readable text.
crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-file} [root@node-1 ~]# crushtool -d crushmap -o crushmap.txt [root@node-1 ~]# cat crushmap.txt # begin crush map # Generally not changed tunable choose_local_tries 0 # Obsolete, set to 0 for backward compatibility tunable choose_local_fallback_tries 0 # Obsolete, set to 0 for backward compatibility tunable choose_total_tries 50 # Select the maximum number of bucket attempts. The default value is 50 tunable chooseleaf_descend_once 1 # Obsolete, set to 1 for backward compatibility tunable chooseleaf_vary_r 1 # tunable chooseleaf_stable 1 # Avoid unnecessary pg migration tunable straw_calc_version 1 # starw algorithm version, backward compatible, set to 1 tunable allowed_bucket_algs 54 # The bucket selection algorithm allowed, 54 represents straw2 algorithm # devices # Each last physical device (i.e. OSD), also known as leaf node, generally does not need to be set manually. # id is generally greater than or equal to 0. It is different from the following bucket id. device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd # types # The defined bucket type can be customized (add or delete type), and the number must be a positive integer. type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 zone type 10 region type 11 root # buckets # All intermediate nodes are called buckets. Buckets can be a collection of devices or a collection of lower level buckets. The root node is called root and is the entrance of the whole cluster. The bucket id must be negative and unique. The actual storage location of a bucket in the crush map is buckets[-1-(bucket id)]. host node-1 { id -3 # do not change unnecessarily id -4 class hdd # do not change unnecessarily # weight 0.010 # This bucket weight is equal to the sum of item weights alg straw2 # Using straw2 algorithm hash 0 # rjenkins1 item osd.0 weight 0.010 # This bucket contains OSD and its weight. The weight is generally determined according to the capacity. For example, 1T is equal to 1 weight } host node-2 { id -5 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 0.010 alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.010 } host node-3 { id -7 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 0.010 alg straw2 hash 0 # rjenkins1 item osd.2 weight 0.010 } # The root bucket, at least one, is the entry to the placement rule. root default { id -1 # do not change unnecessarily id -2 class hdd # do not change unnecessarily # weight 0.029 alg straw2 hash 0 # rjenkins1 item node-1 weight 0.010 # This bucket contains three sub buckets, each with a weight of 0.01 item node-2 weight 0.010 item node-3 weight 0.010 } # rules # placement rule. Note: there is only one crushmap, but multiple rules can be defined rule replicated_rule { id 0 # id type replicated # Type [replicated|erasure] min_size 1 # If the number of pool replicas is less than this value, this rule will not be applied max_size 10 # If the number of pool replicas is greater than this value, this rule will not be applied step take default # The entry of crush rules is generally a bucket of type root step choose firstn 0 type osd # There are two types: choose and chooseleaf. num represents the number of choices, and type is the expected bucket type. step emit # Output results } # end crush map
Compile CRUSH map
crushtool -c {decomplied-crush-map-filename} -o {complied-crush-map-filename}
Simulation test
You can simulate and test the compiled CRUSH map.
min-x: minimum input value. Enter a number with the object name analog to [min, max].
max-x: maximum input value.
Num Rep: the number of copies. The number of output OSDs equals the number of copies
ruleset: rule id. Select which replacement rule.
crushtool -i {complied-crush-map-filename} --test --min-x 0 --max-x 9 --num-rep 3 --ruleset 0 --show_mappings [root@node-1 ~]# crushtool -i crushmap --test --min-x 0 --max-x 9 --num-rep 3 --ruleset 0 --show_mappings CRUSH rule 0 x 0 [1,2,0] CRUSH rule 0 x 1 [2,0,1] CRUSH rule 0 x 2 [2,1,0] CRUSH rule 0 x 3 [0,1,2] CRUSH rule 0 x 4 [1,2,0] CRUSH rule 0 x 5 [0,1,2] CRUSH rule 0 x 6 [2,0,1] CRUSH rule 0 x 7 [1,2,0] CRUSH rule 0 x 8 [2,0,1] CRUSH rule 0 x 9 [1,2,0]
Or only statistical result distribution.
crushtool -i {complied-crush-map-filename} --test --min-x 0 --max-x 10000 --num-rep 3 --ruleset 0 --show_utilization [root@node-1 ~]# crushtool -i crushmap --test --min-x 0 --max-x 10000 --num-rep 3 --ruleset 0 --show_utilization rule 0 (replicated_rule), x = 0..10000, numrep = 3..3 rule 0 (replicated_rule) num_rep 3 result size == 3: 10001/10001 device 0: stored : 10001 expected : 10001 device 1: stored : 10001 expected : 10001 device 2: stored : 10001 expected : 10001
Injection cluster
The CRUSH map cannot take effect until it is injected into the cluster
ceph osd setcrushmap -i {complied-crushmap-filename} [root@node-1 ~]# ceph osd setcrushmap -i crushmap 8
Write a master copy on the SSD and other copies on the CRUSH map of the HDD
How do I always assign the primary replica to SSD devices and the secondary replica to HDD devices?
First of all, it should be understood that the primary replica in the CRUSH map is actually the first calculated device, and the secondary replica is the subsequently calculated device. As long as the SSD type is output in the first emit every time, the primary replica can always be on the SSD device.
Next, take an example.
The equipment distribution diagram is given here. There are three hosts. Each host has one SSD and two HDD type OSD s.
root _ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ | | | node-1 node-2 node-3 _ _ _ _|_ _ _ _ _ _ _ _|_ _ _ _ _ _ _ _|_ _ _ _ | | | | | | | | | osd.0 osd.1 osd.2 osd.3 osd.4 osd.5 osd.6 osd.7 osd.8 SSD HDD HDD SSD HDD HDD SSD HDD HDD
to write
First modify the device.
# Modify device device 0 osd.0 class SSD device 1 osd.1 class HDD device 2 osd.2 class HDD device 3 osd.3 class SSD device 4 osd.4 class HDD device 5 osd.5 class HDD device 6 osd.6 class SSD device 7 osd.7 class HDD device 8 osd.8 class HDD
Then, modify the cluster map to group SSD s. Considering the physical isolation domain, HDD s are divided into three groups according to different hosts.
# buckets # SSD into a group host node-SSD { id -1 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.010 # The weight can be adjusted according to the hard disk capacity, which is considered here item osd.3 weight 0.010 item osd.6 weight 0.010 } # HDD s are divided into three groups according to the host host node-1-HDD { id -2 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.1 item osd.2 weight 0.1 } host node-2-HDD { id -3 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.4 weight 0.1 item osd.5 weight 0.1 } host node-3-HDD { id -4 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.7 weight 0.1 item osd.8 weight 0.1 }
Next, give the entrance. There are two entrances: SSD node and HDD node.
Note: there is no essential difference between root and host. They both belong to type. Host can also be used as an entry.
# root bucket root root-SSD { id -5 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item node-SSD weight 0.030 # Note that the weight is equal to the sum of the three item weights in the node SSD } root root-HDD { id -6 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item node-1-HDD weight 0.2 # Note that the weight is equal to the sum of the three item weights in the node-x-HDD item node-2-HDD weight 0.2 item node-3-HDD weight 0.2 }
placement rule, the policy should pay attention to outputting SSD before outputting HDD.
# rules rule ssd-primary { id 1 type replicated min_size 1 max_size 10 step take root-SSD step chooseleaf firstn 1 type host step emit step take root-HDD step chooseleaf firstn -1 type host # -1 means to select (number of copies - 1) hosts. If there are 3 copies, here are 2 hosts step emit }
test
[root@node-1 ~]# vi mycrushmap.txt [root@node-1 ~]# crushtool -c mycrushmap.txt -o mycrushmap [root@node-1 ~]# crushtool -i mycrushmap --test --min-x 0 --max-x 9 --num-rep 3 --ruleset 1 --s how_mappings CRUSH rule 1 x 0 [0,7,1] CRUSH rule 1 x 1 [3,2,5] CRUSH rule 1 x 2 [0,8,4] CRUSH rule 1 x 3 [0,8,4] CRUSH rule 1 x 4 [3,1,8] CRUSH rule 1 x 5 [3,7,4] CRUSH rule 1 x 6 [6,7,4] CRUSH rule 1 x 7 [6,1,8] CRUSH rule 1 x 8 [6,2,5] CRUSH rule 1 x 9 [6,8,1]
The test results show that all primary OSDs, that is, the first OSD, are SSD s. This indicates that this CRUSH map meets the preset.
appendix
Give a complete compiled CRUSH map.
[root@node-1 ~]# vi mycrushmap.txt device 0 osd.0 class SSD device 1 osd.1 class HDD device 2 osd.2 class HDD device 3 osd.3 class SSD device 4 osd.4 class HDD device 5 osd.5 class HDD device 6 osd.6 class SSD device 7 osd.7 class HDD device 8 osd.8 class HDD type 0 osd type 1 host type 11 root host node-SSD { id -1 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.0 weight 0.010 item osd.3 weight 0.010 item osd.6 weight 0.010 } host node-1-HDD { id -2 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.1 weight 0.1 item osd.2 weight 0.1 } host node-2-HDD { id -3 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.4 weight 0.1 item osd.5 weight 0.1 } host node-3-HDD { id -4 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item osd.7 weight 0.1 item osd.8 weight 0.1 } root root-SSD { id -5 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item node-SSD weight 0.030 } root root-HDD { id -6 # do not change unnecessarily alg straw2 hash 0 # rjenkins1 item node-1-HDD weight 0.2 item node-2-HDD weight 0.2 item node-3-HDD weight 0.2 } rule ssd-primary { id 1 type replicated min_size 1 max_size 10 step take root-SSD step chooseleaf firstn 1 type host step emit step take root-HDD step chooseleaf firstn -1 type host step emit }
CRUSH other commands
View device map
Print out the devie map tree structure in the way of depth first traversal.
ceph osd tree [root@node-1 ~]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.03918 root default -3 0.01959 host node-1 0 hdd 0.00980 osd.0 up 1.00000 1.00000 3 hdd 0.00980 osd.3 up 1.00000 1.00000 -5 0.00980 host node-2 1 hdd 0.00980 osd.1 up 1.00000 1.00000 -7 0.00980 host node-3 2 hdd 0.00980 osd.2 up 1.00000 1.00000
View placement rule
Check which rules are in the cluster map and the specific steps of each rule.
# Query all rule s ceph osd crush rule ls # Specific steps for querying specified rule s ceph osd crush rule dump [root@node-1 ~]# ceph osd crush rule ls replicated_rule [root@node-1 ~]# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "ruleset": 0, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } ]
Data rebalancing
Although the CRUSH algorithm pays attention to the balanced distribution of data as much as possible in the design, in actual production, with the continuous increase of cluster storage data and the change of cluster equipment, this balance will be broken.
To solve this problem, Ceph provides a series of tools to redistribute data and return to a relatively balanced position.
View cluster space usage
ceph df can roughly view the space usage of the cluster's devices and storage pools.
Under the RAW STORAGE type, you can see that USED is smaller than RAW USED. USED is the user's actual data storage, and RAW USED is the total usage (including cluster generated metadata).
In POOLS, STORED is the usage considered by the user, and USED is the actual space usage of the cluster, which is about three times that of the former. This is because Ceph uses three copies to store data, and each copy of data uses three copies of storage space.
[root@node-1 ceph-deploy]# ceph df RAW STORAGE: type size Free space used space Total space used Percentage of total usage CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 40 GiB 35 GiB 1.3 GiB 5.3 GiB 13.17 TOTAL 40 GiB 35 GiB 1.3 GiB 5.3 GiB 13.17 POOLS: Pool name id Data volume Number of objects POOL ID STORED OBJECTS USED %USED MAX AVAIL pool-1 1 376 MiB 95 1.1 GiB 3.30 11 GiB rbd-pool 2 22 MiB 17 67 MiB 0.20 11 GiB
ceph osd df tree can view the space usage of each device in detail.
VAR = current OSD usage / average cluster space usage.
Through "MIN/MAX VAR: 0.88/1.07" in the last line, you can know the maximum and minimum offset of VAR, so as to judge whether the distribution of the whole cluster is balanced.
# Depth first traversal, and output the details of all bucket s and device s in the CRUSH map [root@node-1 ceph-deploy]# ceph osd df tree weight Re weight size total usage data Proportion of total metadata remaining usage pg number ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 0.03918 - 40 GiB 5.3 GiB 1.3 GiB 112 KiB 4.0 GiB 35 GiB 13.17 1.00 - root default -3 0.01959 - 20 GiB 2.4 GiB 449 MiB 32 KiB 2.0 GiB 18 GiB 12.20 0.93 - host node-1 0 hdd 0.00980 1.00000 10 GiB 1.3 GiB 282 MiB 32 KiB 1024 MiB 8.7 GiB 12.76 0.97 92 up osd.0 3 hdd 0.00980 1.00000 10 GiB 1.2 GiB 167 MiB 0 B 1 GiB 8.8 GiB 11.64 0.88 100 up osd.3 -5 0.00980 - 10 GiB 1.4 GiB 424 MiB 48 KiB 1024 MiB 8.6 GiB 14.14 1.07 - host node-2 1 hdd 0.00980 1.00000 10 GiB 1.4 GiB 424 MiB 48 KiB 1024 MiB 8.6 GiB 14.14 1.07 192 up osd.1 -7 0.00980 - 10 GiB 1.4 GiB 424 MiB 32 KiB 1024 MiB 8.6 GiB 14.14 1.07 - host node-3 2 hdd 0.00980 1.00000 10 GiB 1.4 GiB 424 MiB 32 KiB 1024 MiB 8.6 GiB 14.14 1.07 192 up osd.2 TOTAL 40 GiB 5.3 GiB 1.3 GiB 113 KiB 4.0 GiB 35 GiB 13.17 MIN/MAX VAR: 0.88/1.07 STDDEV: 1.05 # Only the details of osd are output [root@node-1 ~]# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 0.03918 - 40 GiB 5.3 GiB 1.3 GiB 96 KiB 4.0 GiB 35 GiB 13.18 1.00 - root default -3 0.01959 - 20 GiB 2.4 GiB 452 MiB 64 KiB 2.0 GiB 18 GiB 12.21 0.93 - host node-1 0 hdd 0.00980 1.00000 10 GiB 1.3 GiB 283 MiB 48 KiB 1024 MiB 8.7 GiB 12.77 0.97 92 up osd.0 3 hdd 0.00980 1.00000 10 GiB 1.2 GiB 169 MiB 16 KiB 1024 MiB 8.8 GiB 11.65 0.88 100 up osd.3 -5 0.00980 - 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 - host node-2 1 hdd 0.00980 1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192 up osd.1 -7 0.00980 - 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 - host node-3 2 hdd 0.00980 1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192 up osd.2 TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB 4.0 GiB 35 GiB 13.18 MIN/MAX VAR: 0.88/1.07 STDDEV: 1.05
In practice, the space utilization rate of Ceph's three replicas is smaller than 33%. Because some space needs to be reserved for the normal operation of Ceph cluster, the average space utilization rate is about 23% [Ceph rados design principle and implementation]. The replicated strategy is too bad. You have to pay for three bowls after eating one bowl of powder.
You can set the cluster space usage through mon osd full ratio and mon osd nearfull ratio. mon osd full ratio is 0.95 by default. After 95% of the data is written, this OSD will prohibit reading and writing data (note that reading and writing are not allowed). mon osd nearfull ratio is 0.85 by default. When the space utilization rate reaches 85%, a warning will be generated.
It should be noted that these two settings are only suitable for writing in the configuration file when creating Ceph clusters at the beginning.
[global] mon_osd_full_ratio = .80 mon_osd_backfillfull_ratio = .75 mon_osd_nearfull_ratio = .70
These two ratio values can be modified during operation through the CEPH OSD set nearfull ratio and set full ratio commands.
[root@node-1 ceph-deploy]# ceph osd set-full-ratio 0.98 osd set-full-ratio 0.98 [root@node-1 ceph-deploy]# ceph osd dump | grep full full_ratio 0.98 backfillfull_ratio 0.9 nearfull_ratio 0.85 [root@node-1 ceph-deploy]# ceph osd set-nearfull-ratio 0.95 osd set-nearfull-ratio 0.95 [root@node-1 ceph-deploy]# ceph osd dump | grep full full_ratio 0.98 backfillfull_ratio 0.9 nearfull_ratio 0.95 # Note: when adjusting parameters, pay attention to the restriction relationship with other parameters. [root@node-1 ~]# ceph osd dump | grep full full_ratio 0.98 backfillfull_ratio 0.9 nearfull_ratio 0.95 [root@node-1 ~]# ceph health detail HEALTH_ERR full ratio(s) out of order OSD_OUT_OF_ORDER_FULL full ratio(s) out of order backfillfull_ratio (0.9) < nearfull_ratio (0.95), increased osd_failsafe_full_ratio (0.97) < full_ratio (0.98), increased
This command is usually used to temporarily modify the configuration file after the cluster is full, so that the OSD can read and write, and then restore the cluster to a healthy state through capacity expansion, deletion, rebalancing, etc. A related O & M link is posted here:
http://strugglesquirrel.com/2019/02/02/ceph%E8%BF%90%E7%BB%B4%E5%A4%A7%E5%AE%9D%E5%89%91%E4%B9%8B%E9%9B%86%E7%BE%A4osd%E4%B8%BAfull/
reweight
CRUSH algorithm has a good balance effect in theory, but the actual production environment is complex and changeable, so its balance effect is not satisfactory. Reweight is a compensation measure specially designed for CRUSH. By adjusting reweight, an overload test can be conducted after straw2 calculation is completed. Only after passing the overload test can it be truly selected. The greater the value of reweight, the higher the probability of passing the test. The maximum value is 1 and the default value is 1. Moreover, it also supports recalculation of the allocated pg. if the osd mapping results are found to have changed, it will dynamically migrate to the new osd to achieve data rebalancing during cluster operation.
Another benefit of overload test is also mentioned in Ceph rados design principle and Implementation: "You can distinguish between the scenarios in which the OSD temporarily fails and the scenario in which the OSD is permanently deleted. The significance of distinguishing the two is that if the OSD temporarily fails, you can eliminate it from the candidate entries by using the overload test by adjusting its reweight to 0, and then move the data it carries to other OSDs. In this way, when the OSD returns normally in the future, its reweight will be readjusted If it is the original value, all the data originally belonging to the OSD will be returned. While deleting will change the unique number of the OSD in the cluster map, so it may carry different data. In this way, the amount of data migration is greater ".
You can use the following command to adjust the reweight of each OSD. The reweight will be magnified 10000 times in the program, i.e. [0-10000].
ceph osd reweight <osd_num|osd id> <reweight[0.0-1.0]> # By adjusting the reweight of osd.3, all its PGs are migrated to other OSDs. [root@node-1 ~]# ceph osd reweight osd.3 0 reweighted osd.3 to 0 (0) [root@node-1 ~]# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.00980 1.00000 10 GiB 1.3 GiB 319 MiB 48 KiB 1024 MiB 8.7 GiB 13.12 0.95 192 up 3 hdd 0.00980 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 up 1 hdd 0.00980 1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.03 192 up 2 hdd 0.00980 1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.03 192 up TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB 4.0 GiB 35 GiB 13.81 MIN/MAX VAR: 0.95/1.03 STDDEV: 0.49 # By adjusting osd.3 reweight to return to the original value, pg migrates back [root@node-1 ~]# ceph osd reweight osd.3 1 reweighted osd.3 to 1 (10000) [root@node-1 ~]# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.00980 1.00000 10 GiB 1.3 GiB 283 MiB 48 KiB 1024 MiB 8.7 GiB 12.80 0.97 92 up 3 hdd 0.00980 1.00000 10 GiB 1.2 GiB 169 MiB 16 KiB 1024 MiB 8.8 GiB 11.66 0.88 100 up 1 hdd 0.00980 1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192 up 2 hdd 0.00980 1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192 up TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB 4.0 GiB 35 GiB 13.19 MIN/MAX VAR: 0.88/1.07 STDDEV: 1.05
It can also be adjusted in batch. At present, there are two modes: 1. According to the current space utilization of OSD, 2. According to the distribution of PG between OSDs. You can add "test -" before these two commands to test the results generated by executing this command.
# Parameter description # Overload: if and only if the space utilization of an OSD is greater than or equal to the overload/100 of the average space utilization of the cluster, adjust the reweight. Value range: > 100, default 120 # max_change: the maximum range of reweight adjustment each time. Value range: [0,1], default 0.05 # max_osds: the maximum number of OSD s can be adjusted each time. The default is 4 # --No increasing: indicates that reweight is not allowed to be up-regulated, but only down regulated. Note the reweight range [0,1] ceph osd <reweight-by-utilization|reweight-by-pg|test-reweight-by-utilization|test-reweight-by-pg> {overload} {max_change} {max_osds} {--no-increasing} [root@node-1 ~]# ceph osd test-reweight-by-utilization 101 no change moved 22 / 576 (3.81944%) avg 144 stddev 48.0833 -> 42.9535 (expected baseline 10.3923) min osd.0 with 92 -> 92 pgs (0.638889 -> 0.638889 * mean) max osd.1 with 192 -> 182 pgs (1.33333 -> 1.26389 * mean) oload 101 max_change 0.05 max_change_osds 4 average_utilization 0.1319 overload_utilization 0.1333 osd.2 weight 1.0000 -> 0.9500 osd.1 weight 1.0000 -> 0.9500 [root@node-1 ~]# ceph osd test-reweight-by-pg 101 no change moved 22 / 576 (3.81944%) avg 144 stddev 48.0833 -> 42.9535 (expected baseline 10.3923) min osd.0 with 92 -> 92 pgs (0.638889 -> 0.638889 * mean) max osd.1 with 192 -> 182 pgs (1.33333 -> 1.26389 * mean) oload 101 max_change 0.05 max_change_osds 4 average_utilization 14699.6636 overload_utilization 14846.6602 osd.2 weight 1.0000 -> 0.9500 osd.1 weight 1.0000 -> 0.9500
It should be noted that if the reweight of the OSD is already 1, it cannot be adjusted to be larger. It can only be highlighted by reducing the reweight of all other OSDs. This is very inconvenient, and the cluster migration will be frequent during the adjustment process. Therefore, it is necessary to appropriately reduce the reweight of the OSD when the cluster is just established to make room for future operation and maintenance Lower margin space.
weight set
A key parameter in straw2 algorithm is the weight of input variables. For Ceph, each cross algorithm calculates a set of OSDs (if the number of copies is 3, it is 3 OSDs) , this group of OSDs is in order. Therefore, when OSDs in different locations are selected through cross, the probability of each OSD should be different. weight set shows different weight characteristics for each OSD in different locations.
Two modes are supported:
- Compatibility mode: the same as the original weight, only one number represents its own weight.
- Incompatible mode: weight set is bound to a specific storage pool, and the weight can be set for each location.
# Create a weight set in compatibility mode ceph osd crush weight-set create-compat # Adjust the weight of each OSD separately ceph osd crush weight-set reweight-compat {name} {weight} # Delete weight set for compatibility mode ceph osd crush weight-set rm-compat
# Create weight set in incompatible mode # flat: the effect is the same as that of the compatibility mode. The weight set of each OSD has only one parameter # Positive: set a set of weights for each OSD according to the number of replicas and the current location of replicas. ceph osd crush weight-set create <poolname> flat|positional # delete ceph osd crush weight-set rm <poolname> [root@node-1 ~]# ceph osd set-require-min-compat-client luminous set require_min_compat_client to luminous [root@node-1 ~]# ceph osd crush weight-set create pool-1 positional [root@node-1 ~]# ceph osd crush weight-set reweight pool-1 osd.0 0.3 0.4 0.5 [root@node-1 ~]# ceph osd crush dump ... { "bucket_id": -4, "weight_set": [ [ 0.29998779296875, 0.009796142578125 ], [ 0.399993896484375, 0.009796142578125 ], [ 0.5, 0.009796142578125 ] ] }, ...
upmap
The reweight and weight set described above can only improve or reduce the probability of OSD being selected, while upmap can directly select an OSD as the output result. By default, upmap is equal to activemap (calculated result).
There are two replacement methods:
-
Specify the calculation result of a PG
ceph osd pg-upmap <pgid><osdname (id|osd.id)> [<osdname (id|osd.id)>...] # View pg mapping [root@node-1 ~]# ceph pg dump | awk '{print $1, $17}' ... PG_STAT UP_PRIMARY 1.7f [2,3,1] 1.7e [1,0,2] 1.7d [3,2,1] 1.7c [0,2,1] 1.7b [1,0,2] ... # The number of copies set must be greater than or equal to pool min size [root@node-1 ~]# ceph osd pg-upmap 1.7f 1 Error EINVAL: num of osds (1) < pool min size (2) [root@node-1 ~]# ceph osd pg-upmap 1.7f 0 0 0 osd.0 already exists, osd.0 already exists, set 1.7f pg_upmap mapping to [0] [root@node-1 ~]# ceph pg dump | awk '{print $1,$17}' | grep 1.7f dumped all 1.7f [0]
-
To replace an OSD in the PG calculation result, you need to specify both the source OSD and the target OSD
ceph osd pg-upmap-items <pgid><osdname (id|osd.id)> [<osdname (id|osd.id)>...] # During the actual operation, it is found that the primer osd cannot be replaced, and the upmap will automatically restore to the original array. [root@node-1 ~]# ceph osd pg-upmap-items 1.7f 3 0 set 1.7f pg_upmap_items mapping to [3->0] [root@node-1 ~]# ceph pg dump | awk '{print $1,$17}' | grep 1.7f dumped all 1.7f [2,0,1] # After a while, check again and find that it changes back to the original array, and the balancer automatically adjusts it [root@node-1 ~]# ceph pg dump | awk '{print $1,$17}' | grep 1.7f dumped all 1.7f [2,3,1]
Delete upmap
ceph osd rm-pg-upmap <pgid> ceph osd rm-pg-upmap-items <pgid>
balancer
balancer is an automatic rebalancing tool of Ceph, which mainly depends on reweight, weight set and upmap tools.
Check status
ceph balancer status # Parameter description # last_optimize_duration: the duration of the last optimization process # plans: optimize tasks # mode: refers to the tool and method selected by default when executing the plan. Currently, crush compat, upmap and none are supported # active: whether to enable balancer # optimize_result: optimization result # last_optimize_started: last optimization time [root@node-1 ~]# ceph balancer status { "last_optimize_duration": "0:00:00.000220", "plans": [], "mode": "none", "active": true, "optimize_result": "Please do \"ceph balancer mode\" to choose a valid mode first", "last_optimize_started": "Wed Jun 23 15:58:21 2021" }
On | off
ceph balancer on ceph balancer off
Through target_ max_ misplaced_ The ratio parameter adjusts the ratio of each PG movement during the balance operation
ceph config set mgr target_max_misplaced_ratio .07 # 7%
Set the interval automatically adjusted by the balancer. The default is 60s. Note that the slash should not be omitted and should not be divided into three commands
ceph config set mgr mgr/balancer/sleep_interval 60
Set the start and end times of the balancer day to avoid the peak hours of normal business. It is carried out all day by default, and the time format is HHMM
ceph config set mgr mgr/balancer/begin_time 0000 ceph config set mgr mgr/balancer/end_time 2400
Set the start and end times of the balancer week. The default is the whole week. Note that 0 or 7 is Sunday, 1 is Monday, and so on
ceph config set mgr mgr/balancer/begin_weekday 0 ceph config set mgr mgr/balancer/end_weekday 7
Set the storage pool for the balancer balancing operation
ceph config set mgr mgr/balancer/pool_ids 1,2,3
Adjust mode
ceph balancer mode crush-compat ceph balancer mode upmap ceph balancer mode none # This is equivalent to closing the balancer
Generate a plan. Note: the balancer without a plan will not perform any operations
ceph balancer optimize <plan> {<pools> [<pools>...]}
Execution plan
ceph balancer execute <plan-name>
Query all plan s
ceph balancer ls
Delete plan
ceph balancer rm <plan>
Evaluate the balance status of the current cluster. The smaller the number, the better
ceph balancer eval {pool}
View assessment details
ceph balancer eval-verbose {pool}
Evaluate a plan
ceph balancer eval <plan-name>
How to create a blaster plan for a cluster
Generally, there is no need to manually create a plan. After starting the balancer and selecting mode, the balancer will automatically complete the optimization work regularly.
# Close the balancer first [root@node-1 ~]# ceph balancer off # Select a mode for the balancer [root@node-1 ~]# ceph balancer mode upmap # The cluster of the experiment has been very balanced, which directly tells me that I can't continue to optimize [root@node-1 ~]# ceph balancer optimize myplan Error EALREADY: Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect # Manually adjust the reweight to build an unbalanced cluster. Note that 0 refers to deactivation. If the reweight is adjusted to 0, the cluster is still perfectly balanced. [root@node-1 ~]# ceph osd reweight osd.3 0.1 reweighted osd.3 to 0.1 (1999) [root@node-1 ~]# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.00980 1.00000 10 GiB 1.4 GiB 427 MiB 48 KiB 1024 MiB 8.6 GiB 14.18 1.07 179 up 3 hdd 0.00980 0.09999 10 GiB 1.0 GiB 33 MiB 16 KiB 1024 MiB 9.0 GiB 10.35 0.78 13 up 1 hdd 0.00980 1.00000 10 GiB 1.4 GiB 431 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192 up 2 hdd 0.00980 1.00000 10 GiB 1.4 GiB 431 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192 up TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB 4.0 GiB 35 GiB 13.24 MIN/MAX VAR: 0.78/1.07 STDDEV: 1.08 # New plan [root@node-1 ~]# ceph balancer optimize myplan # Evaluation plan [root@node-1 ~]# ceph balancer eval myplan plan myplan final score 0.017154 (lower is better) # Check the current score of the cluster and find that the optimization has a little effect. Let's choose the implementation plan here [root@node-1 ~]# ceph balancer eval current cluster score 0.017850 (lower is better) # Implementation plan [root@node-1 ~]# ceph balancer execute myplan [root@node-1 ~]# ceph balancer eval current cluster score 0.017154 (lower is better) [root@node-1 ~]# ceph osd df ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 0 hdd 0.00980 1.00000 10 GiB 1.4 GiB 408 MiB 48 KiB 1024 MiB 8.6 GiB 13.99 1.06 180 up 3 hdd 0.00980 0.09999 10 GiB 1.1 GiB 57 MiB 16 KiB 1024 MiB 8.9 GiB 10.59 0.80 12 up 1 hdd 0.00980 1.00000 10 GiB 1.4 GiB 432 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192 up 2 hdd 0.00980 1.00000 10 GiB 1.4 GiB 432 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192 up TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB 4.0 GiB 35 GiB 13.25 MIN/MAX VAR: 0.80/1.07 STDDEV: 1.00
PG related commands
Get PG details
[root@node-1 ~]# ceph pg dump_json sum dumped all version 406 stamp 2021-06-24 09:24:46.705278 last_osdmap_epoch 0 last_pg_scan 0 PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 1.7f 0 0 0 0 0 0 0 0 0 0 active+clean 2021-06-24 09:11:24.543912 0'0 275:687 [2,1,0] 2 [2,1,0] 2 0'0 2021-06-23 14:51:08.406526 0'0 2021-06-22 09:31:39.906375 0 1.7e 1 0 0 0 0 4194304 0 0 2 2 active+clean 2021-06-24 09:11:24.318962 50'2 275:606 [1,0,2] 1 [1,0,2] 1 50'2 2021-06-23 14:37:05.824343 50'2 2021-06-23 14:37:05.824343 0 1.7d 0 0 0 0 0 0 0 0 0 0 active+clean 2021-06-24 09:11:22.895867 0'0 275:36 [0,2,1] 0 [0,2,1] 0 0'0 2021-06-23 12:21:07.406368 0'0 2021-06-22 09:31:09.128962 0 ...
Query the total number of PG on each OSD
[root@node-1 ~]# ceph pg dump osds dumped osds OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM 3 59 MiB 8.9 GiB 1.1 GiB 10 GiB [0,1,2] 12 4 2 433 MiB 8.6 GiB 1.4 GiB 10 GiB [0,1,3] 192 75 1 433 MiB 8.6 GiB 1.4 GiB 10 GiB [0,2,3] 192 62 0 409 MiB 8.6 GiB 1.4 GiB 10 GiB [1,2,3] 180 51 sum 1.3 GiB 35 GiB 5.3 GiB 40 GiB
Query the details of PG on the specified OSD
[root@node-1 ~]# ceph pg ls-by-osd 0 PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 1.1 0 0 0 0 0 0 0 0 active+clean 19m 0'0 275:573 [2,0,1]p2 [2,0,1]p2 2021-06-23 14:37:36.380990 2021-06-22 09:28:39.643688 1.2 1 0 0 0 4194304 0 0 2 active+clean 19m 50'2 275:606 [1,0,2]p1 [1,0,2]p1 2021-06-23 14:44:23.268353 2021-06-23 14:44:23.268353 1.3 0 0 0 0 0 0 0 0 active+clean 19m 0'0 275:472 [1,2,0]p1 [1,2,0]p1 2021-06-23 10:20:09.588889 2021-06-23 10:20:09.588889 ...
Query the details of PG on the specified pool
[root@node-1 ~]# ceph pg ls-by-pool pool-1 PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 1.0 1 0 0 0 4194304 0 0 2 active+clean 22m 50'2 275:715 [1,2,3]p1 [1,2,3]p1 2021-06-23 12:56:25.914554 2021-06-22 09:29:38.155739 1.1 0 0 0 0 0 0 0 0 active+clean 22m 0'0 275:573 [2,0,1]p2 [2,0,1]p2 2021-06-23 14:37:36.380990 2021-06-22 09:28:39.643688 1.2 1 0 0 0 4194304 0 0 2 active+clean 22m 50'2 275:606 [1,0,2]p1 [1,0,2]p1 2021-06-23 14:44:23.268353 2021-06-23 14:44:23.268353 1.3 0 0 0 0 0 0 0 0 active+clean 22m 0'0 275:472 [1,2,0]p1 [1,2,0]p1 2021-06-23 10:20:09.588889 2021-06-23 10:20:09.588889 ...
Query the number of objects on PG
[root@node-1 ~]# ceph pg dump | awk '{print $1, $2}' dumped all version 760 stamp 2021-06-24 last_osdmap_epoch 0 last_pg_scan 0 PG_STAT OBJECTS 1.7f 0 1.7e 1 1.7d 0 1.7c 1 1.7b 2 1.7a 1 1.79 1 1.78 2 1.77 1
Query the number of objects on the specified OSD
[root@node-1 ~]# ceph pg ls-by-osd 0 | awk 'BEGIN{sum = 0;}{sum+=$2}END {print "objects: " sum}' objects: 106
Query the number of objects on the specified pool
Note that this contains the copy object, and the actual object may be less than 1 / 3.
[root@node-1 ~]# ceph df RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 40 GiB 35 GiB 1.3 GiB 5.3 GiB 13.26 TOTAL 40 GiB 35 GiB 1.3 GiB 5.3 GiB 13.26 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL pool-1 1 376 MiB 95 1.1 GiB 3.30 11 GiB rbd-pool 2 22 MiB 17 67 MiB 0.20 11 GiB
appendix
Monitor commands: ================= pg cancel-force-backfill <pgid> [<pgid>...] restore normal backfill priority of <pgid> pg cancel-force-recovery <pgid> [<pgid>...] restore normal recovery priority of <pgid> pg debug unfound_objects_exist|degraded_pgs_exist show debug info about pgs pg deep-scrub <pgid> start deep-scrub on <pgid> pg dump {all|summary|sum|delta|pools|osds|pgs|pgs_ show human-readable versions of pg map (only brief [all|summary|sum|delta|pools|osds|pgs|pgs_brief. valid with plain) ..]} pg dump_json {all|summary|sum|pools|osds|pgs [all| show human-readable version of pg map in json only summary|sum|pools|osds|pgs...]} pg dump_pools_json show pg pools info in json only pg dump_stuck {inactive|unclean|stale|undersized| show information about stuck pgs degraded [inactive|unclean|stale|undersized|degraded.. .]} {<int>} pg force-backfill <pgid> [<pgid>...] force backfill of <pgid> first pg force-recovery <pgid> [<pgid>...] force recovery of <pgid> first pg getmap get binary pg map to -o/stdout pg ls {<int>} {<states> [<states>...]} list pg with specific pool, osd, state pg ls-by-osd <osdname (id|osd.id)> {<int>} {<states> list pg on osd [osd] [<states>...]} pg ls-by-pool <poolstr> {<states> [<states>...]} list pg with pool = [poolname] pg ls-by-primary <osdname (id|osd.id)> {<int>} list pg with primary = [osd] {<states> [<states>...]} pg map <pgid> show mapping of pg to osds pg repair <pgid> start repair on <pgid> pg repeer <pgid> force a PG to repeer pg scrub <pgid> start scrub on <pgid> pg stat show placement group status.