CRUSH and PG distribution

reference material:
Ceph's RADOS design principle and Implementation
https://docs.ceph.com/en/latest/rados/operations/crush-map/
http://strugglesquirrel.com/2019/02/02/ceph%E8%BF%90%E7%BB%B4%E5%A4%A7%E5%AE%9D%E5%89%91%E4%B9%8B%E9%9B%86%E7%BE%A4osd%E4%B8%BAfull/
https://docs.ceph.com/en/latest/rados/operations/balancer/#modes

CRUSH

Crush algorithm is a tool used to calculate which OSD objects are distributed on. It consists of two steps:

  1. Calculate the mapping from object to PG and use hash algorithm. If the number of PG remains unchanged, the result remains unchanged.

    Hash(oid) = pgid

  2. The straw2 algorithm is generally used to calculate the mapping from PG to specific OSD (detailed in the Crush chapter of Ceph design principle and Implementation). This result can be changed by adjusting the weight of OSD.

    CRUSH(pgid) = OSDid

It can be seen that objects are distributed among PGs, and the mapping result remains stable when the number of PG remains unchanged. In this way, managing the distribution of PG is equivalent to managing the distribution of objects in the whole cluster. In the first chapter of Ceph's Rados design principle and implementation, the mode of PG splitting and capacity expansion and the reasons for its efficiency and simplicity are introduced in detail.

Edit CRUSH map

Crush map is mainly composed of two parts: cluster map and placement rule. The former describes the distribution of the whole cluster equipment, and the latter specifies the steps and rules for selecting OSD.

Get CRUSH map

The CRUSH map obtained by this command has been compiled and needs to be decompiled before it can be edited as text.

ceph osd getcrushmap -o {compiled-crushmap-filename}

[root@node-1 ~]# ceph osd getcrushmap -o crushmap
7
[root@node-1 ~]# ls -al crushmap 
-rw-r--r-- 1 root root 845 6 October 22:31 crushmap

Decompile CRUSH map

Convert the CRUSH map obtained by getcrush map into editable and readable text.

crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-file}

[root@node-1 ~]# crushtool -d crushmap -o crushmap.txt
[root@node-1 ~]# cat crushmap.txt 
# begin crush map
# Generally not changed
tunable choose_local_tries 0              # Obsolete, set to 0 for backward compatibility
tunable choose_local_fallback_tries 0     # Obsolete, set to 0 for backward compatibility
tunable choose_total_tries 50             # Select the maximum number of bucket attempts. The default value is 50
tunable chooseleaf_descend_once 1         # Obsolete, set to 1 for backward compatibility
tunable chooseleaf_vary_r 1               # 
tunable chooseleaf_stable 1               # Avoid unnecessary pg migration
tunable straw_calc_version 1              # starw algorithm version, backward compatible, set to 1
tunable allowed_bucket_algs 54            # The bucket selection algorithm allowed, 54 represents straw2 algorithm

# devices
# Each last physical device (i.e. OSD), also known as leaf node, generally does not need to be set manually.
# id is generally greater than or equal to 0. It is different from the following bucket id.
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd

# types
# The defined bucket type can be customized (add or delete type), and the number must be a positive integer.
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
# All intermediate nodes are called buckets. Buckets can be a collection of devices or a collection of lower level buckets. The root node is called root and is the entrance of the whole cluster. The bucket id must be negative and unique. The actual storage location of a bucket in the crush map is buckets[-1-(bucket id)].
host node-1 {
	id -3		                 # do not change unnecessarily
	id -4 class hdd		         # do not change unnecessarily
	# weight 0.010               # This bucket weight is equal to the sum of item weights
	alg straw2                   # Using straw2 algorithm
	hash 0	                     # rjenkins1
	item osd.0 weight 0.010      # This bucket contains OSD and its weight. The weight is generally determined according to the capacity. For example, 1T is equal to 1 weight
}
host node-2 {
	id -5		# do not change unnecessarily
	id -6 class hdd		# do not change unnecessarily
	# weight 0.010
	alg straw2
	hash 0	# rjenkins1
	item osd.1 weight 0.010
}
host node-3 {
	id -7		# do not change unnecessarily
	id -8 class hdd		# do not change unnecessarily
	# weight 0.010
	alg straw2
	hash 0	# rjenkins1
	item osd.2 weight 0.010
}
# The root bucket, at least one, is the entry to the placement rule.
root default {
	id -1		# do not change unnecessarily
	id -2 class hdd		# do not change unnecessarily
	# weight 0.029
	alg straw2
	hash 0	# rjenkins1
	item node-1 weight 0.010      # This bucket contains three sub buckets, each with a weight of 0.01
	item node-2 weight 0.010
	item node-3 weight 0.010
}

# rules
# placement rule.  Note: there is only one crushmap, but multiple rules can be defined
rule replicated_rule {
	id 0                                   # id
	type replicated                        # Type [replicated|erasure]
	min_size 1                             # If the number of pool replicas is less than this value, this rule will not be applied
	max_size 10                            # If the number of pool replicas is greater than this value, this rule will not be applied
	step take default                      # The entry of crush rules is generally a bucket of type root
	step choose firstn 0 type osd          # There are two types: choose and chooseleaf. num represents the number of choices, and type is the expected bucket type.
	step emit                              # Output results
}

# end crush map

Compile CRUSH map

crushtool -c {decomplied-crush-map-filename} -o {complied-crush-map-filename}

Simulation test

You can simulate and test the compiled CRUSH map.
min-x: minimum input value. Enter a number with the object name analog to [min, max].
max-x: maximum input value.
Num Rep: the number of copies. The number of output OSDs equals the number of copies
ruleset: rule id. Select which replacement rule.

crushtool -i {complied-crush-map-filename} --test --min-x 0 --max-x 9 --num-rep 3 --ruleset 0 --show_mappings

[root@node-1 ~]# crushtool -i crushmap --test --min-x 0 --max-x 9 --num-rep 3 --ruleset 0 --show_mappings
CRUSH rule 0 x 0 [1,2,0]
CRUSH rule 0 x 1 [2,0,1]
CRUSH rule 0 x 2 [2,1,0]
CRUSH rule 0 x 3 [0,1,2]
CRUSH rule 0 x 4 [1,2,0]
CRUSH rule 0 x 5 [0,1,2]
CRUSH rule 0 x 6 [2,0,1]
CRUSH rule 0 x 7 [1,2,0]
CRUSH rule 0 x 8 [2,0,1]
CRUSH rule 0 x 9 [1,2,0]

Or only statistical result distribution.

crushtool -i {complied-crush-map-filename} --test --min-x 0 --max-x 10000 --num-rep 3 --ruleset 0 --show_utilization

[root@node-1 ~]# crushtool -i crushmap --test --min-x 0 --max-x 10000 --num-rep 3 --ruleset 0 --show_utilization
rule 0 (replicated_rule), x = 0..10000, numrep = 3..3
rule 0 (replicated_rule) num_rep 3 result size == 3:	10001/10001
  device 0:		 stored : 10001	 expected : 10001
  device 1:		 stored : 10001	 expected : 10001
  device 2:		 stored : 10001	 expected : 10001

Injection cluster

The CRUSH map cannot take effect until it is injected into the cluster

ceph osd setcrushmap -i {complied-crushmap-filename}

[root@node-1 ~]# ceph osd setcrushmap -i crushmap
8

Write a master copy on the SSD and other copies on the CRUSH map of the HDD

How do I always assign the primary replica to SSD devices and the secondary replica to HDD devices?

First of all, it should be understood that the primary replica in the CRUSH map is actually the first calculated device, and the secondary replica is the subsequently calculated device. As long as the SSD type is output in the first emit every time, the primary replica can always be on the SSD device.

Next, take an example.

The equipment distribution diagram is given here. There are three hosts. Each host has one SSD and two HDD type OSD s.

                                root
           _ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _ _ _ _ _ _ _ _ 
          |                       |                       |
        node-1                  node-2                  node-3
   _ _ _ _|_ _ _ _         _ _ _ _|_ _ _ _         _ _ _ _|_ _ _ _ 
  |       |       |       |       |       |       |       |       |
osd.0   osd.1   osd.2   osd.3   osd.4   osd.5   osd.6   osd.7   osd.8 
 SSD     HDD     HDD     SSD     HDD     HDD     SSD     HDD     HDD     

to write

First modify the device.

# Modify device
device 0 osd.0 class SSD
device 1 osd.1 class HDD
device 2 osd.2 class HDD
device 3 osd.3 class SSD
device 4 osd.4 class HDD
device 5 osd.5 class HDD
device 6 osd.6 class SSD
device 7 osd.7 class HDD
device 8 osd.8 class HDD

Then, modify the cluster map to group SSD s. Considering the physical isolation domain, HDD s are divided into three groups according to different hosts.

# buckets
# SSD into a group
host node-SSD {
	id -1		# do not change unnecessarily
	alg straw2
	hash 0	# rjenkins1
	item osd.0 weight 0.010         # The weight can be adjusted according to the hard disk capacity, which is considered here
	item osd.3 weight 0.010
	item osd.6 weight 0.010
}
# HDD s are divided into three groups according to the host
host node-1-HDD {
	id -2		# do not change unnecessarily
	alg straw2
	hash 0	# rjenkins1
	item osd.1 weight 0.1
	item osd.2 weight 0.1
}
host node-2-HDD {
	id -3		# do not change unnecessarily
	alg straw2
	hash 0	# rjenkins1
	item osd.4 weight 0.1
	item osd.5 weight 0.1
}
host node-3-HDD {
	id -4		# do not change unnecessarily
	alg straw2
	hash 0	# rjenkins1
	item osd.7 weight 0.1
	item osd.8 weight 0.1
}

Next, give the entrance. There are two entrances: SSD node and HDD node.
Note: there is no essential difference between root and host. They both belong to type. Host can also be used as an entry.

# root bucket
root root-SSD {
	id -5		# do not change unnecessarily
	alg straw2
	hash 0	# rjenkins1
	item node-SSD weight 0.030      # Note that the weight is equal to the sum of the three item weights in the node SSD
}
root root-HDD {
	id -6		# do not change unnecessarily
	alg straw2
	hash 0	# rjenkins1
	item node-1-HDD weight 0.2      # Note that the weight is equal to the sum of the three item weights in the node-x-HDD
	item node-2-HDD weight 0.2 
	item node-3-HDD weight 0.2 
}

placement rule, the policy should pay attention to outputting SSD before outputting HDD.

# rules
rule ssd-primary {
	id 1
	type replicated
	min_size 1
	max_size 10
	step take root-SSD
	step chooseleaf firstn 1 type host
	step emit
	step take root-HDD
	step chooseleaf firstn -1 type host         # -1 means to select (number of copies - 1) hosts. If there are 3 copies, here are 2 hosts
	step emit
}

test

[root@node-1 ~]# vi mycrushmap.txt
[root@node-1 ~]# crushtool -c mycrushmap.txt -o mycrushmap
[root@node-1 ~]# crushtool -i mycrushmap --test --min-x 0 --max-x 9 --num-rep 3 --ruleset 1 --s
how_mappings
CRUSH rule 1 x 0 [0,7,1]
CRUSH rule 1 x 1 [3,2,5]
CRUSH rule 1 x 2 [0,8,4]
CRUSH rule 1 x 3 [0,8,4]
CRUSH rule 1 x 4 [3,1,8]
CRUSH rule 1 x 5 [3,7,4]
CRUSH rule 1 x 6 [6,7,4]
CRUSH rule 1 x 7 [6,1,8]
CRUSH rule 1 x 8 [6,2,5]
CRUSH rule 1 x 9 [6,8,1]

The test results show that all primary OSDs, that is, the first OSD, are SSD s. This indicates that this CRUSH map meets the preset.

appendix

Give a complete compiled CRUSH map.

[root@node-1 ~]# vi mycrushmap.txt
device 0 osd.0 class SSD
device 1 osd.1 class HDD
device 2 osd.2 class HDD
device 3 osd.3 class SSD
device 4 osd.4 class HDD
device 5 osd.5 class HDD
device 6 osd.6 class SSD
device 7 osd.7 class HDD
device 8 osd.8 class HDD

type 0 osd
type 1 host
type 11 root

host node-SSD {
        id -1           # do not change unnecessarily
        alg straw2
        hash 0  # rjenkins1
        item osd.0 weight 0.010
        item osd.3 weight 0.010
        item osd.6 weight 0.010
}

host node-1-HDD {
        id -2           # do not change unnecessarily
        alg straw2
        hash 0  # rjenkins1
        item osd.1 weight 0.1
        item osd.2 weight 0.1
}
host node-2-HDD {
      id -3           # do not change unnecessarily
        alg straw2
        hash 0  # rjenkins1
        item osd.4 weight 0.1
        item osd.5 weight 0.1
}
host node-3-HDD {
        id -4           # do not change unnecessarily
        alg straw2
        hash 0  # rjenkins1
        item osd.7 weight 0.1
        item osd.8 weight 0.1
}

root root-SSD {
        id -5           # do not change unnecessarily
        alg straw2
        hash 0  # rjenkins1
        item node-SSD weight 0.030
}
root root-HDD {
        id -6           # do not change unnecessarily
        alg straw2
        hash 0  # rjenkins1
        item node-1-HDD weight 0.2
        item node-2-HDD weight 0.2
        item node-3-HDD weight 0.2
}
rule ssd-primary {
        id 1
        type replicated
        min_size 1
        max_size 10
        step take root-SSD
        step chooseleaf firstn 1 type host
        step emit
        step take root-HDD
        step chooseleaf firstn -1 type host     
        step emit
}

CRUSH other commands

View device map

Print out the devie map tree structure in the way of depth first traversal.

ceph osd tree

[root@node-1 ~]# ceph osd tree
ID CLASS WEIGHT  TYPE NAME       STATUS REWEIGHT PRI-AFF 
-1       0.03918 root default                            
-3       0.01959     host node-1                         
 0   hdd 0.00980         osd.0       up  1.00000 1.00000 
 3   hdd 0.00980         osd.3       up  1.00000 1.00000 
-5       0.00980     host node-2                         
 1   hdd 0.00980         osd.1       up  1.00000 1.00000 
-7       0.00980     host node-3                         
 2   hdd 0.00980         osd.2       up  1.00000 1.00000 

View placement rule

Check which rules are in the cluster map and the specific steps of each rule.

# Query all rule s
ceph osd crush rule ls

# Specific steps for querying specified rule s
ceph osd crush rule dump

[root@node-1 ~]# ceph osd crush rule ls
replicated_rule
[root@node-1 ~]# ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

Data rebalancing

Although the CRUSH algorithm pays attention to the balanced distribution of data as much as possible in the design, in actual production, with the continuous increase of cluster storage data and the change of cluster equipment, this balance will be broken.
To solve this problem, Ceph provides a series of tools to redistribute data and return to a relatively balanced position.

View cluster space usage

ceph df can roughly view the space usage of the cluster's devices and storage pools.
Under the RAW STORAGE type, you can see that USED is smaller than RAW USED. USED is the user's actual data storage, and RAW USED is the total usage (including cluster generated metadata).
In POOLS, STORED is the usage considered by the user, and USED is the actual space usage of the cluster, which is about three times that of the former. This is because Ceph uses three copies to store data, and each copy of data uses three copies of storage space.

[root@node-1 ceph-deploy]# ceph df
RAW STORAGE:
    type       size       Free space used space     Total space used    Percentage of total usage
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED 
    hdd       40 GiB     35 GiB     1.3 GiB      5.3 GiB         13.17 
    TOTAL     40 GiB     35 GiB     1.3 GiB      5.3 GiB         13.17 
 
POOLS:
    Pool name       id     Data volume       Number of objects    
    POOL         ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    pool-1        1     376 MiB          95     1.1 GiB      3.30        11 GiB 
    rbd-pool      2      22 MiB          17      67 MiB      0.20        11 GiB 

ceph osd df tree can view the space usage of each device in detail.

VAR = current OSD usage / average cluster space usage.
Through "MIN/MAX VAR: 0.88/1.07" in the last line, you can know the maximum and minimum offset of VAR, so as to judge whether the distribution of the whole cluster is balanced.

# Depth first traversal, and output the details of all bucket s and device s in the CRUSH map
[root@node-1 ceph-deploy]# ceph osd df tree
          weight     Re weight size total usage data       Proportion of total metadata remaining usage  pg number 
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP    META     AVAIL   %USE  VAR  PGS STATUS TYPE NAME       
-1       0.03918        - 40 GiB 5.3 GiB 1.3 GiB 112 KiB  4.0 GiB  35 GiB 13.17 1.00   -        root default    
-3       0.01959        - 20 GiB 2.4 GiB 449 MiB  32 KiB  2.0 GiB  18 GiB 12.20 0.93   -            host node-1 
 0   hdd 0.00980  1.00000 10 GiB 1.3 GiB 282 MiB  32 KiB 1024 MiB 8.7 GiB 12.76 0.97  92     up         osd.0   
 3   hdd 0.00980  1.00000 10 GiB 1.2 GiB 167 MiB     0 B    1 GiB 8.8 GiB 11.64 0.88 100     up         osd.3   
-5       0.00980        - 10 GiB 1.4 GiB 424 MiB  48 KiB 1024 MiB 8.6 GiB 14.14 1.07   -            host node-2 
 1   hdd 0.00980  1.00000 10 GiB 1.4 GiB 424 MiB  48 KiB 1024 MiB 8.6 GiB 14.14 1.07 192     up         osd.1   
-7       0.00980        - 10 GiB 1.4 GiB 424 MiB  32 KiB 1024 MiB 8.6 GiB 14.14 1.07   -            host node-3 
 2   hdd 0.00980  1.00000 10 GiB 1.4 GiB 424 MiB  32 KiB 1024 MiB 8.6 GiB 14.14 1.07 192     up         osd.2   
                    TOTAL 40 GiB 5.3 GiB 1.3 GiB 113 KiB  4.0 GiB  35 GiB 13.17                                 
MIN/MAX VAR: 0.88/1.07  STDDEV: 1.05

# Only the details of osd are output
[root@node-1 ~]# ceph osd df 
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP   META     AVAIL   %USE  VAR  PGS STATUS TYPE NAME       
-1       0.03918        - 40 GiB 5.3 GiB 1.3 GiB 96 KiB  4.0 GiB  35 GiB 13.18 1.00   -        root default    
-3       0.01959        - 20 GiB 2.4 GiB 452 MiB 64 KiB  2.0 GiB  18 GiB 12.21 0.93   -            host node-1 
 0   hdd 0.00980  1.00000 10 GiB 1.3 GiB 283 MiB 48 KiB 1024 MiB 8.7 GiB 12.77 0.97  92     up         osd.0   
 3   hdd 0.00980  1.00000 10 GiB 1.2 GiB 169 MiB 16 KiB 1024 MiB 8.8 GiB 11.65 0.88 100     up         osd.3   
-5       0.00980        - 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07   -            host node-2 
 1   hdd 0.00980  1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192     up         osd.1   
-7       0.00980        - 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07   -            host node-3 
 2   hdd 0.00980  1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192     up         osd.2   
                    TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB  4.0 GiB  35 GiB 13.18                                 
MIN/MAX VAR: 0.88/1.07  STDDEV: 1.05

In practice, the space utilization rate of Ceph's three replicas is smaller than 33%. Because some space needs to be reserved for the normal operation of Ceph cluster, the average space utilization rate is about 23% [Ceph rados design principle and implementation]. The replicated strategy is too bad. You have to pay for three bowls after eating one bowl of powder.

You can set the cluster space usage through mon osd full ratio and mon osd nearfull ratio. mon osd full ratio is 0.95 by default. After 95% of the data is written, this OSD will prohibit reading and writing data (note that reading and writing are not allowed). mon osd nearfull ratio is 0.85 by default. When the space utilization rate reaches 85%, a warning will be generated.
It should be noted that these two settings are only suitable for writing in the configuration file when creating Ceph clusters at the beginning.

[global]
        mon_osd_full_ratio = .80
        mon_osd_backfillfull_ratio = .75
        mon_osd_nearfull_ratio = .70

These two ratio values can be modified during operation through the CEPH OSD set nearfull ratio and set full ratio commands.

[root@node-1 ceph-deploy]# ceph osd set-full-ratio 0.98
osd set-full-ratio 0.98
[root@node-1 ceph-deploy]# ceph osd dump | grep full
full_ratio 0.98
backfillfull_ratio 0.9
nearfull_ratio 0.85

[root@node-1 ceph-deploy]# ceph osd set-nearfull-ratio 0.95
osd set-nearfull-ratio 0.95
[root@node-1 ceph-deploy]# ceph osd dump | grep full
full_ratio 0.98
backfillfull_ratio 0.9
nearfull_ratio 0.95
 
# Note: when adjusting parameters, pay attention to the restriction relationship with other parameters. 
[root@node-1 ~]# ceph osd dump | grep full
full_ratio 0.98
backfillfull_ratio 0.9
nearfull_ratio 0.95
[root@node-1 ~]# ceph health detail
HEALTH_ERR full ratio(s) out of order
OSD_OUT_OF_ORDER_FULL full ratio(s) out of order
    backfillfull_ratio (0.9) < nearfull_ratio (0.95), increased
    osd_failsafe_full_ratio (0.97) < full_ratio (0.98), increased

This command is usually used to temporarily modify the configuration file after the cluster is full, so that the OSD can read and write, and then restore the cluster to a healthy state through capacity expansion, deletion, rebalancing, etc. A related O & M link is posted here:

http://strugglesquirrel.com/2019/02/02/ceph%E8%BF%90%E7%BB%B4%E5%A4%A7%E5%AE%9D%E5%89%91%E4%B9%8B%E9%9B%86%E7%BE%A4osd%E4%B8%BAfull/

reweight

CRUSH algorithm has a good balance effect in theory, but the actual production environment is complex and changeable, so its balance effect is not satisfactory. Reweight is a compensation measure specially designed for CRUSH. By adjusting reweight, an overload test can be conducted after straw2 calculation is completed. Only after passing the overload test can it be truly selected. The greater the value of reweight, the higher the probability of passing the test. The maximum value is 1 and the default value is 1. Moreover, it also supports recalculation of the allocated pg. if the osd mapping results are found to have changed, it will dynamically migrate to the new osd to achieve data rebalancing during cluster operation.

Another benefit of overload test is also mentioned in Ceph rados design principle and Implementation: "You can distinguish between the scenarios in which the OSD temporarily fails and the scenario in which the OSD is permanently deleted. The significance of distinguishing the two is that if the OSD temporarily fails, you can eliminate it from the candidate entries by using the overload test by adjusting its reweight to 0, and then move the data it carries to other OSDs. In this way, when the OSD returns normally in the future, its reweight will be readjusted If it is the original value, all the data originally belonging to the OSD will be returned. While deleting will change the unique number of the OSD in the cluster map, so it may carry different data. In this way, the amount of data migration is greater ".

You can use the following command to adjust the reweight of each OSD. The reweight will be magnified 10000 times in the program, i.e. [0-10000].

ceph osd reweight <osd_num|osd id> <reweight[0.0-1.0]>

# By adjusting the reweight of osd.3, all its PGs are migrated to other OSDs.
[root@node-1 ~]# ceph osd reweight osd.3 0
reweighted osd.3 to 0 (0)
[root@node-1 ~]# ceph osd df 
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP   META     AVAIL   %USE  VAR  PGS STATUS 
 0   hdd 0.00980  1.00000 10 GiB 1.3 GiB 319 MiB 48 KiB 1024 MiB 8.7 GiB 13.12 0.95 192     up 
 3   hdd 0.00980        0    0 B     0 B     0 B    0 B      0 B     0 B     0    0   0     up 
 1   hdd 0.00980  1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.03 192     up 
 2   hdd 0.00980  1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.03 192     up 
                    TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB  4.0 GiB  35 GiB 13.81                 
MIN/MAX VAR: 0.95/1.03  STDDEV: 0.49

# By adjusting osd.3 reweight to return to the original value, pg migrates back
[root@node-1 ~]# ceph osd reweight osd.3 1
reweighted osd.3 to 1 (10000)
[root@node-1 ~]# ceph osd df 
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP   META     AVAIL   %USE  VAR  PGS STATUS 
 0   hdd 0.00980  1.00000 10 GiB 1.3 GiB 283 MiB 48 KiB 1024 MiB 8.7 GiB 12.80 0.97  92     up 
 3   hdd 0.00980  1.00000 10 GiB 1.2 GiB 169 MiB 16 KiB 1024 MiB 8.8 GiB 11.66 0.88 100     up 
 1   hdd 0.00980  1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192     up 
 2   hdd 0.00980  1.00000 10 GiB 1.4 GiB 425 MiB 16 KiB 1024 MiB 8.6 GiB 14.16 1.07 192     up 
                    TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB  4.0 GiB  35 GiB 13.19                 
MIN/MAX VAR: 0.88/1.07  STDDEV: 1.05

It can also be adjusted in batch. At present, there are two modes: 1. According to the current space utilization of OSD, 2. According to the distribution of PG between OSDs. You can add "test -" before these two commands to test the results generated by executing this command.

# Parameter description
# Overload: if and only if the space utilization of an OSD is greater than or equal to the overload/100 of the average space utilization of the cluster, adjust the reweight. Value range: > 100, default 120
# max_change: the maximum range of reweight adjustment each time. Value range: [0,1], default 0.05
# max_osds: the maximum number of OSD s can be adjusted each time. The default is 4
# --No increasing: indicates that reweight is not allowed to be up-regulated, but only down regulated. Note the reweight range [0,1]
ceph osd <reweight-by-utilization|reweight-by-pg|test-reweight-by-utilization|test-reweight-by-pg> {overload} {max_change} {max_osds} {--no-increasing}

[root@node-1 ~]# ceph osd test-reweight-by-utilization 101
no change
moved 22 / 576 (3.81944%)
avg 144
stddev 48.0833 -> 42.9535 (expected baseline 10.3923)
min osd.0 with 92 -> 92 pgs (0.638889 -> 0.638889 * mean)
max osd.1 with 192 -> 182 pgs (1.33333 -> 1.26389 * mean)

oload 101
max_change 0.05
max_change_osds 4
average_utilization 0.1319
overload_utilization 0.1333
osd.2 weight 1.0000 -> 0.9500
osd.1 weight 1.0000 -> 0.9500


[root@node-1 ~]# ceph osd test-reweight-by-pg 101
no change
moved 22 / 576 (3.81944%)
avg 144
stddev 48.0833 -> 42.9535 (expected baseline 10.3923)
min osd.0 with 92 -> 92 pgs (0.638889 -> 0.638889 * mean)
max osd.1 with 192 -> 182 pgs (1.33333 -> 1.26389 * mean)

oload 101
max_change 0.05
max_change_osds 4
average_utilization 14699.6636
overload_utilization 14846.6602
osd.2 weight 1.0000 -> 0.9500
osd.1 weight 1.0000 -> 0.9500

It should be noted that if the reweight of the OSD is already 1, it cannot be adjusted to be larger. It can only be highlighted by reducing the reweight of all other OSDs. This is very inconvenient, and the cluster migration will be frequent during the adjustment process. Therefore, it is necessary to appropriately reduce the reweight of the OSD when the cluster is just established to make room for future operation and maintenance Lower margin space.

weight set

A key parameter in straw2 algorithm is the weight of input variables. For Ceph, each cross algorithm calculates a set of OSDs (if the number of copies is 3, it is 3 OSDs) , this group of OSDs is in order. Therefore, when OSDs in different locations are selected through cross, the probability of each OSD should be different. weight set shows different weight characteristics for each OSD in different locations.

Two modes are supported:

  1. Compatibility mode: the same as the original weight, only one number represents its own weight.
  2. Incompatible mode: weight set is bound to a specific storage pool, and the weight can be set for each location.
# Create a weight set in compatibility mode
ceph osd crush weight-set create-compat

# Adjust the weight of each OSD separately
ceph osd crush weight-set reweight-compat {name} {weight}

# Delete weight set for compatibility mode
ceph osd crush weight-set rm-compat
# Create weight set in incompatible mode
# flat: the effect is the same as that of the compatibility mode. The weight set of each OSD has only one parameter
# Positive: set a set of weights for each OSD according to the number of replicas and the current location of replicas.
ceph osd crush weight-set create <poolname> flat|positional

# delete
ceph osd crush weight-set rm <poolname>

[root@node-1 ~]# ceph osd set-require-min-compat-client luminous
set require_min_compat_client to luminous
[root@node-1 ~]# ceph osd crush weight-set create pool-1 positional
[root@node-1 ~]# ceph osd crush weight-set reweight pool-1 osd.0 0.3 0.4 0.5
[root@node-1 ~]# ceph osd crush dump
...
			{
                "bucket_id": -4,
                "weight_set": [
                    [
                        0.29998779296875,
                        0.009796142578125
                    ],
                    [
                        0.399993896484375,
                        0.009796142578125
                    ],
                    [
                        0.5,
                        0.009796142578125
                    ]
                ]
            },

...

upmap

The reweight and weight set described above can only improve or reduce the probability of OSD being selected, while upmap can directly select an OSD as the output result. By default, upmap is equal to activemap (calculated result).

There are two replacement methods:

  1. Specify the calculation result of a PG

    ceph osd pg-upmap <pgid><osdname (id|osd.id)> [<osdname (id|osd.id)>...]
    
    # View pg mapping
    [root@node-1 ~]# ceph pg dump | awk '{print $1, $17}'
    ...
    PG_STAT UP_PRIMARY
    1.7f [2,3,1]
    1.7e [1,0,2]
    1.7d [3,2,1]
    1.7c [0,2,1]
    1.7b [1,0,2]
    ...
    
    # The number of copies set must be greater than or equal to pool min size
    [root@node-1 ~]# ceph osd pg-upmap 1.7f 1
    Error EINVAL: num of osds (1) < pool min size (2)
    [root@node-1 ~]# ceph osd pg-upmap 1.7f 0 0 0
    osd.0 already exists, osd.0 already exists, set 1.7f pg_upmap mapping to [0]
    [root@node-1 ~]# ceph pg dump | awk '{print $1,$17}' | grep 1.7f
    dumped all
    1.7f [0]
    
  2. To replace an OSD in the PG calculation result, you need to specify both the source OSD and the target OSD

    ceph osd pg-upmap-items <pgid><osdname (id|osd.id)> [<osdname (id|osd.id)>...]
    
    # During the actual operation, it is found that the primer osd cannot be replaced, and the upmap will automatically restore to the original array.
    [root@node-1 ~]# ceph osd pg-upmap-items 1.7f  3 0
    set 1.7f pg_upmap_items mapping to [3->0]
    [root@node-1 ~]# ceph pg dump | awk '{print $1,$17}' | grep 1.7f
    dumped all
    1.7f [2,0,1]
    # After a while, check again and find that it changes back to the original array, and the balancer automatically adjusts it
    [root@node-1 ~]# ceph pg dump | awk '{print $1,$17}' | grep 1.7f
    dumped all
    1.7f [2,3,1]
    

Delete upmap

ceph osd rm-pg-upmap <pgid>
ceph osd rm-pg-upmap-items <pgid>

balancer

balancer is an automatic rebalancing tool of Ceph, which mainly depends on reweight, weight set and upmap tools.

Check status

ceph balancer status

# Parameter description
# last_optimize_duration: the duration of the last optimization process
# plans: optimize tasks
# mode: refers to the tool and method selected by default when executing the plan. Currently, crush compat, upmap and none are supported
# active: whether to enable balancer
# optimize_result: optimization result
# last_optimize_started: last optimization time
[root@node-1 ~]# ceph balancer status
{
    "last_optimize_duration": "0:00:00.000220", 
    "plans": [], 
    "mode": "none", 
    "active": true, 
    "optimize_result": "Please do \"ceph balancer mode\" to choose a valid mode first", 
    "last_optimize_started": "Wed Jun 23 15:58:21 2021"
}

On | off

ceph balancer on
ceph balancer off

Through target_ max_ misplaced_ The ratio parameter adjusts the ratio of each PG movement during the balance operation

ceph config set mgr target_max_misplaced_ratio .07   # 7%

Set the interval automatically adjusted by the balancer. The default is 60s. Note that the slash should not be omitted and should not be divided into three commands

ceph config set mgr mgr/balancer/sleep_interval 60

Set the start and end times of the balancer day to avoid the peak hours of normal business. It is carried out all day by default, and the time format is HHMM

ceph config set mgr mgr/balancer/begin_time 0000
ceph config set mgr mgr/balancer/end_time 2400

Set the start and end times of the balancer week. The default is the whole week. Note that 0 or 7 is Sunday, 1 is Monday, and so on

ceph config set mgr mgr/balancer/begin_weekday 0
ceph config set mgr mgr/balancer/end_weekday 7

Set the storage pool for the balancer balancing operation

ceph config set mgr mgr/balancer/pool_ids 1,2,3

Adjust mode

ceph balancer mode crush-compat
ceph balancer mode upmap
ceph balancer mode none       # This is equivalent to closing the balancer

Generate a plan. Note: the balancer without a plan will not perform any operations

ceph balancer optimize <plan> {<pools> [<pools>...]}

Execution plan

ceph balancer execute <plan-name>

Query all plan s

ceph balancer ls

Delete plan

ceph balancer rm <plan>

Evaluate the balance status of the current cluster. The smaller the number, the better

ceph balancer eval {pool}

View assessment details

ceph balancer eval-verbose {pool}

Evaluate a plan

ceph balancer eval <plan-name>

How to create a blaster plan for a cluster

Generally, there is no need to manually create a plan. After starting the balancer and selecting mode, the balancer will automatically complete the optimization work regularly.

# Close the balancer first
[root@node-1 ~]# ceph balancer off

# Select a mode for the balancer
[root@node-1 ~]# ceph balancer mode upmap

# The cluster of the experiment has been very balanced, which directly tells me that I can't continue to optimize
[root@node-1 ~]# ceph balancer optimize myplan
Error EALREADY: Unable to find further optimization, or pool(s) pg_num is decreasing, or distribution is already perfect

# Manually adjust the reweight to build an unbalanced cluster. Note that 0 refers to deactivation. If the reweight is adjusted to 0, the cluster is still perfectly balanced.
[root@node-1 ~]# ceph osd reweight osd.3 0.1
reweighted osd.3 to 0.1 (1999)
[root@node-1 ~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP   META     AVAIL   %USE  VAR  PGS STATUS 
 0   hdd 0.00980  1.00000 10 GiB 1.4 GiB 427 MiB 48 KiB 1024 MiB 8.6 GiB 14.18 1.07 179     up 
 3   hdd 0.00980  0.09999 10 GiB 1.0 GiB  33 MiB 16 KiB 1024 MiB 9.0 GiB 10.35 0.78  13     up 
 1   hdd 0.00980  1.00000 10 GiB 1.4 GiB 431 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192     up 
 2   hdd 0.00980  1.00000 10 GiB 1.4 GiB 431 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192     up 
                    TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB  4.0 GiB  35 GiB 13.24                 
MIN/MAX VAR: 0.78/1.07  STDDEV: 1.08

# New plan
[root@node-1 ~]# ceph balancer optimize myplan

# Evaluation plan
[root@node-1 ~]# ceph balancer eval myplan
plan myplan final score 0.017154 (lower is better)

# Check the current score of the cluster and find that the optimization has a little effect. Let's choose the implementation plan here
[root@node-1 ~]# ceph balancer eval 
current cluster score 0.017850 (lower is better)

# Implementation plan
[root@node-1 ~]# ceph balancer execute myplan
[root@node-1 ~]# ceph balancer eval 
current cluster score 0.017154 (lower is better)
[root@node-1 ~]# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA    OMAP   META     AVAIL   %USE  VAR  PGS STATUS 
 0   hdd 0.00980  1.00000 10 GiB 1.4 GiB 408 MiB 48 KiB 1024 MiB 8.6 GiB 13.99 1.06 180     up 
 3   hdd 0.00980  0.09999 10 GiB 1.1 GiB  57 MiB 16 KiB 1024 MiB 8.9 GiB 10.59 0.80  12     up 
 1   hdd 0.00980  1.00000 10 GiB 1.4 GiB 432 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192     up 
 2   hdd 0.00980  1.00000 10 GiB 1.4 GiB 432 MiB 16 KiB 1024 MiB 8.6 GiB 14.22 1.07 192     up 
                    TOTAL 40 GiB 5.3 GiB 1.3 GiB 97 KiB  4.0 GiB  35 GiB 13.25                 
MIN/MAX VAR: 0.80/1.07  STDDEV: 1.00

PG related commands

Get PG details

[root@node-1 ~]# ceph pg dump_json sum
dumped all
version 406
stamp 2021-06-24 09:24:46.705278
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES    OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE        STATE_STAMP                VERSION REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN 
1.7f          0                  0        0         0       0        0           0          0   0        0 active+clean 2021-06-24 09:11:24.543912     0'0  275:687 [2,1,0]          2 [2,1,0]              2        0'0 2021-06-23 14:51:08.406526             0'0 2021-06-22 09:31:39.906375             0 
1.7e          1                  0        0         0       0  4194304           0          0   2        2 active+clean 2021-06-24 09:11:24.318962    50'2  275:606 [1,0,2]          1 [1,0,2]              1       50'2 2021-06-23 14:37:05.824343            50'2 2021-06-23 14:37:05.824343             0 
1.7d          0                  0        0         0       0        0           0          0   0        0 active+clean 2021-06-24 09:11:22.895867     0'0   275:36 [0,2,1]          0 [0,2,1]              0        0'0 2021-06-23 12:21:07.406368             0'0 2021-06-22 09:31:09.128962             0 
...

Query the total number of PG on each OSD

[root@node-1 ~]# ceph pg dump osds
dumped osds
OSD_STAT USED    AVAIL   USED_RAW TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM 
3         59 MiB 8.9 GiB  1.1 GiB 10 GiB  [0,1,2]     12              4 
2        433 MiB 8.6 GiB  1.4 GiB 10 GiB  [0,1,3]    192             75 
1        433 MiB 8.6 GiB  1.4 GiB 10 GiB  [0,2,3]    192             62 
0        409 MiB 8.6 GiB  1.4 GiB 10 GiB  [1,2,3]    180             51 
sum      1.3 GiB  35 GiB  5.3 GiB 40 GiB   

Query the details of PG on the specified OSD

[root@node-1 ~]# ceph pg ls-by-osd 0
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES    OMAP_BYTES* OMAP_KEYS* LOG STATE        SINCE VERSION REPORTED UP        ACTING    SCRUB_STAMP                DEEP_SCRUB_STAMP           
1.1        0        0         0       0        0           0          0   0 active+clean   19m     0'0  275:573 [2,0,1]p2 [2,0,1]p2 2021-06-23 14:37:36.380990 2021-06-22 09:28:39.643688 
1.2        1        0         0       0  4194304           0          0   2 active+clean   19m    50'2  275:606 [1,0,2]p1 [1,0,2]p1 2021-06-23 14:44:23.268353 2021-06-23 14:44:23.268353 
1.3        0        0         0       0        0           0          0   0 active+clean   19m     0'0  275:472 [1,2,0]p1 [1,2,0]p1 2021-06-23 10:20:09.588889 2021-06-23 10:20:09.588889 
...

Query the details of PG on the specified pool

[root@node-1 ~]# ceph pg ls-by-pool pool-1
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES    OMAP_BYTES* OMAP_KEYS* LOG STATE        SINCE VERSION REPORTED UP        ACTING    SCRUB_STAMP                DEEP_SCRUB_STAMP           
1.0        1        0         0       0  4194304           0          0   2 active+clean   22m    50'2  275:715 [1,2,3]p1 [1,2,3]p1 2021-06-23 12:56:25.914554 2021-06-22 09:29:38.155739 
1.1        0        0         0       0        0           0          0   0 active+clean   22m     0'0  275:573 [2,0,1]p2 [2,0,1]p2 2021-06-23 14:37:36.380990 2021-06-22 09:28:39.643688 
1.2        1        0         0       0  4194304           0          0   2 active+clean   22m    50'2  275:606 [1,0,2]p1 [1,0,2]p1 2021-06-23 14:44:23.268353 2021-06-23 14:44:23.268353 
1.3        0        0         0       0        0           0          0   0 active+clean   22m     0'0  275:472 [1,2,0]p1 [1,2,0]p1 2021-06-23 10:20:09.588889 2021-06-23 10:20:09.588889 
...

Query the number of objects on PG

[root@node-1 ~]# ceph pg dump | awk '{print $1, $2}'
dumped all
version 760
stamp 2021-06-24
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS
1.7f 0
1.7e 1
1.7d 0
1.7c 1
1.7b 2
1.7a 1
1.79 1
1.78 2
1.77 1

Query the number of objects on the specified OSD

[root@node-1 ~]# ceph pg ls-by-osd 0 | awk 'BEGIN{sum = 0;}{sum+=$2}END {print "objects: " sum}'
objects: 106

Query the number of objects on the specified pool

Note that this contains the copy object, and the actual object may be less than 1 / 3.

[root@node-1 ~]# ceph df
RAW STORAGE:
    CLASS     SIZE       AVAIL      USED        RAW USED     %RAW USED 
    hdd       40 GiB     35 GiB     1.3 GiB      5.3 GiB         13.26 
    TOTAL     40 GiB     35 GiB     1.3 GiB      5.3 GiB         13.26 
 
POOLS:
    POOL         ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    pool-1        1     376 MiB          95     1.1 GiB      3.30        11 GiB 
    rbd-pool      2      22 MiB          17      67 MiB      0.20        11 GiB 

appendix

 Monitor commands: 
 =================
pg cancel-force-backfill <pgid> [<pgid>...]             restore normal backfill priority of <pgid>
pg cancel-force-recovery <pgid> [<pgid>...]             restore normal recovery priority of <pgid>
pg debug unfound_objects_exist|degraded_pgs_exist       show debug info about pgs
pg deep-scrub <pgid>                                    start deep-scrub on <pgid>
pg dump {all|summary|sum|delta|pools|osds|pgs|pgs_      show human-readable versions of pg map (only 
 brief [all|summary|sum|delta|pools|osds|pgs|pgs_brief.  valid with plain)
 ..]}                                                   
pg dump_json {all|summary|sum|pools|osds|pgs [all|      show human-readable version of pg map in json only
 summary|sum|pools|osds|pgs...]}                        
pg dump_pools_json                                      show pg pools info in json only
pg dump_stuck {inactive|unclean|stale|undersized|       show information about stuck pgs
 degraded [inactive|unclean|stale|undersized|degraded.. 
 .]} {<int>}                                            
pg force-backfill <pgid> [<pgid>...]                    force backfill of <pgid> first
pg force-recovery <pgid> [<pgid>...]                    force recovery of <pgid> first
pg getmap                                               get binary pg map to -o/stdout
pg ls {<int>} {<states> [<states>...]}                  list pg with specific pool, osd, state
pg ls-by-osd <osdname (id|osd.id)> {<int>} {<states>    list pg on osd [osd]
 [<states>...]}                                         
pg ls-by-pool <poolstr> {<states> [<states>...]}        list pg with pool = [poolname]
pg ls-by-primary <osdname (id|osd.id)> {<int>}          list pg with primary = [osd]
 {<states> [<states>...]}                               
pg map <pgid>                                           show mapping of pg to osds
pg repair <pgid>                                        start repair on <pgid>
pg repeer <pgid>                                        force a PG to repeer
pg scrub <pgid>                                         start scrub on <pgid>
pg stat                                                 show placement group status.

Tags: Ceph

Posted on Wed, 20 Oct 2021 20:03:33 -0400 by seidel