[technique] Analyse d'une erreur de coherence detectee par ceph

François Laperruque flaperruque at viviers-fibre.net
Dim 13 Aou 17:32:16 CEST 2017


Bonjour a tous,

Je comprends maintenant pourquoi la ML est baptisee "technique" ;-)

Merci Laurent pour ces explications...

Le 13/08/2017 11:03, Laurent GUERBY via technique a écrit :
> Bonjour,
> 
> Dans le cluster ceph tetaneutral.net nous utilisons le mode
> a trois replica. Nous laissons tourner le mode "deep-scrub"
> qui verifier periodiquement la coherence des donnees vs
> leur hash initial stocké, cela dure environ 8 a 10 jours
> pour passer les 53 TB sur le cluster actuel.
> 
> Ce matin le cluster ceph a detecté une incoherence :
> 
> root at g1:~# ceph -s
>     cluster 1fe74663-8dfa-486c-bb80-3bd94c90c967
>      health HEALTH_ERR
>             1 pgs inconsistent <===== ici
>             2 scrub errors
>      monmap e8: 3 mons at
> {g1=192.168.99.251:6789/0,g2=192.168.99.252:6789/0,g3=192.168.99.253:67
> 89/0}
>             election epoch 1630, quorum 0,1,2 g1,g2,g3
>      osdmap e269110: 40 osds: 40 up, 40 in
>             flags sortbitwise,require_jewel_osds
>       pgmap v87923221: 2816 pgs, 2 pools, 17745 GB data, 4725 kobjects
>             53921 GB used, 16910 GB / 70832 GB avail
>                 2813 active+clean
>                    2 active+clean+scrubbing+deep
>                    1 active+clean+inconsistent
>   client io 63867 kB/s rd, 6571 kB/s wr, 1008 op/s rd, 552 op/s wr
> 
> On demande qui est "inconsistent" :
> 
> root at g1:~# ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
> pg 58.22d is active+clean+inconsistent, acting [9,47,37]
> 2 scrub errors
> 
> On demande l'objet incoherent et ses meta données :
> 
> root at g1:~# rados list-inconsistent-obj 58.22d  --format=json-pretty
> {
>     "epoch": 269000,
>     "inconsistents": [
>         {
>             "object": {
>                 "name": "rbd_data.11f20f75aac8266.00000000000f79f9",
>                 "nspace": "",
>                 "locator": "",
>                 "snap": "head",
>                 "version": 9894452
>             },
>             "errors": [
>                 "data_digest_mismatch" 
>             ],
>             "union_shard_errors": [
>                 "data_digest_mismatch_oi" 
>             ],
>             "selected_object_info": 
> "58:b453643a:::rbd_data.11f20f75aac8266.00000000000f79f9:head(261163'92
> 81748 osd.9.0:6221608 dirty|data_digest|omap_digest s 4194304 uv
> 9894452 dd 2193d055 od ffffffff alloc_hint [0 0])",
>             "shards": [
>                 {
>                     "osd": 9,
>                     "errors": [],
>                     "size": 4194304,
>                     "omap_digest": "0xffffffff",
>                     "data_digest": "0x2193d055" <= version majoritaire
>                 },
>                 {
>                     "osd": 37,
>                     "errors": [
>                         "data_digest_mismatch_oi" 
>                     ],
>                     "size": 4194304,
>                     "omap_digest": "0xffffffff",
>                     "data_digest": "0x05891fb4" <== version minoritaire
>                 },
>                 {
>                     "osd": 47,
>                     "errors": [],
>                     "size": 4194304,
>                     "omap_digest": "0xffffffff",
>                     "data_digest": "0x2193d055" <== version majoritaire
>                 }
>             ]
>         }
>     ]
> }
> 
> On demande ou est stocké cet objet :
> 
> root at g1:~# ceph osd map disks rbd_data.11f20f75aac8266.00000000000f79f9
> osdmap e269110 pool 'disks' (58) object
> 'rbd_data.11f20f75aac8266.00000000000f79f9' -> pg 58.5c26ca2d (58.22d)
> -> up ([9,47,37], p9) acting ([9,47,37], p9)
> 
> On va voir sur les OSD :
> 
> root at g8:/var/lib/ceph/osd/ceph-9/current/58.22d_head# find . -name
> '*11f20f75aac8266.00000000000f79f9*'
> ./DIR_D/DIR_2/DIR_A/DIR_C/rbd\udata.11f20f75aac8266.00000000000f79f9__h
> ead_5C26CA2D__3a
> 
> root at g10:/var/lib/ceph/osd/ceph-37/current/58.22d_head# find . -name
> '*11f20f75aac8266.00000000000f79f9*'
> ./DIR_D/DIR_2/DIR_A/DIR_C/rbd\udata.11f20f75aac8266.00000000000f79f9__h
> ead_5C26CA2D__3a
> 
> On recupere les deux objets :
> 
> $ scp g8:/var/lib/ceph/osd/ceph-
> 9/current/58.22d_head/DIR_D/DIR_2/DIR_A/DIR_C/rbd*data.11f20f75aac8266.
> 00000000000f79f9__head_5C26CA2D__3a g8data
> $ scp g10:/var/lib/ceph/osd/ceph-
> 37/current/58.22d_head/DIR_D/DIR_2/DIR_A/DIR_C/rbd*data.11f20f75aac8266
> .00000000000f79f9__head_5C26CA2D__3a g10data
> $ md5sum *
> bd85c0ef1f30829ce07e5f9152ac2d2f  g10data
> 4297d0bc373e6603e0ad842702e0ecaa  g8data
> 
> Ils sont bien differents, on verifie le detail des differences :
> 
> $ $ diff -u <(od -x g10data) <(od -x g8data)
> --- /dev/fd/63    2017-08-13 10:43:52.837097740 +0200
> +++ /dev/fd/62    2017-08-13 10:43:52.833097808 +0200
> @@ -2617,7 +2617,7 @@
>  0121600 439b 14f4 bb4c 5f14 6ff7 4393 9ff8 a9a9
>  0121620 29a8 56a4 1133 b6a8 2206 4821 2f42 4b2c
>  0121640 3d86 41a2 785f 9785 8b48 4243 e7b9 f0aa
> -0121660 29b6 be0c 0455 bf97 1c0d 49e5 75dd e1ed
> +0121660 29a6 be0c 0455 bf97 1c0d 49e5 75dd e1ed
>  0121700 2519 d6ac 1047 1111 0344 38be 27a1 db07
>  0121720 dff6 c002 75d8 4396 6154 eba9 3abd 5d20
>  0121740 8ae4 e63a 298b d754 0208 9705 1bb8 3685
> 
> Donc un seul bit flip 29b6 vs 29a6
> 
>>>> bin(0xa)
> '0b1010'
>>>> bin(0xb)
> '0b1011'
> 
> Archive des commandes ici :
> 
> https://chiliproject.tetaneutral.net/projects/tetaneutral/wiki/Manageme
> nt_Cluster_Ceph#inconsistent-pg
> 
> Pour reparer :
> 
> root at g1:~# ceph pg repair 58.22d
> instructing pg 58.22d on osd.9 to repair
> 
> ceph a un mode de reparation automatique qu'on active pas pour le
> moment, on a un repair a faire tous les quelques jours.
> 
> C'est la que le replica triple est utile :).
> 
> Le disque de l'OSD 37 est un WDC WD30EFRX-68EUZN0 3 TB, smartctl
> dit qu'il n'y a pas de secteur realloué. La datasheet :
> https://www.wdc.com/content/dam/wdc/website/downloadable_assets/eng/spe
> c_data_sheet/2879-800002.pdf
> dit "Non-recoverable read errors per bits read" "<1 in 10^14"
> 
> Sur une lecture nos 53 TB ca fait environ 4 bits lus en erreur
> non recuperable.
> 
> Sincèrement,
> 
> Laurent
> 
> _______________________________________________
> technique mailing list
> technique at lists.tetaneutral.net
> http://lists.tetaneutral.net/listinfo/technique

-- 
FAI associatif [Viviers Fibre]
http://viviers-fibre.net
https://yunohost.viviers-fibre.net



Plus d'informations sur la liste de diffusion technique