The Ceph Rados Gateway lets you access Ceph via the Swift and S3 APIs. It translates those APIs into librados requests. Librados is a wonderful object store but wasn’t designed to list objects efficiently. The Rados Gateway maintains it’s own indexes to help improve listing responses and maintain some additional metadata. There isn’t a lot of documentation on how these indexes work so I’ve written this blog post to shed some light on that.
First lets examine an existing bucket
# radosgw-admin bucket stats --bucket=mybucket { "bucket": "mybucket", "pool": ".rgw.buckets", "index_pool": ".rgw.buckets.index", "id": "default.14113.1", "marker": "default.14113.1", "owner": "testuser", "ver": "0#3", "master_ver": "0#0", "mtime": "2016-01-29 04:21:47.000000", "max_marker": "0#", "usage": { "rgw.main": { "size_kb": 1, "size_kb_actual": 4, "num_objects": 1 } }, "bucket_quota": { "enabled": false, "max_size_kb": -1, "max_objects": -1 } }
The list of objects in this bucket will be stored in a separate rados object. The name of that object is the bucket id with .dir. prepended to it. The index objects are kept in a separate pool called .rgw.buckets.index. So in this case the bucket index for mybucket should be .dir.default.2529250.167.
Lets find the bucket index
# rados -p .rgw.buckets.index ls - | grep "default.14113.1" .dir.default.14113.1
So here you see the index object was returned in the .rgw.buckets.index pool.
Now lets look at what’s inside the index object
# rados -p rados -p .rgw.buckets.index get .dir.default.14113.1 indexfile # wc -c indexfile 0 indexfile
So the object is 0 bytes … hum … The secret here is that the index information is actually kept in the key/value store in ceph. Each OSD has a colocated leveldb key/value store. So the object is really just acting as a place holder for ceph to find which OSD’s key/value store contains the index.
Lets look at the contents of the key/value store
First lets look at the key
# rados -p .rgw.buckets.index listomapkeys .dir.default.14113.1 myobject
So the key is just the name of the object (Makes sense).
Now lets see the value
# rados -p .rgw.buckets.index listomapvals .dir.default.14113.1 myobject value: (175 bytes) : 0000 : 08 03 a9 00 00 00 08 00 00 00 6d 79 6f 62 6a 65 : ..........myobje 0010 : 63 74 01 00 00 00 00 00 00 00 01 04 03 5b 00 00 : ct...........[.. 0020 : 00 01 d6 00 00 00 00 00 00 00 eb e9 aa 56 00 00 : .............V.. 0030 : 00 00 20 00 00 00 61 34 61 38 64 30 65 64 61 33 : .. ...a4a8d0eda3 0040 : 31 63 66 39 31 34 38 36 63 38 31 35 36 65 37 64 : 1cf91486c8156e7d 0050 : 64 65 65 61 31 63 08 00 00 00 74 65 73 74 75 73 : deea1c....testus 0060 : 65 72 0a 00 00 00 46 69 72 73 74 20 55 73 65 72 : er....First User 0070 : 00 00 00 00 d6 00 00 00 00 00 00 00 00 00 00 00 : ................ 0080 : 00 00 00 00 01 01 02 00 00 00 0c 01 02 10 00 00 : ................ 0090 : 00 64 65 66 61 75 6c 74 2e 31 34 31 31 33 2e 32 : .default.14113.2 00a0 : 34 00 00 00 00 00 00 00 00 00 00 00 00 00 00 : 4..............
Ah now that’s more like it. So we see that the index in this case is 175 bytes and in the hex dump you can see several pieces of information. If you compare the dump against what radosgw-admin tells us about the object we can see what it’s storing in the index.
Here is the dump of the object metadata
# radosgw-admin bucket list --bucket=mybucket [ { "name": "myobject", "instance": "", "namespace": "", "owner": "testuser", "owner_display_name": "First User", "size": 214, "mtime": "2016-01-29 04:26:19.000000Z", "etag": "a4a8d0eda31cf91486c8156e7ddeea1c", "content_type": "", "tag": "default.14113.24", "flags": 0 } ]
So we can see can confirm that the index contains:
- The object name
- owner
- owner_display_name
- etag
- tag
Notice the owner is a value as well as a key. I’m assuming that this was done just in case of corruption so that the keys could be recovered by scanning the values.
The owner_display_name is used there for S3 compatibility. Obviously a compromise for read over write here.
The etag (Entity Tag) is a MD5Sum of the object and is used for S3 compatibility. That’s a shame because I’m sure that would hurt write performance if it has to calculate an MD5Sum for each object when it’s created.
I suspect the rest of the metadata reported by radosgw-admin is there as well (Either empty or not visible in the hex dump).
Now lets actually find where this key/value store lives
Compute which OSD is holding our index object
# ceph osd map .rgw.buckets.index .rgw.buckets.index .dir.default.14113.24 osdmap e60 pool '.rgw.buckets.index' (11) object '.dir.default.14113.24/.rgw.buckets.index' -> pg 11.e6c72a3f (11.3f) -> up ([3,5], p3) acting ([3,5], p3)
So here we can see that the key/value store lives on OSDs 3 and 5 where 3 is the primary (comes first)
Find the key/value store on OSD 3
# ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 0.06235 root default -2 0.02078 host ceph-osd1 0 0.01039 osd.0 up 1.00000 1.00000 3 0.01039 osd.3 up 1.00000 1.00000 -3 0.02078 host ceph-osd0 1 0.01039 osd.1 up 1.00000 1.00000 5 0.01039 osd.5 up 1.00000 1.00000 -4 0.02078 host ceph-osd2 2 0.01039 osd.2 up 1.00000 1.00000 4 0.01039 osd.4 up 1.00000 1.00000
Here we see that osd.3 lives on host ceph-osd1
root@ceph-osd1# cd /var/lib/ceph/osd/ceph-3/ root@ceph-osd1:/var/lib/ceph/osd/ceph-3# ls activate.monmap current journal_uuid ready upstart active fsid keyring store_version whoami ceph_fsid journal magic superblock root@ceph-osd1:/var/lib/ceph/osd/ceph-3# cd current/omap/ root@ceph-osd1:/var/lib/ceph/osd/ceph-3/current/omap# ls 000007.ldb 000011.log CURRENT LOG MANIFEST-000006 000010.ldb 000012.ldb LOCK LOG.old root@ceph-osd1:/var/lib/ceph/osd/ceph-3/current/omap# ls -l total 9128 -rw-r--r-- 1 ceph ceph 163 Jan 11 05:11 000007.ldb -rw-r--r-- 1 ceph ceph 1207818 Jan 20 02:36 000010.ldb -rw-r--r-- 1 ceph ceph 4947942 Jan 29 05:36 000011.log -rw-r--r-- 1 ceph ceph 1235101 Jan 29 03:57 000012.ldb -rw-r--r-- 1 ceph ceph 16 Jan 11 05:11 CURRENT -rw-r--r-- 1 ceph ceph 0 Jan 11 05:11 LOCK -rw-r--r-- 1 ceph ceph 709 Jan 29 03:57 LOG -rw-r--r-- 1 ceph ceph 172 Jan 11 05:11 LOG.old -rw-r--r-- 1 ceph ceph 331 Jan 29 03:57 MANIFEST-000006
And there is the leveldb which is the key/value store holding our index.
So that’s the rados gateway indexes explained. How you find this helpful/enlightening.