Tag Archives: rados ceph rgw index

How Indexes Work In Ceph Rados Gateway

The Ceph Rados Gateway lets you access Ceph via the Swift and S3 APIs. It translates those APIs into librados requests. Librados is a wonderful object store but wasn’t designed to list objects efficiently. The Rados Gateway maintains it’s own indexes to help improve listing responses and maintain some additional metadata. There isn’t a lot of documentation on how these indexes work so I’ve written this blog post to shed some light on that.

First lets examine an existing bucket

# radosgw-admin bucket stats --bucket=mybucket
{
    "bucket": "mybucket",
    "pool": ".rgw.buckets",
    "index_pool": ".rgw.buckets.index",
    "id": "default.14113.1",
    "marker": "default.14113.1",
    "owner": "testuser",
    "ver": "0#3",
    "master_ver": "0#0",
    "mtime": "2016-01-29 04:21:47.000000",
    "max_marker": "0#",
    "usage": {
        "rgw.main": {
            "size_kb": 1,
            "size_kb_actual": 4,
            "num_objects": 1
        }
    },
    "bucket_quota": {
        "enabled": false,
        "max_size_kb": -1,
        "max_objects": -1
    }
}

The list of objects in this bucket will be stored in a separate rados object. The name of that object is the bucket id with .dir. prepended to it. The index objects are kept in a separate pool called .rgw.buckets.index. So in this case the bucket index for mybucket should be .dir.default.2529250.167.

Lets find the bucket index

# rados -p .rgw.buckets.index ls - | grep "default.14113.1"
.dir.default.14113.1

So here you see the index object was returned in the .rgw.buckets.index pool.

Now lets look at what’s inside the index object

# rados -p rados -p .rgw.buckets.index get .dir.default.14113.1 indexfile
# wc -c indexfile
0 indexfile

So the object is 0 bytes … hum … The secret here is that the index information is actually kept in the key/value store in ceph. Each OSD has a colocated leveldb key/value store. So the object is really just acting as a place holder for ceph to find which OSD’s key/value store contains the index.

Lets look at the contents of the key/value store

First lets look at the key

# rados -p .rgw.buckets.index listomapkeys .dir.default.14113.1
myobject

So the key is just the name of the object (Makes sense).

Now lets see the value

# rados -p .rgw.buckets.index listomapvals .dir.default.14113.1
myobject
value: (175 bytes) :
0000 : 08 03 a9 00 00 00 08 00 00 00 6d 79 6f 62 6a 65 : ..........myobje
0010 : 63 74 01 00 00 00 00 00 00 00 01 04 03 5b 00 00 : ct...........[..
0020 : 00 01 d6 00 00 00 00 00 00 00 eb e9 aa 56 00 00 : .............V..
0030 : 00 00 20 00 00 00 61 34 61 38 64 30 65 64 61 33 : .. ...a4a8d0eda3
0040 : 31 63 66 39 31 34 38 36 63 38 31 35 36 65 37 64 : 1cf91486c8156e7d
0050 : 64 65 65 61 31 63 08 00 00 00 74 65 73 74 75 73 : deea1c....testus
0060 : 65 72 0a 00 00 00 46 69 72 73 74 20 55 73 65 72 : er....First User
0070 : 00 00 00 00 d6 00 00 00 00 00 00 00 00 00 00 00 : ................
0080 : 00 00 00 00 01 01 02 00 00 00 0c 01 02 10 00 00 : ................
0090 : 00 64 65 66 61 75 6c 74 2e 31 34 31 31 33 2e 32 : .default.14113.2
00a0 : 34 00 00 00 00 00 00 00 00 00 00 00 00 00 00    : 4..............

Ah now that’s more like it. So we see that the index in this case is 175 bytes and in the hex dump you can see several pieces of information. If you compare the dump against what radosgw-admin tells us about the object we can see what it’s storing in the index.

Here is the dump of the object metadata

# radosgw-admin bucket list --bucket=mybucket
[
    {
        "name": "myobject",
        "instance": "",
        "namespace": "",
        "owner": "testuser",
        "owner_display_name": "First User",
        "size": 214,
        "mtime": "2016-01-29 04:26:19.000000Z",
        "etag": "a4a8d0eda31cf91486c8156e7ddeea1c",
        "content_type": "",
        "tag": "default.14113.24",
        "flags": 0
    }

]

So we can see can confirm that the index contains:

  • The object name
  • owner
  • owner_display_name
  • etag
  • tag

Notice the owner is a value as well as a key. I’m assuming that this was done just in case of corruption so that the keys could be recovered by scanning the values.

The owner_display_name is used there for S3 compatibility. Obviously a compromise for read over write here.

The etag (Entity Tag) is a MD5Sum of the object and is used for S3 compatibility. That’s a shame because I’m sure that would hurt write performance if it has to calculate an MD5Sum for each object when it’s created.

I suspect the rest of the metadata reported by radosgw-admin is there as well (Either empty or not visible in the hex dump).

Now lets actually find where this key/value store lives

Compute which OSD is holding our index object

# ceph osd map .rgw.buckets.index .rgw.buckets.index .dir.default.14113.24
osdmap e60 pool '.rgw.buckets.index' (11) object '.dir.default.14113.24/.rgw.buckets.index' -> pg 11.e6c72a3f (11.3f) -> up ([3,5], p3) acting ([3,5], p3)

So here we can see that the key/value store lives on OSDs 3 and 5 where 3 is the primary (comes first)

Find the key/value store on OSD 3

# ceph osd tree
ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.06235 root default
-2 0.02078     host ceph-osd1
 0 0.01039         osd.0           up  1.00000          1.00000
 3 0.01039         osd.3           up  1.00000          1.00000
-3 0.02078     host ceph-osd0
 1 0.01039         osd.1           up  1.00000          1.00000
 5 0.01039         osd.5           up  1.00000          1.00000
-4 0.02078     host ceph-osd2
 2 0.01039         osd.2           up  1.00000          1.00000
 4 0.01039         osd.4           up  1.00000          1.00000

Here we see that osd.3 lives on host ceph-osd1

root@ceph-osd1# cd /var/lib/ceph/osd/ceph-3/
root@ceph-osd1:/var/lib/ceph/osd/ceph-3# ls
activate.monmap  current  journal_uuid  ready          upstart
active           fsid     keyring       store_version  whoami
ceph_fsid        journal  magic         superblock
root@ceph-osd1:/var/lib/ceph/osd/ceph-3# cd current/omap/
root@ceph-osd1:/var/lib/ceph/osd/ceph-3/current/omap# ls
000007.ldb  000011.log  CURRENT  LOG      MANIFEST-000006
000010.ldb  000012.ldb  LOCK     LOG.old
root@ceph-osd1:/var/lib/ceph/osd/ceph-3/current/omap# ls -l
total 9128
-rw-r--r-- 1 ceph ceph     163 Jan 11 05:11 000007.ldb
-rw-r--r-- 1 ceph ceph 1207818 Jan 20 02:36 000010.ldb
-rw-r--r-- 1 ceph ceph 4947942 Jan 29 05:36 000011.log
-rw-r--r-- 1 ceph ceph 1235101 Jan 29 03:57 000012.ldb
-rw-r--r-- 1 ceph ceph      16 Jan 11 05:11 CURRENT
-rw-r--r-- 1 ceph ceph       0 Jan 11 05:11 LOCK
-rw-r--r-- 1 ceph ceph     709 Jan 29 03:57 LOG
-rw-r--r-- 1 ceph ceph     172 Jan 11 05:11 LOG.old
-rw-r--r-- 1 ceph ceph     331 Jan 29 03:57 MANIFEST-000006

And there is the leveldb which is the key/value store holding our index.

So that’s the rados gateway indexes explained. How you find this helpful/enlightening.