Ceph – full blown NVMe cluster

Ceph is open source software designed to provide highly scalable object-, block- and file-based storage under a unified system.

Ceph storage clusters are designed to run on commodity hardware, using an algorithm called CRUSH (Controlled Replication Under Scalable Hashing) to ensure data is evenly distributed across the cluster and that all cluster nodes can retrieve data quickly without any centralized bottlenecks.

I am using this page as my personal notebook covering Ceph guidelines specifically for all flash deployments.

NVMe drives are getting cheaper but server models that can handle them are still quite expensive. In my search for the ideal configuration i found that the ASUS Hyper M.2 x16 PCI-E card was one way of installing more NVMe’s in one barebone. Other vendors are limited to 2 M.2 drives per PCI-E slot.


Next challange was finding a server that could handle BIOS PCI bifurcation in order to work with x4x4x4x4 lanes (as required by this card). There are not many servers that support this feature at this time, they are expensive, or only have one full length PCI-E slot.

We finally ended up with the following configuration for our ceph OSD nodes:

Supermicro GPU Server: 1018GR-T
Intel Xeon 6C 1.7Ghz
16 GB DDR4
2x Intel 120 GB SSD (boot)
2x ASUS Hyper m.2 x16
8x Samsung 960 PRO NVMe
1x Intel X520 Dualport SFP+

As you can see in this picture, this supermicro model has 2 PCI-E x16 slots, one on the left side and one on the right. This leaves room for our Intel 10G SFP card in the center x8 slot.


Installation of Ceph

After the base install of Ubuntu we need to do some checkes.
– Double check NTP installation/configuration
– Check your hosts file, add all ceph nodes
– Generate keys and make sure you can log on to all nodes.

Ceph has several components that can either be installed on separate nodes or combined on the same host:

Ceph OSDs:
Handles the data storage, data replication, and recovery. A Ceph cluster needs at least two Ceph OSD’s.

Ceph Monitor:
Monitors the cluster state and runs the OSD map and CRUSH map.

Ceph Meta Data Server:
If you use CephFS this is a required component.

NTP installation on Ubuntu:

sudo apt-get install -y ntp ntpdate ntp-doc
hwclock --systohc
systemctl enable ntp
systemctl start ntp

install python

sudo apt-get install -y python python-pip parted

Installing ceph-deploy

We will be using targetcli and rbd in order to publish disks to our Windows 2016 Hyper-V servers, so i install both deph-deploy as well as targetcli:
sudo apt-get install ceph-deploy targetcli

If everything goes as planned we can now start the actual ceph install.

Install ceph and configure cluster

I create a directory (under my home user) for the next few steps, since several files will be created. After that i define the first node in the cluster, install ceph onto all nodes, and define which nodes will perform the monitoring.

mkdir ceph-cluster
cd ceph-cluster
ceph-deploy new ceph01

Now open the file ceph.conf that was created and add this to it:
public network = {public-network/netmask}
cluster network = {cluster-network/netmask}

Cluster network will be used for internal communication such as replication of storage. Public network is (in my case) for Hyper-V communication.

Lets go on:
ceph-deploy install ceph01 ceph02 ceph03
ceph-deploy mon create ceph01 ceph02 ceph03
ceph-deploy gatherkeys ceph01

In this scenario i configured all 3 nodes to run monitoring, this however is not required. After this completed we will have 3 fully installed nodes and we are ready to add the storage (disks) to the cluster.

In order to see disks that are availible run the disk list command:
ceph-deploy disk list ceph01

This however will trigger a bug in ceph-deploy resulting in this error:
[ceph_deploy][ERROR ] Traceback (most recent call last):
[ceph_deploy][ERROR ] File "/usr/local/lib/python2.7/dist-packages/ceph_deploy/util/", line 69, in newfunc
[ceph_deploy][ERROR ] return f(*a, **kw)
[ceph_deploy][ERROR ] File "/usr/local/lib/python2.7/dist-packages/ceph_deploy/", line 164, in _main
[ceph_deploy][ERROR ] return args.func(args)
[ceph_deploy][ERROR ] File "/usr/local/lib/python2.7/dist-packages/ceph_deploy/", line 434, in disk
[ceph_deploy][ERROR ] disk_list(args, cfg)
[ceph_deploy][ERROR ] File "/usr/local/lib/python2.7/dist-packages/ceph_deploy/", line 376, in disk_list
[ceph_deploy][ERROR ] distro.conn.logger(line)
[ceph_deploy][ERROR ] TypeError: 'Logger' object is not callable

You can fix this by opening the file:

Go to line 374 and you will find:
for line in out:
if line.startswith('Disk /'):

replace this with:
for line in out:
if line.startswith(b'Disk /'):'utf-8'))

Run the disk list command again and output will look like this:

[ceph01][DEBUG ] find the location of an executable
[ceph01][INFO ] Running command: sudo /usr/sbin/ceph-disk list
[ceph01][DEBUG ] /dev/dm-0 other, ext4, mounted on /
[ceph01][DEBUG ] /dev/dm-1 swap, swap
[ceph01][DEBUG ] /dev/loop0 other, unknown
[ceph01][DEBUG ] /dev/loop1 other, unknown
[ceph01][DEBUG ] /dev/loop2 other, unknown
[ceph01][DEBUG ] /dev/loop3 other, unknown
[ceph01][DEBUG ] /dev/loop4 other, unknown
[ceph01][DEBUG ] /dev/loop5 other, unknown
[ceph01][DEBUG ] /dev/loop6 other, unknown
[ceph01][DEBUG ] /dev/loop7 other, unknown
[ceph01][DEBUG ] /dev/nvme0n1
[ceph01][DEBUG ] /dev/nvme1n1
[ceph01][DEBUG ] /dev/nvme2n1
[ceph01][DEBUG ] /dev/nvme3n1
[ceph01][DEBUG ] /dev/sda :
[ceph01][DEBUG ] /dev/sda2 other, 0x5
[ceph01][DEBUG ] /dev/sda5 other, LVM2_member
[ceph01][DEBUG ] /dev/sda1 other, ext2, mounted on /boot

In my situation /dev/sda is the boot drive. I have plugging in one of the ASUS Hyper cards, as you can see that provides me with 4 NVMe drives that we can use.

In order to add our disks to the cluster we need to “zap” them. You can do this step by step, or just with one command

ceph-deploy disk zap ceph01:/dev/nvme0n1 ceph01:/dev/nvme1n1 ceph01:/dev/nvme2n1 ceph01:/dev/nvme3n1
ceph-deploy osd prepare CEPH-01:/dev/nvme0n1 CEPH-01:/dev/nvme1n1 CEPH-01:/dev/nvme2n1 CEPH-01:/dev/nvme3n1

Ceph will handle all partitioning and create the OSD. Now we need to push our admin key to all nodes:
ceph-deploy admin ceph01 ceph02 ceph03

And make sure i can actually read the file:
chmod +r /etc/ceph/ceph.client.admin.keyring

Let’s check if ceph actually did what we expected:
ceph-deploy disk list CEPH-01
[CEPH-01][DEBUG ] find the location of an executable
[CEPH-01][INFO ] Running command: sudo /usr/sbin/ceph-disk list
[CEPH-01][DEBUG ] /dev/dm-0 other, ext4, mounted on /
[CEPH-01][DEBUG ] /dev/dm-1 swap, swap
[CEPH-01][DEBUG ] /dev/loop0 other, unknown
[CEPH-01][DEBUG ] /dev/loop1 other, unknown
[CEPH-01][DEBUG ] /dev/loop2 other, unknown
[CEPH-01][DEBUG ] /dev/loop3 other, unknown
[CEPH-01][DEBUG ] /dev/loop4 other, unknown
[CEPH-01][DEBUG ] /dev/loop5 other, unknown
[CEPH-01][DEBUG ] /dev/loop6 other, unknown
[CEPH-01][DEBUG ] /dev/loop7 other, unknown
[CEPH-01][DEBUG ] /dev/nvme0n1 :
[CEPH-01][DEBUG ] /dev/nvme0n1p2 ceph journal, for /dev/nvme0n1p1
[CEPH-01][DEBUG ] /dev/nvme0n1p1 ceph data, prepared, cluster ceph, osd.0, journal /dev/nvme0n1p2
[CEPH-01][DEBUG ] /dev/nvme1n1 :
[CEPH-01][DEBUG ] /dev/nvme1n1p2 ceph journal, for /dev/nvme1n1p1
[CEPH-01][DEBUG ] /dev/nvme1n1p1 ceph data, active, cluster ceph, osd.1, journal /dev/nvme1n1p2
[CEPH-01][DEBUG ] /dev/nvme2n1 :
[CEPH-01][DEBUG ] /dev/nvme2n1p2 ceph journal, for /dev/nvme2n1p1
[CEPH-01][DEBUG ] /dev/nvme2n1p1 ceph data, active, cluster ceph, osd.2, journal /dev/nvme2n1p2
[CEPH-01][DEBUG ] /dev/nvme3n1 :
[CEPH-01][DEBUG ] /dev/nvme3n1p2 ceph journal, for /dev/nvme3n1p1
[CEPH-01][DEBUG ] /dev/nvme3n1p1 ceph data, active, cluster ceph, osd.3, journal /dev/nvme3n1p2
[CEPH-01][DEBUG ] /dev/sda :
[CEPH-01][DEBUG ] /dev/sda2 other, 0x5
[CEPH-01][DEBUG ] /dev/sda5 other, LVM2_member
[CEPH-01][DEBUG ] /dev/sda1 other, ext2, mounted on /boot

Looks good! All NVMe’s now have paritions created by ceph. Let’s check the cluster health:

$ sudo ceph health

All good!

The basics

Now that ceph is installed we have some basic commands to use:

Check cluster health
ceph health

If you see something is wrong:
sudo ceph health detail
That will give you much more detail

Check cluster status
ceph -s

Check cluster status (watch)
ceph -w

Check cluster usage stats
ceph -df

Check cluster OSD usage stats
ceph osd df

Check Crushmap
ceph osd tree

Repair an OSD
ceph osd repair
Ceph is a self-repairing cluster. Tell Ceph to attempt repair of an OSD by calling ceph osd repair with the OSD identifier

Benchmark an OSD
ceph tell osd. bench*

Create pool and RBD image

To list your cluster’s pools, execute:
ceph osd lspools

Create a new pool:
ceph osd pool create vstore1 128 128

As we have a small cluster (<5 OSD's at this time) we set PG_NUM to 128, as well as placement groups. When adding more OSD's we will set this amount higher. More documentation can be found here: Placement Groups

Let’s create our first RBD disk that we will publish to our Hyper-V hosts.
rbd create vdisk001 --pool vstore1 -s 102400 --image-format 2 --image-feature exclusive-lock

This created a 1 TB disk in pool “vstore1”. More information about the image-format and features can be found here

RBD images are striped over many objects, which are then stored by the Ceph distributed object store (RADOS). As a result, read and write requests for the image are distributed across many nodes in the cluster, generally preventing any single node from becoming a bottleneck when individual images get large or busy.



Performance testing





Geef een reactie

%d bloggers liken dit: