Table of Contents |
---|
Introduction
General informations
...
The table below summarizes SPC-1 and SPC-2 specifics.
| SPC-1 | SPC-2 |
---|---|---|
Typical applications |
|
...
|
...
Workload | Random I/O | Sequential I/O (1+ streams) |
---|---|---|
Workload variations |
|
...
|
...
Reported metrics |
|
...
|
...
Good practices and tips for benchmarking
...
Because Iperf is an client-server application, you have to install iperf on both machines involved in tests. Make sure that you use iperf 2.0.2 using p-threads or newer due to some multi-threading issues with older versions. You can check the version of this installed tools with a following command:
Panelnoformat |
---|
[:userathostname ~]$ iperf -v iperf version 2.0.2 (03 May 2005) pthreads |
Because in many cases the low network performance is caused by high CPU load, you should measure CPU usage at both link ends during every test round. In this HOWTO we use an open-source vmstat LINK-! tool which you probably have already installed on your machines.
...
In order to enable MTU 9000 on the machine network interfaces you may use ifconfig command.
Panelnoformat |
---|
test001:[root@hostname ~]$ ifconfig eth1 mtu 9000 |
Alternatively, you can put this settings into interface configuration scripts, e.g. /etc/sysconfig/network-scripts/ifcfg-eth1 (on RHEL, Centos, Fedora etc.).
If jumbo frame are working properly, you should be able to ping one host from another using large MTU:
Panelnoformat |
---|
test001:[root@hostname ~]$ ping 10.0.0.1 -s 8960 |
In the example above, we use 8960 instead of 9000 because ping tool option -s needs frame size minus frame header which lenght is equal to 40 bytes. If you cannot use jumbo frames set the mtu to default value 1500.
To tune your link you should measure the average Round Trip Time (RTT) between machines. RTT can be obtained by multiplying the value returned by a ping command by 2. When you have RTT measured, you can set TCO read and write buffers sizes. There are three values you can set: minimum, initial and maximum buffer size. The theoretical value (in bytes) for initial buffer size is BPS / 8 * RTT, where BPS is the link bandwidth in bits/second. Example commands that set these values for the whole operating system are:
Panelnoformat |
---|
test001:[root@hostname ~]# sysctl -w net.ipv4.tcp_rmem="4096 500000 1000000" test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.ipv4.tcp_wmem="4096 500000 1000000" |
Probably, it is best if you start with values computed using the formula mentioned above and then tune these values according to the tests results.
You can also experiment with maximum socket buffer sizes:
Panelnoformat |
---|
test001:[root@hostname ~]# sysctl -w net.core.rmem_max=1000000 test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.core.wmem_max=1000000 |
Another options that should boost performance are:
Panelnoformat |
---|
test001:[root@hostname ~]# sysctl -w net.ipv4.tcp_no_metrics_save=1 test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.ipv4.tcp_moderate_rcvbuf=1 test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.ipv4.tcp_window_scaling=1 test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.ipv4.tcp_moderate_rcvbuf=1 test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.ipv4.tcp_sack=1 test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.ipv4.tcp_fack=1 test001:root@hostname ~# sysctl -w [root@hostname ~]# sysctl -w net.ipv4.tcp_dsack=1 |
COMMENT: the meaning of these parameters is explained in the Linux documentation (sysctl command) LINK-!.
...
To perform the test, we should run iperf in server mode in one host:
Panelnoformat |
---|
test001:[root@hostname ~]# iperf -s -M $mss |
On the other host we should run command like this:
Panelnoformat |
---|
test001:[root@hostname ~]# iperf -c $serwer -M $mss -P $threads -w $\{window\} -i $interval -t $test_time |
There is a description of names and symbols used in the command line:
...
To generate key-pair we use a following command:
Panelnoformat |
---|
test001:[root@hostname ~]# ssh-keygen -t dsa |
Then we copy the public key to the remote server and add it to authorized keys file:
Panelnoformat |
---|
test001:[root@hostname ~]# cat identity.pub >> /home/sarevok/.ssh/authorized_keys |
Now we can login from the remote serve to bug without password. So can also run programs remotely, e.g. from the bash script.
...
There is a simple shell script to run iperf test:
Code Block | ||||
---|---|---|---|---|
| ||||
#!/bin/sh file_size=41 dst_path=/home/stas/iperf_results script_path=/root curr_date=`date +%m-%d-%y-%H-%M-%S` serwer="10.0.1.1" user="root" test_time=60 interval=1 mss=1460 window=1000000 min_threads=1 max_threads=128 |
Code Block | ||
---|---|---|
| ||
| ||
for threads in 1 2 4 8 16 32 64 80 96 112 128 ; do ssh $user@$serwer $script_path/run_iperf.sh -s -w $\{window\} -M $mss & ssh $user@$serwer $script_path/run_vmstat 1 vmstat-$window-$threads-$mss-$curr_date & vmstat 1 > $dst_path/vmstat-$window-$threads-$mss-$curr_date& iperf c $serwer -M $mss -P $threads -w ${window} -i $interval -t $test_time >> & iperf -c $serwer -M $mss -P $threads -w $\{window\} -i $interval -t $test_time >> $dst_path/iperf-$window-$threads-$mss-$curr_date |
Panelnoformat |
---|
ps ax | grep vmstat | awk '\{print $1\}' | xargs -i kill \{\} 2&>/dev/null ssh $user@$serwer $script_path/kill_iperf_vmstat.sh done |
Script run_iperf.sh can look like this:
Panelnoformat |
---|
#\!/bin/sh |
Panelnoformat |
---|
iperf $1 $2 $3 & |
run_vmstat.sh script can contain:
Panelnoformat |
---|
#\!/bin/sh vmstat $1 > $2 & |
kill_iperf_vmstat.sh may look like this:
Panelnoformat |
---|
#\!/bin/sh ps -elf | egrep "iperf" | egrep -v "egrep" |awk '\{print $4\}' | xargs -i kill -9 \{\} ps -elf | egrep "vmstat" | egrep -v "egrep" |awk '\{print $4\}' | xargs -i kill -9 \{\} |
To start test script that can ignoring hangup signals, you can use nohup command.
Panelnoformat |
---|
[:stasatworm]$ nohup script.sh & |
This command keeps the test running when you close the session with server.
...
Table below contains a RAID levels comparison. The notes range is 0 - the worst to 5 - the best. The notes are based on http://www.pcguide.com/ref/hdd/perf/raid/levels/comp.htm and modified according to gathered experience and the fact that we are considering the same amount of drives in every RAID structure. To compare performance we assume that we make each RAID structure using the same number of drives and we use one thread to read or write data from or to the RAID structure.
...
Here, we present how to make software raid structure using Linux md tool. To create simple raid level from devices sda1 sda2 sda3 sda4 you should use following command:
Panelnoformat |
---|
mdadm --create --verbose /dev/md1 --spare-devices=0 --level=0 --raid-devices=4 /dev/\{sda1, sda2, sda3, sda4\} |
Where:
- /dev/md1 - the name of created raid group,
- spare devices - you can specify number on drives to be spare ones,
- level - simple raid level you want to create (Currently, Linux supports LINEAR (disks concatenation) md devices, RAID0 (striping), RAID1 (mirroring), RAID4, RAID5, RAID6, RAID10),
- raid-devices - number of devices you want to use to make a RAID structure.
When you create a RAID structure you should be able to see some RAID details similar to the informations shown below:
Panelnoformat |
---|
test001:[root@sarevok bin]# mdadm --detail /dev/md1 /dev/md1: Version : Version : 00.90.03 Creation Time : Mon Apr 6 17:41:432009 Raid Level : raid0 Array Size : 6837337472 2009 Raid Level : raid0 Array Size : 6837337472 (6520.59 GiB 7001.43 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor :3 Persistence : Superblock is persistent 3
Persistence : Superblock is persistent
|
No Format |
---|
Update Time : Mon Apr 6 |
Panel |
Update Time : Mon Apr 6 17:41:43 2009State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 |
Panel |
---|
Chunk Size : 64K |
Panel |
---|
Rebuild Status : 10% complete |
2009
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
|
No Format |
---|
Chunk Size : 64K
|
No Format |
---|
Rebuild Status : 10% complete
|
No Format |
---|
UUID : |
Panel |
UUID : 19450624:f6490625:aa77982e:0d41d013 Events : Events : 0.1 |
Panelnoformat |
---|
Number Major Minor RaidDeviceState 0 65 16 0 active sync /dev/sda1 1 65 32 1 active sync /dev/sda2 2 65 48 2 active sync /dev/sda3 3 65 64 3 active sync /dev/sda4 State
0 65 16 0 active sync /dev/sda1
1 65 32 1 active sync /dev/sda2
2 65 48 2 active sync /dev/sda3
3 65 64 3 active sync /dev/sda4
|
When performance is considered the chunk size of the md device may be important parameter to tune. There is -c option of the mdadm command, that cen be usedto specify chunk size of kilobytes. The default is 64 kB, however it should be setup up according to some factors such as:
...
To create the file system on md device and then mount it to some directory we use command like this:
Panelnoformat |
---|
mkfs.ext3 /dev/md1 mount /dev/md1 /mnt/md1 |
Again, there are some filesystem parameters that are interesting from the performance point of view. One of them is
the blocksize. It should be set taking into account the application features and the underlying storage components.
One of the rules of thumb is to use the block size equal to the size of the RAID stripe. You can set the the blocksize
using mkfs's -b parameter. There is also possibility to influence the filesystem behaviouds by using using mount command options.
...
The idea of dd is to copy file from 'if' location to 'of' location. Using this tool to measure disk devices requires some trick. To measure write speed you read data from /dev/zero to file on the tested device. For measuring the read performance you should read the data from the file on tested device and write it to /dev/zero.
In that way we avoid measuring more that one storage system at a time. To measure time of reading or writing the file we use time tool. The example commands to test write and read 32 GB of data are:
for writing performance :(please note using the sync command before and during the benchmark, so you are not measuring your operating system's cache performance) :
No Format |
---|
[root@sarevok ~]# sync; time (dd |
Panel |
test001:root@sarevok ~# time dd if=/dev/zero of=/mnt/md1/test_file.txt bs=1024M count=32; sync) |
and for reading performance:
Panelnoformat |
---|
test001:[root@sarevok ~]# time dd if=/mnt/md1/test_file.txt of=/dev/zero bs=1024M count=32 |
where:
- if - input file/device path
- of - output file/device path
- bs - size of a chunk of data to copy
- count - how many times a chunk defined by bs is copied
...
To perform one round of the test we can use command:
Panelnoformat |
---|
iozone -T -t $threads -r $\{blocksize\}k -s $\{file_size\}G -i 0 -i 1 -i 2 -c -e |
where:
- -T - Use POSIX pthreads for throughput tests
- -t - how many threads use for test
- -r - chunk size used to test
- -s - test file size Important!! It is file size PER THREAD, because each thread writes or reads from it's own file.
- -i - test modes - we choose 0 - write/rewrite 1 - read/reread and 2 - random write/read
- -c - Include close() in the timing calculations
- -e - Include flush (fsync,fflush) in the timing calculations
...
To automate the testing we can write some simple SH script like this:
Panelnoformat |
---|
#\!/bin/sh dst_path=/home/sarevok/wyniki_test_iozone curr_date=`date +%m-%d-%y-%H-%M-%S` |
Panelnoformat |
---|
file_size=128 min_blocksize=1 max_blocksize=32 |
Panelnoformat |
---|
min_queuedepth=1 max_queuedepth=16 |
Panelnoformat |
---|
mkdir $dst_path cd /mnt/sdaw/ |
Panelnoformat |
---|
blocksize=$min_blocksize while while [: _blocksize -le $max_blocksize ];do do queuedepth=$min_queuedepth while while [: _queuedepth -le $max_queuedepth ]; do |
Panelnoformat |
---|
vmstat 1 > $dst_path/vmstat-$blocksize-$queuedepth-$curr_date /root/iozone -T -t $queuedepth -r $\{blocksize\}k -s $\{file_size\}G -i 0 -i 1 -c -e > $dst_path/iozone-$blocksize-$queuedepth-$curr_date |
No Format |
---|
ps ax | grep vmstat | awk '\{print $1\}' | xargs -i kill \{\} |
Panel |
ps ax | grep vmstat | awk '{print $1}' | xargs -i kill {} 2&>/dev/null |
No Format |
---|
|
Panel |
queuedepth=`expr $queuedepth \*2` 2` file_size=`expr $file_size \/2` done 2` done blocksize=`expr $blocksize \* 2` done |
To start test script that can ignore hangup signals, you can use nohup command:
test001: root@sarevok$ nohup script.sh &
...
Links:
Practical information:
Real life benchmark requirements in RFPs:
One of the most common usages of storage benchmarking is making sure that the storage systems you buy meets your requirements. As always there are practical limits how complex the benchmark can be. This section lists benchmark procedures actually used in tenders.
CESNET - ~400TB disk array for HSM system using both FC and SATA disks (2011)
Brief disk array requirements:
- Two types of disk in one disk array, no automatic tiering within the array required (there was an HSM system for doing this on a file level)
- Tier 1 - FC, SAS or SCSI drives, min. 15k RPM, totally min. 50TB consisting of 120x 600GB drives + 3 hot spares
- Tier 2 - SATA drives, min. 7.2.k RPM, totally min. 300TB, min. 375x1TB + 10 hot spares OR 188x2TB + 5 hot spares
Performance requirements:
- Sequential: there will be 10TB cluster filesystem on the disk array using RAID5 or R6, this file system will be part of the HSM system. This filesystem will connected to one of front end servers (technical solutions of the connection is up to the candidates, e.g. MPIO, # FC channels, etc., but the solution must be identical to what is used in the proposal). The following benchmark will be run using iozone v3.347:
iozone -Mce -t200 -s15g -r512k -i0 -i1 -F path_to_files
The result of the test is an average value of three runs of the abovementioned command as „Children see throughput for 200 initial writers”, and , „Children see throughput for 200 readers”.
Minimum read speed 1600MB/s, minimum write speed 1200MB/s.
Random:
Same setup of the volume as in the sequential test, but for this test, it will be connected without any filesystem (on a block level). The following test will be run on the connected LUN using fio v1.4.1 with this test definition:
No Format [global] description=CESNET_test [cesnet] # change it to name of the block device used filename=XXXX rw=randrw # 70% rand read, 30% rand write rwmixread=70 size=10000g ioengine=libaio bs=2k runtime=8h time_based numjobs=32 group_reporting # --- end ---
The result of the test is sum of write and read IO operations divided by total elapsed time of the test in seconds.
Minimum required performance 9000 IOPs.
Results of the tests required as a part of proposal: YES
Notes after evaluation: the tests themselves were OK, but the test architectures could be defined a bit better: The tests actually measured only performance of the FC disks (candidates obviously configured the volume in such a way that it was faster), performance of SATA volumes was not evaluated at all. Also, the winner used RAID5 as required but there was a big RAID0 volume above the 20 individual RAID5s (thus creating RAID50) which was allowed but not used in production afterwards.
File system benchmarking examples:
...