I haven’t got a chance yet to play with asynchronous I/O APIs in any production
application so, to fill in that gap I’ve decided to commit some time and
experiment with posix AIO and liburing a try. These two are (I think?),
at the time of writing, the two most popular APIs to perform
asynchronous operations in Linux.
This post isn’t a benchmark nor a comparison between the two. It’s just a
knowledge summary of the basics behind using these APIs and what I’ve learned
along the way.
POSIX aio
Warning
aio APIs should not be used for production purposes. The kernel
implementation for the underlying APIs is not available - as a result, aio is
implemented using thread pools in glibc. aio works only in O_DIRECT
mode. The
APIs don’t scale well and usage may lead to disappointing performance.
aio is comprised of a set of APIs very similar to traditional synchronous
APIs we all know and use all the time. The most prominent ones:
aio_read
aio_write
aio_error
aio_return
aio_suspend
You work with these APIs using a control block (struct aiocb
). Control block is a structure
which describes an I/O operation to perform. The general workflow looks the
following way:
- Setup the control block,
- Call
aio_read
or aio_write
with the control block as an argument (this enqueues the I/O operation),
- Wait for operation to complete (by polling on
aio_error
, suspending the calling thread using aio_suspend
or setting a signal handler),
- Get the I/O operation’s result using
aio_result
- this returns exactly the same data as synchronous write
or read
would return,
- Operation is done. To enqueue more operations go to 1.
Here’s a very simple program, showing the basic usage of aio APIs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
|
#include <aio.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
const char* statusAsString(int s) {
switch (s) {
case EINPROGRESS:
return "in progress";
case ECANCELED:
return "cancelled";
case 0:
return "completed";
}
return "error";
}
int main() {
char buffer[256] = {0x00};
int infd = open("/etc/hostname", O_RDONLY);
// declare and setup the control block
struct aiocb cb;
memset(&cb, 0, sizeof(cb));
cb.aio_fildes = infd;
cb.aio_buf = buffer;
cb.aio_nbytes = sizeof(buffer);
cb.aio_sigevent.sigev_notify = SIGEV_NONE;
// schedule asynchronous read
aio_read(&cb);
// synchronously wait for the operation to complete
const struct aiocb *const aiocb_list[] = { &cb };
aio_suspend(aiocb_list, 1, nullptr);
close(infd);
// check the operation status
int status = aio_error(&cb);
if (0 > status) {
perror("aio_read");
return EXIT_FAILURE;
}
// get operation's results
int nbytes = aio_return(&cb);
printf("Aio status: %s, transfered: %d bytes\n",
statusAsString(status),
nbytes);
printf("Contents: [%s]\n", buffer);
return EXIT_SUCCESS;
}
|
Of course, you wouldn’t want to use asynchronous APIs like that - in a
synchronous fashion. This is where completion notifications come into play.
Completion notification can be set using sigev_notify
field within the control block.
There are 3 values that sigev_notify
can be set to:
SIGEV_NONE
- no notifications issued on completion,
SIGEV_SIGNAL
- signal will be raised on completion,
SIGEV_THREAD
- new thread will be spun-up running a function given in sigev_notify_function
The third one would be most optimal but spinning up threads on I/O completion
is expensive and will kill the performance so, the only usable option is the
signal notification. This is a bit problematic as only a limited set of
functions are safe to be used from within a signal handler. Looking at man 7 signal-safety
only these, belonging to aio family, are allowed:
aio_error(3)
aio_return(3)
aio_suspend(3)
This is a bit too limiting as, most often, on I/O completion you want to
schedule another operation (like scheduling aio_write
with the data returned
by prior aio_read
result). There’s a clean way out - signalfd
. Have a look on the example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
|
#include <aio.h>
#include <errno.h>
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/signalfd.h>
#include <signal.h>
#include <thread>
const char* statusAsString(int s) {
switch (s) {
case EINPROGRESS:
return "in progress";
case ECANCELED:
return "cancelled";
case 0:
return "completed";
}
return "error";
}
int main() {
char buffer[256] = {0x00};
int infd = open("/etc/hostname", O_RDONLY);
sigset_t mask;
sigemptyset(&mask);
sigaddset(&mask, SIGRTMIN);
sigaddset(&mask, SIGINT);
sigaddset(&mask, SIGQUIT);
sigprocmask(SIG_BLOCK, &mask, NULL);
int sfd = signalfd(-1, &mask, 0);
std::jthread t{[&](std::stop_token st){
while(!st.stop_requested()) {
struct signalfd_siginfo si;
if (int s = read(sfd, &si, sizeof(si));
s != sizeof(si)) {
fprintf(stderr, "error reading signal\n");
return;
}
if (si.ssi_signo == SIGINT || si.ssi_signo == SIGQUIT) {
return;
}
if (static_cast<int>(si.ssi_signo) == SIGRTMIN) {
struct aiocb* cb =
reinterpret_cast<struct aiocb*>(si.ssi_ptr);
int status = aio_error(cb);
if (0 > status) {
perror("aio_read");
return;
}
int nbytes = aio_return(cb);
printf("Aio status: %s, transfered: %d bytes\n",
statusAsString(status),
nbytes);
printf("Contents: [%s]\n", buffer);
}
}
}};
struct aiocb cb;
memset(&cb, 0, sizeof(cb));
cb.aio_fildes = infd;
cb.aio_buf = buffer;
cb.aio_nbytes = sizeof(buffer);
cb.aio_sigevent.sigev_notify = SIGEV_SIGNAL;
cb.aio_sigevent.sigev_signo = SIGRTMIN;
cb.aio_sigevent.sigev_value.sival_ptr = &cb;
aio_read(&cb);
const struct aiocb *const aiocb_list[] = { &cb };
aio_suspend(aiocb_list, 1, nullptr);
close(infd);
return EXIT_SUCCESS;
}
|
In the thread reading from signalfd
I can call any function I want.
Additionally, it’s worth mentioning that I’m using a real time signal as the
notification signal specifically because of one reason - real time signals are
queued.
It’s becoming obvious why aio
is not the preferred choice for asynchronous
I/O API. This simple example is performing a single request yet it’s requiring
a significant boiler plate. Additionally, for brevity, I’ve kept error
handling to minimum.
More complete example with AIO and signalfd
You might remember that I’ve already covered signal handling with signalfd
in
one of my previous posts. Back then, I’ve created a simple library -
libsignals to simplify working with signals and file descriptors. I’m
gonna use this library now to implement an example program which is closer in
its completeness to a real application and will allow for some rudimentary
performance testing.
To kick things off, I’ve prepared the following program skeleton:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
|
#include <fcntl.h>
#include <unistd.h>
#include <cstdlib>
#include <iostream>
#include <memory>
#include <thread>
#include <signals/factory.h>
#include <signals/scoped_sigset.h>
#include <signals/signal_handler_callbacks.h>
#include <signals/sigset_util.h>
namespace {
const int IoSignal = SIGRTMIN;
} // namespace
void usage(const char *progname) {
std::cout << progname << ": <pool-size> <bufsize> <input> <output> <mb>"
<< std::endl;
}
class Transfer : public signals::SignalHandlerCallbacks {
public:
Transfer(std::size_t poolSize, std::size_t bufSize, std::string_view input,
std::string_view output, std::size_t mb)
: bufSize{bufSize}, readSchedPos{0}, total{mb * 1024 * 1024},
ops{std::make_shared<signals::RealFdOps>()},
inFd{open(input.data(), O_RDONLY), ops},
outFd{open(output.data(), O_WRONLY | O_CREAT), ops}
{
}
bool onSignal(const signals::SigInfo &si) override {
return false;
}
void initiate() {
// TODO: to be implemented
}
void wait() {
// TODO: to be implemented
}
private:
const std::size_t bufSize;
std::size_t readSchedPos;
std::size_t writeCompPos;
const std::size_t total;
std::shared_ptr<signals::FdOps> ops;
signals::ScopedFd inFd;
signals::ScopedFd outFd;
};
int main(int argc, const char *argv[]) try {
if (6 != argc) {
usage(argv[0]);
return EXIT_SUCCESS;
}
std::size_t mbs = std::strtoul(argv[5], nullptr, 10);
std::size_t poolSize = std::strtoul(argv[1], nullptr, 10);
std::size_t bufSize = std::strtoul(argv[2], nullptr, 10);
signals::ScopedSigSet sss{signals::createFullSet()};
auto transfer =
std::make_unique<Transfer>(poolSize, bufSize, argv[3], argv[4], mbs);
auto transferRaw = transfer.get();
auto factory = std::make_unique<signals::SignalHandlerFactory>();
auto sigSet = signals::createSet(IoSignal, SIGTERM, SIGINT);
auto sigHandler = factory->createSignalFdHandler(sigSet, std::move(transfer));
std::jthread sigThread{[&]() { sigHandler->run(); }};
transferRaw->initiate();
transferRaw->wait();
return EXIT_SUCCESS;
} catch (const std::exception &e) {
std::cerr << e.what() << std::endl;
} catch (...) {
std::cerr << "Unknown exception caught" << std::endl;
}
|
This program takes the following CLI arguments:
- pool-size - this is the amount of control-blocks pre-allocated, used to perform simultaneous asynchronous I/O operations, in other words, it defines the number of asynchronous operations in flight at the same time,
- bufsize - each control block will be associated with its own buffer, this argument defines the buffer size,
- input/output - these two are self explanatory - these are just paths to files acting as the data source and destination,
- mbs - this the number of megabytes that will be transferred from input to output. I want to define it separately to be able to experiment with
/dev/zero
and similar pseudo-devices like that.
I’ve paired the control blocks with buffers to form a descriptor:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
class Desc {
public:
Desc(std::size_t bufSize)
: buffer{std::make_unique<uint8_t[]>(bufSize)}, bufSize{bufSize},
isLastRead{false} {
::memset(&aiocb, 0, sizeof(aiocb));
aiocb.aio_buf = buffer.get();
aiocb.aio_sigevent.sigev_notify = SIGEV_SIGNAL;
aiocb.aio_sigevent.sigev_signo = IoSignal;
aiocb.aio_sigevent.sigev_value.sival_ptr = this;
}
int aioError() const { return aio_error(&aiocb); }
ssize_t aioReturn() const { return aio_return(&aiocb); }
bool isComplete() const { return aioError() == 0; }
bool wasRead() const noexcept { return isLastRead; }
int aioWrite(int fd, std::size_t n, std::size_t offset) {
aiocb.aio_fildes = fd;
aiocb.aio_nbytes = n;
aiocb.aio_offset = offset;
isLastRead = false;
return aio_write(&aiocb);
}
int aioRead(int fd, std::size_t offset, std::size_t n) {
aiocb.aio_fildes = fd;
aiocb.aio_nbytes = n;
aiocb.aio_offset = offset;
isLastRead = true;
return aio_read(&aiocb);
}
std::size_t getSize() const { return aiocb.aio_nbytes; }
std::size_t getOffset() const { return aiocb.aio_offset; }
static Desc *fromUserData(void *data) {
return reinterpret_cast<Desc *>(data);
}
private:
mutable struct aiocb aiocb;
std::unique_ptr<uint8_t[]> buffer;
const std::size_t bufSize;
bool isLastRead;
};
|
Within Transfer
, I need a pool of descriptors:
1
2
3
|
using Descs = std::list<std::unique_ptr<Desc>>;
...
Descs pool;
|
Now, within initiate
I can start by scheduling some initial set of aio_reads
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
void scheduleRead(Desc &desc, std::size_t offset, std::size_t toTransfer) {
if (0 == toTransfer) {
return;
}
if (auto r = desc.aioRead(inFd.get(), offset, toTransfer); r != 0) {
throw std::runtime_error("Failed to schedule read transfer");
}
}
void scheduleRead(Desc &desc) {
const auto toTransfer = calcTransferSize();
scheduleRead(desc, readSchedPos, toTransfer);
readSchedPos += toTransfer;
}
void initiate() {
std::lock_guard l{m};
readSchedPos = 0;
for (auto it = pool.begin(); it != pool.end(); ++it) {
scheduleRead(*(*it));
}
}
|
From this point onwards, the whole transfer will be self sustaining. Once the
aio_read
completion signal arrives, with the data in the aio_read
descriptor I can proceed with aio_write
. Similarly aio_write
completion
can trigger another aio_read
to maintain data flow. This will spin until the
requested amount of data is transferred. Here’s how the signal handler looks:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
|
bool onSignal(const signals::SigInfo &si) override {
bool isComplete = false;
if (IoSignal == si.sigNo) {
auto *desc = Desc::fromUserData(si.sentPtr);
if (!desc) {
std::cerr << "received signal with no context" << std::endl;
return false;
}
if (!desc->isComplete()) {
throw std::runtime_error(
"completion signal received but operation is not complete");
}
const auto nBytes = static_cast<std::size_t>(desc->aioReturn());
if (nBytes == 0) {
std::cerr << "operation didn't perform any io" << std::endl;
return false;
}
if (desc->wasRead()) {
// read operation completed, perform a write with obtained data
if (desc->getSize() > nBytes) {
std::cout << "Short read, scheduling partial read" << std::endl;
auto &newDesc = pool.emplace_back(std::make_unique<Desc>(bufSize));
auto offset = desc->getOffset() + nBytes;
auto size = desc->getSize() - nBytes;
scheduleRead(*newDesc, offset, size);
}
scheduleWrite(*desc, nBytes);
} else {
// write operation completed, schedule another read
std::lock_guard l{m};
writeCompPos += nBytes;
isComplete = (writeCompPos == total);
if (!isComplete) {
if (desc->getSize() < nBytes) {
std::cout << "Short write, scheduling partial write" << std::endl;
auto offset = desc->getOffset() + nBytes;
auto size = desc->getSize() - nBytes;
scheduleWrite(*desc, offset, size);
auto &newDesc = pool.emplace_back(std::make_unique<Desc>(bufSize));
desc = newDesc.get();
}
scheduleRead(*desc);
}
}
}
if (isComplete) {
cv.notify_one();
}
return (SIGTERM == si.sigNo || SIGINT == si.sigNo || isComplete);
}
|
This code in its entirety can be found here.
io_uring
io_uring
was designed with performance in mind. This is de-facto a goto
async I/O solution for Linux at the time of writing. io_uring
is built
around a single producer, single-consumer ring buffers acting as a communication
channels between the user space program and the kernel itself.
There are two ring buffers. One for SQE
s (submission queue entries) and one
for CQE
s (completion queue entries). In simple terms, you schedule I/O
requests on sqe
and wait for their completion on cqe
- that’s it. No
signals, extra threads spun up on completion, so on and so forth. You define
your threading model yourself.
io_uring
APIs are modelling standard synchronous APIs. Before submitting an
I/O request, it has to be prepared. io_uring
equivalents to traditional
synchronous read
/write
APIs are io_uring_prep_read
and
io_uring_prep_write
. There are APIs for vectorised scatter/gather I/O
(readv
/writev
and preadv
/pwritev
) operations and socket operations
equivalents as well. The API is quite extensive and I’m only scratching the surface here.
I’m gonna focus on most basic use cases only - just to show basic principles
behind the APIs. Here’s how to perform a single read:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
#include <liburing.h>
#include <stdio.h>
int main() {
struct io_uring ring;
std::size_t size = 4;
int flags = 0;
io_uring_queue_init(size, &ring, flags);
int infd = open("/etc/hostname", O_RDONLY);
struct io_uring_sqe* sqe = io_uring_get_sqe(&ring);
char buffer[64];
std::size_t offset = 0;
io_uring_prep_read(sqe, infd, buffer, sizeof(buffer), offset);
// You can prepare a batch of requests here and use a single //
// `io_uring_submit` to submit them all at once.
io_uring_submit(&ring);
struct io_uring_cqe* cqe = nullptr;
io_uring_wait_cqe(&ring, &cqe);
int nBytes = cqe->res;
if (nBytes < 0) {
perror("read");
return -1;
}
// mark cqe as processed
io_uring_cqe_seen(&ring, cqe);
printf("Transfered: %d bytes\n", nBytes);
printf("Contents: %s\n", buffer);
close(infd);
io_uring_queue_exit(&ring);
return 0;
}
|
For the sake of comparison, I’ve implemented a similar program as for aio
,
using io_uring
- transferring data between two files. I’ve created a
small RAII wrapper to manage the io_uring_queue
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
class Uring {
public:
explicit Uring(int queueSize) : queueSize{queueSize} {
unsigned flags = 0;
if (auto r = io_uring_queue_init(queueSize, &ring, flags); r != 0) {
throw std::runtime_error("Failed to setup queue");
}
}
~Uring() { io_uring_queue_exit(&ring); }
struct io_uring *get() noexcept { return ˚ }
struct io_uring_sqe *getSqe() { return io_uring_get_sqe(&ring); }
int submit() { return io_uring_submit(&ring); }
private:
int queueSize;
struct io_uring ring;
};
|
In a similar vein, I’ve created a descriptor as well which mostly abstracts the I/O operation and its associated buffer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
class Desc {
public:
explicit Desc(std::size_t bufSize)
: isRead_{false}, offset{0}, nBytes{0},
buffer{std::make_unique<uint8_t[]>(bufSize)} {}
void setIsRead(bool isRead) { this->isRead_ = isRead; }
void setOffset(std::size_t o) { offset = o; }
bool isRead() const { return isRead_; }
std::size_t getOffset() const { return offset; }
uint8_t *get() { return buffer.get(); }
std::size_t getSize() const { return nBytes; }
void setSize(std::size_t s) { nBytes = s; };
private:
bool isRead_;
std::size_t offset;
std::size_t nBytes;
std::unique_ptr<uint8_t[]> buffer;
};
|
Within the Transfer
class, as previously, I’m kicking off a set of async operations from the get go:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
void initiate() {
for (auto &desc : descs) {
scheduleRead(desc);
}
if (ur.submit() < 0) {
throw std::runtime_error("Failed submitting read events");
}
}
bool scheduleRead(Desc &desc, std::size_t offset, std::size_t toTransfer) {
if (0 == toTransfer) {
return false;
}
if (offset > total) {
std::stringstream ss;
ss << "Read offset (" << offset << ") outside of transfer size (" << total
<< ")";
throw std::runtime_error(ss.str());
}
struct io_uring_sqe *sqe = ur.getSqe();
if (nullptr == sqe) {
throw std::runtime_error(
"Failed to obtain submit queue event for reading");
}
desc.setIsRead(true);
desc.setOffset(offset);
desc.setSize(toTransfer);
io_uring_prep_read(sqe, inFd.get(), desc.get(), toTransfer, offset);
io_uring_sqe_set_data(sqe, &desc);
return true;
}
bool scheduleRead(Desc &desc) {
const auto toTransfer = calcTransferSize();
auto r = scheduleRead(desc, readPos, toTransfer);
if (readPos < total) {
readPos += toTransfer;
}
return r;
}
std::size_t calcTransferSize() const {
const auto remainingTotal = total - readPos;
return std::min(remainingTotal, bufSize);
}
|
Completion and follow up requests are now done on the same thread in wait
:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
void wait() {
while (writtenPos < total) {
struct io_uring_cqe *cqe;
if (auto r = io_uring_wait_cqe(ur.get(), &cqe); r != 0) {
throw std::runtime_error("Failed waiting for cqe");
}
Desc *desc = reinterpret_cast<Desc *>(io_uring_cqe_get_data(cqe));
if (nullptr == desc) {
throw std::runtime_error("CQE has no user data");
}
if (cqe->res < 0) {
std::stringstream ss;
ss << "Read transfer failed: " << ::strerror(std::abs(cqe->res));
throw std::runtime_error(ss.str());
}
const auto nBytes = static_cast<std::size_t>(cqe->res);
if (0 == nBytes) {
io_uring_cqe_seen(ur.get(), cqe);
continue;
}
if (desc->isRead()) {
if (nBytes < desc->getSize()) {
// short read
Desc &newDesc = descs.emplace_back(bufSize);
const auto offset = desc->getOffset() + nBytes;
const auto remaining = desc->getSize() - nBytes;
scheduleRead(newDesc, offset, remaining);
}
scheduleWrite(*desc, nBytes);
if (ur.submit() < 0) {
throw std::runtime_error("Failed submitting write event");
}
} else {
// Write operation completed.
writtenPos += nBytes;
if (nBytes < desc->getSize()) {
// short write
Desc &newDesc = descs.emplace_back(bufSize);
const auto offset = desc->getOffset() + nBytes;
const auto remaining = desc->getSize() - nBytes;
scheduleWrite(newDesc, offset, remaining);
}
if (scheduleRead(*desc)) {
if (ur.submit() < 0) {
throw std::runtime_error("Failed submitting read event");
}
}
}
io_uring_cqe_seen(ur.get(), cqe);
} // while
}
void scheduleWrite(Desc &desc, std::size_t offset, std::size_t toTransfer) {
struct io_uring_sqe *sqe = ur.getSqe();
if (nullptr == sqe) {
throw std::runtime_error(
"Failed to obtain submit queue event for writing");
}
if (offset > total) {
std::stringstream ss;
ss << "Write offset (" << offset << ") outside of transfer size ("
<< total << ")";
throw std::runtime_error(ss.str());
}
desc.setIsRead(false);
desc.setSize(toTransfer);
io_uring_prep_write(sqe, outFd.get(), desc.get(), toTransfer, offset);
io_uring_sqe_set_data(sqe, &desc);
}
void scheduleWrite(Desc &desc, std::size_t toTransfer) {
scheduleWrite(desc, desc.getOffset(), toTransfer);
}
|
For comparison, the entire program is available on gitlab as
well.
Splicing
The test program I’ve written using aio
and io_uring
APIs could be
optimised to avoid copying data buffers between kernel and user space. That’s
what splice(2) is for.
splice
transfers data between two file descriptors but requires one of them
to be pipe(2) file
descriptor. With splice, my test program can be rewritten to perform the
following:
Synchronous version could be written like so (error handling omitted for brevity):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
void transfer(std::string_view input, std::string_view output,
std::size_t chunkSize, std::size_t mb) {
int pipefd[2];
int infd = open(input.data(), O_RDONLY);
int outfd = open(output.data(), O_WRONLY);
off_t readPos = 0;
off_t writePos = 0;
const size_t total = mb * 1024 * 1024;
pipe2(pipefd, O_CLOEXEC);
while (static_cast<std::size_t>(writePos) < total) {
int nRead = splice(infd, &readPos, pipefd[1], nullptr, chunkSize, 0);
readPos += nRead;
int nWritten = splice(pipefd[0], nullptr, outfd, &writePos, chunkSize, 0);
writePos += nWritten;
}
// close pipe
close(pipefd[0]);
close(pipefd[1]);
// close file descriptors
close(infd);
close(outfd);
}
|
There’s also
copy_file_range(2)
which transfers data between file descriptors and doesn’t need a pipe but I’m
not gonna focus on it as there’s no asynchronous equivalent (at least as far as
I know) available in io_uring
.
io_uring
provides an implementation of asynchronous splice
. Writing an
async version using splice is a bit tricky. In essence, splice reads data from
the file and then shares the underlying kernel buffer with the pipe. In other
words, there’s now one buffer, which is append only. Reading concurrently from
multiple offsets all at once won’t be possible as the ordering of the data in
the pipe may not be preserved. This means that asynchronous version only
decouples the act of starting the I/O from its completion but functionally
remains synchronous to maintain proper sequencing.
That being said, the differences between my first io_uring
example and the
one using splice
are rather limited. Within initiate
I’m just initiating one transfer:
1
2
3
4
5
6
|
void initiate() {
scheduleRead(descs.front());
if (ur.submit() < 0) {
throw std::runtime_error("Failed submitting read events");
}
}
|
The code within scheduleRead
is now using splice:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
bool scheduleRead(Desc &desc, std::size_t offset, std::size_t toTransfer) {
if (0 == toTransfer) {
return false;
}
if (offset > total) {
std::stringstream ss;
ss << "Read offset (" << offset << ") outside of transfer size (" << total
<< ")";
throw std::runtime_error(ss.str());
}
struct io_uring_sqe *sqe = ur.getSqe();
if (nullptr == sqe) {
throw std::runtime_error(
"Failed to obtain submit queue event for reading");
}
desc.setIsRead(true);
desc.setOffset(offset);
desc.setSize(toTransfer);
io_uring_prep_splice(sqe, inFd.get(), offset, pipe.getWriteEnd().get(), -1,
toTransfer, 0);
io_uring_sqe_set_data(sqe, &desc);
return true;
}
bool scheduleRead(Desc &desc) {
const auto toTransfer = calcTransferSize();
auto r = scheduleRead(desc, readPos, toTransfer);
return r;
}
|
There are similar changes within scheduleWrite
but I advise you to refer to
the full source for more details.
Basic benchmarks
I need to start with a short disclaimer. I’m running these tests on a
relatively old machine (i5-6300U) with SATA3 SSD so, these results in general
might be underwhelming when compared to new hardware. Additionally, the drive
is LUKS encrypted which, for sure, impacts writes performance noticeably on top
of that.
Let’s start with a baseline. I’ve chosen the block size to be 64k to
correspond with the CPU’s L1 cache size. Here are the results I’m gonna
compare my tests against.
1
2
3
4
|
$ dd if=/dev/zero of=/dev/null bs=$((64*1024)) count=$((4*16*1024)) status=progress
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 0.366778 s, 11.7 GB/s
|
Write performance:
1
2
3
4
5
|
dd if=/dev/zero of=data4g bs=$((64*1024)) count=$((4*16*1024)) status=progress
4065067008 bytes (4.1 GB, 3.8 GiB) copied, 6 s, 723 MB/s
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 6.18258 s, 695 MB/s
|
Read performance:
1
2
3
4
5
|
dd if=data4g of=/dev/null bs=$((64*1024)) count=$((4*16*1024)) status=progress
4151771136 bytes (4.2 GB, 3.9 GiB) copied, 8 s, 519 MB/s
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 8.31217 s, 517 MB/s
|
Read/write performance:
1
2
3
4
5
|
dd if=data4g of=data4g-2 bs=$((64*1024)) status=progress
4290052096 bytes (4.3 GB, 4.0 GiB) copied, 19 s, 226 MB/s
65536+0 records in
65536+0 records out
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 20.9207 s, 205 MB/s
|
I’m dropping the caches between every test!
1
|
echo 3 | sudo tee /proc/sys/vm/drop_caches
|
Expected results
The underlying performance should remain the same, regardless of the API used
as it should remain I/O device bound (SSD - in case of pseudo-device and SSD
transfers or between two files on the SSD).
I guess what I’m trying to verify is if usage of one API introduces any
differences in CPU overhead when compared to other APIs.
Reading from /dev/zero and writing to /dev/null throughput
AIO
Test command (this transfers 16GB between /dev/zero
and /dev/null
in chunks
of 64KB with 32 scheduled I/O operations at minimum at once - I’ve fined tuned
the parameters to reach best results).
1
2
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
/usr/bin/time -v ./bld/aio_example 32 $((64*1024)) /dev/zero /dev/null $((16*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 1820.44 MB/s
Command being timed: "./bld/aio_example 32 65536 /dev/zero /dev/null 16384"
User time (seconds): 6.61
System time (seconds): 9.51
Percent of CPU this job got: 170%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.44
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6104
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 8
Minor (reclaiming a frame) page faults: 687
Voluntary context switches: 921456
Involuntary context switches: 1192
Swaps: 0
File system inputs: 1528
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
This is a bit disappointing. Just shy of 2GB/s compared to a similar test
done with dd
which yield almost 12 GB/s.
io_uring
Test command (this transfers 32GB between /dev/zero
and /dev/null
in chunks
of 64KB with 8 prepared and submitted requests to SQE at minimum at once - I’ve
fined tuned the parameters to reach best results).
1
2
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
/usr/bin/time -v ./bld/io_uring_example 8 $((64*1024)) /dev/zero /dev/null $((32*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 10922.7 MB/s
Command being timed: "./bld/io_uring_example 8 65536 /dev/zero /dev/null 32768"
User time (seconds): 0.85
System time (seconds): 2.69
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.56
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5856
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 11
Minor (reclaiming a frame) page faults: 635
Voluntary context switches: 22
Involuntary context switches: 40
Swaps: 0
File system inputs: 2312
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
This is much better and comparable to the baseline in terms of throughput. The
test programs were not written with extra care for optimisation so some
performance degradation is completely expected.
The time
results are interesting as well when compared to aio test.
Signifficantly less context switches - most likely related to lack of signal
handler and extra thread and in overall the fact that the app is much simpler.
io_uring splice
splice
doesn’t work with /dev/zero
and yields EINVAL
.
Reading from /dev/zero and writing to SSD throughput
AIO
Test command (this transfers 4GB from /dev/zero
to SSD in chunks
of 64KB with 32 scheduled I/O operations at minimum at once - I’ve fined tuned
the parameters to reach best results).
1
2
3
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
truncate -s 0 data4g && \
/usr/bin/time -v ./bld/aio_example 32 $((64*1024)) /dev/zero data4g $((4*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 455.111 MB/s
Command being timed: "./bld/aio_example 32 65536 /dev/zero data4g 4096"
User time (seconds): 1.34
System time (seconds): 5.07
Percent of CPU this job got: 69%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:09.30
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7868
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 8
Minor (reclaiming a frame) page faults: 685
Voluntary context switches: 181605
Involuntary context switches: 5091
Swaps: 0
File system inputs: 1512
File system outputs: 8388608
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
io_uring
Test command (this transfers 8GB from /dev/zero
to data8g
SSD in chunks
of 64KB with 8 prepared and submitted requests to SQE at minimum at once - I’ve
fined tuned the parameters to reach best results).
1
2
3
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
truncate -s 0 data8g && \
/usr/bin/time -v ./bld/io_uring_example 8 $((64*1024)) /dev/zero data8g $((8*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 431.158 MB/s
Command being timed: "./bld/io_uring_example 8 65536 /dev/zero data8g 8192"
User time (seconds): 1.24
System time (seconds): 10.75
Percent of CPU this job got: 62%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:19.19
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 738912
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 8
Minor (reclaiming a frame) page faults: 183917
Voluntary context switches: 155569
Involuntary context switches: 14985
Swaps: 0
File system inputs: 2176
File system outputs: 16777216
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
This is a bit slower than the baseline but it basically saturates the SSD
bandwidth which as expected is a bottleneck here. What stands out is the
maximum resident set size (RSS) which is almost x100 bigger than for aio test
program (really not sure why).
io_uring splice
splice
doesn’t work with /dev/zero
and yields EINVAL
.
Reading from SSD and writing to /dev/null throughput
AIO
Test command (this transfers 16GB between SSD file and /dev/null
in chunks
of 64KB with 32 scheduled I/O operations at minimum at once - I’ve fined tuned
the parameters to reach best results).
1
2
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
/usr/bin/time -v ./bld/aio_example 32 $((64*1024)) data16g /dev/null $((16*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 496.485 MB/s
Command being timed: "./bld/aio_example 32 65536 data16g /dev/null 16384"
User time (seconds): 5.99
System time (seconds): 16.15
Percent of CPU this job got: 66%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:33.48
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6088
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 8
Minor (reclaiming a frame) page faults: 679
Voluntary context switches: 861580
Involuntary context switches: 720
Swaps: 0
File system inputs: 33552112
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
The results are a bit lower than the baseline. Similar story as before in
regards to the rest of the measurements - high CPU utilisation, lots of context
switches etc.
io_uring
Test command (this copies 16GB from SSD to /dev/null in chunks
of 64KB with 8 prepared and submitted requests to SQE at minimum at once - I’ve
fined tuned the parameters to reach best results).
1
2
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
/usr/bin/time -v ./bld/io_uring_example 8 $((64*1024)) data16g /dev/null $((16*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 528.516 MB/s
Command being timed: "./bld/io_uring_example 8 65536 data16g /dev/null 16384"
User time (seconds): 0.76
System time (seconds): 7.91
Percent of CPU this job got: 27%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:31.14
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4440
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 11
Minor (reclaiming a frame) page faults: 282
Voluntary context switches: 129397
Involuntary context switches: 362
Swaps: 0
File system inputs: 33557016
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
Similarly as with the converse case (writing to SSD from /dev/zero) this is SSD
bound which is confirmed by low CPU utilisation. This time RSS is comparable to aio.
io_uring splice
Test command (this copies 16GB from SSD to /dev/null in chunks
of 64KB with 8 prepared and submitted requests to SQE at minimum at once - I’ve
fined tuned the parameters to reach best results).
1
2
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
/usr/bin/time -v ./bld/io_uring_splice_example $((64*1024)) data16g /dev/null $((16*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 481.882 MB/s
Command being timed: "./bld/io_uring_splice_example 65536 data16g /dev/null 16384"
User time (seconds): 1.44
System time (seconds): 11.17
Percent of CPU this job got: 36%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:34.52
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3812
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 8
Minor (reclaiming a frame) page faults: 160
Voluntary context switches: 1169536
Involuntary context switches: 9612
Swaps: 0
File system inputs: 33556488
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
It’s worth noting that the majority of CPU time has been spent in system.
Reading from SSD and writing to SSD throughput
In all cases, as expected, the throughput is halved and again, as expected
fully limited by the SSD itself.
AIO
Test command (this copies 4GB between two SSD files in chunks
of 64KB with 32 scheduled I/O operations at minimum at once - I’ve fined tuned
the parameters to reach best results).
1
2
3
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
truncate -s 0 data4g && \
/usr/bin/time -v ./bld/aio_example 32 $((64*1024)) data8g data4g $((4*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 240.941 MB/s
Command being timed: "./bld/aio_example 32 65536 data8g data4g 4096"
User time (seconds): 1.77
System time (seconds): 8.79
Percent of CPU this job got: 60%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.54
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 6092
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 8
Minor (reclaiming a frame) page faults: 678
Voluntary context switches: 202819
Involuntary context switches: 1035
Swaps: 0
File system inputs: 8390648
File system outputs: 8388608
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
io_uring
Test command (this copies 4GB between two files on SSD in chunks
of 64KB with 8 prepared and submitted requests to SQE at minimum at once - I’ve
fined tuned the parameters to reach best results).
1
2
3
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
truncate -s 0 data4g && \
/usr/bin/time -v ./bld/io_uring_example 8 $((64*1024)) data8g data4g $((4*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 256 MB/s
Command being timed: "./bld/io_uring_example 8 65536 data8g data4g 4096"
User time (seconds): 0.38
System time (seconds): 8.78
Percent of CPU this job got: 54%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.90
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4448
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 11
Minor (reclaiming a frame) page faults: 282
Voluntary context switches: 90801
Involuntary context switches: 1102
Swaps: 0
File system inputs: 8391464
File system outputs: 8388608
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
io_uring splice
Test command (this copies 4GB between two files on SSD in chunks
of 64KB with 8 prepared and submitted requests to SQE at minimum at once - I’ve
fined tuned the parameters to reach best results).
1
2
3
|
echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null && \
truncate -s 0 data4g && \
/usr/bin/time -v ./bld/io_uring_splice_example $((64*1024)) data8g data4g $((4*1024))
|
Results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
Transfer rate: 240.941 MB/s
Command being timed: "./bld/io_uring_splice_example 65536 data8g data4g 4096"
User time (seconds): 0.38
System time (seconds): 6.74
Percent of CPU this job got: 40%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.67
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4064
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 11
Minor (reclaiming a frame) page faults: 155
Voluntary context switches: 293094
Involuntary context switches: 352
Swaps: 0
File system inputs: 8391400
File system outputs: 8388608
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
|
Conclusion
As initially stated, I haven’t yet had a practical need to refer to any of
these APIs in production solutions. It’s good to know though how each of these
APIs work and how to use them, as well as have some results from on-hands
experiments.
I guess that I’ve managed to confirm that AIO is probably not
the best candidate API for asynchronous I/O outside of the realm of
experimentation and, io_uring is something I’d go for. The io_uring APIs are
rich, easy to use and indeed yield good performance.
As the main use-case for these APIs is really high performance network
applications, it would be useful to gain some practical experience in that area
by experimentation as well but that’s a topic for another day :).
As always, example programs are available on gitlab.
References
[1] https://unixism.net/loti/index.html