Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring is slower than epoll #189

Open
ghost opened this issue Aug 30, 2020 · 155 comments
Open

io_uring is slower than epoll #189

ghost opened this issue Aug 30, 2020 · 155 comments

Comments

@ghost
Copy link

ghost commented Aug 30, 2020

EDIT: I have made available detailed benchmark with epoll that shows this in a reliable way: https://github.com/alexhultman/io_uring_epoll_benchmark


Don't get me wrong, since I heard about io_uring I've been all about trying to scientifically reproduce any claim that it improves performance (theoretically it should, since you can have fewer syscalls and I like that idea). I would love to add support for io_uring in my projects, but I won't touch it until someone can scientifically show me proofs of it outperforming epoll (significantly).

I've tested three different examples claiming to be faster than epoll, tested on Linux 5.7 without spectre mitigations, tested on Linux 5.8 with spectre mitigations. Tested Clear Linux, tested Fedora. All tests points towards epoll being faster for my kind of test.

What I have is a simple TCP echo server echoing small chunks of data for 40-400 clients. In all tests epoll is performing better by a measurable amount. In no single test have I seen io_uring beat epoll in this kind of test.

Some io_uring examples are horribly slow, while others are quite close to epoll perforamance. The closest I have seen it go is this one: https://github.com/frevib/io_uring-echo-server comparing to this one: https://github.com/frevib/epoll-echo-server

I have tested on two separate and different machines, both with the same outcome: epoll wins.

Can someone enlighten me on how I can get io_uring to outperform epoll in this case?

@ghost
Copy link
Author

ghost commented Aug 30, 2020

@isilence
Copy link
Collaborator

isilence commented Sep 2, 2020

That's definitely a great thing to do. I don't know whether anybody have time for that at the moment, though.
io_uring-echo-server looks much bulkier from the last time I've seen it.

@InternalHigh
Copy link

Maybe your benchmark is full of errors

@ghost
Copy link
Author

ghost commented Sep 13, 2020

Maybe your benchmark is full of errors

It's not my benchmark and the benchmark I reference is the one @axboe himself has referenced on Twitter, showing that he interprets it as true. That's why I would like @axboe himself to write a benchmark without issues so that we can actually prove this thing works better.

@markpapadakis
Copy link
Contributor

markpapadakis commented Sep 13, 2020 via email

@Qix-
Copy link

Qix- commented Sep 30, 2020

@markpapadakis Have you taken a look at the benchmarks that are quoted above? They use many connections at once, thus many FDs at once. uring still consistently performs worse, in conditions that mimic production usage well enough without actually being production.

Part of the scentific process is to be able to reproduce claims like those being made (60%+ increase in performance over epoll, apparently that was 99% at one point but bugs were found). These are metrics that @axboe has not refuted and has seemingly even confirmed, especially through promoting it on twitter and urging others to buy in (e.g. Netty).

Even if this were a 5% increase, I'd be for it. However, I simply do not understand where these results are coming from, and everyone on twitter seems to be more interested in patting themselves on the back rather than addressing criticisms of things (sorry if that sounds harsh).

Outrageous claims require outrageous evidence, especially when it comes to claims that could completely renovate the space.

@ghost
Copy link

ghost commented Sep 30, 2020

@markpapadakis We all know the theory. It is not that hard to understand, quite basic actually.

But theory means nothing if actual reality mismatches with theorized conclusions (any scientist ever).

I would be happy and excited if io_uring could improve performances but so far no benchmark can show this.

All I'm asking for is scientific proofs. I am a scientist, not a believer.

@ghost
Copy link

ghost commented Sep 30, 2020

I think it is time to take this to the Linux kernel mailing list since @axboe has ignored this criticism entirely.

@Qix-
Copy link

Qix- commented Sep 30, 2020

@alexhultman Check netty/netty#10622 - I won't have any bandwidth for the next day or two, but maybe that will pique your interest.

@InternalHigh
Copy link

I think that @axboe knows that uring is slower. Otherwise he would answer.

@axboe
Copy link
Owner

axboe commented Sep 30, 2020

I have not ignored any of this, but I've been way too busy on other items. And frankly, the way the tone has shifted in here, it's not really providing much impetus to engage. I did run the echo server benchmarks back then, and it did reproduce for me. Haven't looked at it since, it's not like I daily run echo server benchmarks.

So in short, I'm of course interested in cases where io_uring isn't performing to its full potential, it most certainly should. I have a feeling that #215 might explain some of these. Once I get myself out from under the pressure I'm at now, I'll re-run the benchmarks myself.

@romange
Copy link
Contributor

romange commented Sep 30, 2020

I can testify that io_uring is much faster than epoll. Please use kernel 5.7.15 for your benchmarks.
This server https://github.com/romange/gaia/tree/master/examples/pingserver reaches 3M qps on a single instance for redis-benchmark (ping_inline API) on c5n ec2 instances.

@ghost
Copy link
Author

ghost commented Dec 7, 2020

I have made available a new, detailed benchmark that shows io_uring is reliably slower than epoll:

https://github.com/alexhultman/io_uring_epoll_benchmark

I can testify that io_uring is much faster than epoll. Please use kernel 5.7.15 for your benchmarks.
This server https://github.com/romange/gaia/tree/master/examples/pingserver reaches 3M qps on a single instance for redis-benchmark (ping_inline API) on c5n ec2 instances.

And where is your 1-to-1 comparison with epoll? Or you just go by feeling of "high numbers"? See my benchmark for a 1-to-1 comparison.

@ghost
Copy link
Author

ghost commented Dec 7, 2020

For what it’s worth there is a whole lot more to it than trivial one to one comparison. - you save many syscalls that would be required to modify the epolll kernel state (adding, removing monitored FDs, updating per FD state etc). This becomes extremely important as the number of managed FDs increases - exponentially so. - you can register FDs (eg listener FDs and long lived connections FDs) which means you dont have to incur the cost of kernel looking up the file struct and checking for access every time. - and, obviously, you don’t get to monitor just network sockets. You can do so much more.

And were is your benchmark for this claim? I know the theory very well, but you're just assuming theory is correct here because it must be. See my posted benchmark - it shows the complete opposite of what you claim.

The bennchmark I have posted performs ZERO syscalls and does ZERO copies, yet epoll wins reliably despite doing millions of syscalls and performing copies in every syscall.

@ghost
Copy link
Author

ghost commented Dec 7, 2020

@axboe

So in short, I'm of course interested in cases where io_uring isn't performing to its full potential, it most certainly should.

Woud you look at my new benchmark? I have eliminated everything but epoll and io_uring and on both my machines epoll wins despite io_uring being SQ-polled with 0 syscalls and using pre-registered file descriptors and buffers. I'm not involving any networking at all.

strace shows the epoll case make millions of syscalls while the io_uring is entirely silent in the syscalling department.

What am I doing wrong / why is io_uring not performing?

@ghost
Copy link
Author

ghost commented Dec 7, 2020

@alexhultman Check netty/netty#10622 - I won't have any bandwidth for the next day or two, but maybe that will pique your interest.

The fact that a virtualized garbage collected JIT-stuttery Java project can see performance improvements when swapping from epoll to io_uring is not a viable proof of io_uring itself (as a kernel feature) being more performant than epoll itself (as a kernel feature). It really just proves that writing systems in non-systems programming languages are going to get you poor results.

io_uring does more things in kernel, meaning a swap from epoll to io_uring leads to less things happening in Java. As a general rule of tumb; the less you do in high level garbage collected virtualized code, the better.

@romange
Copy link
Contributor

romange commented Dec 7, 2020

@alexhultman which kernel version did you use?

@ghost
Copy link
Author

ghost commented Dec 7, 2020

It is clearly stated in the posted text. 5.9.9

@santigimeno
Copy link

@alexhultman By running your tests locally with a 5.10-rc5 version it seems I'm seeing io_uring behave better, or am I reading it wrong?:

$  ./epoll 1000
Pipes: 1000
Time: 16.059609

$ sudo ./io_uring 1000
Pipes: 1000
Time: 10.984051

$ ./epoll 1500
Pipes: 1500
Time: 24.501726

$ sudo ./io_uring 1500
Pipes: 1500
Time: 18.112729

$ ./epoll 2000
Pipes: 2000
Time: 37.705230

$ sudo ./io_uring 2000
Pipes: 2000
Time: 26.174995

@romange
Copy link
Contributor

romange commented Dec 7, 2020

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

Secondly, I've looked at your benchmark code and you test something that is not necessarily relevant nor optimal for networking use-case.

  1. You test IORING_SETUP_SQPOLL mode - I did not succeed to get any performance gain there with sockets. In fact it was consistently worse than using non-polling mode.

  2. You test a single epoll/io_uring loop which does not trigger contention edge-cases inside kernel. When you have N cores running N epoll loops doing read/writes via socket you put 100% load on your machine, you will see how io_uring performs better.

Finally, I've never tried using pipes in my tests. io_uring essentially delegates requests to the their corresponding APIs. So if, for example, the pipes kernel code takes most CPU you may not see much difference between io_uring or epoll.

@Qix-
Copy link

Qix- commented Dec 7, 2020

Also, just to add (note that I was a skeptic above as well): Hypervisors tend to level out epoll vs io_uring performance benchmarks in my findings - if you're within a VM, expect io_uring to have less than ideal performance increases against epoll. Testing on bare metal seems to make quite a bit of difference depending on how you're using io_uring.

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

@romange Is that already on the io_uring branch of mainline? Just curious.

@martin-g
Copy link

martin-g commented Dec 7, 2020

Here are the results for @alexhultman 's benchmark on my machine (MyTuxedo laptop, Intel i7-7700HQ (8) @ 3.800GHz, 32GB RAM, Ubuntu 20.10 x86_64, Kernel: 5.10.0-051000rc6-generic)

 make runs
rm -f epoll_runs
rm -f io_uring_runs
for i in `seq 100 100 1000`; do ./io_uring $i; done
Pipes: 100
Time: 0.908056
Pipes: 200
Time: 2.063551
Pipes: 300
Time: 3.183146
Pipes: 400
Time: 4.810344
Pipes: 500
Time: 5.609743
Pipes: 600
Time: 8.197645
Pipes: 700
Time: 10.275732
Pipes: 800
Time: 11.889881
Pipes: 900
Time: 15.030963
Pipes: 1000
Time: 15.421023

for i in `seq 100 100 1000`; do ./epoll $i; done
Pipes: 100
Time: 1.575792
Pipes: 200
Time: 3.173769
Pipes: 300
Time: 5.173567
Pipes: 400
Time: 7.255583
Pipes: 500
Time: 10.283918
Pipes: 600
Time: 12.986523
Pipes: 700
Time: 14.560208
Pipes: 800
Time: 17.426127
Pipes: 900
Time: 19.796715
Pipes: 1000
Time: 23.262279

io_uring gives better results!

@romange
Copy link
Contributor

romange commented Dec 7, 2020

Also, just to add (note that I was a skeptic above as well): Hypervisors tend to level out epoll vs io_uring performance benchmarks in my findings - if you're within a VM, expect io_uring to have less than ideal performance increases against epoll. Testing on bare metal seems to make quite a bit of difference depending on how you're using io_uring.

First of all, there is a bug that hurts io_uring performance. its fix will be merged into 5.11.

@romange Is that already on the io_uring branch of mainline? Just curious.

I do not think it's on io_uring branch because the fix does not reside in io_ring code.
Here is the relevant article: https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.11-Task-Work-Opt

@YoSTEALTH
Copy link
Contributor

$ make
gcc -O3 epoll.c -o epoll
gcc -O3 io_uring.c /usr/lib/liburing.a -o io_uring
gcc: error: /usr/lib/liburing.a: No such file or directory
make: *** [Makefile:3: default] Error 1

@romange
Copy link
Contributor

romange commented Dec 7, 2020

$ make
gcc -O3 epoll.c -o epoll
gcc -O3 io_uring.c /usr/lib/liburing.a -o io_uring
gcc: error: /usr/lib/liburing.a: No such file or directory
make: *** [Makefile:3: default] Error 1
git submodule update --init --recursive 
cd liburing/
./configure && make
sudo make install

@ghost
Copy link
Author

ghost commented Dec 7, 2020

@martin-g @santigimeno That's very interesting - thanks for reporting! I will do some more testing on newer kernels and see if I can finally get to see this supposed io_uring wonder myself.

@ghost
Copy link
Author

ghost commented Dec 7, 2020

@romange You do a lot of confident talking but you still refuse to follow up with any actual testing of your claims.

  1. I have tested all modes, the mode without SQ polling had insignificant differences in performance and because it caused syscalls to appear, I wanted to use SQ because that is where all the fuzz is about regarding io_uring.

  2. You haven't tested this, but you just assume. I did an actual test of this and I got results significantly opposing your so confident assumption.

Please do testing before you make up assumptions about everything. This entire thread is about this exact behavior - show with actual numbers like @santigimeno and @martin-g did.

@axboe
Copy link
Owner

axboe commented Dec 7, 2020

@alexhultman Please tone down the snark. FWIW, @romange has done plenty of testing in the past, and was instrumental in finding the task work related signal slowdown for threads. Questioning the validity of a test is perfectly valid, in fact it's the very first thing that should be done before potentially wasting any time on examining results from said test. Nobody is in a position to be making any demands in here, in my experience you get a lot further by being welcoming and courteous.

@ghost
Copy link
Author

ghost commented Dec 7, 2020

When you have N cores running N epoll loops doing read/writes via socket you put 100% load on your machine, you will see how io_uring performs better.

The above is not a question, it is a confident statement without any backing proof other than "it will be the case". This is what is the issue here - blindly making claims without any backing proof other than "listen to my assumption".

@isilence
Copy link
Collaborator

It's not spending a lot of time in copying

@alexhultman, what the judgement is based on? Unfortunately, the traces are useless because they don't show the kernel side. Probably, the kernel is not configured for that, missing debug info or something else. It first ends up in io_uring, but then it calls into pipe code, which will do memcpy. So even if cumulative for io_uring syscall is large, it doesn't mean it spends a lot of time in io_uring itself.

@isilence
Copy link
Collaborator

FWIW, the case tested executes everything during submission, so by the time io_uring_enter returns everything should be ready in the normal case, that's why peek was working ok for bench purposes. Should be good enough.

@lnicola, there is something funky going on, I'd suggest to open an issue. Would be great if you can paste dmesg output there after all the problems start to happening.

@ghost
Copy link
Author

ghost commented Oct 14, 2021

I can recompile with debug symbols. These are very small copies, like we have talked about already, it's 32 bytes per pipe so overhead of copy is very likely not the bottleneck and again, if you look in this thread you can see that many people with AMD CPUs see big wins with io-uring and, again, my own raspberry pi is much faster with io uring. They all do the same copying, yet my shitty Intel machines don't perform any better with iouring

@ghost
Copy link
Author

ghost commented Oct 14, 2021

At this point the bug report is more about Intel machines not seeing any gains while pretty much all other ones seeing big gains.

@isilence
Copy link
Collaborator

I can recompile with debug symbols. These are very small copies, like we have talked about already, it's 32 bytes per pipe so overhead of copy is very likely not the bottleneck and again, if you look in this thread you can see that many people with AMD CPUs see big wins with io-uring and, again, my own raspberry pi is much faster with io uring. They all do the same copying, yet my shitty Intel machines don't perform any better with iouring

Ok, still curious to see relative overheads. If you're willing to recompile, there is a list of kernel options to check
(from https://www.brendangregg.com/perf.html)

# for perf_events:
CONFIG_PERF_EVENTS=y
# for stack traces:
CONFIG_FRAME_POINTER=y
# kernel symbols:
CONFIG_KALLSYMS=y
# tracepoints:
CONFIG_TRACEPOINTS=y
# kernel function trace:
CONFIG_FTRACE=y
# kernel-level dynamic tracing:
CONFIG_KPROBES=y
CONFIG_KPROBE_EVENTS=y
# user-level dynamic tracing:
CONFIG_UPROBES=y
CONFIG_UPROBE_EVENTS=y
# full kernel debug info:
CONFIG_DEBUG_INFO=y
# kernel lock tracing:
CONFIG_LOCKDEP=y
# kernel lock tracing:
CONFIG_LOCK_STAT=y
# kernel dynamic tracepoint variables:
CONFIG_DEBUG_INFO=y

@ghost
Copy link
Author

ghost commented Oct 14, 2021

Yep I'll come back in a few days I have other things to do also. Will be interesting to see.

@vcaputo
Copy link

vcaputo commented Oct 18, 2021

@alexhultman The io_uring variant of your benchmark completes consistently faster in my tests when I change the continue; busy-loop style CQ polling to instead wait for a CQE to arrive via io_uring_wait_cqe() when the CQ isn't ready:

             if (completions == 0) {
-                continue;
+
+               if (io_uring_wait_cqe(&ring, cqes) < 0) {
+                       printf("error waiting for any completions\n");
+                       return 0;
+               }
+               completions = 1;
             }

@axboe Is this style of brute-force CQ polling supposed to work well? In my tests, especially when I add taskset -c 0, the io_uring test takes much longer without the io_uring_wait_cqe()-when-unready. At a glance, at the very least there's the io_uring-created sqp sibling thread that also requires CPU time, which the spinning is contending with. If there's supposed to be some magic in how the sqp thread is created to make it complement the process-wide taskset -c 0 nicely, and not fight with this style of polling, it doesn't seem to be working correctly here (5.14.5-arch1-1, i7-3520M).

I'm on a somewhat older Intel CPU, and adding the wait as noted above makes the io_uring test consistently faster than epoll. From where I'm sitting right now, the io_uring benchmark seems to simply be incorrectly implemented. And judging from the code quality in general, that would not surprise me one bit.

@ghost
Copy link
Author

ghost commented Oct 19, 2021

@vcaputo The benchmark started like that, io_uring_submit_and_wait and io_uring_wait_cqe was how the original version of the benchmark ran. At that point (Linux 5.8) it was still not faster than epoll so the benchmark changed to make use of fixed files, fixed buffers and polling. That was faster so that solution remained. I can re-check with io_uring_wait_cqe on my Linux 5.16. Thanks for testing and reporting.

@vcaputo
Copy link

vcaputo commented Oct 19, 2021

@vcaputo The benchmark started like that, io_uring_submit_and_wait and io_uring_wait_cqe was how the original version of the benchmark ran. At that point (Linux 5.8) it was still not faster than epoll so the benchmark changed to make use of fixed files, fixed buffers and polling. That was faster so that solution remained. I can re-check with io_uring_wait_cqe on my Linux 5.16. Thanks for testing and reporting.

Well, you still keep the opportunistic batched consumption and only resort to waiting when unlucky. It's a sort of hybrid, akin to how mutexes are often tweaked to spin a little before going to sleep (involving the kernel) when contended. But what you have currently is written more like a spinlock in userspace, which is generally A Bad Idea unless you're exerting very fine control over which threads are running on which cores.

If you don't even attempt the batched wait-less consumption you'll definitely go slower as the wait interface just gets a single cqe IIRC.

@ghost
Copy link
Author

ghost commented Oct 19, 2021

I'm definitely open to whatever solution is the best as of Linux 5.16. My goal is really just to have io_uring beat epoll on all my machines including the shitty Intel ones. I've tested 10-or-so solutions by now. Will test yours when I have more time for this.

@ghost
Copy link
Author

ghost commented Oct 19, 2021

@vcaputo That changes doesn't do any difference for me. It never even runs that path.

@isilence
Copy link
Collaborator

PS: there's something funny about the test, it sometimes fails to retrieve a SQE, even though the ring seems to be appropriately-sized.

@lnicola, I just noticed that it enables SQPOLL, and the program doesn't handle it right, though should work fine with normal rings. Anyone benchmarking it should disable SQPOLL unless willing to trink appropriately and invest more time in comparison. It's not magic, it can degrade performance and add variability to results.

@rbernon
Copy link

rbernon commented Dec 5, 2021

Hi, I've also been playing with io_uring latetly, trying to match epoll performance in an IPC use case with pipes, one server and n clients, and a simple request-reply protocol.

I modified the benchmark mentionned here and elsewhere for my tests, see the *ipc*.c sources in https://github.com/rbernon/io_uring-echo-server

What I could see is that epoll currently always beats io_uring, with any number of clients. My use of io_uring is pretty simple, I used the same base as the existing code with fixed fds and buffers (although that did not make a difference), and I'm submitting write-read linked sqes, with the cqe_skip_success flag set on the write.

I've made some gpuvis traces, which I think are intersting to understand the reasons. I used a build of the linux kernel from git (79a72162048e42a677bc7336a9f5d86fc3ff9558), with the patches from [1] on top, and I verified that there's no cqe returned for the write completion. (I'm not sure if this is the latest version of the patches, and I'm happy to try something else if there's any)

With epoll, with 1 client thread, each thread wakes the other one when it has completed its write, and they alternate very quickly.

epoll-1

With 4 client threads, the server is fully busy (in red), and never gets interrupted, serving requests one after another, and the processing time is still very small and roughtly the same as with one client.

epoll-4

With io_uring, however things are quite different, even with 1 client. For a start, there's a kernel worker involved, so a third thread. Then, there seems to be at least one spurious wakeup of the server.

io_uring-1

I would expect that the client write would only wake up the worker, copying the data and waking up the server, which would process it, submit the next sqes, waking the worker and starting to wait for the next one. I verified that the only cqe that the server receives is for the read completion, but maybe it resumes from io_uring_submit_and_wait for nothing once.

The number of spurious wakeups doesn't see to depend on the number of clients, as can be seen here, with 2 clients. However, there's an additional worker for each client, it doesn't really seem necessary?

io_uring-2

About the io_uring workers, I think there's something fishy, as whenever I stopped the measuring, all went crazy for a bit, as can be seen here (note that the scale is much larger, and every small block is also an io_uring worker working).

io_uring-crazy

Edit: That last part may just be because the code doesn't handle the errors and just started submitting a lot of invalid sqes when interrupted.

Edit2: It's actually not, I checked again and I don't see any completion with error status, or any error from the client read or write when I interrupt the process, it just shuts down immediately. The iou_worker frenzy is still there though as soon as there's more than 1 client, and already there on 5.15 without the SKIP_SUCCESS flag. It's possible there's something wrong with my code but I don't see where?

[1] https://lore.kernel.org/all/cover.1636559119.git.asml.silence@gmail.com

@Qix-
Copy link

Qix- commented Dec 5, 2021

@rbernon just out of curiosity, which tool is that?

@rbernon
Copy link

rbernon commented Dec 5, 2021

https://github.com/mikesart/gpuvis

@axboe
Copy link
Owner

axboe commented Dec 6, 2021

If you see any iou-wrk workers, then something isn't working right. That's the slower async path, and should not get hit unless IOSQE_ASYNC is being explicitly set. What code is being run?

@axboe
Copy link
Owner

axboe commented Dec 6, 2021

Took a quick look at that server, and it looks like it has two modes:

  1. Use IOSQE_ASYNC, or
  2. Use an explicit poll

Neither are great solutions, the most performant will generally be to just issue the request and have the internal poll handle it. Otherwise you're just trading an epoll readiness based model for the ditto on io_uring, which doesn't make a lot of sense. It can be useful for converting existing code, but for new code it isn't really advisable.

@rbernon
Copy link

rbernon commented Dec 6, 2021

If you see any iou-wrk workers, then something isn't working right. That's the slower async path, and should not get hit unless IOSQE_ASYNC is being explicitly set. What code is being run?

Interesting, I added the flag on the read sqe because it's not expected to succeed immediately after a write from the server (the clients are not generally spamming requests). It didn't seem to make a difference at first, but I'll try without it.

@rbernon
Copy link

rbernon commented Dec 6, 2021

Took a quick look at that server, and it looks like it has two modes:

1. Use IOSQE_ASYNC, or

2. Use an explicit poll

Neither are great solutions, the most performant will generally be to just issue the request and have the internal poll handle it. Otherwise you're just trading an epoll readiness based model for the ditto on io_uring, which doesn't make a lot of sense. It can be useful for converting existing code, but for new code it isn't really advisable.

FWIW the server code is https://github.com/rbernon/io_uring-echo-server/blob/master/io_uring_ipc_server.c, and it only used the IOSQE_ASYNC flag on the read sqe.

I tried again without this flag, and with IOSQE_CQE_SKIP_SUCCESS, and now the spurious wakeup is gone but there's still iou workers involved, as well as the worker frenzy when exiting the test.

Here with 1 client:
io_uring-1

And with 2:
io_uring-2

With 4 clients, something weird started to happen, with periods of "normal" latency and periods of very high latency. Looking at gpuvis I can see that the "normal" latency periods involve server, clients, and workers talking to each other:

io_uring-4-normal

But after a while there's only the server left, seemingly doing nothing?
io_uring-4-weird

Plus a huge frenzy at the end, but kind of expected.

Unrelated note: by default gpuvis uses a millisecond resolution, I modified it to round the timestamps to us because I wanted to be able to see the latency more accurately.

@isilence
Copy link
Collaborator

@rbernon, pipes, right? The pipes internals don't support any sane nowait behaviour so we have to force any I/O against them to io-wq (slow path). It's just a one line on io_uring side to change that, but then submission may end up waiting for I/O potentially unboundedly. Hopefully, one day we'll push good enough for a change in the pipe code.

fwiw, the benchmark up in the thread doesn't go through io-wq only because O_NONBLOCK and there is special case in io_uring for that.

Don't know what that "frenzy" with lots of rescheduling exactly is, though that's interesting.

@avikivity
Copy link

Can't we register a poll on the pipe, and issue the write from the poll callback?

The same holds for other socket-like files that support poll() but aren't sockets. They can be handled in the same way as recv.

@avikivity
Copy link

Now I'm not sure that recv completes without a workqueue.

Looking at the code, it ends up in io_apoll_task_func, which punts via io_req_task_submit. Is recv indeed using workqueue?

@avikivity
Copy link

I'm probably lost in the maze.

@axboe
Copy link
Owner

axboe commented Apr 14, 2022

We do use poll for any file type that supports it, but since we can't issue a nonblocking read attempt on a pipe, it still has to be done from a worker. What needs to happen here is just converting pipe from using struct file_operations->read to ->read_iter. The solution is known, and patches do exist.

This isn't specific to pipes, but obviously they are one of the more important file types. Thankfully most files use ->read_iter and ->write_iter these days.

@axboe
Copy link
Owner

axboe commented Apr 14, 2022

Looking at the code, it ends up in io_apoll_task_func, which punts via io_req_task_submit. Is recv indeed using workqueue?

It is not, this is task_work. That's different from thread offload. For the latter, look for io_queue_async_work().

@avikivity
Copy link

Looking at the code, it ends up in io_apoll_task_func, which punts via io_req_task_submit. Is recv indeed using workqueue?

It is not, this is task_work. That's different from thread offload. For the latter, look for io_queue_async_work().

Yes, sorry.

@Cagoh
Copy link

Cagoh commented Jul 2, 2022

I'm trying to fork() the io_uring simulation by creating parent and child process, each process have its own rings, parent for writing to the pipe, child for reading from the pipe, however I stumbled upon some error when the reader side read an empty pipe and instead of waiting for content in the pipe (if therers no error the simulation indeed run faaster), it update the CQE Async task failed. To overcome the problem, I didn't set the O_NONBLOCK in the pipe2(), this makes the reader to wait until theres content in the pipes before unblocking. But on the downside, it will call fcntl() with call syscall and the performance of the simulation is slower than the original simulation file. I wanted to know how to set reader side of pipe to be ready without syscalls

@ywave620
Copy link

ywave620 commented Sep 22, 2023

This is a general question. Does replacing epoll with io_uring reduce the CPU usage for a given workload, said an nginx like http gateway? Reducing CPU usage in our production environment is the primary task rather than achieving a higher throughput or RPS in the benchmark environment.

This question comes to me because, IMO, io_uring makes socket IO purely asynchronous, but the data copying between user and kernel is preserved, assuming zero copy(IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC) is not used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests