Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance degradation starting from 5.7.16 #215

Closed
romange opened this issue Sep 27, 2020 · 26 comments
Closed

performance degradation starting from 5.7.16 #215

romange opened this issue Sep 27, 2020 · 26 comments

Comments

@romange
Copy link
Contributor

romange commented Sep 27, 2020

Hi Jens,

I noticed that there is serious performance regression using sockets with io_uring starting from v5.7.16.
Unfortunately, it continues into 5.8 and 5.9 I have not checked yet for which patch versions exactly.

perf top shows exactly where the contention happens but I can not identify the root-cause.
I am not sure if it's directly related to io_uring or some other change that caused the regression.

5.7.15 and earlier does not inhibit such contention for the same machine (18 physical cores). I am testing on Ubuntu 20.04 patched with prebuilt kernels from https://kernel.ubuntu.com/~kernel-ppa/mainline/

For 5.7.15 my benchmark achieves 1.6M qps and system cpu is at ~80%.
for 5.7.16 or later it achieves only 1M qps and the system cpu is is at ~100% .

   57.66%  [kernel]       [k] native_queued_spin_lock_slowpath
     1.09%  [kernel]       [k] _raw_spin_lock_irqsave
     1.05%  [kernel]       [k] _raw_spin_lock
     0.83%  [kernel]       [k] tcp_recvmsg
     0.74%  [kernel]       [k] skb_release_data
     0.69%  [kernel]       [k] ena_start_xmit
     0.69%  [kernel]       [k] memset_erms
     0.69%  [kernel]       [k] __slab_free
     0.63%  [kernel]       [k] mwait_idle_with_hints.constprop.0
     0.56%  [kernel]       [k] copy_user_generic_unrolle
     0.49%  [kernel]       [k] ena_clean_tx_irq
     0.45%  [kernel]       [k] read_tsc

@axboe
Copy link
Owner

axboe commented Sep 27, 2020

It's probably this one:

commit 175852503b41b3b74ffd99789240fa21ea1b8584
Author: Jens Axboe axboe@kernel.dk
Date: Thu Aug 6 19:41:50 2020 -0600

io_uring: use TWA_SIGNAL for task_work uncondtionally

which I did suspect might cause some perf issues... What's your benchmark? Would love to be able to run it and compare. There are ways we can improve this.

@romange
Copy link
Contributor Author

romange commented Sep 27, 2020

The benchmarks programs are attached here:
https://drive.google.com/file/d/1a8f8Za_f9Z6ejC3nlQf_0NRxV61m5cc5/view?usp=sharing

it's just a ping/pong setup simulating high throughput scenario.

You will need to install libunwind lib for the server.
There are some other shared libs that are bundled with ping_iouring_server.

On your server machine run:
LD_LIBRARY_PATH=... /ping_iouring_server --logtostderr
it should print AcceptServer - listening on port 6380

On the client machine run:
./redis-benchmark -p 6380 -h <server_ip> -n 10000000 -c 400 -t ping_inline --threads 24

you can play with -c (number of connections) and -threads on the client side to decrease/increase the load.

@romange
Copy link
Contributor Author

romange commented Sep 28, 2020

I can confirm that 175852503b41b3b74ffd99789240fa21ea1b8584 causes the regression.

@axboe
Copy link
Owner

axboe commented Sep 28, 2020

I ran your test case this morning, and I agree it does cause a slowdown. It's somewhat annoying, as 99.9% of use cases don't need this, and the reasons are pretty complicated. Basically you only need this for some eventfd related setups, where the same application that does the IO also ends up blocking in the kernel waiting on the eventfd. The eventfd will trigger when completions are run, but some completions are not run until the application exits that loop and processes completions.

One idea would be to pass in a flag for io_uring setup that says "I do weird stuff with eventfd, use the safe/slower path" and default to NOT doing this. Don't necessarily like that approach, but right now I can't think of a better way to handle this.

@romange
Copy link
Contributor Author

romange commented Sep 28, 2020 via email

@axboe
Copy link
Owner

axboe commented Sep 28, 2020

Note that this is a problem mainly for threaded cases, as threads share the signal group. To process the completion work, we need to lock the signal structure, which obviously turns to a mess if you have a lot of threads in the same group.

@romange
Copy link
Contributor Author

romange commented Sep 28, 2020 via email

@isilence
Copy link
Collaborator

One idea would be to pass in a flag for io_uring setup that says "I do weird stuff with eventfd, use the safe/slower path" and default to NOT doing this. Don't necessarily like that approach, but right now I can't think of a better way to handle this.

Can we detect it and set the flag in-kernel?

@axboe
Copy link
Owner

axboe commented Sep 28, 2020

What's this signal group? Can I allocate a signal group per thread?

You can't currently break that sharing, clone(2) could potentially work, but doesn't support that.

@axboe
Copy link
Owner

axboe commented Sep 28, 2020

Can you just document the exact flow as invalid/unsupported? It sounds like a deadlock what you are describing.

Here's an example:

#include <sys/eventfd.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <liburing.h>
#include <fcntl.h>
#include <poll.h>
#include <sys/time.h>

struct thread_data {
	struct io_uring *ring;
	int read_fd, write_fd;
};

static void error_exit(char *message)
{
	perror(message);
	exit(1);
}

static void *listener_thread(void *data)
{
	struct thread_data *td = data;
	struct io_uring_cqe *cqe;
	int ret;

        ret = io_uring_wait_cqe(td->ring, &cqe);
        if (ret < 0) {
        	fprintf(stderr, "Error waiting for completion: %s\n",
                	strerror(-ret));
		goto err;
        }
	if (cqe->res < 0) {
		fprintf(stderr, "Error in async operation: %s\n", strerror(-cqe->res));
		goto err;
        }
	io_uring_cqe_seen(td->ring, cqe);
	return NULL;
err:
	return (void *) 1;
}

static void *wakeup_io_uring(void *data)
{
	struct thread_data *td = data;
	char buf = 0x40;
	int res;

	res = write(td->write_fd, &buf, sizeof(buf));
	if (res < 0) {
		perror("write");
		return (void *) 1;
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	struct io_uring_sqe *sqe;
	struct thread_data td;
	struct io_uring ring;
	pthread_t t1, t2;
	int ret, fds[2];
	void *pret;

	if (pipe(fds) < 0)
		error_exit("eventfd");

	ret = io_uring_queue_init(8, &ring, 0);
	if (ret) {
		fprintf(stderr, "Unable to setup io_uring: %s\n", strerror(-ret));
		return 1;
	}

	td.read_fd = fds[0];
	td.write_fd = fds[1];
	td.ring = &ring;

	sqe = io_uring_get_sqe(&ring);
	io_uring_prep_poll_add(sqe, fds[0], POLLIN);
	sqe->user_data = 2;
	ret = io_uring_submit(&ring);
	if (ret != 1) {
		fprintf(stderr, "ring_submit=%d\n", ret);
		return 1;
	}

	pthread_create(&t1, NULL, listener_thread, &td);

	sleep(1);

	pthread_create(&t2, NULL, wakeup_io_uring, &td);
	pthread_join(t1, &pret);

	io_uring_queue_exit(&ring);
	return pret != NULL;
}

Basically we have a main task, M, that issues a poll request. It then spawns two threads, one (T1) which waits for this poll request to complete, and one (T2) which issues a write that triggers this condition. This write spawns a task_work item for M, this work needs to be processed for the completion CQE to be filled in. Task M then waits for T1 to exit. M ends up waiting in futex, and hence doesn't process the task_work required to have T1 receive the completion event.

Hence we won't make any progress here unless we can interrupt the futex wait and process the completion.

@axboe
Copy link
Owner

axboe commented Sep 28, 2020

Can we detect it and set the flag in-kernel?

I'd love for that to be the case, do let me know if you have any good ideas here. Continuing on the above example, we can't detect that task M is in the kernel and waiting. And even if we could, we could have a race if it's just on the way to do just that.

Would love to improve this case, it's very annoying to have to slow down the general case for some odd corner cases. It would be very important to fix.

@axboe
Copy link
Owner

axboe commented Sep 28, 2020

For this exact case, we could make it work by just doing:

diff --git a/kernel/futex.c b/kernel/futex.c
index a5876694a60e..f2c244a3954f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2579,6 +2579,9 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 	 * has tried to wake us, and we can skip the call to schedule().
 	 */
 	if (likely(!plist_node_empty(&q->list))) {
+		if (current->task_works)
+			task_work_run();
+
 		/*
 		 * If the timer has already expired, current will already be
 		 * flagged for rescheduling. Only call schedule if there

but the problem is that this fixes just the one case that could trigger it. Maybe M would be waiting somewhere else. The canonical case is:

while (!condition_to_wait_on)
        schedule();

in essence, where we then never run the task work as we're not exiting the kernel and we need to process task_work for condition_to_wait_on to become true. TWA_SIGNAL forces this, but the side effects are a bit grim on (particularly) threaded cases.

@romange
Copy link
Contributor Author

romange commented Sep 28, 2020

I understand the flow but I am probably missing something about the mechanics of io_uring behind the scene.

  1. Why do we need M to process something? I thought io_uring completions are processed in magical kernel threads :) this is why we have ring-buffer - to synchronize completions why user thread does some other work...
  2. Even if we need M to do something instead of being stuck on join, what will it be? why calling io_uring_wait_cqe in the listener (t1) thread is not enough?

As a general note, I find io_uring data structure not very thread-safe from userland perspective: you can not issue io_ring_get_sqe from multiple threads anyway because you may not know who might trigger io_uring_submit in parallel. Therefore, I consider io_uring single threaded creature. For example, for inter-thread communication case even though IORING_OP_NOP could be more efficient I did not find a way to issue it safely, so ended up using eventfd for the sole purpose of waking up other io_uring loops.

@axboe
Copy link
Owner

axboe commented Sep 28, 2020

There are only magical threads if we need them, in general any sort of deferred try/retry is done by something called task_work (let's shorten that to tw). tw is run by the task itself, which makes it a lot less expensive than punting to a different thread. The work is run on transition between kernel/user.

So for your question 1+2, M has to process the poll request completion in the example above. Once T2 triggers the poll request, tw gets added to M. Once M runs this tw item, then the completion shows up in the ring and T1 gets happy and receives the completion event. That's why we need M to do something. For this poll example, we could potentially have T1 process the tw, but that's not necessarily the case for other work items, say if the work item needs to access the mm of M. This would work for T1 since it's a thread, but you could also have the above be M, P1, and P2 where P are forked processes.

And yes, the rings are not thread safe, generally a thread should have its own ring(s). You can, however, get by with isolating the completion and submission side, those are independent in the raw API, not in liburing though.

@romange
Copy link
Contributor Author

romange commented Sep 29, 2020

thank you for the explanation, I think it could be useful to add it to io_uring.pdf or maybe to man pages.
In particular, I find this sentence in man pages a bit confusing.

Note that, for interrupt-driven I/O (where IORING_SETUP_IOPOLL was not specified in the call to io_uring_setup(2)), an application may check the completion queue for event completions  without entering the kernel at all.

I understand, based on your explanation, that in a regular case (without IORING_SETUP_I OPOLL or  IORING_SETUP_SQPOLL), I need to delegate control to the kernel so it could process TWs and fill the CQ ring.

  1. Do I have to call io_uring_enter or any kernel function will do the job?  
  2. Can I know efficiently from the userland that there are TWs that needs to be processed? If not, is it possible to add a flag to cq.kflags that tells TWs are available?

The reason I am asking is that I designed my io_uring loop under bit different assumptions - see below.
It still works but I thought as long as TWs become available we won't reach io_uring_wait_cqe step and io_uring_peek_batch_cqe is gonna harvest CQEs without any system calls. Now I understand that calling io_uring_wait_cqe is needed to process TWs since io_ring_submit does not call io_uring_enter if no SQEs are added. So, I am looking for the cheapest way to know I need to yield to the kernel, i.e. that there are TWs are pending and I should call io_uring_enter(ring->ring_fd, 0, 0, IORING_ENTER_GETEVENTS, NULL).

while(true) {
     num_requests = io_uring_submit(&ring);
     num_msgs = handle_message_passing();

     io_uring_peek_batch_cqe(&ring, &completions);
     if (!completions.empty()) {
       dispatch_completions();
     }
     if (num_requests || num_msgs || !completions.empty()) {
       continue;
     }

      io_uring_wait_cqe(&ring);   // epoll_wait(timeout=-1)
  };

@axboe
Copy link
Owner

axboe commented Sep 30, 2020

Something like this might be viable. If you have a chance to be able to test this, then that'd be great...

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..79057cecf121 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_32
 	select MODULES_USE_ELF_REL
 	select OLD_SIGACTION
 	select GENERIC_VDSO_32
+	select ARCH_HAS_TIF_TW
 
 config X86_64
 	def_bool y
@@ -30,6 +31,7 @@ config X86_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
 	select SWIOTLB
+	select ARCH_HAS_TIF_TW
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..79fe7db3208c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_TASKWORK		19	/* task_work pending */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_TASKWORK		(1 << TIF_TASKWORK)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 159c7476b11b..a8e3dbdfad63 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -69,7 +69,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_TASKWORK|	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..470a1da5166b 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -355,7 +355,8 @@ static inline int restart_syscall(void)
 
 static inline int signal_pending(struct task_struct *p)
 {
-	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
+			test_tsk_thread_flag(p, TIF_TASKWORK));
 }
 
 static inline int __fatal_signal_pending(struct task_struct *p)
@@ -502,7 +503,8 @@ extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize);
 static inline void restore_saved_sigmask_unless(bool interrupted)
 {
 	if (interrupted)
-		WARN_ON(!test_thread_flag(TIF_SIGPENDING));
+		WARN_ON(!test_thread_flag(TIF_SIGPENDING) &&
+			!test_thread_flag(TIF_TASKWORK));
 	else
 		restore_saved_sigmask();
 }
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 7bbc0e9cf084..4bf470d66791 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -6,6 +6,11 @@
 #include <linux/signal_types.h>
 #include <linux/string.h>
 
+#ifndef CONFIG_ARCH_HAS_TIF_TW
+#define TIF_TASK_WORK	0
+#define _TIF_TASK_WORK	0
+#endif
+
 struct task_struct;
 
 /* for sysctl */
diff --git a/init/Kconfig b/init/Kconfig
index d6a0b31b13dc..3f9c9dd69f84 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2333,6 +2333,9 @@ config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	bool
 
+config ARCH_HAS_TIF_TW
+	bool
+
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fdb6105e6d6..755e0aac7e2c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -160,6 +160,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & _TIF_SIGPENDING)
 			arch_do_signal(regs);
 
+		if (ti_work & _TIF_TASKWORK)
+			task_work_run();
+
 		if (ti_work & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
diff --git a/kernel/signal.c b/kernel/signal.c
index a38b3edc6851..3ed090798811 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2529,6 +2529,9 @@ bool get_signal(struct ksignal *ksig)
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+	if (unlikely(current->task_works))
+		task_work_run();
+
 	if (unlikely(uprobe_deny_signal()))
 		return false;
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 613b2d634af8..de85a4c3f899 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -28,7 +28,6 @@ int
 task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 {
 	struct callback_head *head;
-	unsigned long flags;
 
 	do {
 		head = READ_ONCE(task->task_works);
@@ -41,7 +40,10 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 	case TWA_RESUME:
 		set_notify_resume(task);
 		break;
-	case TWA_SIGNAL:
+	case TWA_SIGNAL: {
+#ifndef CONFIG_ARCH_HAS_TIF_TW
+		unsigned long flags;
+
 		/*
 		 * Only grab the sighand lock if we don't already have some
 		 * task_work pending. This pairs with the smp_store_mb()
@@ -53,7 +55,11 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 			signal_wake_up(task, 0);
 			unlock_task_sighand(task, &flags);
 		}
+#else
+		set_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
 		break;
+		}
 	}
 
 	return 0;
@@ -110,6 +116,11 @@ void task_work_run(void)
 	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
 
+#ifdef CONFIG_ARCH_HAS_TIF_TW
+	if (test_tsk_thread_flag(task, TIF_TASKWORK))
+		clear_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
+
 	for (;;) {
 		/*
 		 * work->func() can do task_work_add(), do not set

@romange
Copy link
Contributor Author

romange commented Sep 30, 2020 via email

@axboe
Copy link
Owner

axboe commented Sep 30, 2020

Thanks! I'm still fiddling around with this, it's just x86 only right now since a) it's arch dependent, and b) that's the only arch I can test on. There's nothing preventing similar changes on other archs.

@axboe
Copy link
Owner

axboe commented Sep 30, 2020

Here's a cleaned up version:

commit df2bf7418769dfe879087bf3fbe85b2f0736d2f8
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Sep 30 15:36:02 2020 -0600

    kernel: decouple TASK_WORK TWA_SIGNAL handling from signals
    
    Users of TWA_SIGNAL need to have system calls interrupted and go through
    a kernel/user transition to ensure they are run. This currently works
    well from a functional standpoint, but it is heavy handed on a
    multithreaded application where sighand is shared between the threads and
    main process. Adding TWA_SIGNAL task_work on such setups need to grab
    the sighand->lock, which creates a hot spot for otherwise unrelated
    task_work.
    
    This adds TIF_TASKWORK for x86, which if set, will return true on
    checking for pending signals. That in turn causes tasks to restart the
    system call, which will run the added task_work. If TIF_TASKWORK is
    available, we'll use that for notification when TWA_SIGNAL is specified.
    If it isn't available, the existing TIF_SIGPENDING path is used.
    
    Once all archs have added support for TIF_TASKWORK, we can kill the
    old code completely. That will also allow removal of JOBCTL_TASK_WORK
    and related code.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..79fe7db3208c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_TASKWORK		19	/* task_work pending */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_TASKWORK		(1 << TIF_TASKWORK)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 159c7476b11b..0a3702b4774c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -37,6 +37,10 @@
 # define _TIF_UPROBE			(0)
 #endif
 
+#ifndef _TIF_TASKWORK
+# define _TIF_TASKWORK			(0)
+#endif
+
 /*
  * TIF flags handled in syscall_enter_from_usermode()
  */
@@ -69,7 +73,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_TASKWORK|	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..d1a02749df74 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -355,7 +355,17 @@ static inline int restart_syscall(void)
 
 static inline int signal_pending(struct task_struct *p)
 {
-	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
+#ifdef TIF_TASKWORK
+	/*
+	 * TIF_TASKWORK isn't really a signal, but it requires the same
+	 * behavior of restarting the system call to force a kernel/user
+	 * transition.
+	 */
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
+			test_tsk_thread_flag(p, TIF_TASKWORK));
+#else
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
+#endif
 }
 
 static inline int __fatal_signal_pending(struct task_struct *p)
@@ -501,10 +511,16 @@ extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize);
 
 static inline void restore_saved_sigmask_unless(bool interrupted)
 {
-	if (interrupted)
+	if (interrupted) {
+#ifdef TIF_TASKWORK
+		WARN_ON(!test_thread_flag(TIF_SIGPENDING) &&
+			!test_thread_flag(TIF_TASKWORK));
+#else
 		WARN_ON(!test_thread_flag(TIF_SIGPENDING));
-	else
+#endif
+	} else {
 		restore_saved_sigmask();
+	}
 }
 
 static inline sigset_t *sigmask_to_save(void)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fdb6105e6d6..755e0aac7e2c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -160,6 +160,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & _TIF_SIGPENDING)
 			arch_do_signal(regs);
 
+		if (ti_work & _TIF_TASKWORK)
+			task_work_run();
+
 		if (ti_work & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 613b2d634af8..3cd7c6ab9985 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -28,7 +28,6 @@ int
 task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 {
 	struct callback_head *head;
-	unsigned long flags;
 
 	do {
 		head = READ_ONCE(task->task_works);
@@ -41,7 +40,10 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 	case TWA_RESUME:
 		set_notify_resume(task);
 		break;
-	case TWA_SIGNAL:
+	case TWA_SIGNAL: {
+#ifndef TIF_TASKWORK
+		unsigned long flags;
+
 		/*
 		 * Only grab the sighand lock if we don't already have some
 		 * task_work pending. This pairs with the smp_store_mb()
@@ -53,7 +55,11 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 			signal_wake_up(task, 0);
 			unlock_task_sighand(task, &flags);
 		}
+#else
+		set_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
 		break;
+		}
 	}
 
 	return 0;
@@ -110,6 +116,11 @@ void task_work_run(void)
 	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
 
+#ifdef TIF_TASKWORK
+	if (test_tsk_thread_flag(task, TIF_TASKWORK))
+		clear_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
+
 	for (;;) {
 		/*
 		 * work->func() can do task_work_add(), do not set

@axboe
Copy link
Owner

axboe commented Sep 30, 2020

Committed it here for testing:

https://git.kernel.dk/cgit/linux-block/log/?h=tif-task_work

and I also added an arm64 patch for you. Also shows how trivial it is to add arch support for this.

@romange
Copy link
Contributor Author

romange commented Sep 30, 2020 via email

@romange
Copy link
Contributor Author

romange commented Oct 1, 2020

Ok, checked the latest fix c6a44403af72ba7788b82ee83bd53ff972923117 and I confirm that the performance issue is solved!
Ping server reaches 2.7M qps run on 72 vcpus and saturate networking capacity way before overloading the cpu.

Great job!

P.S. Please backport the fix to 5.8 for the upcoming Ubuntu release.

Screenshot from 2020-10-01 10-31-47

Screenshot from 2020-10-01 10-30-47

@axboe
Copy link
Owner

axboe commented Oct 1, 2020

Awesome, thanks for testing! I've cleaned it up a bit more, and posted it for comments/review:

https://lore.kernel.org/io-uring/0b5336a7-c975-a8f8-e988-e983e2340d99@kernel.dk/T/#u

I forgot to add your Reported-by, I'll get that done for the next round. I'm sure we're not done yet...

@romange
Copy link
Contributor Author

romange commented Oct 1, 2020

np, I've recently learnt that my company will appreciate if my work email will appear in the log. Can you please use romger@amazon.com from now on?

@axboe
Copy link
Owner

axboe commented Oct 1, 2020

You bet, I'll use that email. Once we have a final and you've vetted it, would be great if I could also add your Tested-by in there.

@axboe
Copy link
Owner

axboe commented Nov 24, 2020

It's queued up for 5.11 and will go into 5.10-stable once that's done. Closing this one out.

@axboe axboe closed this as completed Nov 24, 2020
torvalds added a commit to torvalds/linux that referenced this issue Dec 16, 2020
…x-block

Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe:
 "This sits on top of of the core entry/exit and x86 entry branch from
  the tip tree, which contains the generic and x86 parts of this work.

  Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL.

  With that done, we can get rid of JOBCTL_TASK_WORK from task_work and
  signal.c, and also remove a deadlock work-around in io_uring around
  knowing that signal based task_work waking is invoked with the sighand
  wait queue head lock.

  The motivation for this work is to decouple signal notify based
  task_work, of which io_uring is a heavy user of, from sighand. The
  sighand lock becomes a huge contention point, particularly for
  threaded workloads where it's shared between threads. Even outside of
  threaded applications it's slower than it needs to be.

  Roman Gershman <romger@amazon.com> reported that his networked
  workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU
  after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all
  spent hammering on the sighand lock, showing 57% of the CPU time there
  [1].

  There are further cleanups possible on top of this. One example is
  TIF_PATCH_PENDING, where a patch already exists to use
  TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more
  consolidation, but the work stands on its own as well"

[1] axboe/liburing#215

* tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits)
  io_uring: remove 'twa_signal_ok' deadlock work-around
  kernel: remove checking for TIF_NOTIFY_SIGNAL
  signal: kill JOBCTL_TASK_WORK
  io_uring: JOBCTL_TASK_WORK is no longer used by task_work
  task_work: remove legacy TWA_SIGNAL path
  sparc: add support for TIF_NOTIFY_SIGNAL
  riscv: add support for TIF_NOTIFY_SIGNAL
  nds32: add support for TIF_NOTIFY_SIGNAL
  ia64: add support for TIF_NOTIFY_SIGNAL
  h8300: add support for TIF_NOTIFY_SIGNAL
  c6x: add support for TIF_NOTIFY_SIGNAL
  alpha: add support for TIF_NOTIFY_SIGNAL
  xtensa: add support for TIF_NOTIFY_SIGNAL
  arm: add support for TIF_NOTIFY_SIGNAL
  microblaze: add support for TIF_NOTIFY_SIGNAL
  hexagon: add support for TIF_NOTIFY_SIGNAL
  csky: add support for TIF_NOTIFY_SIGNAL
  openrisc: add support for TIF_NOTIFY_SIGNAL
  sh: add support for TIF_NOTIFY_SIGNAL
  um: add support for TIF_NOTIFY_SIGNAL
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants