Looking for Insights on Tock's Context Switch Latency
Hello Tock developers, I am recently testing Tock's context switch speed with an STM32F412G-Discovery board, and I noticed that the latency is longer than expected. I hope to confirm that my settings are correct and gain some insights behind the data. To test the context switch speed, I implemented two apps called "server" and "client." The server sets up an IPC service that simply notifies completion back to the client. The client invokes the IPC service 10,000 times in a loop and then calculates the average ping-pong time. On my board, the 10,000 ping-pong takes 10,748 ms. That is more than 1ms for each ping-pong. I believe the CPU clock is set to 96 MHz by the default kernel configuration for the board. With a similar clock speed, a ping-pong implemented through task notification on FreeRTOS takes around 13 μs, which is almost two orders of magnitude faster than Tock. Professor Campbell pointed out to me that there are multiple context switches for each ping-pong: On the "client" side: - calls command() -> context switch to the kernel - command returns -> context switch back to the client app - client app calls yield -> context switch back to the kernel Next, on the “server” side: - upcall scheduled -> context switch to the server app - calls ipc_notify_client -> context switch to the kernel to schedule the upcall - command returns -> context switch back to the server app - yield -> context switch to the kernel Finally, on the “client” side again: - upcall scheduled -> context switch to the client app However, even if there are 8 context switches per ping-pong, on average each context switch still takes around 135 μs, which is around 13k instructions executed per context switch on a 96 MHz CPU. The number still seems to be high. Below is the source code for the “server” and “client” app. Server: ```c #include <libtock/kernel/ipc.h> #include <libtock/tock.h> static void ipc_callback( int pid, __attribute__ ((unused)) int len, __attribute__ ((unused)) int buf, __attribute__ ((unused)) void* ud) { ipc_notify_client(pid); } int main(void) { ipc_register_service_callback("server", ipc_callback, NULL); while (1) { yield(); } } ``` Client: ```c #include <stdio.h> #include <libtock-sync/services/alarm.h> #include <libtock/kernel/ipc.h> size_t service_id = -1; char buf[64] __attribute__((aligned(64))); bool done = false; static void ipc_callback(__attribute__ ((unused)) int pid, __attribute__ ((unused)) int len, __attribute__ ((unused)) int arg2, __attribute__ ((unused)) void* ud) { done = true; } static uint32_t get_time_ms(void) { struct timeval tv; libtock_alarm_gettimeasticks(&tv, NULL); return (tv.tv_sec * 1000) + (tv.tv_usec / 1000); } int main(void) { int ret; ret = ipc_discover("server", &service_id); while (ret != RETURNCODE_SUCCESS) { printf("No server, retry in 1 second.\n"); libtocksync_alarm_delay_ms(1000); } printf("Server discovered.\n"); ipc_register_client_callback(service_id, ipc_callback, NULL); ipc_share(service_id, buf, 64); uint32_t start_time = get_time_ms(); for (int i = 0; i < 10000; ++i) { done = false; ipc_notify_service(service_id); yield_for(&done); } uint32_t end_time = get_time_ms(); printf("Ping-pong 10000 times take %lu ms.\n", end_time - start_time); return 0; } ```
Hi Zhiyao, Thank you for including code. I haven't done so just yet, but will replicate this soon to both validate and try to debug/explain in more detail. Meanwhile, there area few things going on here, I think. First, to confirm some general numbers. At 64MHz (on the NRF52840DK), an almost no-op system-call (a context switch back and forth that reads the current RTC counter) takes about 42uS (10,000 in ~2.1 seconds), so about 21uS each way or 1680 instructions at 64MHz. That's about 13x off of what you're seeing _if_ no work is being done between those context switches, though of course there is. So on average, there is something like 12k instructions left to explain, which is indeed a lot. Second, it's worth noting that IPC in Tock is so unoptimized its not even funny. I added IPC in 2016 originally in service of a paper deadline, when there was no real use case for IPC for any Tock users, but as a proof of concept that, yes, this kind of interaction could also be supported. Since then it's hardly been touched. Recently there are some actual use cases that use and/or require IPC, so it's a good time to design a half-way decent system... Finally, IPC in Tock is quite different from FreeRTOS task notifications (as I understand them anyway), and as a result does quite a bit more. FreeRTOS task notifications directly pass control from the sender to the receiver rather than queue a notification to be handled asynchronously, which is what Tock IPC does. So there are a bunch of subsystems that get traversed in Tock (process scheduler, systick reset, hardware event loop, MPU reconfiguration, etc) that shouldn't need to be traversed in FreeRTOS. MPU reconfiguration is expensive (not _that_ expensive), but should only be happening twice here. Because the IPC is asynchronous, other things _could_ be happening between yieling in the client and receiving the notification in the server (or vice versa), such as handling hardware interupts, that that seems unlikely. In summary, while I don't have an off-the-cuff explanation for the high latency, I'm not _that_ surprised that IPC is very slow. To be continued... -Amit "zhiyao.ma.98--- via Devel" <devel@lists.tockos.org> writes:
Hello Tock developers,
I am recently testing Tock's context switch speed with an STM32F412G-Discovery board, and I noticed that the latency is longer than expected. I hope to confirm that my settings are correct and gain some insights behind the data.
To test the context switch speed, I implemented two apps called "server" and "client." The server sets up an IPC service that simply notifies completion back to the client. The client invokes the IPC service 10,000 times in a loop and then calculates the average ping-pong time.
On my board, the 10,000 ping-pong takes 10,748 ms. That is more than 1ms for each ping-pong.
I believe the CPU clock is set to 96 MHz by the default kernel configuration for the board. With a similar clock speed, a ping-pong implemented through task notification on FreeRTOS takes around 13 μs, which is almost two orders of magnitude faster than Tock.
Professor Campbell pointed out to me that there are multiple context switches for each ping-pong:
On the "client" side: - calls command() -> context switch to the kernel - command returns -> context switch back to the client app - client app calls yield -> context switch back to the kernel
Next, on the “server” side: - upcall scheduled -> context switch to the server app - calls ipc_notify_client -> context switch to the kernel to schedule the upcall - command returns -> context switch back to the server app - yield -> context switch to the kernel
Finally, on the “client” side again: - upcall scheduled -> context switch to the client app
However, even if there are 8 context switches per ping-pong, on average each context switch still takes around 135 μs, which is around 13k instructions executed per context switch on a 96 MHz CPU. The number still seems to be high.
Below is the source code for the “server” and “client” app.
Server:
```c #include <libtock/kernel/ipc.h> #include <libtock/tock.h>
static void ipc_callback( int pid, __attribute__ ((unused)) int len, __attribute__ ((unused)) int buf, __attribute__ ((unused)) void* ud) { ipc_notify_client(pid); }
int main(void) { ipc_register_service_callback("server", ipc_callback, NULL);
while (1) { yield(); } } ```
Client:
```c #include <stdio.h> #include <libtock-sync/services/alarm.h> #include <libtock/kernel/ipc.h>
size_t service_id = -1;
char buf[64] __attribute__((aligned(64)));
bool done = false;
static void ipc_callback(__attribute__ ((unused)) int pid, __attribute__ ((unused)) int len, __attribute__ ((unused)) int arg2, __attribute__ ((unused)) void* ud) { done = true; }
static uint32_t get_time_ms(void) { struct timeval tv; libtock_alarm_gettimeasticks(&tv, NULL); return (tv.tv_sec * 1000) + (tv.tv_usec / 1000); }
int main(void) { int ret;
ret = ipc_discover("server", &service_id); while (ret != RETURNCODE_SUCCESS) { printf("No server, retry in 1 second.\n"); libtocksync_alarm_delay_ms(1000); }
printf("Server discovered.\n");
ipc_register_client_callback(service_id, ipc_callback, NULL); ipc_share(service_id, buf, 64);
uint32_t start_time = get_time_ms();
for (int i = 0; i < 10000; ++i) { done = false; ipc_notify_service(service_id); yield_for(&done); }
uint32_t end_time = get_time_ms();
printf("Ping-pong 10000 times take %lu ms.\n", end_time - start_time);
return 0; } ``` _______________________________________________ Devel mailing list -- devel@lists.tockos.org To unsubscribe send an email to devel-leave@lists.tockos.org
Hi Professor Levy, Thank you so much for looking into the problem. Please let me know if there is anything I can help on my side. Best regards, Zhiyao
Hello Zhiyao, Unfortunately, I also do not have a great explanation for you. At 64MHz (on the NRF52840DK), your benchmark takes 4846 ms. I would like to point out some things. Your context switch diagram is underestimating the amount of context switches. The sequence you present is not correct because If a syscall does not immediately error (which IPC is not doing in this case), then the kernel triggers a context switch (ContextSwitchReason::SyscallFired). To be more precise, the kernel's process loop is going to mark that a syscall has fired and ask the scheduler whether it should immediately handle the syscall. The default round robin scheduler will always choose to switch to the other process. To revise your diagram: On the "client" side: - calls command() -> context switch to the kernel - (NEW) kernel defers processing the command() -> kernel round robin scheduler context switches to server's main() - (NEW) server app main loop immediately yields -> server context switches to kernel - kernel processes the client's command -> kernel context switch to the client app - client immediately runs yield_for -> context switch to kernel I empirically verified that the `loop { yield(); }` runs 10,000 times in the server. It is possible there is more overhead from the server calling `ipc_notify_client(pid);`, but I did not trace further in your timeline. I think the best method to get more info about this bug would be to do some debugging inside the kernel scheduler (`round_robin.rs`). Let me know if I got anything wrong, Professor Levy. Samir Fingerprint: DE65 F61B 7AD6 69C8 3972 1530 A81C A0CA 8957 AC94 On 2024-10-11 18:23, zhiyao.ma.98--- via Devel <devel@lists.tockos.org> wrote:
Hi Professor Levy,
Thank you so much for looking into the problem. Please let me know if there is anything I can help on my side.
Best regards, Zhiyao _______________________________________________ Devel mailing list -- devel@lists.tockos.org To unsubscribe send an email to devel-leave@lists.tockos.org
Hello Zhiyao and Amit, Sorry for the misinformation, ignore my previous message. The kernel indeed returns immediately to the process after a command(). I measured that, from board startup, the kernel regains control 100,014 times for 10,000 IPC ping-pongs (Measurement Methodology: counted how many times do_process reached the match). Doesn't this mean that the user/kernel barrier is crossed 20 times per IPC loop (10 round trips each)? @Amit What is the benchmarking story for Tock? There are no code speed benchmarks in the Tock repo. I found tock/TockBenchmarking, but it never got upstreamed to Tock. I will be benchmarking Tock this quarter as part of the same class as Gabe (Pat's student, the guy who made the aforementioned repo). It will be interesting to compare my results to his. What should I do if I want to get some benchmarks upstreamed? I can wrap them with an Action like the code size check (https://github.com/tock/tock/blob/fb33e5f672411318ce9ca4700d9fb3c4a08babe0/.github/workflows/benchmarks.yml). Secondly, I am having a hard time trying to explain the cause of IPC's performance. I am not sure how to go about profiling within the kernel to figure out where all the overhead is going. Any tips? Thanks, Samir Fingerprint: DE65 F61B 7AD6 69C8 3972 1530 A81C A0CA 8957 AC94 On 2024-10-14 07:12, Samir Rashid via Devel <devel@lists.tockos.org> wrote: > Hello Zhiyao, > > Unfortunately, I also do not have a great explanation for you. At 64MHz (on the NRF52840DK), your benchmark takes 4846 ms. > > I would like to point out some things. Your context switch diagram is underestimating the amount of context switches. The sequence you present is not correct because > If a syscall does not immediately error (which IPC is not doing in this case), then the kernel triggers a context switch (ContextSwitchReason::SyscallFired). To be more precise, the kernel's process loop is going to mark that a syscall has fired and ask the scheduler whether it should immediately handle the syscall. The default round robin scheduler will always choose to switch to the other process. > > To revise your diagram: > On the "client" side: - calls command() -> context switch to the kernel > - (NEW) kernel defers processing the command() -> kernel round robin scheduler context switches to server's main() > - (NEW) server app main loop immediately yields -> server context switches to kernel > - kernel processes the client's command -> kernel context switch to the client app > - client immediately runs yield_for -> context switch to kernel > > I empirically verified that the `loop { yield(); }` runs 10,000 times in the server. > > It is possible there is more overhead from the server calling `ipc_notify_client(pid);`, but I did not trace further in your timeline. I think the best method to get more info about this bug would be to do some debugging inside the kernel scheduler (`round_robin.rs`). Let me know if I got anything wrong, Professor Levy. > > Samir > Fingerprint: DE65 F61B 7AD6 69C8 3972 1530 A81C A0CA 8957 AC94 > > > > On 2024-10-11 18:23, zhiyao.ma.98--- via Devel <devel@lists.tockos.org> wrote: > > Hi Professor Levy, > > > > Thank you so much for looking into the problem. Please let me know if there is anything I can help on my side. > > > > Best regards, > > Zhiyao > > _______________________________________________ > > Devel mailing list -- devel@lists.tockos.org > > To unsubscribe send an email to devel-leave@lists.tockos.org _______________________________________________ Devel mailing list -- devel@lists.tockos.org To unsubscribe send an email to devel-leave@lists.tockos.org
Hello Zhiyao and Amit, Sorry for the misinformation, ignore my previous message. The kernel indeed returns immediately to the process after a command(). I measured that, from board startup, the kernel regains control 100,014 times for 10,000 IPC ping-pongs (Measurement Methodology: counted how many times do_process reached the match). Doesn't this mean that the user/kernel barrier is crossed 20 times per IPC loop (10 round trips each)? @Amit What is the benchmarking story for Tock? There are no code speed benchmarks in the Tock repo. I found tock/TockBenchmarking, but it never got upstreamed to Tock. I will be benchmarking Tock this quarter as part of the same class as Gabe (Pat's student, the guy who made the aforementioned repo). It will be interesting to compare my results to his. What should I do if I want to get some benchmarks upstreamed? I can wrap them with an Action like the code size check (https://github.com/tock/tock/blob/fb33e5f672411318ce9ca4700d9fb3c4a08babe0/.github/workflows/benchmarks.yml). Secondly, I am having a hard time trying to explain the cause of IPC's performance. I am not sure how to go about profiling within the kernel to figure out where all the overhead is going. Any tips? Thanks, Samir Fingerprint: DE65 F61B 7AD6 69C8 3972 1530 A81C A0CA 8957 AC94 On 2024-10-14 07:12, Samir Rashid via Devel <devel@lists.tockos.org> wrote: > Hello Zhiyao, > > Unfortunately, I also do not have a great explanation for you. At 64MHz (on the NRF52840DK), your benchmark takes 4846 ms. > > I would like to point out some things. Your context switch diagram is underestimating the amount of context switches. The sequence you present is not correct because > If a syscall does not immediately error (which IPC is not doing in this case), then the kernel triggers a context switch (ContextSwitchReason::SyscallFired). To be more precise, the kernel's process loop is going to mark that a syscall has fired and ask the scheduler whether it should immediately handle the syscall. The default round robin scheduler will always choose to switch to the other process. > > To revise your diagram: > On the "client" side: - calls command() -> context switch to the kernel > - (NEW) kernel defers processing the command() -> kernel round robin scheduler context switches to server's main() > - (NEW) server app main loop immediately yields -> server context switches to kernel > - kernel processes the client's command -> kernel context switch to the client app > - client immediately runs yield_for -> context switch to kernel > > I empirically verified that the `loop { yield(); }` runs 10,000 times in the server. > > It is possible there is more overhead from the server calling `ipc_notify_client(pid);`, but I did not trace further in your timeline. I think the best method to get more info about this bug would be to do some debugging inside the kernel scheduler (`round_robin.rs`). Let me know if I got anything wrong, Professor Levy. > > Samir > Fingerprint: DE65 F61B 7AD6 69C8 3972 1530 A81C A0CA 8957 AC94 > > > > On 2024-10-11 18:23, zhiyao.ma.98--- via Devel <devel@lists.tockos.org> wrote: > > Hi Professor Levy, > > > > Thank you so much for looking into the problem. Please let me know if there is anything I can help on my side. > > > > Best regards, > > Zhiyao > > _______________________________________________ > > Devel mailing list -- devel@lists.tockos.org > > To unsubscribe send an email to devel-leave@lists.tockos.org _______________________________________________ Devel mailing list -- devel@lists.tockos.org To unsubscribe send an email to devel-leave@lists.tockos.org
Update to include some additional observation from my end: Compiler optimization significantly affects the speed. Currently Tock's kernel uses "z" which is the slowest. On STM32F412G Discovery board @ 96 MHz, performing 10,000 ping-pong. When changing kernel compilation to use different optimization flags, the measured time and code size are shown below: O3: 6831 ms text data bss 151042 32 76320 O2: 6954ms text data bss 135682 32 76324 O1: 8528ms text data bss 139778 32 76352 Os: 8528ms text data bss 106498 32 76324 Oz: 10748ms text data bss 100866 32 76320
Hi Zhiyao, "zhiyao.ma.98--- via Devel" <devel@lists.tockos.org> writes:
Update to include some additional observation from my end: Compiler optimization significantly affects the speed. Currently Tock's kernel uses "z" which is the slowest.
On STM32F412G Discovery board @ 96 MHz, performing 10,000 ping-pong. When changing kernel compilation to use different optimization flags, the measured time and code size are shown below:
Thanks for posting this, this is really quite interesting. I did not expect there to be such a drastic difference between optimization level and overall performance. Especially on the simpler microcontroller platforms, conventional wisdom would suggest that fewer & smaller instructions should correlate with better performance. I'd guess that the compiler may choose more efficient algorithms for certain primitives and performs more aggressive inlining on higher optimization levels, which may serve as a gateway to enabling other optimizations. I'm curious whether we could pinpoint these performance gains to some few subroutines. I did a similar exploration a while back for Tock on RISC-V, which I've posted to the old mailing list: https://groups.google.com/g/tock-dev/c/FPTmNe4BAq0 This highlighted the `memcpy` intrinsic as being particularly inefficient on the `-Oz` optimization level. Supposedly this is fixed in upstream Rust / LLVM since a couple months or years now, but it might be good to verify that. For RISC-V we can use the LiteX Sim target (compiling its HDL to a Verilated simulation) to generate an instruction trace with cycle-accurate information on the instructions executed by the CPU. I have a set of patches for this which I can dig up if you're curious. I don't know whether we have an equivalent for ARM Cortex-M (yet). I'm Ccing @Alex who had Tock running on Renode (emulating an STM ARM chip) once, which may also be able to generate such a trace. -Leon
participants (4)
-
Amit Levy
-
Leon Schuermann
-
Samir Rashid
-
zhiyao.ma.98@gmail.com