While working on idle power management, I used vaidy’s klog based patches to profile an idle system to obtain stats such as:
* The time when a cpu enters into the tickless idle mode
* The various interrupts that bring the cpu out of idle state.
* The timers that expire in this interval and cause a wake up.
* The tasks that demand/ or are made to be wake up on the idle cpu.
* The time when the cpu comes out of the tickless idle mode and starts executing the tasks.
While observing the task wakeup instrumentation data, I noticed that the wakeup statistics for kondemand appeared pretty strange. For the uninitiated, kondemand is a kernel thread that belongs to the ondemand governor of cpufreq subsystem, which changes the p-states of the system, based on the utilization statistics. Thus it’s something that helps in power management.
root@llm43 tests]# ps aux | grep kondemand | head -4
root 1143 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/0]
root 1145 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/1]
root 1146 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/2]
root 1147 0.0 0.0 0 0 ? S< 09:21 0:00 [kondemand/3]
From the file wakeups.txt, an output of my profiling experiment,
pid cpu nr_wakeups
1143 0 468
1145 1 279
1146 2 78
1147 3 68
Couple of things bothered me here.
- The unusually high number of wakeups on CPU0 and CPU1. kondemand was wakeing up approximately at the rate of 4 time and 2 times respectively on these cpu’s over a observation idle period of 120 seconds.
- The difference in the number of wakeups by kondemand on the different CPUs.
Bewildered, I fired a mail to Venki asking for possible explanations.
And I started looking at the code. Now, the number of times the kondemand thread is supposed to check for a change in the frequency is determined by this sysfs tunable called sampling_rate. It was set to 256000us on my system. Which accounted for the unusually high number of wakeups on the CPUs.
But I was still confused. The sampling_rate is a global tunable which maps to the variable dbs_tuners_ins.sampling rate, which is common to all the kondemand threads. Then why the different wakeup rates on different CPUs?
Venki replied to my original query reminding me that kondemand uses deferrable timers! That explained everything.
Deferrable timers, behave normally on a busy system. But on a idle system, when are about to decide when should we wake up next inorder to service the next timer in the list, we skip any deferrable timer we encounter.Thus, a deferrable timer on an idle cpu would expire when the next nearest *hard* timer would expire.
So, the reason why CPU0 and CPU1 were having such high number of wake ups on an idle system can be accounted to the fact the expiry of some other timer like the ehci_watchdog would trigger the expiry of kondemand timer, and along with it the wakeup of the kondemand thread! And depending on the number of timers that are queued on different CPUs, we have the corresponding number of wakeups of kondemand thread!
So, what I was thinking to be the major cause for wakeups in the kernel, turned out to be an effect of the expiring timers queued by a totally unrelated subsystem, thus confirming the old wisdom of mathematical logic: “If two events occur one after the other, it doesn’t necessarily imply that one is the cause for the other”