)]}'
{
  "commit": "b4ecc126991b30fe5f9a59dfacda046aeac124b2",
  "tree": "ecde1569068bbe6e941658e385a1e44671752a7b",
  "parents": [
    "44408ad7368906c84000e87a99c14a16dbb867fd"
  ],
  "author": {
    "name": "Jeremy Fitzhardinge",
    "email": "jeremy@goop.org",
    "time": "Wed May 13 17:16:55 2009 -0700"
  },
  "committer": {
    "name": "Ingo Molnar",
    "email": "mingo@elte.hu",
    "time": "Fri May 15 20:07:42 2009 +0200"
  },
  "message": "x86: Fix performance regression caused by paravirt_ops on native kernels\n\nXiaohui Xin and some other folks at Intel have been looking into what\u0027s\nbehind the performance hit of paravirt_ops when running native.\n\nIt appears that the hit is entirely due to the paravirtualized\nspinlocks introduced by:\n\n | commit 8efcbab674de2bee45a2e4cdf97de16b8e609ac8\n | Date:   Mon Jul 7 12:07:51 2008 -0700\n |\n |     paravirt: introduce a \"lock-byte\" spinlock implementation\n\nThe extra call/return in the spinlock path is somehow\ncausing an increase in the cycles/instruction of somewhere around 2-7%\n(seems to vary quite a lot from test to test).  The working theory is\nthat the CPU\u0027s pipeline is getting upset about the\ncall-\u003ecall-\u003elocked-op-\u003ereturn-\u003ereturn, and seems to be failing to\nspeculate (though I haven\u0027t seen anything definitive about the precise\nreasons).  This doesn\u0027t entirely make sense, because the performance\nhit is also visible on unlock and other operations which don\u0027t involve\nlocked instructions.  But spinlock operations clearly swamp all the\nother pvops operations, even though I can\u0027t imagine that they\u0027re\nnearly as common (there\u0027s only a .05% increase in instructions\nexecuted).\n\nIf I disable just the pv-spinlock calls, my tests show that pvops is\nidentical to non-pvops performance on native (my measurements show that\nit is actually about .1% faster, but Xiaohui shows a .05% slowdown).\n\nSummary of results, averaging 10 runs of the \"mmperf\" test, using a\nno-pvops build as baseline:\n\n\t\tnopv\t\tPv-nospin\tPv-spin\nCPU cycles\t100.00%\t\t99.89%\t\t102.18%\ninstructions\t100.00%\t\t100.10%\t\t100.15%\nCPI\t\t100.00%\t\t99.79%\t\t102.03%\ncache ref\t100.00%\t\t100.84%\t\t100.28%\ncache miss\t100.00%\t\t90.47%\t\t88.56%\ncache miss rate\t100.00%\t\t89.72%\t\t88.31%\nbranches\t100.00%\t\t99.93%\t\t100.04%\nbranch miss\t100.00%\t\t103.66%\t\t107.72%\nbranch miss rt\t100.00%\t\t103.73%\t\t107.67%\nwallclock\t100.00%\t\t99.90%\t\t102.20%\n\nThe clear effect here is that the 2% increase in CPI is\ndirectly reflected in the final wallclock time.\n\n(The other interesting effect is that the more ops are\nout of line calls via pvops, the lower the cache access\nand miss rates.  Not too surprising, but it suggests that\nthe non-pvops kernel is over-inlined.  On the flipside,\nthe branch misses go up correspondingly...)\n\nSo, what\u0027s the fix?\n\nParavirt patching turns all the pvops calls into direct calls, so\n_spin_lock etc do end up having direct calls.  For example, the compiler\ngenerated code for paravirtualized _spin_lock is:\n\n\u003c_spin_lock+0\u003e:\t\tmov    %gs:0xb4c8,%rax\n\u003c_spin_lock+9\u003e:\t\tincl   0xffffffffffffe044(%rax)\n\u003c_spin_lock+15\u003e:\tcallq  *0xffffffff805a5b30\n\u003c_spin_lock+22\u003e:\tretq\n\nThe indirect call will get patched to:\n\u003c_spin_lock+0\u003e:\t\tmov    %gs:0xb4c8,%rax\n\u003c_spin_lock+9\u003e:\t\tincl   0xffffffffffffe044(%rax)\n\u003c_spin_lock+15\u003e:\tcallq \u003c__ticket_spin_lock\u003e\n\u003c_spin_lock+20\u003e:\tnop; nop\t\t/* or whatever 2-byte nop */\n\u003c_spin_lock+22\u003e:\tretq\n\nOne possibility is to inline _spin_lock, etc, when building an\noptimised kernel (ie, when there\u0027s no spinlock/preempt\ninstrumentation/debugging enabled).  That will remove the outer\ncall/return pair, returning the instruction stream to a single\ncall/return, which will presumably execute the same as the non-pvops\ncase.  The downsides arel 1) it will replicate the\npreempt_disable/enable code at eack lock/unlock callsite; this code is\nfairly small, but not nothing; and 2) the spinlock definitions are\nalready a very heavily tangled mass of #ifdefs and other preprocessor\nmagic, and making any changes will be non-trivial.\n\nThe other obvious answer is to disable pv-spinlocks.  Making them a\nseparate config option is fairly easy, and it would be trivial to\nenable them only when Xen is enabled (as the only non-default user).\nBut it doesn\u0027t really address the common case of a distro build which\nis going to have Xen support enabled, and leaves the open question of\nwhether the native performance cost of pv-spinlocks is worth the\nperformance improvement on a loaded Xen system (10% saving of overall\nsystem CPU when guests block rather than spin).  Still it is a\nreasonable short-term workaround.\n\n[ Impact: fix pvops performance regression when running native ]\n\nAnalysed-by: \"Xin Xiaohui\" \u003cxiaohui.xin@intel.com\u003e\nAnalysed-by: \"Li Xin\" \u003cxin.li@intel.com\u003e\nAnalysed-by: \"Nakajima Jun\" \u003cjun.nakajima@intel.com\u003e\nSigned-off-by: Jeremy Fitzhardinge \u003cjeremy.fitzhardinge@citrix.com\u003e\nAcked-by: H. Peter Anvin \u003chpa@zytor.com\u003e\nCc: Nick Piggin \u003cnpiggin@suse.de\u003e\nCc: Xen-devel \u003cxen-devel@lists.xensource.com\u003e\nLKML-Reference: \u003c4A0B62F7.5030802@goop.org\u003e\n[ fixed the help text ]\nSigned-off-by: Ingo Molnar \u003cmingo@elte.hu\u003e\n",
  "tree_diff": [
    {
      "type": "modify",
      "old_id": "df9e885eee143ba5434a39461821a14e3fc102df",
      "old_mode": 33188,
      "old_path": "arch/x86/Kconfig",
      "new_id": "a6efe0a2e9ae613a81bedec5e4772698d16541cc",
      "new_mode": 33188,
      "new_path": "arch/x86/Kconfig"
    },
    {
      "type": "modify",
      "old_id": "378e3691c08c54dd76e060eb468a00e8b61c59f9",
      "old_mode": 33188,
      "old_path": "arch/x86/include/asm/paravirt.h",
      "new_id": "a53da004e08ed8903dbf135fead7e3e09664089a",
      "new_mode": 33188,
      "new_path": "arch/x86/include/asm/paravirt.h"
    },
    {
      "type": "modify",
      "old_id": "e5e6caffec87ab61063cd8c356ebd8f59523b00b",
      "old_mode": 33188,
      "old_path": "arch/x86/include/asm/spinlock.h",
      "new_id": "b7e5db8763994cf3164ecf8273f7ec725be9215c",
      "new_mode": 33188,
      "new_path": "arch/x86/include/asm/spinlock.h"
    },
    {
      "type": "modify",
      "old_id": "145cce75cda70dcc5f90560902eff37cc6ddd3fc",
      "old_mode": 33188,
      "old_path": "arch/x86/kernel/Makefile",
      "new_id": "88d1bfc847d30fc6b87007648fedc9970856e1af",
      "new_mode": 33188,
      "new_path": "arch/x86/kernel/Makefile"
    },
    {
      "type": "modify",
      "old_id": "8e45f4464880ccdba671ea5248c5e2bea4f9b297",
      "old_mode": 33188,
      "old_path": "arch/x86/kernel/paravirt.c",
      "new_id": "9faf43bea3361cf178f53942e8f37a8842737afc",
      "new_mode": 33188,
      "new_path": "arch/x86/kernel/paravirt.c"
    },
    {
      "type": "modify",
      "old_id": "3b767d03fd6add7d6b12ae208e0f5e33ba25db4d",
      "old_mode": 33188,
      "old_path": "arch/x86/xen/Makefile",
      "new_id": "172438f86a02aaf01b3855e36318cf0bb8aaeba1",
      "new_mode": 33188,
      "new_path": "arch/x86/xen/Makefile"
    },
    {
      "type": "modify",
      "old_id": "20139464943c7fa6ee3e516c6ead3a6e7c15ef33",
      "old_mode": 33188,
      "old_path": "arch/x86/xen/xen-ops.h",
      "new_id": "ca6596b05d533c25f56e242409e88471a816ba9c",
      "new_mode": 33188,
      "new_path": "arch/x86/xen/xen-ops.h"
    }
  ]
}