Sunday 9 August 2015

BFS 464, linux-4.1-ck2

Here's an updated BFS/CK which includes the one test patch I put on this blog after 463 and another trivial fix for the previous release. The patch fixed a lot of regressions including hangs with BTRFS and panics on shutdown.

BFS by itself:

4.1-sched-bfs-464.patch

-ck branded linux-4.1-ck1 patches:

4.1-ck2 patches

Enjoy!
お楽しみください

32 comments:

  1. Thank you. I am using the patch with 4.1.4 in PCLinuxOS and so far, everything is working well.

    Galen

    ReplyDelete
  2. Hi Con,

    many, many THANKS. It seems, that with BFS version 0.464 the stability is back. Until now, no crashes during heavy IO on my server machine. Will test it on my laptop asap.

    CU Mike

    ReplyDelete
    Replies
    1. And no problems with ZEN kernel 4.1.5 and BFS on my laptop under heavy IO.
      So again, thanks Con.

      CU Mike

      Delete
  3. I don't know if this would bother Con too much?! Anyways...

    Can someone of you, having had troubles with BFS 463/ 41-ck1, try Alfred Chen's -gc branch in comparison, to see, if the issues persists with it?

    The only patches you'd need are these two:
    No.1: https://bitbucket.org/alfredchen/linux-gc/downloads/bfs_enhancement_v4.1_0463_1.patch
    No.2: https://bitbucket.org/alfredchen/linux-gc/downloads/4.1_0463_1_rcu_stall_fix.patch

    I would be glad, if we'd find some testers for this. Thank you in advance for reporting back, and

    best regards,
    Manuel Krause

    ReplyDelete
    Replies
    1. Bare -gc branch fails at least for me under I/O load, so I'm testing latest BFS update from Con.

      Delete
  4. Thanks for this Con.
    For me BFS works noticable better for QuakeLive (latency really metter for this game) using wine but only if I disable lowpower c-states of the cpu. Something like this:

    echo 1 |sudo tee /sys/devices/system/cpu/cpu*/cpuidle/state4/disable
    echo 1 |sudo tee /sys/devices/system/cpu/cpu*/cpuidle/state3/disable
    echo 1 |sudo tee /sys/devices/system/cpu/cpu*/cpuidle/state2/disable

    Maybe this is also usefull for benchmarking the scheduler....

    ReplyDelete
  5. Thanks a lot Con! -ck2 is stable for me with 4.1.5 and 4.1.6, it's the first stable version since 3.16.7-ck2.

    ReplyDelete
  6. I think there is a bug in bfs since I can't reproduce this in cfs.

    After some time one of my cores stops entering low-power c3/c6 states.
    Here is an example output of the 'turbostat --debug' command:
    Core CPU Avg_MHz %Busy Bzy_MHz TSC_MHz SMI CPU%c1 CPU%c3 CPU%c6 CoreTmp Pkg%pc3 Pkg%pc6
    - - 29 1.96 1489 2394 0 50.47 0.23 47.35 56 0.00 0.00
    0 0 29 1.30 2213 2394 6 3.56 0.46 94.69 55 0.00 0.00
    0 2 46 3.48 1331 2394 6 1.37
    2 1 31 2.45 1287 2394 6 97.55 0.00 0.00 56
    2 3 10 0.61 1656 2394 6 99.38

    As you can see core 2 never enters c3/c6 states. This almost doubles the power consumption of my laptop. Before this happens the power consumption is pretty equal to the consumption of cfs. I can trigger this almost reliably by touching a file in a kernel tree a doing a make -j4 while firefox + emacs are opened. Can someone reproduce this? Any suggestions how to debug this?

    ReplyDelete
    Replies
    1. On more observation.
      If start doing I/O e.g. 'cat /dev/zero >tmp/zzzzzzzz' the core starts to enter the c3/c6 states.

      Delete
    2. more precise...
      The core starts entering the low-power states only during the I/O. If I/O stops the core stops entering the low-power states again

      Delete
    3. So now I can trigger this reliably.
      After a reboot I need just to 'cat /dev/zero >tmp/zzzzzzzz' and one of the cores stops entering the low-power states

      Delete
  7. it doesn't happen if I revert bfs463-revert-unplugged.patch

    ReplyDelete
    Replies
    1. Currently I am on the unplugged_io issue and come up with a trial patch for testing, @pf and @kernelOfTruth have tested it and gave positive result.
      You can try it and see if it help with your C3/C6 state issue and unplugged_io issue(if you have). The patch is at https://bitbucket.org/alfredchen/linux-gc/downloads/sched_submit_work_02.patch

      Delete
    2. Thx,
      after reverting bfs463-revert-unplugged.patch on BFS464 the C3/C6 problem disappeared for me. As far as I can see your patch ads in addition to reverting bfs463-revert-unplugged.patch some more checks to sched_submit_work. I guess this is meant to solve the freezes which some people had during io on btrfs. I'm using ext4 and I don't have any freezes so far. I seems tha only people on btrfs have issues so it could be a btrfs bug.

      Delete
    3. @Anonymous:
      Then maybe you wanna read this:
      http://cchalpha.blogspot.de/2015/08/the-bfs-unpluged-io-issue.html

      Best regards,
      Manuel Krause

      Delete
  8. Hi all,

    I am cautiously asking if anyone has experienced problems with the nvidia blob. I have been running into freezes with 4.1.6 + bfs 464 and nvidia 352.30. I don't know why they suddenly happened - i've had temporary freezes in the past but they went away. This time round i upgraded a lot to try and fix it (Xorg is now 1.17.2 and gcc is 4.9.3) and rebuilt kernel and modules. Afterwards the freezes were worse and ended in a complete UI freeze. Via SSH i could still work in system. I noticed the system log had the following entries in correspondence with the freeze events:

    [ 1898.883851] NVRM: Xid (PCI:0000:01:00): 8, Channel 00000010
    [ 1900.889007] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

    Now i'm pretty sure the nvidia blob is to blame - however it is scheduler related and does so far not occur with the vanilly kernel (i am still testing).

    Any ideas welcome.

    Martin

    ReplyDelete
  9. My porting of bfs to 4.2 and other enhancements upon it has been done at http://cchalpha.blogspot.com/2015/09/42-sync-up-completed-for-gc-branch.html

    You can try it to have fun with bfs in 4.2 before ck's next update.

    BR Alfred

    ReplyDelete
  10. Thx.
    Runs flawlessly.
    4.1.9

    ReplyDelete
  11. Multiple people report system freezes when running -ck patched 4.1.9 and 4.1.10 kernels. There is little information to provide because nothing related to the freezes is written to system logs. See here: https://bbs.archlinux.org/viewtopic.php?pid=1566879#p1566879 (linked post and others after it in the thread)

    ReplyDelete
  12. Hi,
    I got RCU stalls on 4.1.8, 4.1.9.
    Remote server not responding anymore.
    Even with no task running.
    Happens randomly.

    [35381.296473] INFO: rcu_preempt self-detected stall on CPU
    [35381.296490] 2: (1 GPs behind) idle=5a3/2/0 softirq=211619/211621 fqs=5480121
    [35381.296518] (t=16440365 jiffies g=10239 c=10238 q=86010)
    [35381.296535] Task dump for CPU 2:
    [35381.296536] BFS/2 R running task 0 0 1 0x00000008
    [35381.296537] 0000000000000003 ffffffff816244c0 ffffffff8106919f 00000000000027ff
    [35381.296538] ffff88041fb14500 ffffffff816244c0 ffffffff816244c0 ffffffff8165b520
    [35381.296539] ffffffff8106c668 ffff88041fb03bf8 ffff88041fb14500 ffff88041fb03c08
    [35381.296541] Call Trace:
    [35381.296541] [] ? rcu_dump_cpu_stacks+0x7f/0xc0
    [35381.296544] [] ? rcu_check_callbacks+0x488/0x870
    [35381.296545] [] ? rcu_check_callbacks+0x174/0x870
    [35381.296546] [] ? tick_init_highres+0x10/0x10
    [35381.296548] [] ? update_process_times+0x31/0x60
    [35381.296549] [] ? tick_sched_timer+0x41/0x160
    [35381.296550] [] ? tick_init_highres+0x10/0x10
    [35381.296551] [] ? __run_hrtimer.isra.37+0x44/0xf0
    [35381.296552] [] ? hrtimer_interrupt+0xd5/0x210
    [35381.296554] [] ? smp_apic_timer_interrupt+0x35/0x50
    [35381.296555] [] ? apic_timer_interrupt+0x68/0x70
    [35381.296556] [] ? _raw_spin_unlock_irqrestore+0x6/0x20
    [35381.296557] [] ? try_to_del_timer_sync+0x3f/0x60
    [35381.296558] [] ? del_timer_sync+0x3a/0x50
    [35381.296559] [] ? del_timer_sync+0x42/0x50
    [35381.296561] [] ? inet_csk_reqsk_queue_drop+0x71/0x1e0
    [35381.296562] [] ? reqsk_timer_handler+0x11b/0x270
    [35381.296564] [] ? inet_csk_reqsk_queue_drop+0x1e0/0x1e0
    [35381.296565] [] ? call_timer_fn.isra.28+0x15/0x70
    [35381.296566] [] ? inet_csk_reqsk_queue_drop+0x1e0/0x1e0
    [35381.296567] [] ? run_timer_softirq+0x1b0/0x240
    [35381.296569] [] ? __do_softirq+0xd4/0x1f0
    [35381.296570] [] ? irq_exit+0x55/0x60
    [35381.296572] [] ? smp_apic_timer_interrupt+0x3a/0x50
    [35381.296573] [] ? apic_timer_interrupt+0x68/0x70
    [35381.296574] [] ? cpuidle_enter_state+0x93/0x140
    [35381.296576] [] ? cpuidle_enter_state+0x8c/0x140
    [35381.296577] [] ? cpu_startup_entry+0x201/0x280

    ReplyDelete
    Replies
    1. EDIT: ^^^4.1.9 only, sorry.

      Delete
    2. This is a problem in 4.1.9 & 4.1.10 and has *nothing* to do with BFS. See http://www.spinics.net/lists/kernel/msg2087851.html for a fix.

      Delete
    3. Ok.
      My bad.
      Thanks.

      Delete
    4. ^^ changed this one line in the source.
      Running stable so far.
      Thanks again.
      ^ Nothing said.

      Delete
  13. Just a heads up that 4.1.12 will break compilation of BFS because of a change in the common scheduler API. I already sent a patch to Con to fix it, so please don't panic. :)

    ReplyDelete
    Replies
    1. @holger:
      It would be more than fair from you to also post/ upload your patch for other users on here and show an address to download from.

      Thx.
      I've posted a version for/in Alfred Chen's branch several days ago.

      BR Manuel Krause

      Delete
    2. This comment has been removed by the author.

      Delete
  14. So 4.1.12 is finally out. You can find the patch at:

    https://raw.githubusercontent.com/hhoffstaette/kernel-patches/master/4.1/bfs-009-add-preempt_offset-argument-to-should_resched%28%29.patch

    Simply apply this on top.

    ReplyDelete
    Replies
    1. Thank you for your needless double work!

      BR Manuel Krause

      Delete
  15. Looks like it's working just fine:

    https://github.com/sirlucjan/aur/tree/master/linux-bfs

    ReplyDelete
  16. Of course, it's working, it's just a back-copy of mainline changes that had been tested days ago.
    BR Manuel Krause

    ReplyDelete