Blag
He's not dead, he's resting
Yay for git
May 18, 2008
Posted by on Linux 2.6.26-rc2 wouldn’t boot on my desktop. Linux 2.6.25 worked. In the good old days, tracking down why would be a major pain in the ass. But now, a quick git bisect and fifteen reboots later, I have the exact commit: 3def3d6ddf43dbe20c00c3cbc38dfacc8586998f, also known as:
Author: Yinghai Lu Date: Fri Feb 22 17:07:16 2008 -0800 x86: clean up e820_reserve_resources on 64-bit e820_resource_resources could use insert_resource instead of request_resource also move code_resource, data_resource, bss_resource, and crashk_res out of e820_reserve_resources. Signed-off-by: Yinghai Lu Signed-off-by: Ingo Molnar
Verifying that this really is the offender is equally easy — a quick ‘git revert’ on head and another reboot and the kernel’s working again. Now, I know nothing about what an e820 is beyond what Google tells me, but hopefully someone else will.
As much as I hate to say it, if this were Subversion I’d still be tracking down the bug. And if it were CVS, I wouldn’t’ve bothered.
Obviously, you’re not familiar with svn-bisect: http://search.cpan.org/perldoc?svn-bisect
Actually, I am. It’s just really really really really slow…
Are you still experiencing this regression? I am at LKML working on a
fix for a regression just like this —
identical, I think — and was wondering
how your situation has gone since May?
I use Debian, and stock kernels will
boot if I add “hpet=disable” — does
this work for you? Also, are you willing
to recompile kernels as we try to triage
this thing to identify the root cause of
the problem?
Yes, it’s still broken. I’ve spoken to Yinghai Lu about the problem, but neither of us managed to get anywhere towards solving it, so I’ve taken to just reverting the commit locally.
I’ve no objection to trying further patches etc if you think you can get anywhere. I do often need to keep the box up for several days at a time, though, so I can’t always guarantee particularly quick testing. I’ll let you know about hpet=disable when I can next reboot.
Well, thank you for offering to help. I believe we have the same problem, even though we have much different hardware. On 2 of my 3 systems, both 3def3d6d and the very next commit have to be reverted in order to prevent the hang at boot.
I was hoping to inject some printk()’s into your code to find out where the hang occurs; the kernel team probably would like some /proc/iomem data, most likely from a kernel that works without lockups. (You could provide the latter without a reboot.)
I have a very long thread going at LKML since 8/4. I thought I was getting close to finally pinning down the problem today, but that is now in doubt.
If you could post that /proc/iomem data at linux-kernel (AT) vger.kernel.org with the subject line:
HPET regression in 2.6.26 versus 2.6.25
I would be most grateful. Also, briefly remind everyone of the facts — you were there in May, but left having to revert the problem commit yourself in future versions of the kernel. That may or may not bring a tear to their eye, but it will underline the fact that they will be facing a major problem when 2.6.26 hits the major distros, and all sorts of complaints about 3def3d6d start flooding in.
Thanks,
DW
Oops, never mind about /proc/iomem! I see that you already submitted that info back in May.
A short post reminding them of the facts, using the subject line I mentioned, would be nice.
THX
Oh, and one more thing… ;)
I am not subscribed there, so it would be a real help if you could add my email address to the CC line, and then I can reply — putting all of the others that I have on my CC line onto yours, in case you want to make future replies.
Alright, I’ve looked over the thread on marc. I’ll give hpet=disable a try at the weekend and then post details to lkml.
I have some good news for me, and some bad news for you. These kernel hackers are really good, and Yinghai was finally able to use a “quirk” approach to solve the problem with 2 systems that were experiencing the regression.
Once I discovered your May posts to LKML, I had hoped to get you involved quickly enough to prevent a quirks-based approach — because I suspect that the changes in the original problematic commit will affect more people than just we two. No way is the kernel team going to want separate quirks for each person who reports a hanging kernel!
My advice is for you to get involved again while this is still fresh. I actually have to leave LKML ASAP because of an embarassing problem with my email client — I was temporarily forced to use my ISP’s webmail system, but it’s broken header support messes up threading in their inboxes — but I make a last argument that they should consider the possibility that they will be facing dozens of quirks when 2.6.26 hits the major distros. (I hope I’m wrong, of course.)
My hardware is radically different from yours, so I’m very sure that the patch which solves my troubles will not help you at all. Sorry I didn’t find you sooner!
One last update before I leave you alone.
Ingo Molnar and Yinghai Lu continued to work on this after my last comment here. They provided a more generic solution — which works on my hardware, and *should* work on other hardware as well — so hopefully you are covered now.
The fix should make it into 2.6.27-rc5, and I am hoping to get it into stable 2.6.26.X as well (if possible).
Oh good. Looks like git head works just fine now for me, so whatever it was appears to be fixed.