Dynamic Disk Problems in Windows Server 2003
A freshly provisioned server needed to implement Windows dynamic disks for software RAID1.
Trail 1 - Graphical disk management tool does not work.
For whatever reasons, the standard admin plugin to create dynamic disks for a software raid mirror was not working. It would create a dynamic disk for any unallocated drive okay, but doing the same for the current boot/system drive would not work. It would give errors, take the disk offline, and not allow it to return.
A reboot of the server did not bring it back online. Of course this meant a OS problem, and without console access, the hosting company had to do a Windows reload.
They were actually pretty jovial about loading the OS 3 times in one week. Good on them!
After 2 miserable failures with the graphical tool, using the command line diskpart worked like a champ. No problem in converting the boot volume into a dynamic drive. I rebooted and a minute it was back up. Nice.
Trial 2 - Adding the mirror works… kind of.
After diskpart actually worked, using graphical disk manager to add the mirror worked too. For the most part. It gave the error “Logical disk manager could not update the boot file for any boot partitions on the target disk. Verify your arcpath listings in file boot.ini or through the bootcfg.exe utility.” after creating the mirror.
Ok. It looks like something didn’t update boot.ini properly to let it boot off the mirror I guess. I don’t feel like blindly rebooting and hoping it works, so it needs to be updated too. So while the disks resync let’s proceed with manually editing boot.ini. I guess bootcfg.exe would work too?
Trial 3 - boot.ini is not there?
This is a common problem I found online - you know boot.ini exists, but it’s not visible. Even enabling ’show hidden files’ does not make it visible. The fix is to just type
C:\boot.ini
from the command line and it opens in Notepad. That’s intuitive!
Trial 4 - Editing boot.ini with Cookie Monster
What exactly does boot.ini want to know to boot off the mirror if it needs to? This was a whole new area for me. Luckily, the existing Windows server has been set up with a mirrored drive. So to cheat, here is it’s boot.ini:
[boot loader]
timeout=10
default=multi(0)disk(0)rdisk(0)partition(1)\WINNT
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINNT=”Windows Server 2003, Standard” /noexecute=optout /fastdetect
C:\=”Microsoft Windows”
multi(0)disk(0)rdisk(1)partition(1)\WINNT="Boot Mirror C: - secondary plex"
And the stale boot.ini for the new server:
[boot loader]
timeout=30
default=multi(0)disk(0)rdisk(0)partition(1)\WINNT
[operating systems]
multi(0)disk(0)rdisk(0)partition(1)\WINNT=”Windows Server 2003, Web” /noexecute=optout /fastdetect
C:\=”Microsoft Windows”
Alright, let’s sing along…..One of these things is not like the other things….
I believe the extra line is the arcpath entry for x86 computers. “secondary plex” is redmond slang for “allow this to boot too if a software RAID volume is gone.”
So, add that to the newly provisioned boot.ini. If your partition scheme is different I cannot help - luckily, both our servers have the same drive layout: one big partition on the first disk, with the second disk being a straight mirror. YMMV.
Trial 5 - The longest reboot ever.
So now, with a fresh boot.ini and the resync process 92% done, I’m at a crossroads. I can quickly logout and hope the boot.ini changes are fine, and that all is good in Windows land, or let the resync process finish, and make sure it reboots okay.
Better to find out now…
Resync is done, my disks are “Healthy” so I restart. Visions of getting a fourth OS install in one week dance in my head. Well, not so much visions, more of ’soul crushing thoughts of wasting 3 more hours finding out why boot.ini hates me.’ Patiently I wait for ping to start returning echos.
Success!
What’s Next?
In a perfect world, I’d have had hardware level access to now yank the primary drive and test booting of the secondary. This is not the case, nor something the hosting company would do. Should either drive have a crash and go offline, the server will continue running on the second one. Should something really bad happen, the drives would need to be placed in a new computer, boot records rebuilt, soy-chickens sacrificed, and maybe it would work. Maybe not. At that point you need offsite backups.
What did we learn?
- Get hardware RAID if it’s offered and you can afford it. It’s much more robust than software RAID and faster to implement!
- Software RAID does not replace your backup scheme. At best, this setup buys you enough time to fix a single drive failure.
- Windows lies when it says boot.ini is hidden.
- Graphical disk management sucks.
August 24, 2008 No Comments
Mongrel Won’t Load After Server Reset
After having to perform a hard reboot on a server, I was quite upset to see no mongrels running. The server had been very stable up this point, and mongrel_cluster was linked into /etc/rc*.d.
Turns out the version of mongrel_cluster was too old to deal with the stale PID files left behind after the hard reset.
sudo gem upgrade mongrel_cluster
Got the new init.d script from /usr/lib/ruby/gems/1.8/gems/mongrel_cluster-1.0.5/resources/mongrel_cluster (to support a global PID_DIR)
Then add the –clean flag to the restart and start commands
mongrel_cluster_ctl start --clean -c $CONF_DIRmongrel_cluster_ctl restart --clean -c $CONF_DIR
Finally, I wanted to check the status.
mongrel_cluster_ctl status
was giving the error “ERROR RUNNING ‘cluster::status’: Plugin /cluster::status does not exist in category /commands”
This happens when you have multiple older versions of mongrel_cluster.
sudo gem uninstall mongrel_cluster
After removing the ancient 0.2.1 it worked, so I just left 1.0.5.
Good help from paul goscicki and the mongrel mailing list.
July 30, 2008 No Comments
Windows mysql/MDB2/pear
Had to upgrade MDB2 with pear since the older stable version that worked with a Linux install of MySQL 4 did not work with Windows running MySQL 5. getlastinsertID() was not working, among others.
pear config-set preferred_state stable
pear upgrade MDB2_Driver_mysql
It installed a new MDB2 and some other packages.
May 31, 2008 No Comments
Bitten by PHP’s Lack of Namespace (or: Runkit, How Great We Could Have Been)
The client’s install of a closed source CMS was taking over a minute to render some category archive views. Every other view was okay though, so that ruled out a general database problem.
Without having access to the source it became a lot of guessing.
I enabled MySQL’s slow log, then the general log, but nothing was taking more than 1 second to execute. I verified by running the relevant archive view queries from the log by hand - still no problem.
Next step was to start hacking away at some of the views to make sure nothing was wonky. There wasn’t. As long as it’s main() was running, it would cause the slowdown. Disabling main() loaded the page, just with no content beside the template. Okay.. that’s useless. All other hacks I wrote for this software just regex the hell out of what main() returns - they don’t actually touch it. Without access to what’s going on in main() though, I was still lost.
So after much searching, I found a de-zended version of the code online. Normally that would explain it all, but this code is a mix of HTML, PHP, and SQL, with tragic variable names, no error handling, etc. 3000 lines or so of spaghetti for main() - Fun stuff!
After making a valiant effort to clean up the de-zender’s formatting I ended up with something passable enough to look into. Turns out main() contains 30-40 function declarations inside of it, which then get put into the global namespace when main runs, and can then call the appropriate function based on the view.
I could have used Runkit to override the specific view that gives troubles, but PHP can’t handle overwriting functions. Even if it did, the function was ghetto-namespaced inside another function which would override mine, basically negating it unless I could override and rewrite the whole main() function.
Oh well!
In the end, I dug around the view’s code, and it was doing the standard:
@opendir
readdir
strstr each file name to see if a certain file exists
That works okay up to a point, but there were 11,000 files in the directory it searched, and it ran the loop up to 100 times per archive entry, up to 100 entries per page. So 110,000,000 times per page view, all because the programmers couldn’t be bothered to either regex the directory listing a whole, or do up to 100 file existance tests.
So how did it get solved? I had to remove half the files because they were old and would never get seen in an archive anyway, and blog this so in 2 years when the problem hits again I’ll remember how to solve it
May 31, 2008 No Comments
IIS not working
Had a server where IIS was not serving the websites out consistently, or at all. Trying to restart it via the GUI didn’t work, nor did iisreset. iisreset at least gave an error:
“IIS admin service or a service dependent on IIS Admin is not active.”
Trying to start IIS Admin Service via the service manager wouldn’t start either. I ended up killing inetinfo.exe with the task manager, then running iisreset and the admin service. That worked.
May 30, 2008 No Comments
Flickr API
I started out yesterday writing a Ruby library for Flickr’s API - wasn’t too impressed with the existing libraries I found out there, and several people had said it’s an easy one to deal with.
Ended up with a general purpose Flickr lib that creates the method signatures on the fly, and uses the newer authentication. It also uses method_missing to catch all calls to the REST API, and lets the caller send whatever params it wants.
All said, I’m pretty happy with it, and Flickr really is one the best APIs I’ve used.
April 9, 2008 No Comments
Where Do I Want To Go?
“Be mindful of the link between present action and desired future outcome. Ask yourself: if I repeat today’s actions 365 times, will I be where I want to be in a year?”
March 17, 2008 No Comments
TechStars for a Day was Amazing
Wow. I had a blast.
Met lots of cool people and learned about the program. More importantly being around so many driven people was very inspiring - I am all fired up about working on Chickpea!
March 12, 2008 No Comments
TechStars
Alright, I am going to TechStars for a Day on March 5th. Booked my flight tonight - this will be great.
February 28, 2008 No Comments
The Strange Story of the Bandwidth Overage Charge - part 2
To combat the 1TB of overage caused by MySQL, I was going to move a subset of the databases over to the webserver,
Should have been pretty uneventful. An earlier version of MySQL 5 has been installed around 2 years ago, so I upgraded it and set about to import the data.
The dump came from mysqldump and ended up being 173MB in total. It’s format did not work with mysqlimport though, so I used the GUI admin browser. It took 30 minutes to import 800K! Yeah right. The command line client took 4 seconds. Sweet.
Moving on, the CMS (thankfully) only kept it’s database config in one file. Updated that, set up a backup task for the database, and turned it on. Load the CMS site and…
Error:Incorrect integer value: ” for column ‘id’ at row 1
Okay….
After much googling on this one, we can blame the poorly written, closed source CMS for this. It appears that MySQL 5 (sanely) will not let you set an integer column to a blank string. MySQL 4 will autoconvert, 5 won’t. Since the original db server ran MySQL 4 I had to enable a different SQL mode on the new host.
Removing STRICT_TRANS_TABLES from my.ini did the trick, and the CMS site loaded….like a glacier. Ouch. So, I enabled query_cache, tweaked some variables, and restarted. Much better.
With that yak shaved I could focus on the bandwidth issues. I’ll let ntop speak here:
Before:
After:
I know it’s a small sample time, but wow - from a peak of 28Mbps to 20Kbps. Intense!
To wrap this up some more supporting applications needed their database moved to the new host, and their configs updated, but that wasn’t a big deal. I am concerned about the Windows host becoming overloaded now, but short of getting an entire new server provisioned for it (ughhhh…) it will have to do.
So, what did we learn?
- Put your database server on an unmetered subnet, or use something like Amazon EC2 (they don’t bill inter-cloud bandwidth)
- Monitor bandwidth to avoid a surprise.
- Query cache is good as long as your SQL supports it. Our CMS puts some weird timestamp limits in some queries, but for the most post it’s a SELECT monster which cache’s nicely.
- The CMS was more than just a subset of the traffic - it was the traffic.
February 18, 2008 No Comments

