{"id":1791,"date":"2013-09-10T01:28:04","date_gmt":"2013-09-10T01:28:04","guid":{"rendered":"http:\/\/www.atumvirt.com\/?p=1791"},"modified":"2013-09-10T01:28:04","modified_gmt":"2013-09-10T01:28:04","slug":"a-bumpy-road-leads-to-a-rocky-start","status":"publish","type":"post","link":"https:\/\/avtempwp.azurewebsites.net\/2013\/09\/a-bumpy-road-leads-to-a-rocky-start\/","title":{"rendered":"A Bumpy Road Leads to A Rocky Start"},"content":{"rendered":"

To say that I’ve been busy the past few weeks is a bit of an understatement. \u00a0I am relieved and proud to say that as of today we had 193 desktops of 861 in use. \u00a0We survived a log on \/ log off of 50 users at once and things went pretty smoothly so far. \u00a0I got some breathing room to troubleshoot some of the minor one-off issues affecting singular users or cosmetic cleanup issues. \u00a0Here’s a rundown of what happened over the past few weeks.<\/p>\n

 <\/p>\n

In early August our target was set for 770 desktops for the start of school on September 4th. \u00a0As part of this expansion we were planning to switch the storage to Atlantis ILIO Diskless VDI (for non-persistent desktops). \u00a0\u00a077 VMs per host was dervied from a specification provided by Atlantis for a 64 GB ILIO VM on a 256GB host at 2GB per VM, leaving room for approximately 79 VMs. \u00a0We opted for 77, since it fit our model of 770 desktops.<\/p>\n

Around August 20th, we were told we had another budgetary boost to provide substantial, noticable improvements to deliver services to students. \u00a0As part of that, we opted to expand our infrastructure by 308 licenses on the assumption that we’d be working with 77 VMs per host. We purchased the 4 additional hosts and licenses. \u00a0As other hardware arrived, I began installing and configuring the hosts. \u00a0The setup time for each Hyper-V host was substantial.<\/p>\n

On Friday, August 23rd the HP Universal print driver was updated on the print server, causing various printer installation issues on machines with corrupted registry keys. \u00a0This included the existing XenDesktop 5.6 image which did not have a fix applied to it. \u00a0Users (apparently) reported it, but the reports did not get filtered to the appropriate queue in a timely fashion.<\/p>\n

With the new XenDesktop 7 infrastructure in place and functionally tested, we were ready to schedule a cutover. \u00a0However, one of our biggest risks was using the EMC Celerra CIFS server that had previously had significant CPU issues for no apparent reason. \u00a0EMC was unable to resolve the case, so we decided to abandon the Celerra platform and instead opt for a Windows Server 2012 file server. \u00a0On Wednesday, August 28st I made the final cutover from the Celerra to the Windows File Server (~25-30 million files, approximately 18TB before de-duplication).<\/p>\n

On Thursday, August 29th I was made aware of persisted (and rather angry) reports that users “could not print anything since last Friday” (I won’t go into the fact that users have 5-40 printers to choose from at their site from a variety of manufacturers). \u00a0As the day was half way done and the cutover was scheduled for that evening, we held out for Friday. \u00a0The cutover on Thursday evening went somewhat smooth, but I noticed very late (~1:30 AM) that Office wasn’t activated. \u00a0I had to put the disk into private mode. \u00a0After doing so, I\u00a0neglected to re-select \u00a0Cache on Device Hard Drive<\/strong> on accident, leaving the default of cache on server.<\/p>\n

On Friday, we had reports of horrific performance. \u00a0Given that we were using ILIO, we were quite surprised. \u00a0As it turned out, however, the devices were caching on the PVS server. \u00a0After a rather troubling day, we got through it and I changed the vDisk over the weekend, working tirelessly to provision additional VMs on Hyper-V hosts(a follow up post will be available about Hyper-V\/SCVMM). \u00a0On Tuesday, with 1 day to go it was clear I would not be able to provision VMs fast enough to come anywhere near our expected roll-out. \u00a0To complicate matters, early Tuesday morning a server became completely unresponsive, knocking off 12 users. \u00a0I immediately contacted Atlantis support who got back to me and we discovered that the ramdisk had filled to capacity. \u00a0While I was on the phone, I discovered that Hyper-V allocates BIN files regardless of any user preferences for memory swapping and there is no way to turn it off, drastically reducing our storage capacity in the ILIO ramdrive (~1-2GB per VM before deduplication). \u00a0Around noon, the remaining hosts began to fill to capacity and as the 90 or so users bounced around they filled up the additional servers and everything went down.<\/p>\n

We immediately began trying to find the cause of\u00a0why<\/strong> the write cache was filling so quickly with so few users. \u00a0However, we were hamstrung by Hyper-V’s hyper-slow provisioning through the XenDesktop Setup Wizard in Provisioning Services, so even getting basic service restored was extremely difficult. \u00a0\u00a0By Wednesday at 2:30 AM, I had\u00a0brought 20 VMs per host online and I worked to reconfigure the Hyper-V hosts to attach via iSCSI to store the configuration files as well as the .BIN swap file. \u00a0 On Wednesday it became clear that the slowness with provisioning in Hyper-V and SCVMM as well as general failings with SCVMM were going to be our death knell, so I began inquiring for quotes on ESXi standard pricing.<\/p>\n

On Thursday, September 5th, I was working with Atlantis support on optimizations for our image to reduce write cache usage. \u00a0Atlantis also suggested using SDELETE to write zeros to the free space of the virtual desktops in order to free space on the RAM drive. \u00a0We configured this to run as a shutdown script due to our high log on and log off rate. \u00a0On Thursday evening, I provisioned as many desktops as I could, increasing the number from 20 after running semi-stable for a day. \u00a0I enabled some hosts with additional VMs (some with 30, some with 40) which brought us to a total of 180. \u00a0Additionally, I setup one host as ESXi and installed ILIO Center to monitor the ILIO instances, but did not have any online before Friday morning. \u00a0I was scheduled to be out of the office on Friday for my son’s birthday. \u00a0I went to sleep (again) at about 3 AM for the 7 or 8th time in 2 weeks after verifying we had the 180 VMs online and ready.<\/p>\n

Friday morning I woke up at about 6:45 to check the status of things. \u00a0To my horror, I discovered all desktops were down. \u00a0PVS was not booting devices properly. \u00a0A quick CDF trace revealed there was a date stamp difference between the versions (why I don’t know – I used robocopy to copy it. \u00a0Recopying didn’t work, so I just deleted the version and re-made the changes to a maintenance version and recopied as quickly as possible. \u00a0By 8:15, I had about 60 VMs online with more booting every few minutes. \u00a0I continued working to get the environment stabilized and to monitor the write cache before departing at about 11:30 to hang out with my Son. \u00a0By 3pm, they were off to grandma’s house and I \u00a0helped out with the (now daily) VDI status meeting. \u00a0The plan was to expand the VMware platform and get \u00a0some Hyper-V hosts closer-to-capacity.<\/p>\n

I worked Friday evening to cover my hours and into overtime setting up ESXi and reverse imaging our Windows 7 image for import into VMware. \u00a0On Saturday, I worked hard to clean up the image and stress-test the VMware deployment with ILIO. \u00a0The results were so fantastic I called my supervisor to ask to modify the plan instead of doing only 1-2 hosts on VMware if I could proceed and do the rest. \u00a0We agreed to keep 1 Hyper-V host and the remaining 13 would be VMware. \u00a0It truly was pleasing to have rapidly the OS rapidly deployed without cumbersome imaging and annoying configuration requirements (NIC teaming on 2008 R2…looking at you). \u00a0The ILIO VMs were provisioned with 90GB of RAM and in very short order I had 13 hosts ready to go…then the moment of truth. \u00a0I had 12 hosts remaining to provision. \u00a0At about 1:00 AM on Sunday, I opened 12 instances of the PVS Console and was able to start\u00a0all<\/strong> of them each for 77 VMs per host. \u00a0It took about 37 minutes. \u00a0In SCVMM, even with the “fast” workaround of enabling dynamic disks, that would have taken about 26 hours (again, assuming random SCVMM failures didn’t happen!). \u00a0On Sunday, I successfully booted (and rebooted) 861 VMs a few times and did further image cleanup. \u00a0Antivirus was removed as the policies weren’t applying properly as well as the SCCM client due to various “idle” tasks eating away our CPU. \u00a0Finally, late Sunday I was content that the VMs were stable.<\/p>\n

Monday morning I woke up and immediately checked my phone…AND NO UNEXPECTED ALERTS! \u00a0I was thrilled. \u00a0Throughout the day I was monitoring the write cache sizes a well as the ILIO datastores. \u00a0Peak usage in the ILIO datastore was about 15GB, while some hosts barely scratched 6GB. \u00a0Now that the ILIO instances are on VMware we don’t have a 64 GB limit and can allocate 90GB per ILIO, leaving us with a 110GB NFS store.<\/p>\n

In the VDI status meeting it was clear that VMware was the clear choice moving forward. \u00a0Hyper-V’s TCO was simply blown out of the water by the stability and speed of the tools at hand. \u00a0Tuesday, 9\/10\/2013 we will commence testing pushing our limits and trying to get to capacity on a host – that is – 77 VMs in use. \u00a0We’re not out of the woods yet, but I’m sitting back enjoying a beer after what I consider to be a successful day (albeit a bit late!) and I am looking forward to making sure that our VDI environment is absolutely\u00a0awesome<\/strong> for the users.<\/p>\n

 <\/p>\n","protected":false},"excerpt":{"rendered":"

To say that I’ve been busy the past few weeks is a bit of an understatement. \u00a0I am relieved and proud to say that as of today we had 193 desktops of 861 in use. \u00a0We survived a log on \/ log off of 50 users at once and things went pretty smoothly so far. […]<\/p>\n","protected":false},"author":1,"featured_media":1801,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[10,82,33,43,51,62,77],"tags":[86,95,119,120,122],"_links":{"self":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/posts\/1791"}],"collection":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/comments?post=1791"}],"version-history":[{"count":0,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/posts\/1791\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/media\/1801"}],"wp:attachment":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/media?parent=1791"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/categories?post=1791"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/tags?post=1791"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}