{"id":3141,"date":"2014-10-25T22:27:21","date_gmt":"2014-10-26T06:27:21","guid":{"rendered":"http:\/\/www.atumvirt.com\/?p=3141"},"modified":"2014-10-25T22:27:21","modified_gmt":"2014-10-26T06:27:21","slug":"the-strange-case-of-no-desktops-available","status":"publish","type":"post","link":"https:\/\/avtempwp.azurewebsites.net\/2014\/10\/the-strange-case-of-no-desktops-available\/","title":{"rendered":"The Strange Case of \u201cNo Desktops Available\u201d"},"content":{"rendered":"<p>I recently spoke with someone and realized I hadn\u2019t created a post for this yet \u2013 much to my disappointment, as it was a major accomplishment to have the issue solved.&nbsp; At my previous employer (K-12 education) the environment was setup as follows:<\/p>\n<ul>\n<li>Approximately 1100 pooled desktops from a single PVS image.<\/li>\n<li>600 powered on desktops from 6am to 5pm<\/li>\n<li>50% PeakBufferPercent<\/li>\n<li>The majority of users were in library or lab environments where 10-30 users would log on and off in fairly short time<\/li>\n<li>Average concurrent utilization analyzed over a period two months was between 350-450 users during peak hours (~10am to noon Monday thru Friday)<\/li>\n<\/ul>\n<p>On February 28th, there was a brief \u201coutage&#8221; period where users were reporting \u201cNo Desktops Available.\u201d&nbsp; I frantically scrambled to find out why \u2013 there were only about 400 users logged in so we should have had plenty of reserve available and powered on or at the very least more machines powering on.&nbsp; It wasn\u2019t every login that was failing, either.&nbsp; During my troubleshooting, school ended and I was eventually unable to duplicate the issue, despite having changed nothing.&nbsp; Since I couldn\u2019t duplicate, I looked a bit longer for a cause then chalked it up to a ghost I\u2019d have to watch out for in the future.<\/p>\n<p>The same thing happened on March 5th, at which time my alarms went into overdrive.&nbsp; We didn\u2019t have support, so it was definitely time to call in reinforcements.&nbsp; A quick check in #Citrix proved unfruitful at the time and it seemed we were on our own.&nbsp; We had previously had difficulties getting satisfactory results out of support so we were reluctant to pay per-hour support from a partner.&nbsp; As before, while troubleshooting the issue, it went away seemingly on its own.&nbsp; It was at an earlier time of day, too, which proved strange.<\/p>\n<p>On March 12th and 13th the same symptoms were experienced but reports didn\u2019t trickle in through the helpdesk to the right channels, despite being extra vigilant.&nbsp; On the 21st, however, it happened again and being on XenDesktop 7 (which was being serviced only through private, non-publicized hotfixes), we decided an \u201cemergency\u201d change request to XenDesktop 7.1 was in order.&nbsp; This update proved unfruitful as the issue re-occurred on the 26th as well, however between the 21st and 26th we had arranged for a Citrix case to be opened through a partner.<\/p>\n<p>On the 26th, with support on the line, we captured the standard stuff \u2013 Scout, CDF traces, event logs, etc. of every conceivable component that support could muster \u2013 including powering on all 1100 VMs and collecting their broker agent logs &#8211; but again, the issue went away on its own.&nbsp; Since support seemed pretty lost (which turns out to be somewhat common, unfortunately), I decided to give #Citrix another whirl.&nbsp; A Citrix Support staff member who participates in the channel reviewed my CDF trace and uploads.&nbsp; The next day I received a nice PM, paraphrased below:<\/p>\n<p>\u201cWe looked over your logs.&nbsp; An incredibly smart guy on my team has a hunch.&nbsp; You\u2019re using the default power settings, yes?\u201d<\/p>\n<p>Me: \u201cYes\u201d<\/p>\n<p>\u201cCheck MaxPowerActionsPerMinute\u201d.<\/p>\n<p>Since I had a support case that was being escalated at this point I passed this tidbit along to support, who shrugged it off, despite me trying to emphasize the importance that this was a recommendation from another unit within support.&nbsp; Given the visibility this case had within the organization, we needed this resolved and an <strong><em>official <\/em><\/strong>answer.&nbsp; There were weeks of inactivity due to the fact that the issue didn\u2019t re-occur.&nbsp; I had set the power settings to keep all 1100 vms powered on in order to mitigate the risk during a critical testing time.&nbsp; However for the better part of a month I harped on support about that setting but was ignored again and again.&nbsp; Finally, reflecting on my earlier work (and post) digging around in the XenDesktop database, I decided to go find out where the Power Actions were stored.&nbsp; I initially didn\u2019t find the Get-BrokerHostingPowerAction cmdlet (I didn\u2019t look very hard, because support told me the answer from development was that there was \u201cNo way to find this information, we don\u2019t expose it\u201d).&nbsp; So instead, I turned to the VirtualCenter database, which logs the tasks.&nbsp; My horror was immediately confirmed when I saw the pattern in the tasks submitted from svc-xendesktop.<\/p>\n<p>There were a large number of commands being sent, then they slow down to 10 per minute, but every 60 seconds there\u2019s 10 new commands \u2013 indicating that the commands are queuing, then executing.&nbsp; When I submitted this database dump and my previous assertion that MaxPowerActionsPerMinute was suggested by support and that we <strong>were not going to renew SA<\/strong> because we weren\u2019t getting satistfactory results with support, I got an e-mail within 6 hours that had the \u201cUse Get-BrokerHostingPowerAction\u201d answer (eye roll, good job there guys).<\/p>\n<p>Get-BrokerHostingPowerAction did exactly what I wanted \u2013 it showed the output of the Pending actions which immediately revealed during peak hours I was seeing 250-350 queued commands, which over a period of logons and logoffs, would result in not enough VDA\u2019s being registered and unable to be powered on in a timely fashion, thus resulting in \u201cNo desktops available\u201d.&nbsp; Finally, after nearly 2 months, I was almost free. I&nbsp; had the answer, I had proven to support and escalation support that the issue was indeed power management and the advanced setting \u201cMaxPowerActionsPerMinute\u201d had to be changed in our environment, so we asked for the official recommendation.&nbsp; After a week or two of waiting, they finally came back with (paraphrased), \u201cwe haven no official answer, you will need to test it in your environment.\u201d&nbsp; That\u2019s exactly what I did.&nbsp; I found that with the storage configuration, host configuration of my VDI servers CPU and RAM, hypervisor (vSphere standard) and solid PVS servers I could safely execute about 60 power actions per minute without causing the host CPU\u2019s to go hog-wild and degrade the experience while a large number of VMs booted (this number is closer to 100 in that environment, but I intentionally set it much lower to provide a cushion during the daytime).<\/p>\n<h2>So why was this an issue?<\/h2>\n<p>In a pooled VDI environment, delivery groups are set, by default, to power off machines after use.&nbsp; This behavior can be configured with the following PowerShell commands:<\/p>\n<blockquote>\n<p>Set-BrokerDesktopGroup -Name &#8220;Desktop Group Name&#8221; -ShutdownDesktopsAfterUse $True<\/p>\n<p>Set-BrokerDesktopGroup -Name &#8220;Desktop Group Name&#8221; -ShutdownDesktopsAfterUse $False<\/p>\n<\/blockquote>\n<p>When a logoff occurs, XenDesktop will send the Hypervisor connection associated with that VM a graceful shutdown command (i.e. using VMware tools).&nbsp; This task is limited by the 10 new actions per minute setting on the Hypervisor resource in the site configuration, shown below<\/p>\n<p><img src=\"http:\/\/cdn.ws.citrix.com\/wp-content\/uploads\/2014\/09\/XD-7.5-Throttling1.png\"><\/p>\n<p>In 7.6, the screen changes a bit.&nbsp; Read about the changes <a href=\"http:\/\/blogs.citrix.com\/2014\/09\/30\/personal-vdisk-pvd-and-hypervisor-throttling-in-xendesktop-7-6\/\">here<\/a>.<\/p>\n<p><img src=\"http:\/\/cdn.ws.citrix.com\/wp-content\/uploads\/2014\/09\/XD-7.6-Throttling1.png\"><\/p>\n<p>The \u201cmaximum new actions per minute\u201d is what was adjusted, but why was it needed?&nbsp; Well, as it turns out the primary use case was library and lab scenarios with approximately 1800 endpoints (thin clients or repurposed machines), which involved either a large number of students logging off at once (and thus generating approximately 30 power actions all at once) or a high volume of users logging in and out, say when looking up a library book.&nbsp; Given that these labs were spread among approximately 30 sites, the queue could build up,&nbsp; and once it was deep enough it couldn\u2019t recover until after the school day was out.<\/p>\n<h2><\/h2>\n<h2>Could this have been avoided by not using power management?<\/h2>\n<p>No.&nbsp; Power management wasn\u2019t the cause of the issue.&nbsp; Although we didn\u2019t see the issue with 1100 powered on VMs, this was only because our reserve was able to meet the demand while our queue for power actions was very large.&nbsp; If we had a higher concurrent usage at the time (say, 600 or 700 users) we\u2019d have seen the same problem and likely faster as well given that there\u2019d be a greater number of people logging in and out.<\/p>\n<h2>What lessons can be learned from this?<\/h2>\n<p>The first lesson I\u2019d say is be sure to use all resources available to you.&nbsp; It\u2019s no secret I\u2019m a fan of #Citrix IRC.&nbsp; Since joining in 2011 the knowledge I\u2019ve gained from simply listening and watching or participating in conversations has accelerated my understanding of Citrix technologies tremendously.&nbsp; The Citrix discussion forums can also be helpful, as well as other user communities.<\/p>\n<p>Secondly, although having support is a nice CYA, don\u2019t take what they say as gospel, even if it is from an escalation engineer supposedly speaking with development.&nbsp; I could have had my \u201csmoking gun\u201d for my issue weeks earlier had I found the Get-BrokerHostingPowerAction cmdlet on my own, but I took the escalation engineer\u2019s word that the functionality, after checking with development, was simply not exposed.&nbsp; I knew something was off with that statement since I had seen the advanced connection properties dialog and knew it must be&nbsp; related.<\/p>\n<p>Third, test.&nbsp; One might be tempted to set the max actions per minute to a high value like 200, but keep in mind what happens to your infrastructure and hosts if you boot 200 VMs at once.&nbsp; Even though my PVS and storage could handle it, the CPUs in the host (2 x 6 core) simply maxed out much lower.&nbsp; Additionally, many PVS servers are substantially \u201csmaller\u201d in resources than the ones I was using because I had tuned them for the needs of the environment.<\/p>\n<p>Finally, realize that you might be a corner case.&nbsp; While Citrix marketing makes it seem as though everyone is rolling XenDesktop with tens of thousands of VMs and therefore it must be a common scenario, the exact factors that go into making this particular issue occur actually might be quite rare.&nbsp; For example, using session hosts, this would never be an issue as they wouldn\u2019t have power actions on logoffs.&nbsp; Likewise, in a non lab or otherwise highly rotational setup these bursts might have been handled normally with the power on buffer.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently spoke with someone and realized I hadn\u2019t created a post for this yet \u2013 much to my disappointment, as it was a major accomplishment to have the issue solved.&nbsp; At my previous employer (K-12 education) the environment was setup as follows: Approximately 1100 pooled desktops from a single PVS image. 600 powered on [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[10,62,77],"tags":[102,115,119,122],"_links":{"self":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/posts\/3141"}],"collection":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/comments?post=3141"}],"version-history":[{"count":0,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/posts\/3141\/revisions"}],"wp:attachment":[{"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/media?parent=3141"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/categories?post=3141"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/avtempwp.azurewebsites.net\/wp-json\/wp\/v2\/tags?post=3141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}