Up until recently the commonly understood and accepted ideal number for page-life-expectancy (PLE) for a SQL Server instance was 300 seconds (5 minutes) or greater. Recently, the Microsoft rule-of-thumb from the days of SQL 2000 and 4GB RAM is being rethought, some might say debunked.
Jonathan Kehayias offers a good explanation which says in summary, in these days of 16GB, 32GB, 64GB, and above RAM and correspondingly more available memory for the buffer pool, a PLE of only 300 seconds might indicate some significant resource usage and be a cause of concern.
In fact, even back when servers didn’t typically have as much RAM and instances had limited buffer pool sizes compared with today, the 300 second rule was of limited value as a stand-alone metric. Like many counters – from perfmon as well as metrics acquired from DMV queries – its meaning comes from its relation with other, associated metrics.
In the case of PLE, this performance counter value can be compared to the current buffer pool size to give an approximate rate of buffer pool churn (i.e. a measure of buffer pool data written to disk per second). For example, consider a 64bit SQL 2008 EE instance with 128GB RAM. Maximum memory for the instance is configured at 120GB and we can see that in the Target Server Memory performance counter. So to get MB of buffer pool writes per second we can use this equation:
BPool_Churn_in_MB_per_sec = (Target Server Memory KB/1024)/PLE
Here is a quick sample query to collect that data:
select Instancename, TargetServerMemoryKB/1024 'TargetServerMemoryMB', TotalServerMemoryKB/1024 as 'TotalServerMemoryMB', PLE, (TargetServerMemoryKB/1024)/PLE as 'ChurnMB/sec' from( select @@SERVERNAME as 'Instancename', max(case when counter_name = 'Target Server Memory (KB)' then cntr_value end) as 'TargetServerMemoryKB', max(case when counter_name = 'Total Server Memory (KB)' then cntr_value end) as 'TotalServerMemoryKB', max(case when counter_name = 'Page life expectancy' then cntr_value end) as 'PLE' from sys.dm_os_performance_counters) as p
(*note: PLE per NUMA node is not considered here)
So, given this example we would discover a Churn rate of ~400MB/sec which –depending, of course, on other factors – could be considered a fairly high rate of sustained disk activity due to disk writes from the buffer pool.
128GB is Real
Yes, servers with 128GB of RAM do exist. However, in my experience with the day-to-day workhorses in the SQL wild, 16GB, or the very common 8GB VM, or even 64GB of RAM is much more common than machines with 100GB+ RAM. So, in fact the 300 rule might still be considered a pretty good ballpark memory pressure indicator for servers in the 8GB-64GB range. But the point is not that a good portion of real world servers are 8-16GB VMs and so the 300 second rule is good enough, the point is that we can look deeper and with a churn rate metric we have a meaningful number on which to base configuration decisions. For one potential use scenario for this metric read on.
In the Wild
Most would agree that, in general, stacking SQL instance on the same server is not ideal. But due to poor design, consolidation efforts, security policies or a number of other reasons this design is out there and a number of SQL instances may be sharing the same server resources, including RAM.
To understand how the instances are sharing the server RAM, we can run the query above as a multi-server query so as to compare buffer pool churn. Here, utilizing the churn metric, we are beginning to develop a method to tune each instance’s minimum and maximum memory configurations for the best performance possible given the shared resource situation. Referring to the churn metric, we want to maximize available memory for each instance without compromising – too much – the memory available for the other instance(s).
To flesh out the query we can use instance name as a join condition and include a few other metrics of interest. Also, let’s drop the Churn MB units to KB (so we are more likely to get a non-zero number for qualitative comparison). Lastly, note we are using ‘Total Server Memory’ to calculate the Churn since – especially considering potential memory pressure from the other instances – this is the truer (though not perfect) representation of the size of the buffer pool.
;with cte as ( select @@SERVERNAME as 'Instancename', max(case when name = 'min server memory (MB)' then value_in_use end) as 'MinServerMemoryMB', max(case when name = 'max server memory (MB)' then value_in_use end) as 'MaxServerMemoryMB' from sys.configurations) select p.Instancename, os.PhysicalCPUCount, os.PhysicalMemoryMB, c.MinServerMemoryMB,c.MaxServerMemoryMB, p.TargetServerMemoryKB/1024 as 'TargetServerMemoryMB', p.TotalServerMemoryKB/1024 as 'TotalServerMemoryMB', p.PLE , (p.TotalServerMemoryKB)/p.PLE as 'ChurnKB/sec' from( select @@SERVERNAME as 'Instancename', max(case when counter_name = 'Target Server Memory (KB)' then cntr_value end) as 'TargetServerMemoryKB', max(case when counter_name = 'Total Server Memory (KB)' then cntr_value end) as 'TotalServerMemoryKB', max(case when counter_name = 'Page life expectancy' then cntr_value end) as 'PLE' from sys.dm_os_performance_counters) as p join cte c on p.instancename = c.instancename join (SELECT @@SERVERNAME as 'Instancename', cpu_count AS 'LogicalCPUCount', hyperthread_ratio AS 'HyperthreadRatio', cpu_count/hyperthread_ratio AS 'PhysicalCPUCount', physical_memory_in_bytes/1048576 AS 'PhysicalMemoryMB' FROM sys.dm_os_sys_info ) as os on c.instancename=os.instancename
Here are a couple example results from servers hosting 3 instances:
Just quickly scanning these results one might consider reconfiguring the even split of memory on Server1’s instances. InstanceA max memory can perhaps be reduced and the more heavily churned Instance3 max memory increased. On the other hand, on Server2 one might argue that even though Instances A and C have a max memory setting that is tuned quite low and InstanceB is getting the greater slice, InstanceB is still reporting a high relative churn (note the PLE is approaching the 300 mark). Perhaps the server needs more RAM or InstanceB should have a dedicated server.
Granted these are only snapshot values. These metrics can change often and may vary widely over a day, a week, or a month. A better, more interesting evaluation would include results from these counters, as well as a few others, over some range of time and as the configurations are adjusted – but I’ll save that for another post.
To conclude, the 300 rule turns out to be of some use, but just looking at PLE on these instances we might not be able to come to a reasonable judgment on how to distribute the RAM resource. It is only possible by looking in-depth beyond the 300 second rule to consider the churn rate and, more generally, all the pertinent info; not just PLE, but including (most importantly) buffer pool size and (most specifically) the current buffer pool size in use, as well as minmax server memory and available RAM and CPU count. We need this whole picture presented before us so we can compare, contrast and decide.
EPM applications help measure the business performance. This post will help you choose the best EPM solutions for your organization’s needs and objectives.