Max Server Memory and the 300 Second Rule
Author: Andy McDermid | | December 11, 2013
Up until recently the commonly understood and accepted ideal number for page-life-expectancy (PLE) for a SQL Server instance was 300 seconds (5 minutes) or greater. Recently, the Microsoft rule-of-thumb from the days of SQL 2000 and 4GB RAM is being rethought, some might say debunked.
Jonathan Kehayias offers a good explanation which says in summary, in these days of 16GB, 32GB, 64GB, and above RAM and correspondingly more available memory for the buffer pool, a PLE of only 300 seconds might indicate some significant resource usage and be a cause of concern.
Churning
In fact, even back when servers didn’t typically have as much RAM and instances had limited buffer pool sizes compared with today, the 300 second rule was of limited value as a stand-alone metric. Like many counters – from perfmon as well as metrics acquired from DMV queries – its meaning comes from its relation with other, associated metrics.
In the case of PLE, this performance counter value can be compared to the current buffer pool size to give an approximate rate of buffer pool churn (i.e. a measure of buffer pool data written to disk per second). For example, consider a 64bit SQL 2008 EE instance with 128GB RAM. Maximum memory for the instance is configured at 120GB and we can see that in the Target Server Memory performance counter. So to get MB of buffer pool writes per second we can use this equation:
BPool_Churn_in_MB_per_sec = (Target Server Memory KB/1024)/PLE
Here is a quick sample query to collect that data:
select Instancename, TargetServerMemoryKB/1024 'TargetServerMemoryMB', TotalServerMemoryKB/1024 as 'TotalServerMemoryMB', PLE, (TargetServerMemoryKB/1024)/PLE as 'ChurnMB/sec' from( select @@SERVERNAME as 'Instancename', max(case when counter_name = 'Target Server Memory (KB)' then cntr_value end) as 'TargetServerMemoryKB', max(case when counter_name = 'Total Server Memory (KB)' then cntr_value end) as 'TotalServerMemoryKB', max(case when counter_name = 'Page life expectancy' then cntr_value end) as 'PLE' from sys.dm_os_performance_counters) as p
(*note: PLE per NUMA node is not considered here)
So, given this example we would discover a Churn rate of ~400MB/sec which –depending, of course, on other factors – could be considered a fairly high rate of sustained disk activity due to disk writes from the buffer pool.
128GB is Real
Yes, servers with 128GB of RAM do exist. However, in my experience with the day-to-day workhorses in the SQL wild, 16GB, or the very common 8GB VM, or even 64GB of RAM is much more common than machines with 100GB+ RAM. So, in fact the 300 rule might still be considered a pretty good ballpark memory pressure indicator for servers in the 8GB-64GB range. But the point is not that a good portion of real world servers are 8-16GB VMs and so the 300 second rule is good enough, the point is that we can look deeper and with a churn rate metric we have a meaningful number on which to base configuration decisions. For one potential use scenario for this metric read on.
In the Wild
Most would agree that, in general, stacking SQL instance on the same server is not ideal. But due to poor design, consolidation efforts, security policies or a number of other reasons this design is out there and a number of SQL instances may be sharing the same server resources, including RAM.
To understand how the instances are sharing the server RAM, we can run the query above as a multi-server query so as to compare buffer pool churn. Here, utilizing the churn metric, we are beginning to develop a method to tune each instance’s minimum and maximum memory configurations for the best performance possible given the shared resource situation. Referring to the churn metric, we want to maximize available memory for each instance without compromising – too much – the memory available for the other instance(s).
To flesh out the query we can use instance name as a join condition and include a few other metrics of interest. Also, let’s drop the Churn MB units to KB (so we are more likely to get a non-zero number for qualitative comparison). Lastly, note we are using ‘Total Server Memory’ to calculate the Churn since – especially considering potential memory pressure from the other instances – this is the truer (though not perfect) representation of the size of the buffer pool.
;with cte as ( select @@SERVERNAME as 'Instancename', max(case when name = 'min server memory (MB)' then value_in_use end) as 'MinServerMemoryMB', max(case when name = 'max server memory (MB)' then value_in_use end) as 'MaxServerMemoryMB' from sys.configurations) select p.Instancename, os.PhysicalCPUCount, os.PhysicalMemoryMB, c.MinServerMemoryMB,c.MaxServerMemoryMB, p.TargetServerMemoryKB/1024 as 'TargetServerMemoryMB', p.TotalServerMemoryKB/1024 as 'TotalServerMemoryMB', p.PLE , (p.TotalServerMemoryKB)/p.PLE as 'ChurnKB/sec' from( select @@SERVERNAME as 'Instancename', max(case when counter_name = 'Target Server Memory (KB)' then cntr_value end) as 'TargetServerMemoryKB', max(case when counter_name = 'Total Server Memory (KB)' then cntr_value end) as 'TotalServerMemoryKB', max(case when counter_name = 'Page life expectancy' then cntr_value end) as 'PLE' from sys.dm_os_performance_counters) as p join cte c on p.instancename = c.instancename join (SELECT @@SERVERNAME as 'Instancename', cpu_count AS 'LogicalCPUCount', hyperthread_ratio AS 'HyperthreadRatio', cpu_count/hyperthread_ratio AS 'PhysicalCPUCount', physical_memory_in_bytes/1048576 AS 'PhysicalMemoryMB' FROM sys.dm_os_sys_info ) as os on c.instancename=os.instancename
Here are a couple example results from servers hosting 3 instances:
Server 1:
Instance | CPUs | MemMB | MinMemMB | MaxMemMB | TargetSMemMB | TotalMemMB | PLE | ChurnKB/sec |
Instance1 | 8 | 16383 | 4600 | 4600 | 4600 | 4600 | 398486 | 11 |
Instance2 | 8 | 16383 | 4600 | 4600 | 4600 | 4600 | 25544 | 184 |
Instance3 | 8 | 16383 | 4600 | 4600 | 4600 | 4600 | 2672 | 1762 |
Server 2:
Instance | CPUs | MemMB | MinMemMB | MaxMemMB | TargetMemMB | TotalMemMB | PLE | ChurnKB/sec |
InstanceA | 4 | 16383 | 1024 | 2048 | 2048 | 2048 | 86002 | 24 |
InstanceB | 4 | 16383 | 2048 | 8192 | 8192 | 8192 | 379 | 22133 |
InstanceC | 4 | 16383 | 1024 | 1024 | 1024 | 1024 | 71931 | 14 |
Just quickly scanning these results one might consider reconfiguring the even split of memory on Server1’s instances. InstanceA max memory can perhaps be reduced and the more heavily churned Instance3 max memory increased. On the other hand, on Server2 one might argue that even though Instances A and C have a max memory setting that is tuned quite low and InstanceB is getting the greater slice, InstanceB is still reporting a high relative churn (note the PLE is approaching the 300 mark). Perhaps the server needs more RAM or InstanceB should have a dedicated server.
Decisions, Decisions
Granted these are only snapshot values. These metrics can change often and may vary widely over a day, a week, or a month. A better, more interesting evaluation would include results from these counters, as well as a few others, over some range of time and as the configurations are adjusted – but I’ll save that for another post.
To conclude, the 300 rule turns out to be of some use, but just looking at PLE on these instances we might not be able to come to a reasonable judgment on how to distribute the RAM resource. It is only possible by looking in-depth beyond the 300 second rule to consider the churn rate and, more generally, all the pertinent info; not just PLE, but including (most importantly) buffer pool size and (most specifically) the current buffer pool size in use, as well as minmax server memory and available RAM and CPU count. We need this whole picture presented before us so we can compare, contrast and decide.
Related Posts
How to Solve the Oracle Error ORA-12154: TNS:could not resolve the connect identifier specified
The “ORA-12154: TNS Oracle error message is very common for database administrators. Learn how to diagnose & resolve this common issue here today.
How to Recover a Table from an Oracle 12c RMAN Backup
Our database experts explain how to recover and restore a table from an Oracle 12c RMAN Backup with this step-by-step blog. Read more.
Data Types: The Importance of Choosing the Correct Data Type
Most DBAs have struggled with the pros and cons of choosing one data type over another. This blog post discusses different situations.