Sunday, June 6, 2010

RDTSC, RDTSCP and QueryPerformanceCounter

During a recent project to implement the ICorProfilerCallback2 interface, supplied by Microsoft to profile .NET applications, I hit a small snag. I wanted a fine grained time stamp. Fine grained enough to see all distinct function call times to say something simple like the Fibonacci algorithm.

The Fibonacci algorithm is quite small and done recursively using C# with the Environment.TickCount property the times overlap and according to the time generated it would appear that many operations occur at the same time. Of course this is nonsense. It is also obvious that collecting this time through the Environment class requires going through many layers of the .NET runtime to actually get to the underlying mechanism that collects that time. This is where the previously mentioned interface comes in.

Implementing the ICorProfilerCallback2 interface, the most recent profiler interface is ICorProfilerCallback3, allows the user to stay in native and gather a lot of data about the runtime needed by a profiler. Being in native then ensures that I will not have to go through the layers of the runtime and can access more precise data. This would of course mean making a call to the WINAPI GetTickCount(). Unfortunately this also did not offer the fine grained resolution desired.

Enter assembly. Regardless of what my final resolution was, I highly recommend anyone who has any interest in assembly to goto the Intel site and either download and\or request the Intel Architecture manual. They are quite helpful and although did not get me to the best solution for this problem have answer many questions since. Reading through the vast instruction lists, I stumbled upon rdtsc and rdtscp.

The first instruction, rdtsc, queries the on chip counter for an ever increasing QWORD. The previously mentioned functions all returned a DWORD. Using rdtsc however requires doing some inline assembly or with Visual C++ the intrinsic call __rdtsc can be used. The assembly code though is quite simple and can be done with some help from your favorite search engine or with the previously mentioned manuals. There are some caveats with using this instruction and they are based heavily on power management settings and with CPU frequency throttling. I am not going to discuss them here, but be wary of this.

After changing over my code to use the rdtsc instruction, resolution was granted and the world good, until I decided to try out the profiler on a multi-processor machine. Then the world was not so good and time did not seem to always be going forward. Timestamps would show the system leaving a function before it entered it. I assumed this was my own error but after some lengthy research realized that the counters being used were not synchronized across all CPUs. This was a terrible discovery and one that has a long history, that for another time. My solution was to use the rdtscp instruction that also returned the processor ID. This information is contained in those manuals mentioned above.

The rdtscp instruction proved to be an issue on another front however, not all CPUs come with this instruction. It is rather new and after looking at all of my machines and several of my colleagues it was clear none of us had the instruction present. Regardless of all this, the solution would still not solve the issue because there was no way to query a specific CPU only to know which CPU it came from. Which means I could not tell how out of sync the processors were and would just have to throw away the data anyways, which is evil by anyones standards. What the problem reduced to was I needed a way to tell the program to run on a specific CPU, which can be done with SetThreadAffinityMask() and SetProcessAffinityMask(), but then the application cannot be properly profiled because the application cannot run naturally on the target machine. Therefore rdtscp was also going to causes issues in the multi-processor case.

The problem was becoming obnoxious. All was not lost however, I knew that Windows must have some solution and resolved to find it. It came in the form of QueryPerformanceCounter(). This function abstracts away CPUs and provides the needed synchronization of CPUs or in many cases uses another counter. What this other counter is, I will not go into here. This function does have its limitations since it is a system call instead of a simple instruction, but does ensure that on multi-processor applications that values returned are increasing. It also returns the count in the LARGE_INTEGER datatype which happens to be a QWORD.

I found that this function does not have the resolution of rdtsc but it is close and we can assume it will only get better. In many instances it worked just fine, even though it was a system call. At this point I decided to settle on the supplied WINAPI call and have been happy. I have however written the following code to use the rdtscp instruction in instances when it exists because I still have not had a chance to use it. That code is below.

__declspec(naked) BOOL WINAPI
MyProfiler::GetCurrentTick(PLARGE_INTEGER time, PDWORD procId)
    mov eax,FALSE
    cmp [esp + 0x4],NULL;; Check for NULL pointer - time
    jz DONE
    mov eax,0x80000001  ;; Argument for CPUID
    bt edx,0x1B         ;; Check for RDTSCP in bit 27 of EDX
    rdtscp  ;; ReaD Time Stamp Counter [EDX:EAX] and Processor id [ECX]
    mov ebx,[esp + 0x4] ;; - time
    mov [ebx],eax       ;; Set lower order bits - time
    mov [ebx + 0x4],edx ;; Set higher order bits - time
    mov eax,TRUE        ;; Return TRUE
    jmp PROCID
    mov eax,[esp + 0x4] ;; - time
    push eax            ;; Push time on stack for QueryPerformanceCounter()
    call dword ptr QueryPerformanceCounter ;; EAX set in function,TRUE
    mov ecx,0x0         ;; Set ECX to 0 for Processor id
    cmp [esp + 0x8],NULL;; Check for NULL pointer - procId
    jz DONE
    mov ebx,[esp + 0x8] ;; - procId
    mov [ebx],ecx       ;; Set - procId
    ret 8 ;; Clean up stack

1 comment: