Monday, August 1, 2011

Pound cake

Over the past year or so, I have discovered that wheat or rather gluten bothers me.  This is not some personal crusade that I decided to embark on nor is it a product of my locale, Seattle WA.  A few years ago my father was diagnosed with celiac disease and of course recommended that all of us kids get tested for it.  

It turned out that although I don't suffer from my fathers intense sensitivity to gluten, it does effect me.  I feel much better when I don't have products containing gluten and notice subtle changes when I do.  This has caused me to considerably reduce the amount of gluten products I consume and I basically limit myself to something small occasionally, once or twice a week.

The point here is that I missed breads.  From dense pumpernickel to a light croissant, I adore bread and all that it encompasses.  I have tried several 'gluten-free' bread and no dice.  Just not the same, don't let anyone tell you otherwise. They are close and I appreciate the effort made, but it is by definition not the same, period.

On top of my fondness for bread, I have tea every morning and with that there is nothing like a biscuit or some pastry to make a solid breakfast. In an effort to solve this biscuit\pastry dilemma, I tried something last night that has turned out wonderful.  I made a traditional pound cake.  Not just traditional, 1:1:1:1 ratio mind you, I made it all with a wooden spoon, no mixer for me.   The result of this was a wonderfully dense, sweet, and gluten free pound cake that accompanied my morning tea.  I once again state that it is still not the same without the wheat flour, but it is close enough.

Pound cake (gluten free)
  1. 1 lbs unsalted butter
  2. 1 lbs eggs (8)
  3. 1 lbs sugar
  4. 14 oz (by weight) brown rice flour
  5. 2 oz (by weight) tapioca flour
  1. Leave butter out to get soft for about 30 minutes
  2. Mix butter and sugar in large bowl till the consistency of a smooth milkshake
  3. Mix in eggs (1 at a time)
  4. Preheat oven to 350F
  5. Combine brown rice and tapioca flour
  6. Mix in flour mixture (1/3 at a time)
  7. Butter and flour 2 loaf pans
  8. Pour 2 lbs of batter into 1 of the pans
  9. Pour the remaining batter into the other pan
  10. Bake for 1 hr. Internal temperature should reach no more than 210F
  11. Remove from oven and let cool for 10 minutes
  12. Remove from pans and let cool for another 10 minutes
The resulting pound cake is sinful in the amount of fat and sugar, but it is a wonderful treat and allows me to have a decent warm breakfast. 

Saturday, January 22, 2011

MS Assembler does the right thing!

While working on OS373, something of note that I discovered during the interrupt work is that the MS Assembler is just 'following orders'.  Take for instance the following naked function written for CL:

void __declspec(naked) ISR()
       __asm pushad

       // Handle interrupt

       __asm popad
       __asm iretd

This is the basically the simplest Interrupt Service Routine (ISR) that can be constructed.  Now we know that the CL and LINK output 32 bit images by default.   What this means is that all opcodes generated by the compiler will be 32-bit.  However that may not always be the case.  Take the GCC inline assembler for instance.

void ISR()
       // Ignore Prolog

       __asm__ ("pushad");

       // Handle interrupt

       __asm__ ("popad");
       __asm__ ("iret");

       // Ignore Epilog

Both code examples are the same.  In fact both will output the same opcodes, minus the prologue and epilogue in the GCC case (GCC does not support naked functions for the x86 architecture).  If you are looking closely you will notice that the GCC has an "iret" instruction instead of "iretd".  So how can they generated the same opcodes?  Well many assemblers consider them to be the same thing, even though "iret" is the 16-bit instruction and "iretd" the 32-bit, because the compilers infers which one to use depending on the bit-ness of the resulting binary.

If you look at the Intel Architecture guide vol 2a you read that even the underlying opcodes are the same, 0xCF.  However if in the CL example I replace the "iretd" with "iret" something very fishy happens.  MS takes the perspective that if you are writing in assembly you know what you want and it respects that, GCC makes some assumptions.  The resulting opcode for CL using the "iret" instruction will actually be, 0x66 0xCF, this is very different from 0xCF.  The 0x66 prefix overrides the default operand size (Intel Architecture guide vol 2a: chapter 2.1.1); in this case the input to the opcode is not a 32-bit argument but 16-bit.  The same can be done for 16-bit opcodes.  If you write in 16-bit assembler, you can prefix an opcode with 0x66 and the CPU will supply a 32-bit argument to the 16-bit opcode.  I had originally written the "iret" instruction because I ‘assumed’ that the MS Assembler was doing what the software guide suggested.  This cause all my interrupts to blow up.  The only way to track down this issue was to actually step through the kernel and examine the assembly generated for the ISR.

Sunday, January 9, 2011

GDT Expand Down entries

Currently I am working on the development of an x86 operating system, OS373. It is something that I have thought about for many years and now feel that I have developed the skills and basic knowledge to begin the endeavorer. The experience has been enlightening and many criticism I have developed over the years for various operating systems have now relaxed due to my understanding of why certain decisions had been made. I may not always agree with those decisions, but now I can at least appreciate the difficulty of making those trade offs.

While reading various sources on OS development an aspect of a protected mode operating systems was not really explained very well. Many sites mention it and a few attempt to explain the mechanism but all seem to fall short of actually conveying how it works. See here for an example of an explanation that is correct, but over complicates the logic.

The mechanism being alluded to is the Expand Down option for data segment entries in the Global Descriptor Table (GDT). At this point unless you have done some reading on the GDT, data segments, and\or looked at an OS implementation of the GDT some background is probably needed.

The mechanism that puts the 'protected' in protected mode is the GDT. This table contains entries that break up the memory accessible by an OS into specific segments with certain permissions. Each entry contains an address, BASE, a length, LIMIT, and various flags that define a segment of memory and how as well as who can access that segment.

In its simplest form a segment register is populated with an index into the GDT. All general purpose registers are associated with a segment register. When an address, OFFSET, in a general purpose register is referenced the CPU converts the OFFSET to a linear address (LA) by using the BASE and LIMIT in the associated GDT entry contained in the segment register. The equation is quite simple:


If the OFFSET is greater than the LIMIT defined, a General Protection fault is triggered by the CPU. The permissions in the entry are also verified, but that is outside the scope of this post. Using real numbers will help to clarify. The following is defined:

BASE = 0x1000
LIMIT = 0x100
OFFSET = 0x10

The OFFSET is less than the LIMIT so no issue there.

0x1000 : BASE
0x10 : OFFSET
0x1010 : LA

The actual memory location to be accessed by the OFFSET is 0x1010. This example is how the Expand Up data segments work and the expected LA is obvious now. However in the 'various flags' mentioned above, there is a way to set the data segment to Expand Down.

Setting the Expand Down flag makes the LA less than the BASE, hence Expand Down; however, this setting does not change the above equation. The OFFSET will still be added to the BASE and the OFFSET will still be compared against the LIMIT. There is a subtle trick that the CPU is using to allow addition yet yield a decreasing LA, this trick is arithmetic overflow.

Consider the following. Lets us assume a series of registers exist that are only 4 bits in width. This means the registers can contain a maximum value of 15 and a minimum of 0. For example:
B3 B2 B1 B0
0 0 0 0 : 0
1 0 0 1 : 9
1 1 1 1 : 15

Let us consider addition using these registers.

B3 B2 B1 B0
1 : Carry
1 0 0 1 : 9
0 1 0 1 : 5
1 1 1 0 : 14

Using binary arithmetic, the sum of two values is as simple as 1 + 1. Looking at this example, a solution presents itself if we take the second operand to its furthest extreme.

B3 B2 B1 B0
1 1 1 1 : Carry
1 0 0 1 : 9
1 1 1 1 : 15
1 0 0 1 : 9

Notice B3 has an bit that needs to be carried over. In the overflow case, the CPU simply carries that overflow back to B0, this addition of the maximum value yields the same value. Thus decreasing the second operand by 1 in turn decreases the first also by 1.

B3 B2 B1 B0
1 1 1 1 : Carry
1 0 0 1 : 9
1 1 1 0 : 14
1 0 0 0 : 8

Using arithmetic overflow we have now decreased the value of the first operand using addition, the value is expanding down. Turning back to the Expand Down data segment the mechanism works the same way.

Lets use the same BASE as the previous example but compute the LA as if the data segment were of the Expand Down variety. The following is defined:

BASE = 0x1000
LIMIT = 0xfeff
OFFSET = 0xffef

Since we are decreasing the BASE we need to start at the other extreme. Where the LIMIT is 0 + the length for upward expansion, the LIMIT is now max register value - length. The OFFSET is greater than the LIMIT, expanding down, so no issue there.

0x1000 : BASE
0xffef : OFFSET
0x0fef : LA

Now like the initial example the LA is 0x10 from the BASE, but the LA is expanding down instead of up.

Friday, October 8, 2010

Where's my bits?

Recently on StackOverflow a question was asked and a colleague of mine answered it with a mostly correct solution. You can read about the question and multiple solutions on the link above, but what got me thinking was the real solution, which was to use prime numbers instead of normalized values from 0 or 1 to n. The reason here is because in my colleague's solution addition and multiplication operations were used and unless you are working with primes you are going to have some collision with factors.

Primes however were not going to work. Not because of any mathematically flaw but because of bits. There were not enough bits to represent a unique hash using primes for the range the OP (original poster) was asking about (At this point you should read the proposed solution to get a gist of what is going on). Why is this, you may ask. Well for some reason Microsoft treats the long double datatype as a synonym for a 64-bit double. This did not help because my colleague needed at least 67 bits, which if Microsoft did like most compilers, long double would have been 80 bits or more. Dismayed at this, pseudo-code was put up instead of an actual working solution.

Ironically I had been complaining about this very problem about 3 weeks prior and the same colleague asked 'Why would you ever need a datatype that wide?' Before anyone jumps all over him, he is mostly right. You have the 128-bit SSE registers and the up and coming 256-bit AVX register so you can do all the work you want with intrinsics, hopefully, and of course how often do you need a value that big? All that aside, I was determined to figure out some way to perform a multiplication with 2 64-bit operands with a result that would need more than 64-bits. I took to C++ and wrote a function template for that very purpose. The below function template takes two operands and returns the result in upper and lower referenced arguments if they will not fit in one of the result arguments. This is not optimized so that the logic can be followed but because C++ templates are so amazing, you would not believe how reduced this code becomes after compilation.

template<typename T, typename R>
void LargeIntegerMultiply(T operand1, T operand2, R& upperResult, R& lowerResult)
unsigned int bitShift = sizeof(T) * 4;
T mask = ~(((T) -1) << bitShift);

T x1 = operand1 & mask;
T y1 = operand2 & mask;
T x2 = (operand1 >> bitShift) & mask;
T y2 = (operand2 >> bitShift) & mask;

T x1y1 = x1 * y1;
T x1y2 = x1 * y2;
T x2y1 = x2 * y1;
T x2y2 = x2 * y2;

lowerResult = x1y1 & mask;

R temp = (x1y1 >> bitShift) + x1y2 + x2y1;

T upperTemp = temp >> bitShift;
T lowerTemp = temp & mask;

lowerResult += (lowerTemp << bitShift);

upperResult = x2y2 + upperTemp;

Saturday, August 7, 2010

Don't be scared...

Reading is an empowering thing. It allows the individual to speak almost directly with a person from another time and place. Words on paper or rather words that are read, offer a way to set down an idea and allows others to take from it what they will at their leisure. It can however be a scary thing.

It was not too long ago when people were banned from reading the Bible and in some places in the world some are still banned from reading the Qur'an. The reasons for this are that if you read someone's ideas then you have the opportunity to interpret them for yourself and make a decision regarding your agreement or not. In the past, religious leaders did not want their congregations to be making those decisions. As an aside, this is also why much of the Catholic mass was spoken in Latin until the Vatican II. For religion however this has largely changed and we are free to read as we like and interpret without fear of being strung up or burned at the stake for our viewpoints. This is mostly true.

Thankfully, Government bills and laws are different in that they are in the public domain for viewing, yet similar to the Catholic mass being done in Latin. Government documents are heavy on legal procedure and rhetorical flourish, might as well be Latin. This has forced many people to seek commentators to give a brief rundown of what is in bills. We turn to news channels or newspapers for the gist of the bill and base our opinions on that information. This is a good way of doing it actually. It allows us to be in the law making process without having to speak in a legal dialect, gives commentators something to do, and lets the Government be as precise as possible in the laws they write.

The downside to this is that you get a different viewpoint and interpretation based on who explains the bill to you. See the beginning of the entry for why. Many times it is helpful just to be able to see what was written and reference that against what is being explained. In my high school we had a decent civic's class, an endangered species. We discussed the role of Government, how it works, and what we as citizens are expected to do. The internet was just coming up at this time and I was lucky to have an instructor that utilized the fledgling technology. I was directed to what has become my goto URL for all things related to United States law, Thomas. It is named after Thomas Jefferson for his pivotal role in the legal foundation of our country. This site allows the user to search sessions of Congress as well as look at actual bills being currently debated and look at the precise wording, layout, implications, and all that resides in bill XYZ.

It may take some time to understand bill layout and parse verbiage, don't be scared, but when you hear somewhere that bill XYZ says that "The Government will kill your dog." Now you can go check, if they really will.

Sunday, June 27, 2010

History: Cum grano salis

Recently I found this article on the BBC and was quite intrigued because biased historical perspective is still apparently an interesting topic. This seems to be a topic that gets a great deal of publicity whenever a political group begins to gain momentum.

For example in reading the article the historical archetype for who does this is the Soviet Union, more specifically Joseph Stalin. It appears as a general rallying cry by whatever party is against historical perspective to compare that other party to the Soviet Union or Stalin specifically. This type of comparison is the precise issue I have a problem with. It almost seems that political parties refuse to accept that history is inherently based on perspective.

At this point I am sure many people will be up in arms and preparing scathing rebuttals but it is true. Let us examine any conflict throughout history. For the sake of ruffling as few feathers as possible consider Genghis Khan, or for those on the Asian continent, Chinggis Khaan. This person is highly revered in Mongolia and in many indirect ways by those in Europe. He started what would become the great trade route from China to Europe during his conquest. Who can argue this is a bad thing, besides the most staunch anti-globalisationists? Obviously I am looking at one aspect of him... aha! There is the precise issue. Perspective. There are those in India or modern day Persia (Iran) who argue that because of Genghis Khan we are hundreds of years behind in technology because of the destruction of knowledge brought on by his campaign. Many others will bring up the senseless torture and debauchery inflicted upon their respective peoples. One country's hero is another's dictator.

This argument is well known and nothing new. I guess my point is that complaining about what goes into history books is a constant battle and one that will never truly be done to everyone's liking. The fact that Texas and possibly other states are engaging in this enterprise is not too surprising, considering the recent upheaval of political perspectives in the United States. I submit to all that fighting for unbiased historical accuracy is a battle that no one will ever win, outside of their own political group anyways and we should take anyone's account of history, cum grano salis.

Sunday, June 6, 2010

RDTSC, RDTSCP and QueryPerformanceCounter

During a recent project to implement the ICorProfilerCallback2 interface, supplied by Microsoft to profile .NET applications, I hit a small snag. I wanted a fine grained time stamp. Fine grained enough to see all distinct function call times to say something simple like the Fibonacci algorithm.

The Fibonacci algorithm is quite small and done recursively using C# with the Environment.TickCount property the times overlap and according to the time generated it would appear that many operations occur at the same time. Of course this is nonsense. It is also obvious that collecting this time through the Environment class requires going through many layers of the .NET runtime to actually get to the underlying mechanism that collects that time. This is where the previously mentioned interface comes in.

Implementing the ICorProfilerCallback2 interface, the most recent profiler interface is ICorProfilerCallback3, allows the user to stay in native and gather a lot of data about the runtime needed by a profiler. Being in native then ensures that I will not have to go through the layers of the runtime and can access more precise data. This would of course mean making a call to the WINAPI GetTickCount(). Unfortunately this also did not offer the fine grained resolution desired.

Enter assembly. Regardless of what my final resolution was, I highly recommend anyone who has any interest in assembly to goto the Intel site and either download and\or request the Intel Architecture manual. They are quite helpful and although did not get me to the best solution for this problem have answer many questions since. Reading through the vast instruction lists, I stumbled upon rdtsc and rdtscp.

The first instruction, rdtsc, queries the on chip counter for an ever increasing QWORD. The previously mentioned functions all returned a DWORD. Using rdtsc however requires doing some inline assembly or with Visual C++ the intrinsic call __rdtsc can be used. The assembly code though is quite simple and can be done with some help from your favorite search engine or with the previously mentioned manuals. There are some caveats with using this instruction and they are based heavily on power management settings and with CPU frequency throttling. I am not going to discuss them here, but be wary of this.

After changing over my code to use the rdtsc instruction, resolution was granted and the world good, until I decided to try out the profiler on a multi-processor machine. Then the world was not so good and time did not seem to always be going forward. Timestamps would show the system leaving a function before it entered it. I assumed this was my own error but after some lengthy research realized that the counters being used were not synchronized across all CPUs. This was a terrible discovery and one that has a long history, that for another time. My solution was to use the rdtscp instruction that also returned the processor ID. This information is contained in those manuals mentioned above.

The rdtscp instruction proved to be an issue on another front however, not all CPUs come with this instruction. It is rather new and after looking at all of my machines and several of my colleagues it was clear none of us had the instruction present. Regardless of all this, the solution would still not solve the issue because there was no way to query a specific CPU only to know which CPU it came from. Which means I could not tell how out of sync the processors were and would just have to throw away the data anyways, which is evil by anyones standards. What the problem reduced to was I needed a way to tell the program to run on a specific CPU, which can be done with SetThreadAffinityMask() and SetProcessAffinityMask(), but then the application cannot be properly profiled because the application cannot run naturally on the target machine. Therefore rdtscp was also going to causes issues in the multi-processor case.

The problem was becoming obnoxious. All was not lost however, I knew that Windows must have some solution and resolved to find it. It came in the form of QueryPerformanceCounter(). This function abstracts away CPUs and provides the needed synchronization of CPUs or in many cases uses another counter. What this other counter is, I will not go into here. This function does have its limitations since it is a system call instead of a simple instruction, but does ensure that on multi-processor applications that values returned are increasing. It also returns the count in the LARGE_INTEGER datatype which happens to be a QWORD.

I found that this function does not have the resolution of rdtsc but it is close and we can assume it will only get better. In many instances it worked just fine, even though it was a system call. At this point I decided to settle on the supplied WINAPI call and have been happy. I have however written the following code to use the rdtscp instruction in instances when it exists because I still have not had a chance to use it. That code is below.

__declspec(naked) BOOL WINAPI
MyProfiler::GetCurrentTick(PLARGE_INTEGER time, PDWORD procId)
    mov eax,FALSE
    cmp [esp + 0x4],NULL;; Check for NULL pointer - time
    jz DONE
    mov eax,0x80000001  ;; Argument for CPUID
    bt edx,0x1B         ;; Check for RDTSCP in bit 27 of EDX
    rdtscp  ;; ReaD Time Stamp Counter [EDX:EAX] and Processor id [ECX]
    mov ebx,[esp + 0x4] ;; - time
    mov [ebx],eax       ;; Set lower order bits - time
    mov [ebx + 0x4],edx ;; Set higher order bits - time
    mov eax,TRUE        ;; Return TRUE
    jmp PROCID
    mov eax,[esp + 0x4] ;; - time
    push eax            ;; Push time on stack for QueryPerformanceCounter()
    call dword ptr QueryPerformanceCounter ;; EAX set in function,TRUE
    mov ecx,0x0         ;; Set ECX to 0 for Processor id
    cmp [esp + 0x8],NULL;; Check for NULL pointer - procId
    jz DONE
    mov ebx,[esp + 0x8] ;; - procId
    mov [ebx],ecx       ;; Set - procId
    ret 8 ;; Clean up stack