Monday, August 24, 2009

Trace System Calls

Sometimes, it's helpful to trace system api calls during debugging so that we can determine if the incorrect behavior is caused by passing wrong argument to a function or not. Or we can try to identify the performance bottleneck with it. There are several tools to use with this purpose.
For demonstration purpose, we'll use code below to try these tools
 1 #include    "Windows.h" / "unistd.h"
 2 #include    "fstream"
 3 using namespace std;
 4 
 5 void foo()
 6 {
 7     Sleep(2000); // windows
 8     sleep(2); // linux
 9 }
10 
11 void bar()
12 {
13     ofstream of("foo.dat");
14     char buf[] = {12345678910111213141516};
15     of << buf;
16     of.close();
17 
18     foo();
19 }
20 
21 int main()
22 {
23     bar();
24     foo();
25     return 0;
26 }
1. Logger This tool can be used as a standalone application or used as a debugger extension in windbg. It's capable of keeps records of all system api calls and corresponding parameters, return value, time spent, calling module and thread. To run as a application, simply call "logger.exe application to be traced". Then you can specify some option and filter in the window popped up. But standalone logger application isn't suitable for tracing windows service. In this case, we can attach windbg to a target process and load logexts extension to work against a service.
The result of logger is saved in a binary file placed in LogExts folder, and the file need to be opened in LogViewer. The file only records all APIs being called based on their orders, so we can't identify calling relationships between them. The figure belows shows the result of tracing ProcessAndThreads and ioFunctions module: From the row #42 which is expanded, we can see the parameter passed to Sleep function is 0x000007d0 which is 2000 in decimal.

2. wt command in windbg windbg has another powerful command wt. Compared with logger extension, it has more controls over which apis shall be traced. Actually, it can also trace user's function call. And the wonderful thing with it is it can show the calling relationship with a tree. It should be a preferable way. To use it, we set a break point in the place of interest. Then issue wt command. The debugee shall continue executing until this function returns. We perform wt on the sample code, the output is:
0:000> wt -l 4
Tracing 1!main to return address 004096a1
3     0 [  0] 1!main
1     0 [  1]   1!ILT+540(?barYAXXZ)
12     0 [  1]   1!bar
...
37  1069 [  1]   1!bar
1     0 [  2]     1!ILT+735(?fooYAXXZ)
4     0 [  2]     1!foo
6     0 [  3]       kernel32!Sleep
37     0 [  4]         kernel32!SleepEx
8    37 [  3]       kernel32!Sleep
6    45 [  2]     1!foo
39  1121 [  1]   1!bar
...
42  1209 [  1]   1!bar
3     0 [  2]     1!__security_check_cookie
45  1212 [  1]   1!bar
4  1258 [  0] 1!main
1     0 [  1]   1!ILT+735(?fooYAXXZ)
4     0 [  1]   1!foo
6     0 [  2]     kernel32!Sleep
3     0 [  3]       kernel32!SleepEx
19     0 [  4]         kernel32!_SEH_prolog
15    19 [  3]       kernel32!SleepEx
20     0 [  4]         ntdll!RtlActivateActivationContextUnsafeFast
20    39 [  3]       kernel32!SleepEx
19     0 [  4]         kernel32!BaseFormatTimeOut
26    58 [  3]       kernel32!SleepEx
1     0 [  4]         ntdll!ZwDelayExecution
3     0 [  4]         ntdll!NtDelayExecution
31    62 [  3]       kernel32!SleepEx
4     0 [  4]         kernel32!SleepEx
36    66 [  3]       kernel32!SleepEx
9     0 [  4]         kernel32!_SEH_epilog
37    75 [  3]       kernel32!SleepEx
8   112 [  2]     kernel32!Sleep
6   120 [  1]   1!foo
7  1385 [  0] 1!main

1392 instructions were executed in 1391 events (0 from other threads)

Function Name                               Invocations MinInst MaxInst AvgInst
1!ILT+1010(??0?$basic_ofstreamDU?$char_traitsDs       1       1       1       1
1!ILT+1040(?sputn?$basic_streambufDU?$char_trai       1       1       1       1
1!ILT+1060(?close?$basic_ofstreamDU?$char_trait       1       1       1       1
1!ILT+1090(??0sentry?$basic_ostreamDU?$char_tra       1       1       1       1
1!ILT+1125(?flagsios_basestdQBEHXZ)                   1       1       1       1
1!ILT+1155(?getloc?$basic_streambufDU?$char_tra       1       1       1       1
1!ILT+1185(??0_Sentry_base?$basic_ostreamDU?$ch       1       1       1       1
1!ILT+1210(?_Osfx?$basic_ostreamDU?$char_traits       1       1       1       1
1!ILT+130(??1localestdQAEXZ)                          1       1       1       1
...
1!ILT+950(??0?$basic_streambufDU?$char_traitsDs       1       1       1       1
1!ILT+965(??1?$basic_ofstreamDU?$char_traitsDst       1       1       1       1
1!__security_check_cookie                             1       3       3       3
1!__uncaught_exception                                1       6       6       6
1!bar                                                 1      45      45      45
1!fclose                                              1      27      27      27
1!foo                                                 2       6       6       6
1!main                                                1       7       7       7
1!std::_Fiopen                                        1      29      29      29
...
1!strlen                                              1      52      52      52
kernel32!BaseFormatTimeOut                            1      19      19      19
kernel32!Sleep                                        2       8       8       8
kernel32!SleepEx                                      3       4      37      26
kernel32!_SEH_epilog                                  1       9       9       9
kernel32!_SEH_prolog                                  1      19      19      19
ntdll!NtDelayExecution                                1       3       3       3
ntdll!RtlActivateActivationContextUnsafeFast          1      20      20      20
ntdll!ZwDelayExecution                                1       1       1       1

0 system calls were executed

eax=00000000 ebx=7ffd6000 ecx=7c80240f edx=7c90e514 esi=00dcf766 edi=00dcf6f2
eip=004096a1 esp=0012ff80 ebp=0012ffc0 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246
1!__tmainCRTStartup+0xfb:
0

3. strace on linux strace is similar to logger on windows in that it only trace api calls in a flat structure.

Wednesday, August 19, 2009

Design for debugging

After we release a product to market, it's not uncommon that our customer report bugs to us, even we gained a lot of confidence from "thorough" tests by ourselves and QA guys. If it occurs, we usually don't have the luxury to perform debugging in production environment with a debugger. So, it's important we design in advance to help us debugging in this situation.
I read from Darin Dillon's Debugging Strategies for .NET Developers that they always have several testing boxes to run released products. And it's strictly prohibited to install any development tools on them. It seems to be an extreme stern rule and hurt debugging efficiency. But I think it's a beneficial rule for producing a great product. This way, we developers are forced to think about how to make debugging easier in this situation which may occur frequently on customers' sites.
Take microsoft's windows for example, there are plenty of ways to assist debugging in released version. Such as dr.watson for collecting dump file, error reporting mechanism, system event log, etc. Engineers at MS are doing their best to collect as much information as possible when unexpected behavior occurs. You can hardly imagine is one of your application crashes and the tech support guy from MS asked you to assist him attaching a remote debugger to your machine. Neither will we be able to.

Means:
1. Logging
It's no doubt logging will be the first one jumping into our head. Logging is a good way on customer's site since it doesn't require the customer to install any additional software to enable it. While using this feature, we must always be cautious not to simply use printf or Debug.WriteLine otherwise we'll easily be overwhelmed by a mess of log flood.
A not-bad logging system should at least be able to or have: toggle on/off, consistent style and proper layout. It's better to have the ability to set logging level to avoid log flood when possible. Preferably, we can choose a mature logging library such as log4net.
An important requirement is that the logging system could be configured without recompilation. The best way is it can even be toggled at runtime without restarting the application. DbgView is a great tool can achieve so. (Ken.Zhang explained its work mechanism in this article.)

2. Dump file
Log file has the advantage of keeping record of what's happening as time goes by, but you can only see part of information that you explicitly output. Dump file is a copy of a process on customer's site that you can examine on your developing machine to view other information missing in log. But it's a static image of the process at a fixed time. Usually, it also doesn't require install a specific software to generate dump file. Core dump on linux and Dr. Watson on windows are supported by a naked OS installation.
If we need to perform some customizations on these tools, we'd better provide a friendly GUI tool to the customer so that a average user can use.

3. Source control tag
After we released a product, it's important we keep a record of the source control system version tag correlates to the release. It's mandatory that we examine the correct code to hunt for a bug. It's not possible that we commit new code to repository after the product has been released and the most recent code doesn't match the product release. How disappointed will we be if we seem to get some clue for a bug but are not able to retrieve the correct code. So, it's important for us to find out the source control system version tag for any release.
BTW, on windows, it's also important to keep a copy of related symbol file for each release. Refer to: PDB Files: What Every Developer Must Know by John Robbins

4. Information collector
The product should have at least passed testing in the development department. So we usually need to focus on the difference between the customer's environment and our own. Those difference are very likely to be the culprit. msinfo32.exe is a good candidate for this goal.

Friday, August 14, 2009

Use trace point in windbg

In visual studio, we can add trace point while debugging. It's a helpful feature to analyze the behavior of a program without breaking the execution of the target process. Isn't it useful if you debug a multi-thread process?
At first glance, it seems to be just an alternative for System.Diagnostics.Debug.WriteLine. But we don't have to write these code anymore. Say if we want to examine the behavior at some point that we didn't use WriteLine, use trace point, we don't need to stop the process, modify code, compile and run again. Just attach the visual studio to the process, set a break point and change it to trace point. It's done.
What about on-site debugging? It's common that we use windbg in this scenario. We can also take advantage of similar feature in windbg. When set a break point in windbg, you can also specify a command string that will be executed automatically while the break point is hit. The command string can be composed of several statements that will be executed in sequence.

Here is a sample code, suppose we want to trace when and where foo is invoked:
#include "Windows.h"

void foo()
{
Sleep(2000);
}

int main()
{
while(true)
foo();
return 0;
}

Then we compile and execute the code. Now attach the windbg to the target and add a break point with:
bp foo ".echo enter foo;k;g;"

Note that we need at least public symbol file to debug this.
Every time foo is invoked, following output will be printed in the debugger. And the process continues.
enter foo
ChildEBP RetAddr
0012ff54 004010d3 1!foo [d:\documents and settings\raymond\desktop\dbg\1.cpp @ 35]
0012ff78 0040138c 1!main+0x53 [d:\documents and settings\raymond\desktop\dbg\1.cpp @ 45]
0012ffc0 7c816fe7 1!__tmainCRTStartup+0xfb [f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c @ 266]
0012fff0 00000000 kernel32!BaseProcessStart+0x23

Let's explain the preceding command string. It composes of three statements. First, echo a string saying we enter the foo function. Then, print the call stack with k command. Finally, use g command to continue the process.

Monday, August 3, 2009

View memory usage

I came across some questions about how to evaluate the memory usage of a process. It seems to be an easy task since most operating systems have provided a set of powerful tools to investigate memory usage. But what confuses the users is the power and complexity of these tools. And different tools may use different term to refer the same concept which shall aggravate the confusion. In this post, I'd try to make things more clear.

Basic concepts:
From a process's point of view, the physical memory isn't directly visible. What is visible to it is the virtual memory space. In other words, each process is given an illusion that there is a memory space (usually 4GB on 32bit system) for it to use exclusively. The OS's memory manager will be responsible for translating the virtual address to physical address. The active process's virtual space doesn't have to be resident in physical memory as a whole. It's managed in pages with a platform dependent granularity, which is 4kb on 32bit windows. Only those pages are currently being used need to be in physical memory. Other pages can be placed in paging file.
When the process attempts to use a virtual address that isn't in physical memory yet, a page fault is triggered. Page fault handler will swap the missing page from paging file to physical memory. Then the faulting instruction which references the missing page will be executed again to visit the page that is now available. The process even doesn't know this happens, it feels that it can use that virtual address directly.
It's common that some standard components are shared by different processes, e.g., the .net framework code can be shared by different .net application. It's a waste of memory if there are several copies in physical memory. The OS allows virtual pages of different process being mapped to the same physical pages to save memory usage.
Let's look at the image below.

There are two processes and their virtual memory layout, organized in pages. Virtual pages in blue color have corresponding page mapped in physical memory, e.g., VP1, VP5, etc. Pink page(VP7) isn't mapped in memory. Yellow page(Page2) in physical memory is a shared page that has several pages(VP1 and VP5) mapped to it.

Understand data in memory monitor:
So, we know that the memory of a process may exist in physical memory or paging file on file system. The utilities we used to monitoring memory usage can show us these data separately.
Microsoft uses different performance counter to monitor them.
Working set is the part of pages that have been swapped in physical memory, so it's usually the one for users who want to know how many memory does an application uses. But it's not very accurate, because the shared pages are counted for every process uses it. It corresponds to memory usage column in Task Manager.
Private bytes is the size of committed memory excluding shared pages. This value indicates the footprint of a process. It corresponds to VM size in Task Manager. Confusing naming convention, right?
Virtual size is the size of committed memory plus the size of reserved memory. It doesn't hurt performance if there is a huge reserved memory because allocating reversed memory doesn't have a significant cost. The memory manager can simply mark a region in virtual space as being reserved without doing actual allocation. This information isn't available in Task Manager.
It's better to use Process Explorer in sysinternal suit to monitor memory because it supplies more information and the naming convention is consistent with Performance Monitor.

On linux platform, command "ps aux" or "ps" can be used to view memory usage. VSZ (virtual set size) and RSS (resident set size) corresponds to virtual size and working set on windows respectively. But it also has the problem of not being accurate for reporting shared pages several times.

Code Sample:
Here is a sample application (windows specific) demonstrates the impacts of different type of allocations of memory.



1 #include
2 #include
3 using namespace std;
4
5 const int ALLOCATION_SIZE = 1024*1024;
6
7 int main()
8 {
9 int cmd;
10 void* wildptr;
11 char* buf = new char[ALLOCATION_SIZE];
12 for(int i = 0; i < ALLOCATION_SIZE; ++i)
13 buf[i] = static_cast<char>(i % 128);
14 while(true)
15 {
16 cout << "allocation type:\n\
17 1: comitted memory\n\
18 2: comitted memory and use\n\
19 3: reversed memory\n\
20 4: new obj on default heap\n\
21 5: new obj on default heap and use\n";
22 cmd = 0;
23 cin >> cmd;
24 switch(cmd)
25 {
26 case 1:
27 wildptr = VirtualAlloc(NULL, ALLOCATION_SIZE, MEM_COMMIT,
PAGE_EXECUTE_READWRITE);
29 break;
30 case 2:
31 wildptr = VirtualAlloc(NULL, ALLOCATION_SIZE, MEM_COMMIT,
PAGE_EXECUTE_READWRITE);
33 memcpy(wildptr, buf, ALLOCATION_SIZE);
34 break;
35 case 3:
36 wildptr = VirtualAlloc(NULL, ALLOCATION_SIZE, MEM_RESERVE,
PAGE_EXECUTE_READWRITE);
39 break;
40 case 4:
41 wildptr = new char[ALLOCATION_SIZE];
43 break;
44 case 5:
45 wildptr = new char[ALLOCATION_SIZE];
47 memcpy(wildptr, buf, ALLOCATION_SIZE);
48 break;
49 default:
50 exit(0);
51 }
52 }
53 return 0;
54 }


The image below shows the result for each allocation type in performance monitor.
For type 1, both private bytes and virtual bytes increases. Working set doesn't change because the memory isn't used.
For type 2, private bytes, virtual bytes and working set all increase because the memory is allocated and used.
For type 3, only virtual bytes increases because the memory isn't committed. Subsequently, memory manager doesn't allocate memory in paging file. This block of memory can't be used until it's committed. Otherwise, access violation is reported.
For type 4 and 5, the result is the same as type 1 and 2 since the heap manger is built on top of virtual memory allocation.

Here are some excerpts from microsoft's perfmon utility:
Page File Bytes:
Page File Bytes is the current amount of virtual memory, in bytes, that this process has reserved for use in the paging file(s). Paging files are used to store pages of memory used by the process that are not contained in other files. Paging files are shared by all processes, and the lack of space in paging files can prevent other processes from allocating memory. If there is no paging file, this counter reflects the current amount of virtual memory that the process has reserved for use in physical memory.

Private Bytes:
Private Bytes is the current size, in bytes, of memory that this process has allocated that cannot be shared with other processes.

Virtual Bytes:
Virtual Bytes is the current size, in bytes, of the virtual address space the process is using. Use of virtual address space does not necessarily imply corresponding use of either disk or main memory pages. Virtual space is finite, and the process can limit its ability to load libraries.

Working Set:
Working Set is the current size, in bytes, of the Working Set of this process. The Working Set is the set of memory pages touched recently by the threads in the process. If free memory in the computer is above a threshold, pages are left in the Working Set of a process even if they are not in use. When free memory falls below a threshold, pages are trimmed from Working Sets. If they are needed they will then be soft-faulted back into the Working Set before leaving main memory.

Get memory usage in code:
Sometimes, it's useful to get memory information in code. There are system apis for this purpose.
GetProcessMemoryInfo and GlobalMemoryStatusEx funtions on windows and getrusage function on linux. It's trivial to provide sample here since preceding links have done.