-1-
Introduction to Microsoft’s IL
The code that
we write in a programming language like C#, ASP+ or in any other .NET
compatible language is finally converted to either Assembler or Intermediate
Language (IL). Thus, code written in the COBOL Programming Language can be
modified in C# and subsequently used in ASP+. Therefore, the best way to
accentuate our comprehension about the .NET technologies is by understanding
IL.
Once you are
conversant with IL, you will have no difficulty in understanding the .NET technologies,
since all .NET languages finally compile to it. IL was invented first and it is
programming language neutral. It was then followed by other programming
languages like C#, Visual Basic.NET, ASP.NET, etc.
We shall raise
the curtains on IL with a significantly small program. Also, we will commence
with the assumption that you are familiar with at least one .NET programming
language.
a.il
.method void vijay()
{
}
We have written
a very small non-working IL program in the il subdirectory and named it as
a.il. How do we assemble it into an executable program? There is no need to
fret over this problem. Microsoft has provided a program called ilasm whose
sole task is to create an executable file from an IL file.
Before you run
this command make sure that your path variable is set to the bin sub directory
in the framework. If not, give the command as
set path=c:\progra~1\microsoft.net\frameworksdk\bin;%PATH%
Now we use the
command as follows:
c:\il>ilasm /nologo /quiet
a.il
On doing so, the
following error is generated:
Source file is ANSI
Error: No entry point declared
for executable
***** FAILURE *****
In future, we
shall not display the first and the last lines of the output generated by
ilasm. We shall also remove the blank lines between non-blank lines.
In IL, we are
permitted to commence a line with or without a dot '.'. Anything that begins
with a dot is a directive to the assembler, asking it to perform some function,
such as creating a function or class etc. Anything that does not start with a
'.' is an actual assembler instruction.
The
significance of .method is that a function or method called vijay is created
and this function returns void i.e. it does not return any value. The function
has been named vijay arbitrarily for want of any other superior nomenclature.
The assembler
was obviously not impressed with this program and thus brandished the message
'no entry point'. This error message is generated because the IL file can
contain numerous functions, and the assembler has no way of distinguishing as
to which of them is to be executed first.
In IL, the
first function to be executed is called the entrypoint function. In C#, the
function is Main. The syntax for a function is the name followed by the
familiar pair of round () brackets. The start point and the end point of the
function's code is signified by the curly braces {}.
a.il
.method void vijay()
{
.entrypoint
}
c:\il>ilasm /nologo /quiet
a.il
Source file is ANSI
Creating PE file
Emitting members:
Global Methods: 1;
Writing PE file
Operation completed
successfully
Now no error is
generated. The directive entrypoint signifies that the program execution has to
begin from this function. In this case, we have to use this directive
notwithstanding the fact that, this program has only one function. On giving the dir command at the DOS prompt,
we see three files created. a.exe is an executable file which can now be
executed to see the output of the program
C:\il>a
Exception occurred:
System.BadImageFormatException: Exception from HRESULT: 0x8007000B. Failed to
load C:\IL\A.EXE.
Our luck seems
to run out when we try to execute the above program because the above run-time
error is generated. One probable reason for this could be the poor formation of
the function. Every function should have the instruction 'end of function'
incorporated in it. We obviously overlooked this fact in our haste.
a.il
.method void vijay()
{
.entrypoint
ret
}
The 'end of
function' instruction is called ret. All well formed functions have to end with
this instruction.
Output
Exception occurred:
System.BadImageFormatException: Exception from HRESULT: 0x8007000B. Failed to
load C:\IL\A.EXE.
On executing
the function, we get the same error again. Where could we have faltered this
time?
a.il
.assembly mukhi {}
.method void vijay()
{
.entrypoint
ret
}
The blunder was
that we forgot to use the mandatory directive called assembly followed by a
name. We have incorporated it in the code above, and have used the name mukhi
followed by a pair of empty curly braces {}. The assembly directive is used to
give a name to the program. It is also called a deployment unit.
The code above
is the smallest program that can be assembled without any errors, though it
does not perform anything useful when executed. It does not have any function
called Main. It only has a function called vijay with the entrypoint directive.
The program now assembles and runs with no errors at all.
The concept of
assembly is extremely crucial in the .NET world and should be thoroughly
understood. We will explore this directive in the latter half of the chapter.
a.il
.assembly mukhi {}
.method void vijay()
{
.entrypoint
ret
}
.method void vijay1()
{
.entrypoint
ret
}
Error
***** FAILURE *****
The cause for
the above failure message is that the above program has two functions, vijay
and vijay1, with each containing the .entrypoint directive. As mentioned
earlier, this directive specifies as to which function is to be executed first.
Thus, in
functionality, it is akin to the Main function in C#. When C# code gets
converted into IL code, the code contained in the function Main gets converted
into a function in IL and contains the directive .entrypoint. For example, if
the first function to be executed in a COBOL program is called abc, the code
generated in IL inserts the .entrypoint directive in this function.
In conventional
programming languages, the function to be executed first has to have a specific
name, eg. Main, but in IL, only the .entrypoint directive is required.
Therefore, since a program can have only one starting point, only one function
in the IL code is allowed to contain the .entrypoint directive.
It is pertinent
to note that no error message number or explanation is generated, making it
difficult to debug this error.
a.il
.assembly mukhi {}
.method void vijay()
{
ret
.entrypoint
}
The .entrypoint
directive need not be positioned as the first or last directive in the
function. It has to merely be present in the body of the function, to herald
its status as the first function to be executed. Directives are not assembly
instructions and can even be placed after the ret instruction. To remind you,
ret signifies the end of the function code.
a.il
.assembly mukhi {}
.method void vijay()
{
.entrypoint
call void
System.Console::WriteLine()
ret
}
We may have a
function written in C#, ASP+ or COBOL, but the mechanism for executing this
function in IL is the same. It is as follows:
We have to use
the assembler instruction call. The call instruction is to be followed by the
following details in the given sequence:
• return
type of the function (void).
• the
namespace (System).
• the
class (Console).
• the
function name (WriteLine()).
The function
gets called but does not produce any output. So, we pass a parameter to the
WriteLine function.
a.il
.assembly mukhi {}
.method void vijay()
{
.entrypoint
call void
System.Console::WriteLine(class System.String)
ret
}
The above code
has a glaring omission. When a function is called in IL, in addition to its
return type, the data type of the parameters that are being passed to the
function have to also be specified. We have already stated that the Writeline
function expects a parameter of the class named System.String, but since no
string is passed to the function, it generates a runtime error.
Thus, there is
a significant difference between IL and other programming language when it
comes to calling a function. In IL, when we call a function, we have to specify
everything we know about the function, including its return type and the data
types of its parameters. This ensures that the assembler can authenticate the
syntactical propriety of your code, by conducting appropriate checks at run
time.
We shall now
see how to facilitate passing of parameters to a function.
a.il
.assembly mukhi {}
.method void vijay()
{
.entrypoint
ldstr "hell"
call void
System.Console::WriteLine(class System.String)
ret
}
Output
hell
The assembler
instruction ldstr places a string on the stack. The name ldstr is an
abbreviated version of the text "load a string on the stack". A stack
is an area of memory that facilitates passing of parameters to a function. All
functions receive their parameters from the stack. Thus, instructions like
ldstr are indispensable.
a.il
.assembly mukhi {}
.method public hidebysig static
void vijay()il managed
{
.entrypoint
ldstr "hell"
call void
System.Console::WriteLine(class System.String)
ret
}
Output
hell
We have added
some attributes to the method vijay. We shall explain them one by one below.
public: This is
called an accessibility attribute as it decides as to who all can access a
method. Public means that this method is accessible to every other part of the
program.
hidebysig: A
class can be derived from many other classes. The attribute hidebysig ensures
that a function in a parent class is hidden from the derived class having the
same name or signature. In this example, it makes sure that if the function
vijay is present in the base class, it is not visible in the derived class.
static: Methods
can either be static or non-static. A static method belongs to a class and not
to an instance. Thus, as we have only a single class, we cannot have more than
one copy of a static function. There are no restrictions on where a static
method can be created. The function with the entrypoint directive must be
static. Static functions must have a body or source code associated with them
and they are referenced using the type name and not the instance name.
il managed: Due
to its complex nature, we shall postpone the explanation of this attribute.
When the time is appropriate, its functionality will be clearly explained.
The
abovementioned attributes do not modify the output of the function. In a short
while, it will become apparent to you as to why we have provided the
explanation of these attributes.
Whenever we
write a program in the C# programming language, we first specify the keyword
class, followed by the name of the class and then, we enclose the source code
within a pair of curly braces {}. This is demonstrated in a.cs
a.cs
class zzz
{
}
Let us now
introduce the IL directive called class.
a.il
.assembly mukhi {}
.class zzz
{
.method public hidebysig static
void vijay()il managed
{
.entrypoint
ldstr "hell"
call void
System.Console::WriteLine(class System.String)
ret
}
}
Notice the
change in assembler output : Class 1 Methods: 1;
Output
hell
The directive
.class is followed by the name of the class. It is optional in IL. Let us
enhance the functionality of the class by adding a few class attributes.
a.il
.assembly mukhi {}
.class private auto ansi zzz
{
.method public hidebysig static
void vijay()il managed
{
.entrypoint
ldstr "hell"
call void
System.Console::WriteLine(class System.String)
ret
}
}
Output
hell
We have added
three attributes to our class directive:
• private:
This signifies that access to the members of the class is restricted to the
current class only.
• auto:
This means that the layout of the class in memory will be decided only at
runtime, and not by our program.
• ansi:
The source code is generally divided into two main categories:
- Managed Code
- Unmanaged Code
Code written in
languages like C is called unmanaged code or untrustworthy code. We need an
attribute that handles interoperability between unmanaged code and managed
code. For example, this attribute can be put to use when we want to transfer
strings between managed and unmanaged code.
If we cross the
bounds of managed code and vault into the realm of unmanaged code, a string,
which is an array of 2-byte Unicode characters, will be converted into an ANSI
string, which is an array of 1-byte ANSI characters and vice versa. The
modifier ansi is used for smooth transition between managed and unmanaged code.
a.il
.assembly mukhi {}
.class private auto ansi zzz
extends System.Object
{
.method public hidebysig static
void vijay()il managed
{
.entrypoint
ldstr "hell"
call void
System.Console::WriteLine(class System.String)
ret
}
}
Output
hell
The class zzz
has been derived from the class System.Object. In the .NET world, in order to
maintain type consistency, all types are ultimately derived form System.Object.
Thus, all objects have a common base class of Object. In IL, classes are
derived from other classes in the same manner as incorporated in programming
languages like C++, C# and Java.
a.il
.module aa.exe
.subsystem 3
.corflags 1
.assembly extern mscorlib
{
.originator = (03 68 91 16
D3 A4 AE 33 )
.hash = (52 44 F8 C9 55 1F 54 3F 97 D7 AB AD E2 DF 1D E0
F2 9D 4F BC )
.ver 1:0:2204:21
}
.assembly a as "a"
{
.hash algorithm 0x00008004
.ver 0:0:0:0
}
.class private auto ansi zzz
extends System.Object
{
.method public hidebysig static
void vijay() il managed
{
.entrypoint
ldstr "hell"
call void
System.Console::WriteLine(class System.String)
ret
}
.method public hidebysig
specialname rtspecialname instance void .ctor() il managed
{
.maxstack 8
ldstr "hell1"
call void
System.Console::WriteLine(class System.String)
ldarg.0
call instance void
[mscorlib]System.Object::.ctor()
ret
}
}
Output
hell
You are bound
to wonder as to why we have written such an ungainly program. You need to
exercise a little patience before the mist clears and it all starts to make
sense. We shall explain the newly introduced functions and attributes one by
one:
.ctor: We have
introduced a new function called .ctor which calls the WriteLine function to
display hell1, but it does not get called. .ctor refers to the constructor.
rtspecialname:
This attribute signifies to the runtime that the name of the function is
special and it is to be treated in a special manner.
specialname:
This attribute alerts the compilers and tools that the function is special. The
runtime may choose to ignore this attribute.
instance: A
normal function is called an instance function. Such a function is associated
with an object, unlike a static method, which is associated with a class.
The reason for
choosing the specified name for the function will become apparent in due
course.
ldarg.0: This
is an assembler instruction which loads either the this pointer or the address
of the ZEROth parameter on the execution stack. We shall explain ldarg.0 in detail subsequently.
mscorlib: In
the program above, the function .ctor is being called from the base class
System.Object. The name of the function is normally prefixed with the name of
the library that contains the code. This library name is placed within square
brackets. In this case, it is optional because mscorlib.dll is the default
library and it contains most of the classes that .NET requires.
.maxstack: This
directive specifies the maximum number of elements that can be present on the
evaluation stack when a method is being executed.
.module: All IL
files must be part and parcel of a logical entity called a module. The file is
added to a module using the .module directive. The name of the module may be
stated as aa.exe, but the name of the executable file remains the same as
before, i.e. a.exe.
.subsystem:
This directive is used to specify the operating system on which the executable
will run. This is another way of specifying the kind of executable the assembly
is representing. Some of the numeric values and their corresponding Operating
Systems are as follows:
2 - A Windows Character
Subsystem.
3 - A Windows GUI Subsystem.
5 - An older operating system
called OS/2.
.corsflags:
This directive is used to specify flags that are unique to a 64 bit computer. A
value of 1 indicates that it is an executable created from il and a value of 4 signifies a library.
.assembly: We
very briefly touched upon a directive called .assembly a couple of pages
earlier. Lets delve a little deeper now.
Whatever we
create is part of an entity called a manifest. The .assembly directive marks
the beginning of a manifest. In the hierarchy, the module is the next smaller
entity to a manifest. The .assembly directive specifies the assembly to which
this module belongs. A module can only contain a single .assembly directive.
The presence of
this directive is mandatory for exe files but is optional for modules in a
.dll. This is because this directive is needed to create an assembly for us. It
is a basic requirement of the .NET world. An assembly directive contains other
directives.
.hash: Hashing
is a common technique used in the computer world and there are a large number
of hashing methods or algorithms used. This directive is used for hashing.
.ver: The .ver
directive consists of 4 numbers separated by a colons. They represent the
following information in the order given below:
• major
version number
• minor
version number
• build
• revision
number
extern: If
there is a requirement to refer to other assemblies, the extern directive is
used. The code of the core .NET classes is in mscorlib.dll. Besides this dll,
when our program needs to refer to code from a large number of other dlls, the
extern directive comes into play.
originator:
This is the last directive that we shall explore before we move on to explain
the essence and significance of the above example. This directive discloses the
identity of the creator of the dll. It contains eight bytes of the public key
of the owner of the dll. It is obviously a hash value.
Let us revise
what we have done so far, step by step via a different approach:
(a) We started
with the smallest C# program that we could write. This program was called a.cs
and contained the following code:
a.cs
class zzz
{
public static void Main()
{
System.Console.WriteLine("hi");
}
}
(b) Then we ran
the C# compiler using the following command:
>csc a.cs
Therefore, the
exe file called a.exe got created.
(c) On the
executable, we ran a program called ildasm, provided by Microsoft:
>ildasm /out=a.txt a.exe
This created a
text file a.txt with the following contents:
a.txt
// Microsoft (R) .NET Framework IL Disassembler. Version 1.0.2204.21
// Copyright (C) Microsoft Corp. 1998-2000
// VTableFixup Directory:
// No data.
.subsystem 0x00000003
.corflags 0x00000001
.assembly extern mscorlib
{
.originator = (03 68 91 16 D3 A4 AE 33 ) // .h.....3
.hash = (52 44 F8 C9 55 1F 54 3F 97 D7 AB AD E2 DF 1D E0
F2 9D 4F BC ) // RD..U.T?.........O.
.ver 1:0:2204:21
}
.assembly a as "a"
{
.hash algorithm 0x00008004
.ver 0:0:0:0
}
.module aa.exe
// MVID:
{89CFAD60-F5BD-11D4-A55A-96B5C7D61E7B}
.class private auto ansi zzz
extends System.Object
{
.method public hidebysig static void vijay() il managed
{
.entrypoint
// Code size 11
(0xb)
.maxstack 8
IL_0000: ldstr "hell"
IL_0005: call void System.Console::WriteLine(class
System.String)
IL_000a: ret
} // end of method zzz::vijay
.method public hidebysig specialname rtspecialname
instance void .ctor() il managed
{
// Code size 17
(0x11)
.maxstack 8
IL_0000: ldstr "hell"
IL_0005: call void System.Console::WriteLine(class
System.String)
IL_000a: ldarg.0
IL_000b: call instance void
[mscorlib]System.Object::.ctor()
IL_0010: ret
} // end of method zzz::.ctor
} // end of class zzz
//*********** DISASSEMBLY COMPLETE
***********************
When you read
the above file, you will realize that all of it has been explained earlier. We
started out with a simple C# program and then compiled it into an executable
file. Under normal circumstances, it would have got converted into machine
language or the assembler of the computer/microprocessor that the program is
running on. Once the executable is created, we disassemble it using ildasm. The
disassembled output is saved in a new file a.txt. This file could be named as
a.il and we could have then reversed gear by running ilasm on it to create the
executable again.
Let us take a
look at the smallest VB.NET program. We have named it as one.vb and its source
code is as follows:
one.vb
Public Module modmain
Sub Main()
System.Console.WriteLine("hell")
End Sub
End Module
After writing
the above code, we run the Visual.Net compiler, vbc. as:
>vbc one.vb
This produces
the file one.exe.
Next we execute
ildasm as follows:
>ildasm /out=a.txt one.exe
This produces
the following file a.txt:
a.txt
// Microsoft (R) .NET Framework IL Disassembler. Version 1.0.2204.21
// Copyright (C) Microsoft Corp. 1998-2000
// VTableFixup Directory:
// No data.
.subsystem 0x00000003
.corflags 0x00000001
.assembly extern mscorlib
{
.originator = (03 68 91 16 D3 A4 AE 33 ) // .h.....3
.hash = (52 44 F8 C9 55 1F 54 3F 97 D7 AB AD E2 DF 1D E0
F2 9D 4F BC ) // RD..U.T?..........O.
.ver 1:0:2204:21
}
.assembly extern Microsoft.VisualBasic
{
.originator = (03 68 91 16 D3 A4 AE 33 ) // .h.....3
.hash = (5B 42 1F D2 5E 1A 42 83 F5 90 B2 29 9F 35 A1 BE
E5 5E 0D E4 ) // [B..^.B....).5....
.ver 1:0:0:0
}
.assembly one as "one"
{
.hash algorithm 0x00008004
.ver 1:0:0:0
}
.module one.exe
// MVID:
{1ED19820-F5C2-11D4-A55A-96B5C7D61E7B}
.class public auto ansi modmain
extends [mscorlib]System.Object
{
.custom instance void
[Microsoft.VisualBasic]Microsoft.VisualBasic.Globals/Globals$StandardModuleAttribute::.ctor()
= ( 01 00 00 00 )
.method public static void Main() il managed
{
// Code size 11
(0xb)
.maxstack 1
.locals init (class System.Object[] V_0)
IL_0000: ldstr "hell"
IL_0005: call void
[mscorlib]System.Console::WriteLine(class System.String)
IL_000a: ret
} // end of method modmain::Main
} // end of class modmain
.class private auto ansi
_vbProject
extends [mscorlib]System.Object
{
.custom instance void
[Microsoft.VisualBasic]Microsoft.VisualBasic.Globals/Globals$StandardModuleAttribute::.ctor()
= ( 01 00 00 00 )
.method public static void _main(class System.String[] _s) il managed
{
.entrypoint
// Code size 6
(0x6)
.maxstack 8
IL_0000: call void modmain::Main()
IL_0005: ret
} // end of method _vbProject::_main
} // end of class _vbProject
//*********** DISASSEMBLY COMPLETE
***********************
You would be
amazed to see that the outputs produced by two different compilers are almost
identical. We have shown you this example to demonstrate that, irrespective of
the language you use, ultimately, the source code will get converted to IL
code. Whether we use VB.NET or C#, the same WriteLine function gets called.
Thus, the
differences between programming languages has now become a superficial issue.
The endless debate over which language is superior has finally been put to
rest. Thus, IL has created a situation where programmers are free to use the
programming language of their choice.
Let us now
demystify the code given above.
Every VB.NET
program needs to be included into a module. We’ve called it modmain. All
modules in Visual Basic have to end with the keyword End, hence we see End Module.
This is where the syntax of VB differs that from C#, which does not understand
modules.
In VB.NET,
functions are known as sub-routines. We need a sub-routine to mark the starting
point of program execution. This sub-routine is called Main.
The VB.NET code
not only does it refer to mscorlib.dll,
but also uses the file Microsoft.VisualBasic.
A class called
_vbProject is created in IL; as the class name is not mandatory in VB.
The function
called _main is the starting sub-routine to be called as it has the entrypoint
directive. Its name is preceded by a leading underscore. These names are chosen
by the VB compiler that generates the IL code.
This function
is passed an array of strings as a parameter. It has a custom directive that
deals with the concept of metadata.
Next, we have
the full prototype of the function, ending with an optional series of bytes.
These bytes are part of the metadata specifications.
The module
modmain gets converted into a class having the same name. This class also has
the same directive .custom as before and a function called Main. The function
uses a directive called .locals to create a variable on the stack that can only
be used within the method. This variable exists only for the duration of the
execution of the method and dies when the method stops running.
Fields are also stored in memory
but, it takes a longer time to allocate
memory for them. The word init signifies that on creation, these variables
should be initialized to their default values. The default values depend upon
the type of the variable. Numbers are always initialized to the value ZERO. The
word init is followed by the data type of the variable and finally by its name.