Introduction to Microsoft’s IL

-1-

The code that we write in a programming language like C#, ASP+ or in any other .NET compatible language is finally converted to either Assembler or Intermediate Language (IL). Thus, code written in the COBOL Programming Language can be modified in C# and subsequently used in ASP+. Therefore, the best way to accentuate our comprehension about the .NET technologies is by understanding IL.

Once you are conversant with IL, you will have no difficulty in understanding the .NET technologies, since all .NET languages finally compile to it. IL was invented first and it is programming language neutral. It was then followed by other programming languages like C#, Visual Basic.NET, ASP.NET, etc.

We shall raise the curtains on IL with a significantly small program. Also, we will commence with the assumption that you are familiar with at least one .NET programming language.

a.il

.method void vijay()

{

}

We have written a very small non-working IL program in the il subdirectory and named it as a.il. How do we assemble it into an executable program? There is no need to fret over this problem. Microsoft has provided a program called ilasm whose sole task is to create an executable file from an IL file.

Before you run this command make sure that your path variable is set to the bin sub directory in the framework. If not, give the command as

set path=c:\progra~1\microsoft.net\frameworksdk\bin;%PATH%

Now we use the command as follows:

c:\il>ilasm /nologo /quiet a.il

On doing so, the following error is generated:

Source file is ANSI

Error: No entry point declared for executable

***** FAILURE *****

In future, we shall not display the first and the last lines of the output generated by ilasm. We shall also remove the blank lines between non-blank lines.

In IL, we are permitted to commence a line with or without a dot '.'. Anything that begins with a dot is a directive to the assembler, asking it to perform some function, such as creating a function or class etc. Anything that does not start with a '.' is an actual assembler instruction.

The significance of .method is that a function or method called vijay is created and this function returns void i.e. it does not return any value. The function has been named vijay arbitrarily for want of any other superior nomenclature.

The assembler was obviously not impressed with this program and thus brandished the message 'no entry point'. This error message is generated because the IL file can contain numerous functions, and the assembler has no way of distinguishing as to which of them is to be executed first.

In IL, the first function to be executed is called the entrypoint function. In C#, the function is Main. The syntax for a function is the name followed by the familiar pair of round () brackets. The start point and the end point of the function's code is signified by the curly braces {}.

a.il

.method void vijay()

{

.entrypoint

}

c:\il>ilasm /nologo /quiet a.il

Source file is ANSI

Creating PE file

Emitting members:

Global Methods: 1;

Writing PE file

Operation completed successfully

Now no error is generated. The directive entrypoint signifies that the program execution has to begin from this function. In this case, we have to use this directive notwithstanding the fact that, this program has only one function. On giving the dir command at the DOS prompt, we see three files created. a.exe is an executable file which can now be executed to see the output of the program

C:\il>a

Exception occurred: System.BadImageFormatException: Exception from HRESULT: 0x8007000B. Failed to load C:\IL\A.EXE.

Our luck seems to run out when we try to execute the above program because the above run-time error is generated. One probable reason for this could be the poor formation of the function. Every function should have the instruction 'end of function' incorporated in it. We obviously overlooked this fact in our haste.

a.il

.method void vijay()

{

.entrypoint

ret

}

The 'end of function' instruction is called ret. All well formed functions have to end with this instruction.

Output

Exception occurred: System.BadImageFormatException: Exception from HRESULT: 0x8007000B. Failed to load C:\IL\A.EXE.

On executing the function, we get the same error again. Where could we have faltered this time?

a.il

.assembly mukhi {}

.method void vijay()

{

.entrypoint

ret

}

The blunder was that we forgot to use the mandatory directive called assembly followed by a name. We have incorporated it in the code above, and have used the name mukhi followed by a pair of empty curly braces {}. The assembly directive is used to give a name to the program. It is also called a deployment unit.

The code above is the smallest program that can be assembled without any errors, though it does not perform anything useful when executed. It does not have any function called Main. It only has a function called vijay with the entrypoint directive. The program now assembles and runs with no errors at all.

The concept of assembly is extremely crucial in the .NET world and should be thoroughly understood. We will explore this directive in the latter half of the chapter.

a.il

.assembly mukhi {}

.method void vijay()

{

.entrypoint

ret

}

.method void vijay1()

{

.entrypoint

ret

}

Error

***** FAILURE *****

The cause for the above failure message is that the above program has two functions, vijay and vijay1, with each containing the .entrypoint directive. As mentioned earlier, this directive specifies as to which function is to be executed first.

Thus, in functionality, it is akin to the Main function in C#. When C# code gets converted into IL code, the code contained in the function Main gets converted into a function in IL and contains the directive .entrypoint. For example, if the first function to be executed in a COBOL program is called abc, the code generated in IL inserts the .entrypoint directive in this function.

In conventional programming languages, the function to be executed first has to have a specific name, eg. Main, but in IL, only the .entrypoint directive is required. Therefore, since a program can have only one starting point, only one function in the IL code is allowed to contain the .entrypoint directive.

It is pertinent to note that no error message number or explanation is generated, making it difficult to debug this error.

a.il

.assembly mukhi {}

.method void vijay()

{

ret

.entrypoint

}

The .entrypoint directive need not be positioned as the first or last directive in the function. It has to merely be present in the body of the function, to herald its status as the first function to be executed. Directives are not assembly instructions and can even be placed after the ret instruction. To remind you, ret signifies the end of the function code.

a.il

.assembly mukhi {}

.method void vijay()

{

.entrypoint

call void System.Console::WriteLine()

ret

}

We may have a function written in C#, ASP+ or COBOL, but the mechanism for executing this function in IL is the same. It is as follows:

We have to use the assembler instruction call. The call instruction is to be followed by the following details in the given sequence:

• return type of the function (void).

• the namespace (System).

• the class (Console).

• the function name (WriteLine()).

The function gets called but does not produce any output. So, we pass a parameter to the WriteLine function.

a.il

.assembly mukhi {}

.method void vijay()

{

.entrypoint

call void System.Console::WriteLine(class System.String)

ret

}

The above code has a glaring omission. When a function is called in IL, in addition to its return type, the data type of the parameters that are being passed to the function have to also be specified. We have already stated that the Writeline function expects a parameter of the class named System.String, but since no string is passed to the function, it generates a runtime error.

Thus, there is a significant difference between IL and other programming language when it comes to calling a function. In IL, when we call a function, we have to specify everything we know about the function, including its return type and the data types of its parameters. This ensures that the assembler can authenticate the syntactical propriety of your code, by conducting appropriate checks at run time.

We shall now see how to facilitate passing of parameters to a function.

a.il

.assembly mukhi {}

.method void vijay()

{

.entrypoint

ldstr "hell"

call void System.Console::WriteLine(class System.String)

ret

}

Output

hell

The assembler instruction ldstr places a string on the stack. The name ldstr is an abbreviated version of the text "load a string on the stack". A stack is an area of memory that facilitates passing of parameters to a function. All functions receive their parameters from the stack. Thus, instructions like ldstr are indispensable.

a.il

.assembly mukhi {}

.method public hidebysig static void vijay()il managed

{

.entrypoint

ldstr "hell"

call void System.Console::WriteLine(class System.String)

ret

}

Output

hell

We have added some attributes to the method vijay. We shall explain them one by one below.

public: This is called an accessibility attribute as it decides as to who all can access a method. Public means that this method is accessible to every other part of the program.

hidebysig: A class can be derived from many other classes. The attribute hidebysig ensures that a function in a parent class is hidden from the derived class having the same name or signature. In this example, it makes sure that if the function vijay is present in the base class, it is not visible in the derived class.

static: Methods can either be static or non-static. A static method belongs to a class and not to an instance. Thus, as we have only a single class, we cannot have more than one copy of a static function. There are no restrictions on where a static method can be created. The function with the entrypoint directive must be static. Static functions must have a body or source code associated with them and they are referenced using the type name and not the instance name.

il managed: Due to its complex nature, we shall postpone the explanation of this attribute. When the time is appropriate, its functionality will be clearly explained.

The abovementioned attributes do not modify the output of the function. In a short while, it will become apparent to you as to why we have provided the explanation of these attributes.

Whenever we write a program in the C# programming language, we first specify the keyword class, followed by the name of the class and then, we enclose the source code within a pair of curly braces {}. This is demonstrated in a.cs

a.cs

class zzz

{

}

Let us now introduce the IL directive called class.

a.il

.assembly mukhi {}

.class zzz

{

.method public hidebysig static void vijay()il managed

{

.entrypoint

ldstr "hell"

call void System.Console::WriteLine(class System.String)

ret

}

Notice the change in assembler output : Class 1 Methods: 1;

Output

hell

The directive .class is followed by the name of the class. It is optional in IL. Let us enhance the functionality of the class by adding a few class attributes.

a.il

.assembly mukhi {}

.class private auto ansi zzz

{

.method public hidebysig static void vijay()il managed

{

.entrypoint

ldstr "hell"

call void System.Console::WriteLine(class System.String)

ret

}

Output

hell

We have added three attributes to our class directive:

• private: This signifies that access to the members of the class is restricted to the current class only.

• auto: This means that the layout of the class in memory will be decided only at runtime, and not by our program.

• ansi: The source code is generally divided into two main categories:

- Managed Code

- Unmanaged Code

Code written in languages like C is called unmanaged code or untrustworthy code. We need an attribute that handles interoperability between unmanaged code and managed code. For example, this attribute can be put to use when we want to transfer strings between managed and unmanaged code.

If we cross the bounds of managed code and vault into the realm of unmanaged code, a string, which is an array of 2-byte Unicode characters, will be converted into an ANSI string, which is an array of 1-byte ANSI characters and vice versa. The modifier ansi is used for smooth transition between managed and unmanaged code.

a.il

.assembly mukhi {}

.class private auto ansi zzz extends System.Object

{

.method public hidebysig static void vijay()il managed

{

.entrypoint

ldstr "hell"

call void System.Console::WriteLine(class System.String)

ret

}

Output

hell

The class zzz has been derived from the class System.Object. In the .NET world, in order to maintain type consistency, all types are ultimately derived form System.Object. Thus, all objects have a common base class of Object. In IL, classes are derived from other classes in the same manner as incorporated in programming languages like C++, C# and Java.

a.il

.module aa.exe

.subsystem 3

.corflags 1

.assembly extern mscorlib

{

.originator = (03 68 91 16 D3 A4 AE 33 )

.hash = (52 44 F8 C9 55 1F 54 3F 97 D7 AB AD E2 DF 1D E0

F2 9D 4F BC )

.ver 1:0:2204:21

}

.assembly a as "a"

{

.hash algorithm 0x00008004

.ver 0:0:0:0

}

.class private auto ansi zzz extends System.Object

{

.method public hidebysig static void vijay() il managed

{

.entrypoint

ldstr "hell"

call void System.Console::WriteLine(class System.String)

ret

}

.method public hidebysig specialname rtspecialname instance void .ctor() il managed

{

.maxstack 8

ldstr "hell1"

call void System.Console::WriteLine(class System.String)

ldarg.0

call instance void [mscorlib]System.Object::.ctor()

ret

}

Output

hell

You are bound to wonder as to why we have written such an ungainly program. You need to exercise a little patience before the mist clears and it all starts to make sense. We shall explain the newly introduced functions and attributes one by one:

.ctor: We have introduced a new function called .ctor which calls the WriteLine function to display hell1, but it does not get called. .ctor refers to the constructor.

rtspecialname: This attribute signifies to the runtime that the name of the function is special and it is to be treated in a special manner.

specialname: This attribute alerts the compilers and tools that the function is special. The runtime may choose to ignore this attribute.

instance: A normal function is called an instance function. Such a function is associated with an object, unlike a static method, which is associated with a class.

The reason for choosing the specified name for the function will become apparent in due course.

ldarg.0: This is an assembler instruction which loads either the this pointer or the address of the ZEROth parameter on the execution stack. We shall explain ldarg.0 in detail subsequently.

mscorlib: In the program above, the function .ctor is being called from the base class System.Object. The name of the function is normally prefixed with the name of the library that contains the code. This library name is placed within square brackets. In this case, it is optional because mscorlib.dll is the default library and it contains most of the classes that .NET requires.

.maxstack: This directive specifies the maximum number of elements that can be present on the evaluation stack when a method is being executed.

.module: All IL files must be part and parcel of a logical entity called a module. The file is added to a module using the .module directive. The name of the module may be stated as aa.exe, but the name of the executable file remains the same as before, i.e. a.exe.

.subsystem: This directive is used to specify the operating system on which the executable will run. This is another way of specifying the kind of executable the assembly is representing. Some of the numeric values and their corresponding Operating Systems are as follows:

2 - A Windows Character Subsystem.

3 - A Windows GUI Subsystem.

5 - An older operating system called OS/2.

.corsflags: This directive is used to specify flags that are unique to a 64 bit computer. A value of 1 indicates that it is an executable created from il and a value of 4 signifies a library.

.assembly: We very briefly touched upon a directive called .assembly a couple of pages earlier. Lets delve a little deeper now.

Whatever we create is part of an entity called a manifest. The .assembly directive marks the beginning of a manifest. In the hierarchy, the module is the next smaller entity to a manifest. The .assembly directive specifies the assembly to which this module belongs. A module can only contain a single .assembly directive.

The presence of this directive is mandatory for exe files but is optional for modules in a .dll. This is because this directive is needed to create an assembly for us. It is a basic requirement of the .NET world. An assembly directive contains other directives.

.hash: Hashing is a common technique used in the computer world and there are a large number of hashing methods or algorithms used. This directive is used for hashing.

.ver: The .ver directive consists of 4 numbers separated by a colons. They represent the following information in the order given below:

• major version number

• minor version number

• build

• revision number

extern: If there is a requirement to refer to other assemblies, the extern directive is used. The code of the core .NET classes is in mscorlib.dll. Besides this dll, when our program needs to refer to code from a large number of other dlls, the extern directive comes into play.

originator: This is the last directive that we shall explore before we move on to explain the essence and significance of the above example. This directive discloses the identity of the creator of the dll. It contains eight bytes of the public key of the owner of the dll. It is obviously a hash value.

Let us revise what we have done so far, step by step via a different approach:

(a) We started with the smallest C# program that we could write. This program was called a.cs and contained the following code:

a.cs

class zzz

{

public static void Main()

{

System.Console.WriteLine("hi");

}

(b) Then we ran the C# compiler using the following command:

>csc a.cs

Therefore, the exe file called a.exe got created.

>ildasm /out=a.txt a.exe

This created a text file a.txt with the following contents:

a.txt

// Microsoft (R) .NET Framework IL Disassembler. Version 1.0.2204.21

// Copyright (C) Microsoft Corp. 1998-2000

// VTableFixup Directory:

// No data.

.subsystem 0x00000003

.corflags 0x00000001

.assembly extern mscorlib

{

.originator = (03 68 91 16 D3 A4 AE 33 ) // .h.....3

.hash = (52 44 F8 C9 55 1F 54 3F 97 D7 AB AD E2 DF 1D E0

F2 9D 4F BC ) // RD..U.T?.........O.

.ver 1:0:2204:21

}

.assembly a as "a"

{

.hash algorithm 0x00008004

.ver 0:0:0:0

}

.module aa.exe

// MVID: {89CFAD60-F5BD-11D4-A55A-96B5C7D61E7B}

.class private auto ansi zzz

extends System.Object

{

.method public hidebysig static void vijay() il managed

{

.entrypoint

// Code size 11 (0xb)

.maxstack 8

IL_0000: ldstr "hell"

IL_0005: call void System.Console::WriteLine(class System.String)

IL_000a: ret

} // end of method zzz::vijay

.method public hidebysig specialname rtspecialname

instance void .ctor() il managed

{

// Code size 17 (0x11)

.maxstack 8

IL_0000: ldstr "hell"

IL_0005: call void System.Console::WriteLine(class System.String)

IL_000a: ldarg.0

IL_000b: call instance void [mscorlib]System.Object::.ctor()

IL_0010: ret

} // end of method zzz::.ctor

} // end of class zzz

//*********** DISASSEMBLY COMPLETE ***********************

When you read the above file, you will realize that all of it has been explained earlier. We started out with a simple C# program and then compiled it into an executable file. Under normal circumstances, it would have got converted into machine language or the assembler of the computer/microprocessor that the program is running on. Once the executable is created, we disassemble it using ildasm. The disassembled output is saved in a new file a.txt. This file could be named as a.il and we could have then reversed gear by running ilasm on it to create the executable again.

Let us take a look at the smallest VB.NET program. We have named it as one.vb and its source code is as follows:

one.vb

Public Module modmain

Sub Main()

System.Console.WriteLine("hell")

End Sub

End Module

After writing the above code, we run the Visual.Net compiler, vbc. as:

>vbc one.vb

This produces the file one.exe.

Next we execute ildasm as follows:

>ildasm /out=a.txt one.exe

This produces the following file a.txt:

a.txt

// Microsoft (R) .NET Framework IL Disassembler. Version 1.0.2204.21

// VTableFixup Directory:

// No data.

.subsystem 0x00000003

.corflags 0x00000001

.assembly extern mscorlib

{

.originator = (03 68 91 16 D3 A4 AE 33 ) // .h.....3

.hash = (52 44 F8 C9 55 1F 54 3F 97 D7 AB AD E2 DF 1D E0

F2 9D 4F BC ) // RD..U.T?..........O.

.ver 1:0:2204:21

}

.assembly extern Microsoft.VisualBasic

{

.originator = (03 68 91 16 D3 A4 AE 33 ) // .h.....3

.hash = (5B 42 1F D2 5E 1A 42 83 F5 90 B2 29 9F 35 A1 BE

E5 5E 0D E4 ) // [B..^.B....).5....

.ver 1:0:0:0

}

.assembly one as "one"

{

.hash algorithm 0x00008004

.ver 1:0:0:0

}

.module one.exe

// MVID: {1ED19820-F5C2-11D4-A55A-96B5C7D61E7B}

.class public auto ansi modmain

extends [mscorlib]System.Object

{

.custom instance void [Microsoft.VisualBasic]Microsoft.VisualBasic.Globals/Globals$StandardModuleAttribute::.ctor() = ( 01 00 00 00 )

.method public static void Main() il managed

{

// Code size 11 (0xb)

.maxstack 1

.locals init (class System.Object[] V_0)

IL_0000: ldstr "hell"

IL_0005: call void [mscorlib]System.Console::WriteLine(class System.String)

IL_000a: ret

} // end of method modmain::Main

} // end of class modmain

.class private auto ansi _vbProject

extends [mscorlib]System.Object

{

.custom instance void [Microsoft.VisualBasic]Microsoft.VisualBasic.Globals/Globals$StandardModuleAttribute::.ctor() = ( 01 00 00 00 )

.method public static void _main(class System.String[] _s) il managed

{

.entrypoint

// Code size 6 (0x6)

.maxstack 8

IL_0000: call void modmain::Main()

IL_0005: ret

} // end of method _vbProject::_main

} // end of class _vbProject

//*********** DISASSEMBLY COMPLETE ***********************

You would be amazed to see that the outputs produced by two different compilers are almost identical. We have shown you this example to demonstrate that, irrespective of the language you use, ultimately, the source code will get converted to IL code. Whether we use VB.NET or C#, the same WriteLine function gets called.

Thus, the differences between programming languages has now become a superficial issue. The endless debate over which language is superior has finally been put to rest. Thus, IL has created a situation where programmers are free to use the programming language of their choice.

Let us now demystify the code given above.

Every VB.NET program needs to be included into a module. We’ve called it modmain. All modules in Visual Basic have to end with the keyword End, hence we see End Module. This is where the syntax of VB differs that from C#, which does not understand modules.

In VB.NET, functions are known as sub-routines. We need a sub-routine to mark the starting point of program execution. This sub-routine is called Main.

The VB.NET code not only does it refer to mscorlib.dll, but also uses the file Microsoft.VisualBasic.

A class called _vbProject is created in IL; as the class name is not mandatory in VB.

The function called _main is the starting sub-routine to be called as it has the entrypoint directive. Its name is preceded by a leading underscore. These names are chosen by the VB compiler that generates the IL code.

This function is passed an array of strings as a parameter. It has a custom directive that deals with the concept of metadata.

Next, we have the full prototype of the function, ending with an optional series of bytes. These bytes are part of the metadata specifications.

The module modmain gets converted into a class having the same name. This class also has the same directive .custom as before and a function called Main. The function uses a directive called .locals to create a variable on the stack that can only be used within the method. This variable exists only for the duration of the execution of the method and dies when the method stops running.

Fields are also stored in memory but, it takes a longer time to allocate memory for them. The word init signifies that on creation, these variables should be initialized to their default values. The default values depend upon the type of the variable. Numbers are always initialized to the value ZERO. The word init is followed by the data type of the variable and finally by its name.