8086 Assembly Language
This language was presented to create a few simple programs and present how the CPU executed code. In this chapter, the assembly language of the Intel 80×86 processor family is introduced along with the typical syntax for writing 80×86 assembly language programs. This information is then used to write a sample program for the 80×86 processor.
Assemblers versus Compilers
For a high-level programming language such as C, there is a two step process to produce an application from source code. To begin with, a program called a compiler takes the source code and converts it into machine language instructions. This is a complex task that requires a detailed understanding of the architecture of the processor.
The compiler outputs the resulting sequence of machine code instructions to a file called an object file. The second step takes one or more object files and combines them by merging addressing information and generating necessary support code to make the final unit operate as an application. The program that does this is called a linker.
In order for the linker to operate properly, the object files must follow certain rules for format and addressing to clearly show how one object file interrelates with the others.
A similar two-step process is used to convert assembly language source code into an application. It begins with a program called an assembler. The assembler takes an assembly language program, and using a one-to-one conversion process, converts each line of assembly language to a single machine code instruction.
Because of this one-to one relation between assembly language instructions and machine code instructions, the assembly language programmer must have a clear understanding of how the processor will execute the machine code.
In other words, the programmer must take the place of the compiler by converting abstract processes to the step-by-step processor instructions.
As with the compiler, the output of the assembler is an object file. The format and addressing information of the assembler’s object file should mimic that of the compiler making it possible for the same linker to be used to generate the final application.
This means that as long as the assembly language programmer follows certain rules when identifying shared addressing, the object file from an assembler should be capable of being linked to the object files of a high-level language compiler.
The format of an assembly language program depends on the assembler being used. There are, however, some general formatting patterns that are typically followed. This section presents some of those standards.
Components of a Line of Assembly Language
As shown in Figure below, a line of assembly language code has four fields: a label, an opcode, a set of operands, and comments. Each of these fields must be separated by horizontal white space, i.e., spaces or tabs. No carriage returns are allowed as they identify the beginning of a new line of code. Depending on the function of a particular line, one or more of the fields may be omitted.
The first field of a line is an optional label field. A label is used to identify a specific line of code or the memory location of a piece of data so that it may be referenced by other lines of assembly language. The assembler will translate the label into an address for use in the object file.
As far as the programmer is concerned, however, the label may be used any time an address reference is needed to that particular line. It is not necessary to label all lines of assembly language code, only the ones that are referred to by other lines of code.
A label is a text string much like a variable name in a high-level language. There are some rules to be obeyed when defining a label.
• Labels must begin in the first column with an alphabetic character. Subsequent characters may be numeric.
• It must not be a reserved string, i.e., it cannot be an assembly language instruction nor can it be a command to the assembler.
• Although a label may be referenced by other lines of assembly language, it cannot be reused to identify a second line of code within the same file.
• In some cases, a special format for a label may be required if the label’s function goes beyond identification of a line within a file. A special format may be needed, for example, if a high-level programming language will be referencing one of the assembly language program’s functions.
The next field is the instruction or opcode field. The instruction field contains the assembly language command that the processor is supposed to execute for this line of code. An instruction must be either an assembly language instruction (an opcode) or an instruction to the assembler (an assembler directive).
The third field is the operand field. The operand field contains the data or operands that the assembly language instruction needs for its execution. This includes items such as memory addresses, constants, or register names. Depending on the instruction, there may be zero, one, two, or three operands, the syntax and organization of which also depends on the instruction.
The last field in a line of assembly language is the comment field. As was mentioned earlier, assembly language has no structure in the syntax to represent blocks of code. Although the specific operation of a line of assembly language should be clear to a programmer, its purpose within the program usually is not. It is therefore imperative to comment assembly language programs.
In addition to the standard use of comments, comments in assembly language can be used to:
• show where functions or blocks of code begin and end;
• explain the order or selection of commands (e.g., where a shift left has replaced a multiplication by a power of two);
• identify obscure values (e.g., that address 037816 represents the data registers of the parallel port).
A comment is identified with a preceding semi-colon, ‘;’. All text from the semi-colon to the end of the line is ignored. This is much like the double-slash, “//”, used in C++ or the quote used in Visual Basic to comment out the remaining text of a line. A comment may be alone in a line or it may follow the last necessary field of a line of code.
Assembly Language Directives
There are exceptions in an assembly language program to the opcode/operand lines described in the previous section. One of the primary exceptions is the assembler directive. Assembler directives are instructions to the assembler or the linker indicating how the program should be created.
One of the most important directives with respect to the final addressing and organization of the application is SEGMENT. This directive is used to define the characteristics and or contents of a segment.
There are three main segments:
1. The code segment,
2. The data segment,
3. and the stack segment.
To define these segments, the assembly language file is divided into areas using the SEGMENTdirective. The beginning of the segment is defined with the keyword SEGMENT while its end is defined using the keyword ENDS. Figure below presents the format and parameters used to define a segment.
The label uniquely identifies the segment. The SEGMENT directive label must match the corresponding ENDS directive label. The alignment attribute indicates the “multiple” of the starting address for the segment.
For a number of reasons, either the processor or the operating system may require that a segment begin on an address that is divisible by a certain power of two. The align attribute is used to tell the assembler what multiple of a power of two is required. The following is a list of the available settings for alignment.
• BYTE – There is no restriction on the starting address.
• WORD – The starting address must be even, i.e., the binary address must end in a zero.
• DWORD – The starting address must be divisible by four, i.e., the binary address must end in two zeros.
• PARA – The starting address must be divisible by 16, i.e., the binary address must end in four zeros.
• PAGE – The starting address must be divisible by 256, i.e., the binary address must end in eight zeros.
The combine attribute is used to tell the linker if segments can be combined with other segments..
.MODEL, .STACK, .DATA, and .CODE Directives
Instead of going to the trouble of defining the segments with the SEGMENT directive, a programmer may select a memory model. By defining the memory model for the program, a basic set of segment definitions is assumed. The directive .MODEL can do this. Figure below presents the format of the .MODEL directive.
Table below presents the different types of memory models that can be used with the directive. The memory models LARGE and HUGE are the same except that HUGE may contain single variables that use more than 64K of memory.
There are three more directives that can be used to simplify the definition of the segments. They are .STACK, .DATA, and .CODE. When the assembler encounters one of these directives, it assumes that it is the beginning of a new segment, the type being defined by the specific directive used (stack, data, or code). It includes everything that follows the directive in the same segment until a different segment directive is encountered.
The .STACK directive takes an integer as its operand allowing the programmer to define the size of the segment reserved for the stack. The .CODE segment takes a label as its operand indicating the segment’s name.
The next directive, PROC, is used to define the beginning of a block of code within a code segment. It is paired with the directive ENDP which defines the end of the block. The code defined between PROC and ENDP should be treated like a procedure or a function of a high-level language. This means that jumping from one block of code to another is done by calling it like a procedure.
Another directive, END, is used to tell the assembler when it has reached the end of all of the code. Unlike the directive pairs SEGMENT and ENDS and PROC and ENDP, there is no corresponding directive to indicate the beginning of the code.
Data Definition Directives
The previous directives are used to tell the assembler how to organize the code and data. The next class of directives is used to define entities that the assembler will convert directly to components to be used by the code.
They do not represent code; rather they are used to define data or constants on which the application will operate. Many of these directives use integers as their operands. As an aid to programmers, the assembler allows these integers to be defined in binary, decimal, or hexadecimal.
Without some indication as to their base, however, some values could be interpreted as hex, decimal, or binary (e.g., 100). Hexadecimal values have an ‘H’ appended to the end of the number, binary values have a ‘B’ appended to the end, and decimal values are left without any suffix.
Note also that the first digit of any number must be a numeric digit. Any value beginning with a letter will be interpreted by the assembler as a label instead of a number. This means that when using hexadecimal values, a leading zero must be placed in front of any number that begins with A, B, C, D, E, or F.
The first of the defining directives is actually a set of directives used for reserving and initializing memory. These directives are used to reserve memory space to hold elements of data that will be used by the application. These memory spaces may either be initialized or left undefined, but their size will always be specified.
The next directive, EQU, is in the same class as the define directives. It is like the #define directive used in C, and like #define, it is used to define strings or constants to be used during assembly. The format of the EQU directive is shown in Figure below
Both the label and the expression are required fields with the EQU directive. The label, which also is to follow the formatting guidelines of the label field, is made equivalent to the expression. This means that whenever the assembler comes across the label later in the file, the expression is substituted for it.
Assembly language instructions can be categorized into four groups: data transfer, data manipulation, program control, and special operations. The next four sections introduce some of the Intel 80×86 instructions by describing their function.
There is one Intel 80×86 opcode that is used to move data: MOV. As shown in Figure below, the MOV opcode takes two operands, dest and src. MOV copies the value specified by the src operand to the memory or register specified by dest.
Both dest and src may refer to registers or memory locations. The operand src may also specify a constant. These operands may be of either byte or word length, but regardless of what they are specifying, the sizes of src and dest must match for a single MOV opcode. The assembler will generate an error if they do not.
Intel designed the 80×86 family of processors with plenty of instructions to manipulate data. Most of these instructions have two operands, dest and src, and just like the MOV instruction, they read from src and store in dest. The difference is that the src and dest values are combined somehow before being stored in dest. Another difference is that the data manipulation opcodes typically affect the flags.
Take for example the ADD opcode shown in Figure 17-11. It reads the data identified by src, adds it to the data identified by dest, then replaces the original contents of dest with the result.
The ADD opcode modifies the processor’s flags including the carry flag (CF), the overflow flag (OF), the sign flag (SF), and the zero flag (ZF). This means that any of the Intel 80×86 conditional jumps can be used after an ADD opcode for program flow control.
Many of the other data manipulation opcodes operate the same way. These include logic operations such as AND, OR, and XOR and mathematical operations such as SUB (subtraction) and ADC (add with carry). MUL (multiplication) and DIV (division) are different in that they each use a single operand, but since two pieces of data are needed to perform these operations, the AX or AL registers are implied.
As with the generic processor described in Chapter 15, the 80×86 uses both unconditional and conditional jumps to alter the sequence of instruction execution. When the processor encounters an unconditional jump or “jump always” instruction (JMP), it loads the instruction pointer with the address that serves as the JMP’s operand. This makes it so that the next instruction to be executed is at the newly loaded address. Figure below presents an example of the JMP instruction.
The 80×86 has a full set of conditional jumps to provide program control based on the results of execution. Each conditional jump examines the flags before determining whether to load the jump opcode’s operand into the instruction pointer or simply move to the next sequential instruction. Table below presents a summary of most of the 80×86 conditional jumps along with the flag settings that force a jump. (Note that “!=” means “is not equal to”)
Typically, these conditional jumps come immediately after a compare. In the Intel 80×86 instruction set, the compare function is CMP. It uses two operands, setting the flags by subtracting the second operand from the first. Note that the result is not stored.
The special operations category is for opcodes that do not fit into any of the first three categories, but are necessary to fully utilize the processor’s resources. They provide functionality ranging from controlling the processor flags to supporting the 80×86 interrupt system.
To begin with, there are seven instructions that allow the user to manually alter the flags. These are presented in Table below.
The next two special instructions are PUSH and PULL. The Intel 80×86 processor’s stack is referred to as a post-increment/ pre-decrement stack. This means that the address in the stack pointer is decremented before data is stored to the stack and incremented after data is retrieved from the stack.