What I Dislike About GAS

vid, 2007-09-29

Revision: 1.0

I often get to argument with various linux guys about AT&T versus Intel syntax. There are many things I dislike on AT&T syntax, so I decided to write them all down in this article.

This article will be specifically about most often used AT&T-syntax assembler, GNU Assembler (GAS). GAS implements "AT&T syntax", and adds its own extensions as technology evolves. I will discuss GAS-specific extensions in this article too.

Testing for this article was done with MinGW GAS version 2.17.50 20060824.

Order of arguments

Order of instruction arguments is first point both sides bring up first in discussion. GAS uses "source, dest" order of arguments, while intel uses "dest, source".

GAS proponents argue that "source, dest" is more readable and more logical.

On "Readable" Argument

First thing you will always hear is that movl %eax, %ebx in GAS reads nicely to "move register eax to register ebx". Equivalent in Intel syntax mov ebx, eax is trickier to read. Okay, i agree in this one case, when transformed to human language, GAS's mov instruction reads better. But there are much more important problems with reading assembly, than mov.

Very basic counterexample to mov are instruction like xor, cmp, and, test, etc. These read better in Intel syntax:

 xor eax, 10       ;bitwise xor register eax with 10
 and [var], 0xFF   ;bitwise and var with 0xFF
 cmp eax, 10       ;compare eax to 10

GAS equivalents are not so readable:

 xor $10, %eax
 and $0xFF, var
 cmp $10, %eax

This alone is enough for me to outweight mov, add, and others of GAS. Also, readability for me doesn't mean how easy is instruction transformed to human languages, readability means how easily is purpose of instruction understood. That makes difference.

More serious problem with GAS syntax is mnemonics of conditional jumps. Intel is one who designed procesors and decided on mnemonics of instructions, and it decided these mnemonics for intel syntax, not for AT&T syntax. This causes serious problems to readability of GAS code with conditional instructions:

 cmp eax, 10     ;compare eax to 10
 jg greater      ;jump if greater
 cmovl eax, 10   ;move if lesser

In GAS:

 cmp $10, %eax
 jg greater
 cmovl $10, %eax

Another problem with GAS, that we will see numerous times in this article, is that GAS keeps compatibility to ancient standards, which were often bad decisions. Keeping backwards compatibility is of course important for it, otherwise lot of old code would have to be rewritten. But programmers can decide to use new and better tools.

Regarding order of arguments, this causes problem with FPU instructions. FPU instruction in GAS don't have reversed order. They have "dest, source" order of arguments. Even many proponents of GAS consider this a bug, see AT&T Syntax bugs as an example.

On "Logical" Argument

Discussion about which of "dest, source" or "source, dest" is more logical extends beyond area of assemblers. There are languages using both ways, but "dest, source" languages are strongly prevailing. Consider C:

 a = 5;
 a += 10;

There are same rare languages using other way:

 5 = a
 5 += a

But I can't in any way find the latter more logical. Especially if there are more than one source operands:

 a = (x + y) / 2
 imul eax, ebx, 10   ;eax = ebx*10

Versus:

 (x + y) / 2 = a
 imull $10, %ebx, %eax   ;eax = ebx*10

The way I see it is that first information you get is "what am I working with", and only then you get on more complex "what I am doing with it". Otherwise, you first must "parse" through all the things you are doing to the value, and only then you find out with what you was doing all those things. It is not as big problem in assembly, but still…

Addressing Memory

Memory addressing is in my opinion one of worst problems of GAS. How it works:

If argument of instruction is without any special marker (such as % for register or $ for numeric constant), then it is memory access. So following:

 movl 10, %eax
 movl foo, %eax

Corresponds to intel syntax:

 mov eax, [10]
 mov eax, [foo]

To use numeric constant, or use address of label, there is $ operator:

 movl $10, %eax
 movl $foo, %eax

In Intel syntax:

 mov eax, 10
 mov eax, offset foo

Note that in NASM-style syntax, last instruction (getting offset of foo) is

mov eax, foo

This is my preffered style, it makes the syntax most clear and unambigous.

x86-16 Addressing

One more minor GAS problem with 16-bit addressing was, that I wasn't able to enforce 32-bit addressing by immediate in 16-bit code.

Another minor thing is that you have to keep order (%bx, %si) or (%bx, %di), you cannot use other order of arguments. It gives sense for AT&T syntax, but may be annoying, especially if you use registers in other sense (e. g., SI is base and BX is index).

x86-32 Addressing

In Intel syntax, full 32-bit addressing looks like this:

 push segment:[base + scale*index + displacement]

For example:

 push fs:[table + 8*ecx + ebx]
 push gs:[8*eax + 4]

Note. Some intel-syntax-derivate assemblers use this form:

 push [segment : base + scale*index + displacement]
 push [fs:0]

Note. Some assemblers can automatically create "base + scale*index" from single "value*register", like the following. I am not fan of this feature, but some people do like it.

 lea eax, [9*eax]
 lea eax, [eax + 8*eax]

In AT&T syntax, full x86-32 addressing is written as:

 segment:displacement(base, index, scale)

For example:

 pushl %fs:table(%ebx, %ecx, 8)
 pushl %gs:4(,%eax,8)

More complicated example follows. table is array of ITEM structures. sizeof.ITEM holds size of item structure, ITEM.foo holds index of member foo within structure ITEM. Following code gets member foo of structure ITEM, that is at index EBX in array table (e. g., ax = table[ebx].foo in C).

 sizeof.ITEM = 8
 ITEM.foo = 2
 mov ax, [table + sizeof.ITEM*ebx + ITEM.foo]

In GAS:

 mov table+ITEM.foo(,%ebx,sizeof.ITEM), %ax

I think intel-style syntax clearly wins in readability here.

Also same problem as with 16-bit mode, i wasn't able to enforce 16-bit addressing in 32-bit code.

x86-64 addressing

Adding support for AMD64 architecture caused great troubles in all assemblers. AMD64 provides RIP-relative addressing. With this addressing, you can make position-independent code very easily. This addressing is what you want 99.9% times when writing 64-bit code. Using absolute addresses is still possible, but pretty much limited to 4GB address space, and they require relocations. The only exceptions are mov instructions (opcodes A0h-A3h), where absolute 64-bit addressing is enabled.

Some assemblers (YASM, GAS) still use absolute addresses by default in 64-bit mode, and they require explicit notation for RIP-relative addressing. This makes writing position-independent code pain.

 mov variable(%rip), %eax

Other assemblers (FASM, MASM) use RIP-relative addressing by default. MASM doesn't provide any way to use absolute addresing, and FASM does offer way to use it, in rare cases when it is needed.

As a minor problem, GAS doesn't provide some exotic 64-bit addressing modes. These are not really needed, but would be nice for sake of completness of assembler.

Lazy Syntax

One more thing thing that I dislike about GAS is it's lazy (easy to parse) syntax. I understand that it was useful few tenths of years ago, when parsing language had to be as simple as possible, to save expensive memory. Nowadays this isn't a issue, especially for assembler. Unfortunatelly, GAS inherited this syntax.

Having % before ever register name is reasonable, to separate it from label with same name as register. This is good way for GAS as back-end for gcc compilers. But for hand-written assembly code, having to type % character before every register is a little bit annoying. Registers are used much more often than variables in assembly. Some other assemblers solve this issue in different way: MASM (and all MS tools) decorate every name with underscore, and FASM allows assigning symbols different "global" and private name in object. As far as I know, NASM doesn't solve this issue. Using special notation for memory addressing makes more sense in handwritten assembly code, than using special notation for registers. And for compilers, it doesn't matter.

Using $ is similar to offset operator in MASM/TASM style assemblers. However, GAS has less ambiguity here than MASM, because even constant values are treated like memory addresing, unless prefixed with $. But still, NASM/FASM syntax is in my opinion better understandable, and easier to write.

Thing that is no longer an issue is specifying operand size in mnemonics, like movl, cmpw, etc. GAS can now deduce operand size from registers used, and in case there is no register used, nor explicit size in mnemonics, it throws error.

Feature that I lack in GAS is ability to assign size to label. In many other assemblers you can assign size to label, and if that label is used as address, and no explicit size is given for instruction, assembler uses size associated with this label:

 var1 dd 10
 label var2 dword
 mov [var1], 10
 cmp [var2], 0

Object format limitations

GAS can only output object files. It is unable to create pure binary file. It is still possible to produce pure binary output using linker scripts, but doing it this way has several limitations. These limitations are imposed by object format, in which code produced by GAS is stored.

All addresses in object format must be relocated, even though in resulting binary they will be constant. But GAS doesn't know about this, and has to treat all addresses as relocatable. That makes it impossible to do things like:

 lea eax, [variable and (not 0xFFF)]   ;get address of page where a variable is

because there is no relocation for and (not 0xFFF) in ELF objects.

This is a minor limitation, but still it is a little annoying, and this problem is solved in other assemblers with pure binary output (NASM, YASM, FASM).

External Symbols

Very annoying thing in GAS is that it treats all undefined symbols as external dependencies. That means if you mistype some name, your program compiles fine, and you can find error only later during linking, having to go back to assembly source.

If you are unlucky, you may happen to have symbol with such name defined in other module, linking will succeed too, and you have nasty bug to look for. If GAS wouldn't beheave this way, you could catch such bug immediately.

As for other assemblers i know, only FASM and MASM solve this issue properly. See my External Dependencies in Assemblers article for more details.

Conclusion

In my view, GAS is too lowlevel assembler. It is fine as backend for compilers, but it doesn't matter at all to compiler nowadays how much lowlevel its backend is. For human, assembler can beheave nicer than GAS does, without losing any control or simplicity.

Another GAS's problem is that it still keeps very old standards, not all of which are the best choice. There are newer and better assemblers to use, even though they are less standard.

Note. This article is only a first version, and I am no GAS expert. I believe GAS proponents will reply, explain how to do things I wasn't able to do with GAS, and correct possible mistakes.

Links

GAS is part of binutils package.

http://www.x86-64.org/, resources and discussion on 64-bit programming with GAS.

Comments

Continue to discussion board.

You can contact the author using e-mail vid@x86asm.net.

Visit author's home page.

Revisions

2007-09-29

1.0

First public version

vid

(dates format correspond to ISO 8601)