Wednesday, January 3, 2018

Assembling the Tokens

Assemblers are easier to write than a complier would be as the structure of the code is a lot simpler. There are still a lot of similarities between a compiler and an assembler that some of the common techniques used to write a compiler are also useful for writing an assembler. With a compiler you first tokenized the source code (Lexical Analysis), parse the tokens into something meaningful (syntax and semantic analysis), generate some intermediate code which then finally generate the machine code. With the assembler, I am taking a similar approach. I am tokenizing the source code, parsing it into intermediate machine language, then generating the final machine language.

Java does have some tokenizer classes that will do part of what I want done but while it is possible to use Java classes in Kotlin I am not sure how well those will work if I try to compile my Kotlin code to JavaScript, which is something I do want to do eventually. For this reason, and my NIH syndrome kicking in, I opted to write my own tokenizer for the assembler. I created a simple enumeration to hold the different types of tokens that my assembler will use, though suspect that I will be adding some new ones later.

enum class AssemblerTokenTypes {
    DIRECTIVE, ERROR, IMMEDIATE,
    INDEX_X, INDEX_Y,
    INDIRECT_START, INDIRECT_END
    LABEL_DECLARATION, LABEL_LINK,
    NUMBER, OPCODE, WHITESPACE }

A DIRECTIVE is a command for the assembler. There are a number of different ways that these can be handled but I am going to start my directives with a dot. I have not worked out all the directives that I plan on supporting but at a minimum I will need .ORG, .BANK, .BYTE and .DATA with .INCLUDE and .CONST being nice to have. More on these when I actually get to the directives portion of my assembler.

ERROR is an indication of a tokenization error which kind of breaks the assembling of the file. Invalid characters used would be the likely culprit.

Some of the 6502 instructions have an immediate mode that lets the programmer specify a constant value to use in the next byte. This is indicated by prefacing the constant with the hash (#) symbol. The tokenizer simply indicates that an immediate mode value is going to be next by having an IMMEDIATE token.

The 6502 supports offsets of an address using “,X” or “,Y” so the tokens indicate such an index is being used. These indexes are used for zero page indexing, indirect indexing, as well as your normal indexing which is officially called absolute indexing. The particular type of indexing address mode that will be used will be determined by the parser which will be covered later.

Indirect addressing modes prefix the address with an open bracket and postfix the address with a closed bracket. To indicate this the INDIRECT_START, and INDIRECT_END tokens are used.

It is certainly possible to write an assembler that does not track the addresses of locations for you but requires you to know all the addresses that you are using but one of the reasons that assemblers were invented was to remove this busywork. This means that we need to have some type of support for labels in our assembler. Most 6502 assemblers will indicate the location within the code by having a label that is followed by a colon at the beginning of the line. This  is indicated by the  LABEL_DECLARATION token with LABEL_LINK tokens being used for links within the code.
As assembly language revolves around numbers, we obviously need a NUMBER token. This is a special token for processing as I am supporting binary, decimal, and hexadecimal formats for numbers. My Machine Architecture teacher will probably be upset that I am not including support for octal numbers but I never use that format in code so didn’t see the point in adding that. I am using the pretty standard 6502 convention of representing hex numbers by prefixing them with a $ and by prefixing binary numbers with a % symbol. Supporting binary is not vital but very handy to have, especially for a machine like the 2600 where you are doing a lot of bit manipulation.

While I probably should have used the term MNEMONIC instead of OPCODE for the enumeration, I often call the mnemonic an op code even though technically the op code is the actual numeric value that the assembler ultimately converts the mnemonic into. Should I change this in my code, probably. Will I?

Finally, WHITESPACE is the spaces, tabs, and comments. In most assemblers comments are designated with a ; so that works fine for me. Most the time the whitespace characters will be ignored so I could arguably not have a token for whitespace and simply ignore it.

Now that we have the tokens out of the way, we need to write the tokenizer.

No comments:

Post a Comment