Wednesday, January 10, 2018

Tokenizing

I finally got around to putting up my source on GitHub. The code is in a very rough state as it is more prototyping than production and it is a bit ahead of what I am posting but anybody who is interested in it can find it at github.com/BillySpelchan/VM2600 . My original plans were to self-host it on 2600Dragons, but that just seemed like too much work. Once I get to the point of getting the emulator running in JavaScript I may post it to 2600Dragons but that may be a few months or so.

As explained last week, tokens are broken down into a number of types which are stored in the AssemberTokenTypes enumeration. Tokens are effectively just a simple data structure so I take advantage of Kotlin’s data class support to define the class which is all that is needed.

data class AssemblerToken(val type:AssemblerTokenTypes, val contents:String, val num:Int)

The contents holds the actual string that forms this token while the value is the numeric representation of the token if appropriate or 0 otherwise. In the case of label declarations, the contents does not include the postpending colon.

The tokenize function is quite long but is simple. Essentially it simply loops through each character in the string that is about to be processed basing what to do based on the character encountered. Spaces, tabs, and semicolons as well as everything after them get converted into whitespace. Commas are matched with an x or a y. Brackets form the indirect start and indirect end tokens. The hash becomes an immediate token. The period becomes a directive token. The percent indicates the start of a binary number so the ones and zeros after it are converted into the appropriate base 2 number. Likewise, the dollar sign signifies hexadecimal numbers so the hex characters after it are taken and used to form a base 16 number. Numbers get converted into a decimal number. I could have taken the route of making an octal number if the first number starts with a 0 but don’t really use octal so never bothered.

Strings of letters form labels with a colon indicating the label is a link label. If not a link label, the text is checked against the list of mnemonics and if it matches becomes a mnemonic token.

When I was planning my tokenizer, I created the following diagram which explains it. This was probably not necessary as the logic is very straight forward, but this may help make the process easier to understand.



The tokens are stored in an array list of assembler tokens, with whitespace tokens not being added to the list unless the function is told to include it by passing false to the ignoreWhiteSpace parameter. Testing is simply verifying that passed strings produce the appropriate list of tokens.


Next we need to take the tokens and convert them into 6502 assembly which will start being covered next week.

No comments:

Post a Comment