Files
Toy/.notes/bytecode-format.txt
Kayne Ruse 002651f95d WIP, adjusting architecture, read more
The 'source' directory compiles, but the repl and tests are almost
untouched so far. There's no guarantee that the code in 'source' is
correct, so I'm branching this for a short time, until I'm confident the
whole project passes the CI again.

I'm adjusting the concepts of routines and bytecode to make them more
consistent, and tweaking the VM so it loads from an instance of
'Toy_Module'.

* 'Toy_ModuleBuilder' (formally 'Toy_Routine')

This is where the AST is compiled, producing a chunk of memory that can
be read by the VM. This will eventually operate on individual
user-defined functions as well.

* 'Toy_ModuleBundle' (formally 'Toy_Bytecode')

This collects one or more otherwise unrelated modules into one chunk of
memory, stored in sequence. It is also preprended with the version data for
Toy's reference implementation:

For each byte in the bytecode:

    0th: TOY_VERSION_MAJOR
    1st: TOY_VERSION_MINOR
    2nd: TOY_VERSION_PATCH
    3rd: (the number of modules in the bundle)
    4th and onwards: TOY_VERSION_BUILD

TOY_VERSION_BUILD has always been a null terminated C-string, but from
here on, it begins at the word-alignment, and continues until the first
word-alignment after the null terminator.

As for the 3rd byte listed, since having more than 256 modules in one
bundle seems unlikely, I'm storing the count here, as it was otherwise
unused. This is a bit janky, but it works for now.

* 'Toy_Module'

This new structure represents a single complete unit of operation, such
as a single source file, or a user-defined function. It is divided into
three main sections, with various sub-sections.

    HEADER (all members are unsigned ints):
        total module size in bytes
        jumps count
        param count
        data count
        subs count
        code addr
        jumps addr (if jumps count > 0)
        param addr (if param count > 0)
        data addr (if data count > 0)
        subs addr (if subs count > 0)
    BODY:
        <raw opcodes, etc.>
    DATA:
        jumps table
            uint array, pointing to addresses in 'data' or 'subs'
        param table
            uint array, pointing to addresses in 'data'
        data
            heterogeneous data, including strings
        subs
            an array of modules, using recursive logic

The reference implementation as a whole uses a lot of recursion, so this
makes sense.

The goal of this rework is so 'Toy_Module' can be added as a member of
'Toy_Value', as a simple and logical way to handle functions. I'll
probably use the union pattern, similarly to Toy_String, so functions
can be written in C and Toy, and used without needing to worry which is
which.
2025-01-29 10:21:33 +11:00

74 lines
2.9 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
The bytecode format
===
There are four components in the bytecode header:
TOY_VERSION_MAJOR
TOY_VERSION_MINOR
TOY_VERSION_PATCH
TOY_VERSION_BUILD
The first three are each one unsigned byte, and the fourth is a null terminated C-string.
* Under no circumstance, should you ever run bytecode whose major version is different
* Under no circumstance, should you ever run bytecode whose minor version is above the interpreters minor version
* You may, at your own risk, attempt to run bytecode whose patch version is different from the interpreters patch version
* You may, at your own risk, attempt to run bytecode whose build version is different from the interpreters build version
An additional note: The contents of the build string may be anything, such as:
* the compilation date and time of the interpreter
* a marker identifying the current fork and/or branch
* identification information, such as the developer's copyright
* a link to Risk Astley's "Never Gonna Give You Up" on YouTube
Please note that in the final bytecode, if the null terminator of TOY_VERSION_BUILD is not 4-byte aligned, extra space will be allocated to round out the header's size to a multiple of 4. The contents of the extra bytes are undefined.
===
At this time, a 'module' consists of a single 'routine', which acts as its global scope.
Additional information may be added later, or multiple 'modules' listed sequentially may be a possibility.
===
# the routine structure, which is potentially recursive
# symbol shorthand : 'module::identifier'
# where 'module' can be omitted if it's local to this module ('identifier' within the symbols is calculated at the module level, it's always unique)
.header:
N total size # size of this routine, including all data and subroutines
N .jumps count # the number of entries in the jump table (should be data count + routine count)
N .param count # the number of parameter fields expected
N .data count # the number of data fields expected
N .routine count # the number of routines present
.code start # absolute address of .code; mandatory
.param start # absolute addess of .param; omitted if not needed
.datatable start # absolute address of .datatable; omitted if not needed
.data start # absolute address of .data; omitted if not needed
.routine start # absolute address of .routine; omitted if not needed
# additional metadata fields can be added later
.code:
# instructions read and 'executed' by the interpreter
READ 0
LOAD 0
ASSERT
.param:
# a list of symbols to be used as keys in the environment
.jumptable:
# a 'symbol -> pointer' jumptable for quickly looking up values in .data and .routines
0 -> {string, 0x00}
1 -> {fn, 0xFF}
.data:
# data that can't really be embedded into .code
<STRING>,"Hello world"
.routines:
# inner routines, each of which conforms to this spec