Documented the literal cache

2023-08-03 17:20:33 +10:00
parent 68a5a4eb56
commit 8bcab9fc34
1 changed files with 108 additions and 22 deletions
@@ -2,11 +2,22 @@

 Here you'll find some of the implementation details.

+# Bytecode
+
+The output of Toy's compiler, and the input of the interpreter, is known as "bytecode". Here, I've attempted to fully document the layout of the canonical bytecode's structure, but since this was written after most of this was implemented, there may be small discrepencies present.
+
+There are four main sections of the bytecode:
+
+* Header
+* Literal Cache
+* Function Definitions
+* Program Definition
+
 ## Bytecode Header Format

-The bytecode header format must not change.
+Note: The bytecode header format must not change.

-Every instance of Toy bytecode will be divided up into several sections, by necessity - however the first one to be read is the header. This section is used to define what version of Toy is currently running, as well as to prevent any future version/fork clashes.
+This section is used to define what version of Toy is currently running, as well as to prevent any version/fork clashes.

 The header consists of four values:

@@ -15,7 +26,7 @@ The header consists of four values:
 * TOY_VERSION_PATCH
 * TOY_VERSION_BUILD

-The first three are single unsigned bytes, embedded at the beginning of the bytecode in sequence. These represent the major, minor and patch versions of the language. The fourth value is a null-terminated string of unspecified data, which is *intended* but not required to specify the time that the langauge's compiler was itself compiled. The build string can hold arbitrary data, such as the current maintainer's name, current fork of the language, or other versioning info.
+The first three are single unsigned bytes, embedded at the beginning of the bytecode in sequence. These represent the major, minor and patch versions of the language. The fourth value is a null-terminated c-string of unspecified data, which is *intended* but not required to specify the time that the langauge's compiler was itself compiled. The build string can hold arbitrary data, such as the current maintainer's name, current fork of the language, or other versioning info.

 There are some strict rules when interpreting these values (mimicking, but not conforming to [semver.org](https://semver.org/)):

@@ -28,6 +39,95 @@ All interpreter implementations retain the right to reject any bytecode whose he

 The latest version information can be found in [toy_common.h](https://github.com/Ratstail91/Toy/blob/main/source/toy_common.h)

+## Literal Cache
+
+In Toy, a "Literal" is a value of some kind, be it an integer, or a dictionary, or even a variable name. Rather than embedding the same literal (potentially) many times within the bytecode, the "Literal Cache" was devised to act as an immutable, indexable repository of any literals needed. When bytecode is first loaded into the interpreter, the first thing that happens (after the header is parsed) is the reconstruction of the literal cache. The internal function `readInterpreterSections()` is responsible for this step.
+
+The first `unsigned short` to be read from this section is `literalCount`, which defines the number of literals which are to be read. Once all literals have been read out of this section, the opcode `TOY_OP_SECTION_END` is expected to be consumed. Some preprocessor macros can also enable or disable debug printing functionality within the repl.
+
+The list of valid literal types are:
+
+### TOY_LITERAL_NULL
+
+This literal is simply inserted into the literal cache when encountered.
+
+### TOY_LITERAL_BOOLEAN
+
+This literal specifies that the next byte is it's value, either true or false.
+
+### TOY_LITERAL_INTEGER
+
+This literal specifies that the next 4 bytes are it's value, interpreted as a 32-bit integer.
+
+### TOY_LITERAL_FLOAT
+
+This literal specifies that the next 4 bytes are it's value, interpreted as a 32-bit floating point integer.
+
+### TOY_LITERAL_STRING
+
+This literal specifies that the next collection of null terminated bytes are it's value, interpreted as a null-terminated string.
+
+### TOY_LITERAL_ARRAY_INTERMEDIATE
+
+`TOY_LITERAL_ARRAY_INTERMEDIATE` specifies that the literal to be read is a flattened `LiteralArray`. A "flattened" compound literal does not actually store it's contents, only references to it's contents' positions within the literal cache.
+
+To read this array, you must first read an `unsigned short` which specifies the size, then read that many additional `unsigned shorts`, which are indices. Finally, the original `LiteralArray` can be reconstructed using those indices, in order.
+
+As the final step, the newly reconstructed `LiteralArray` is added to the literal cache.
+
+### TOY_LITERAL_DICTIONARY_INTERMEDIATE
+
+`TOY_LITERAL_DICTIONARY_INTERMEDIATE` specifies that the literal to be read is a flattened `LiteralDictionary`. A "flattened" compound literal does not actually store it's contents, only references to it's contents' positions within the literal cache.
+
+To read this dictionary, you must first read an `unsigned short` which specifies the size (both keys and values), then read that many additional `unsigned shorts`, which are indices of keys and values. Finally, the original `LiteralDictionary` can be reconstructed using those keys and value indices.
+
+As the final step, the newly reconstructed `LiteralDictionary` is added to the literal cache.
+
+### TOY_LITERAL_FUNCTION
+
+When a `TOY_LITERAL_FUNCTION` is encountered, the next `unsigned short` to be read (the function index) should be converted into an integer literal, before having it's type manually changed to `TOY_LITERAL_FUNCTION_INTERMEDIATE` for storage within the literal cache.
+
+Functions will be processed properly in a later step - so this literal is added to the cache as a placeholder until that point.
+
+### TOY_LITERAL_IDENTIFIER
+
+This literal specifies that the next collection of null terminated bytes are it's value, interpreted as a null-terminated string.
+
+### TOY_LITERAL_TYPE
+
+This literal specifies that the next byte is the type of a literal, and the following byte is a boolean specifying const-ness.
+
+(This literal type may be integrated with `TOY_LITERAL_TYPE_INTERMEDIATE` at some point.)
+
+### TOY_LITERAL_TYPE_INTERMEDIATE
+
+This literal specifies that the next byte is the type of a literal, and the following byte is a boolean specifying const-ness.
+
+Then if the type is `TOY_LITERAL_ARRAY`, the following `unsigned short` is an index within the cache, representing the type of the contents.
+
+Otherwise, if the type is `TOY_LITERAL_DICTIONARY`, the following two `unsigned short`s are indices within the cache, representing the types of the keys and values.
+
+### TOY_LITERAL_INDEX_BLANK
+
+This literal is simply inserted into the literal cache when encountered.
+
+## Function Definitions
+
+The second stage of `readInterpreterSections()` is used to read the third section of the given bytecode - the function definitions.
+
+The first `unsigned short` is the number of functions present within this section. The second `unsigned short` is the length of this entire section (this one is not necessarily needed, and may be removed at some point).
+
+For each `TOY_LITERAL_FUNCTION_INTERMEDIATE` within the cache, you must read an `unsigned short` as the size. Then, the following `size` block of bytecode is to be copied, wholesale, into the specified cached literal, before setting that literal's type to `TOY_LITERAL_FUNCTION`. While the function is not operational yet, it will be further processed when needed.
+
+Once all function literals have been read out of this section, the opcode `TOY_OP_SECTION_END` is expected to be consumed.
+
+## Program Definition
+
+TODO
+
+### Opcodes
+
+TODO

 # Parser Structure and Operations

@@ -35,13 +135,13 @@ TODO

 # Compiler Structure and Operations

-No.
+TODO

 # Interpreter Structure and Operations

 The Toy interpreter is, at it's core, just a big loop that reads bytes from memory and acts on them. Here, I'll break down exactly how it works, from a top-down perspective.

-### Running the Interpreter
+## Running the Interpreter

 There are four main functions for running the interpreter:

@@ -62,21 +162,7 @@ Next, `run` will pass to a function called `execInterpreter()`, which contains t

 Finally, `run` will automatically free the bytecode and associated literalCache (this may change at some point).

-### Bytecode Layout
-
-I don't know.
-
-To put it bluntly, the layout of the compressed bytecode was very adhoc, and as such it was not documented at the time. This was partially because I (wrongly) believed that the layout didn't matter much, only the final execution.
-
-I can say a few things about it though -
-
-* Literal compounds are stored as arrays of integers which reference previously declared literals
-* Functions are stored *after* the literal cache, in their own section and are referenced in the literal cache by index
-* Functions are structured very similarly to the program as a whole, and store their argument and return arrays within their own literalCaches
-
-I will document this one day, but not any time soon.
-
-### Executing the Interpreter
+## Executing the Interpreter

 Opcodes within the bytecode are 1 byte in length, and specify a single action to take. Each possible action is definied within the interpreter in a function that begins with `exec`, and are called from within a big looping switch statement. If any of these `exec` functions encounters an error, they can simply return false to break the loop.

@@ -114,7 +200,7 @@ static bool execPrint(Toy_Interpreter* interpreter) {
 }
 ```

-### Identity Crisis
+## Identity Crisis

 As in most programming languages, variables can be represented by names specified by the programmer; in Toy, these are called "identifiers". These identifiers can be passed around in place of their actual values, but can't be used directly. To retrieve a value, you must first "parse" it, like so:

@@ -127,7 +213,7 @@ if (TOY_IS_IDENTIFIER(literal) && Toy_parseIdentifierToValue(interpreter, &liter

 You will often see this pattern throughout the codebase.

-### Other Utility Functions
+## Other Utility Functions

 Other functions are available at the top of the interpreter source file: