# Intro People are probably aware but I have been bullish on AI for a good while now (lord knows we could use more intelligence in our modern society). As time has gone by I also notice there is a decay in scepticism. It is hard to deny that current frontier models, in various forms, can, and do, generate valuable intellectual outputs. Education, in my opinion, is such an important aspect of AI progress. I decided to play with this education aspect in this blogpost by teaching myself some new skills. Below we will try to learn about `Reinforcement Learning (rl)` by working with [o3-mini](https://openai.com/index/openai-o3-mini/) to develop a ~~`snake`~~ block game with some additional constraints and train a neural network through `rl` to play the game. Along the way we learn many things about the technology stack, state representation/trade-offs, the difficulties involved with reward shaping and strategies to illicit desirable behaviour. Overall there was a lot iterative design work with trial-and-error combined with questioning `o3`. Generally I've been on the AI API consumer side of things. We can't learn everything about the world we live in so it is hard to operate at all layers. It's not even desirable to try to do that. However, I have been wanting to look at some part of the AI implementation layer for a while now, `rl` is a good target because it is independently useful. Recently `DeepSeek` has caused some turbulence in the AI space and as I was reading some of their [papers](https://arxiv.org/pdf/2501.12948), I came across this: ![[snake-01.png]] I've been fascinated by `rl` as a concept for a while. This all started because of the amazing work done by `Google DeepMind` in 2016 with their project to tackle the game of Go. You should check out their excellent documentary below. ![AlphaGo](https://www.youtube.com/watch?v=WXuK6gekU1Y) With `DeepSeek` publishing some really innovative research, including using large scale reinforcement learning, it seemed like a good time to focus a `side-quest` on this. There it is, that's how we got here, what now, what's going on? No one knows.. ``` Our argument shows that the power and capacity of learning exists in the soul already; and that just as the eye was unable to turn from darkness to light without the whole body, so too the instrument of knowledge can only by the movement of the whole soul be turned from the world of becoming into that of being, and learn by degrees to endure the sight of being. -Plato (The Republic: Book 7) ``` ### GitHub The entire code base for the project is available on [GitHub](https://github.com/FuzzySecurity/SolidBlock-RL). The code lets you `train` agents and `auto play` with pre-existing models. I have included a sample, pre-trained, neural network you can use to test. ### Resources - ConvNetJS Deep Q Learning Demo (Karpathy) - [here](https://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html) - Playing Atari with Deep Reinforcement Learning (DeepMind) - [here](https://arxiv.org/pdf/1312.5602v1) - AlphaGo (DeepMind) - [here](https://deepmind.google/research/breakthroughs/alphago/) - Value-Based Deep RL Scales Predictably - [here](https://arxiv.org/pdf/2502.04327) - Optuna - [here](https://github.com/optuna/optuna) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - [here](https://arxiv.org/pdf/2501.12948) # It's only game ### Tech Stack Ok, first I had some discussion with `o3` about suitable languages for the project and frameworks we could use (I'm so tired of writing python tbh). We settled on `NodeJS` which has support for [tfjs-node](https://github.com/tensorflow/tfjs). I'll note here that you have both `tfjs-node` and `tfjs-node-gpu` which operate on the `CPU`/`GPU` respectively. If you want to use the GPU version you will need to install some dependencies. - CUDA Toolkit - I used `v11.0` for compatibility with `tfjs-node-gpu` `v4.22.0` from [here](https://developer.nvidia.com/cuda-11.0-download-archive) - cuDNN Library - Annoyingly you need `cuDNN` separately and your version needs to be compatible with the version of the `CUDA Toolkit`. In my case `v8.9.7`. You have to register for a developer account to get this [here](https://developer.nvidia.com/cudnn). - On Windows you may also run into [this]() issue, just look at the comments and move the node module to the correct folder. Some hero should make a `pr` but I'm no hero. Before you rush to become an `NVIDIA` developer I want to make some observations based on my local hardware. I have an `i9-10900k` and a `RTX 3090`, when I was benchmarking training speed I found that my CPU was training faster than my GPU. I think what's going on here is that the model and batch size isn't large enough, because of this the overhead of transferring data to and from the GPU isn't worth it. ### Solid ~~Snake~~ Block We need a good simple game, keep in mind we are trying to absorb a lot of fresh knowledge here: - Build a terminal game we can play - Develop a system for auto play - Generate state data and scoring criteria - Train a neural network of some configuration to play the game - QOL around training, snapshots, pausing resuming and such - A play mode where we can watch a pre-trained model play Not a lot of work as you can see πŸ˜…. Initially `o3` recommended a slightly unhinged game with hostage rescue, traps and safe zones! I didn't really like it and I think there was too much complexity involved. Thinking a little bit about classic games with low complexity constraints, snake immediately came to mind. I decided that was a good target and I also added some custom rules: - Snake eats food block, grows in length, gets reward - Every 20 game ticks, a pair of red and green blocks spawn at random unoccupied positions on the board - Snake eats green block, makes a random red block disappear, gets reward - Snake eats red block, shrinks in length, gets penalty Of course these are not the only rewards and penalties, we will talk more about that later but I will say that I feel like rewards design and shaping is the most important part of `rl`. You can see some of my manual gameplay below. ![[snake-02.mp4]] `~Solid Snake` It's actually a very cool version of snake (my tribute to MGS). I took the concept pretty far but there is a big non-obvious issue here (to a layperson like me). When we train a model we create a `state vector` that `describes things` about the turn that are `relevant to decision making`. Here, as the snake grows, it has to be able to know the length and position of all its body parts. I also apply some sensory perception where the `state vector` contains information about `blocks in a radius` of the snake's body. That's kind of hard to do when the body grows, it means we have a growing state we have to be aware of and many edge-cases. Complex rewards also introduce noise or unclear reward signals into training itself. One solution would be to use a convolutional approach where we represent the whole board in the state vector. However, I wanted to do some behavioural work where the agent doesn't have perfect knowledge. Also, I'm not sure if the model would be too slow to train because of its size (I'm a novice here). Eventually I decided to implement a simpler game: - We still have the red/green/food block mechanic - The snake is now just a block and doesn't grow - Hitting an obstacle doesn't end the game and we add a central obstacle to the board in order to encourage more terrain navigation skills Again, just some of my manual play below so you can visualize the concept. ![[snake-03.mp4]] `~Solid Block` 🫑 # Theory Of Design Good, we have some idea now about the thought process and the game, next we will go through the interesting parts of the `rl environment` so we can gradually come to understand everything a bit better. ### State Vectors: All Shapes and Sizes A state vector is essentially a representation of the environment that contains all information the model needs to make a prediction. It's also important that we don't include elements that don't directly correlate to what the next best action could be. For example, we don't add score or step count into the vector because these metrics are the result of many decisions over time, they are not indicative of `"what is the best action now"`. That being said, this is how I have my state vector laid out: - I get `perceptual` data for a `9x9 grid` centred on the agent - This provides `immediate localized information` about what is nearby (empty, red, green, food, obstacle) - I also calculate a `normalized relative position` for the current `food` block with respect to the `agent's` own `position`. The way we store `grid data` is a bit strange, let me try to explain. For each cell in the `9x9 square` we have these possible values: - `0`: Obstacle (or `oob`) - `1`: Red - `2`: Green - `3`: Food - `4`: Player - `5`: Empty Each of these values can be converted into a one‑hot vector of length 6 (I know it sounds made up, I feel the same way). These are just arrays/matrices where only one flag is set, very much like an `enum`: ``` 0: Obstacle (or oob) [1, 0, 0, 0, 0, 0] 1: Red [0, 1, 0, 0, 0, 0] ... 5: Empty [0, 0, 0, 0, 0, 1] ``` Each cell is now represented by such a vector and the full grid has a shape of `[9, 9, 6]`: - 9 rows - 9 columns - 6 channels This `3d tensor` fully captures the spatial environment around the agent, it captures `9x9x6 = 486` features. Additionally, as I mentioned, we also store the relative offset to the food block, this is much more simple `[dx, dy]` and it of course has a shape of `[2]`. We can compute distance like this: ```js // Manhattan distance to food const d_new = Math.abs(this.player.x - this.food.x) + Math.abs(this.player.y - this.food.y); ``` In total we capture `488` features from the game state. ``` +-----------------------------------------+ | Feature Vector | +-----------------------------------------+ | [1] Grid: 9 x 9 x 6 tensor | | | | Row 0: [ Cell(0,0): [1,0,0,0,0,0] | | Cell(0,1): [0,1,0,0,0,0] | | ... | | Cell(0,8): [0,0,0,1,0,0] ] | | | | Row 1: [ Cell(1,0): [0,0,1,0,0,0] | | ... | | Cell(1,8): [0,0,0,0,1,0] ] | | ... | | Row 8: [ ... ] | | | | Shape: [9, 9, 6] | +-----------------------------------------+ | [2] Offset: 2-dimensional vector | | [dx, dy] | | Shape: [2] | +-----------------------------------------+ ``` ### Model Architecture I went through a few iterations as I was tinkering (my spatial data increased over time `5 -> 7 -> 9`). Here, again, I'm letting `o3` teach me about good configurations of networks that suit the types of data I have and are `appropriately sized` based on the size of the `state vector`. Also, I took on broad recommendations around `convolutional layers`, `max-pooling`, `dense layers` and `dropout`. `Informed vibes` network design `and experimentation`. ```js const tf = require('@tensorflow/tfjs-node'); /** * Creates a convolutional neural network model that accepts two inputs: * - A grid input of shape [gridSize, gridSize, numClasses] (the visual field). * For a 9x9 view, set gridSize = 9. * - An offset input of shape [2] (the normalized relative food offset). * * The model processes the grid input through two convolutional layers followed * by a max-pooling layer to reduce spatial dimensions. The resulting features are * then flattened and concatenated with the offset input. Three dense layers follow * before outputting Q–values (one per action). * * This architecture is designed to capture a broader field of view (9Γ—9) * and provide increased capacity for long-term planning and obstacle avoidance. * * @param {number} gridSize - The height/width of the grid (e.g., 9 for a 9x9 view). * @param {number} numClasses - The number of classes for one–hot encoding (e.g., 6). * @param {number} outputDim - The number of actions (output neurons). * @returns {tf.Model} A compiled TensorFlow.js model. */ function createModel(gridSize, numClasses, outputDim) { const gridInput = tf.input({ shape: [gridSize, gridSize, numClasses] }); const offsetInput = tf.input({ shape: [2] }); // Convolutional layers for processing the grid let x = tf.layers.conv2d({ filters: 32, kernelSize: 3, activation: 'relu', padding: 'same' }).apply(gridInput); x = tf.layers.conv2d({ filters: 64, kernelSize: 3, activation: 'relu', padding: 'same' }).apply(x); // Max-pooling to reduce spatial dimensions // |_ for a 9x9 input, this produces roughly a 5x5 feature map x = tf.layers.maxPooling2d({ poolSize: [2, 2], strides: [2, 2], padding: 'same' }).apply(x); // Flatten the convolutional output x = tf.layers.flatten().apply(x); // Concatenate with the offset input const concatenated = tf.layers.concatenate().apply([x, offsetInput]); // Dense layers let dense = tf.layers.dense({ units: 256, activation: 'relu' }).apply(concatenated); dense = tf.layers.dropout({ rate: 0.2 }).apply(dense); dense = tf.layers.dense({ units: 128, activation: 'relu' }).apply(dense); dense = tf.layers.dense({ units: 64, activation: 'relu' }).apply(dense); // Output layer, one neuron per action const output = tf.layers.dense({ units: outputDim, activation: 'linear' }).apply(dense); const model = tf.model({ inputs: [gridInput, offsetInput], outputs: output }); model.compile({ optimizer: tf.train.adam(0.001), loss: 'meanSquaredError' }); return model; } module.exports = { createModel }; ``` Two main points here, about `size` and about the `convolutional layers`. `Size` : Of course I rely on my `o3` expert assistant here. As I adjusted my design goals, I shared code snippets, got feedback, consulted online resources for background and most importantly, `I asked good questions` (e.g. *are there any techniques we can use to capture more spatial data while not dramatically increasing the size of our dense layers?*). In the screenshot below you can see some advice `o3` is giving me where we were discussing the previous generation of the network (`7x7` grid) and what the trade-offs would be moving to a larger perceptual field. ![[snake-04.png]] I applied a combination of these recommendations in my model architecture. `convolutional layers`: As I learned, convolutional layers (`cnn's`) are specifically designed to handle spatial data (like pixels, board games, terrain maps). Fully connected layers (`fcn's`) treat each input as an independent feature without considering its relationship to adjacent inputs. This means that they are computationally expensive and, because they don't capture relationships, ineffective. Convolutional layers on the other hand use `small sliding filters` (remember our `one-hot matrices`) to focus on extracting local patterns from small pieces of data. Using these layers also means that the overall size of our network can be much smaller. Finally, concerning `max-pooling` (the details are not important), this is a technique we can apply to further `compress` the spatial data. We `lose some resolution` but the most `important information` is `retained`. ### Model Rewards You will probably spend a lot of your time here. The neural network has no idea what we want it to do, it only knows about the `state vector` and that it has to generate probabilities for our possible outputs (`up`, `left`, `down`, `right`). As the agent plays, randomly (we will address hyperparameters later), we give it `rewards` and `punishments`. Over time it will learn that performing actions in certain states will results in a reward or a punishment. The more the predictions align with the rewards we defined the better it has understood the problem space. In our example, the network very quickly learns to predict that `red blocks` result in a punishment. This is layered, of course, knowing `red blocks` result in a punishment doesn't mean it knows how to get around them. At a later stage, the network may learn to make more complex predictions and develop strategies to maximize rewards. For example, if a `red block` is to my `left` and an `obstacle` is `below`, then I might learn that I'm more likely to get rewarded by going `up` to get around than going `right`. It all sounds very anthropomorphic but the good part is we don't have to care about this at all. Our only task is to design `appropriate rewards and punishments` that describe the `desired outcome` of our scenario. That's what we are doing here, so lets look at some examples from the code. ```js // Food good, need food, more food if (this.player.x === this.food.x && this.player.y === this.food.y) { this.score += 40; this.food = this.getRandomFreePosition(); } // Green good, need green, more green for (let i = 0; i < this.greenBlocks.length; i++) { let gb = this.greenBlocks[i]; if (this.player.x === gb.x && this.player.y === gb.y) { this.score += 25; this.greenBlocks.splice(i, 1); i--; if (this.redBlocks.length > 0) { const randIndex = Math.floor(Math.random() * this.redBlocks.length); this.redBlocks.splice(randIndex, 1); } } } // Red bad, not red, never red for (let i = 0; i < this.redBlocks.length; i++) { let rb = this.redBlocks[i]; if (this.player.x === rb.x && this.player.y === rb.y) { this.score -= 20; this.redBlocks.splice(i, 1); i--; } } ``` These are pretty straight forward, they are mostly like how we would explain the game to someone else if they wanted to play. But, remember that our agent has no concept of its environment, it can only detect objects in a `9x9 square`. How can it then predict the best move? This is where the `relative normalized distance` to the food block comes into play. Remember we encode this into the state vector (`[dx, dy]`). In code I apply a reward if the agent reduces its distance to the food. ```js // Manhattan distance change as multiplier for reward const d_old = Math.abs(oldX - this.food.x) + Math.abs(oldY - this.food.y); const d_new = Math.abs(this.player.x - this.food.x) + Math.abs(this.player.y - this.food.y); const shapingReward = 1.0 * (d_old - d_new); this.score += shapingReward; ``` Of course the agent has no idea what these coordinates mean but as it plays it gradually learns, `reducing number good, need reduce, small number`. We may also need to punish it so it learns this lesson more aggressively. ```js // If the decrease in distance is less than a threshold, apply punishment const progressThreshold = 0.5; if ((d_old - d_new) < progressThreshold) { this.score -= 0.5; // increase number bad, must reduce, small number } ``` It's very important that you `balance rewards and punishments`, networks can and do learn that it is safer not to move anywhere if not calibrated properly πŸ˜…. That brings us to the next set of reward categories, `mitigating pathological behaviour`. We, of course, want the agent to explore the board, move around and encounter new terrain. This is easier said than done, I actually had to apply a lot of small but significant punishments for undesirable behaviour. ```js // Moving in a square is bad (haha) if (this.moveHistory.length >= 8) { const last8 = this.moveHistory.slice(-8); const firstHalf = last8.slice(0, 4).join(','); const secondHalf = last8.slice(4).join(','); if (firstHalf === secondHalf) { this.score -= 1.5; this.cyclePenaltyCount++; if (this.debugPatternDetection) { console.log("[DEBUG] 4-move cycle repeated. Penalty applied."); } } } if (this.moveHistory.length >= 10) { // Only one direction is really bad const last10 = this.moveHistory.slice(-10); const uniqueMoves = [...new Set(last10)]; if (uniqueMoves.length === 1) { this.score -= 2; } else if (uniqueMoves.length === 2) { let patternAlternates = true; // I love oscillating but it is bad, I don't know why, it's so cool for (let i = 0; i < last10.length - 1; i++) { if (last10[i] === last10[i + 1]) { patternAlternates = false; break; } } if (patternAlternates) { this.score -= 2; } } else { // Doing this thing I love most of the time, little bit bad const candidate = last10[0]; const countSame = last10.reduce((count, move) => count + (move === candidate ? 1 : 0), 0); if (countSame >= 8) { this.score -= 0.5; if (this.debugPatternDetection) { console.log("[DEBUG] 80% identical moves detected. Penalty applied."); } } } } // I love moving into the obstacle over and over but it is bad if (this.player.x === oldX && this.player.y === oldY) { this.score -= 3; } ``` Ok I want to make a humorous observation. Agents, in my setup, love oscillating (`left, right, left, right` `up, down, up, down`), it's like their favourite thing in the whole world. Punishment helps but the behaviour remains even at higher levels of training. I suspect this happens in all such systems, I guess that in some states predicted outputs loop on each other. For example, the agent sees a specific `state vector` with elements in it, through training it has learned that the best prediction shows `down` is the correct move. Now it has moved down, it is now observing a new `state vector`, this one has a best prediction of `up`. Thusly the show goes on, `ad infinitum`. I think this is a side-effect of limiting the `view` of the `agent` to a `9x9 square`. A strategy here could be to adjust what we are capturing: ``` // Current field-of-view β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ A β–ˆ β–ˆ β–ˆ β–ˆ -> No cocept of direction (9x9=81) β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ // Possible alternative field-of-view β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ A β–ˆ β–ˆ -> Localized knowledge (5x5=25) -> 56 tiles remaing, for directional vision β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ β–ˆ ``` Of course we would have to add direction to the game but it would let the agent have an idea of things happening in the distance and develop predictions based on those as well. Or some kind of radial pattern maybe, but I don't know if it is harder to train like that. Finally, we really want to encourage moving around. ```js // If the agent moves to a new cell in this episode, reward exploration const EXPLORATION_BONUS = 0.6; const posKey = `${this.player.x},${this.player.y}`; if (!this.visitedCells.has(posKey)) { this.visitedCells.add(posKey); this.score += EXPLORATION_BONUS; } ``` I feel like this is a bit complicated to get this right, probably practise makes it easier to intuit what types of incentives produce the outcomes you want. I'm also not sure these rewards are designed correctly but they are ok for our small scenario. ### Hyperparameters Ah yes, the things that cause us to have to retrain models. These are basically setting you define before you train the network. We already discussed model architecture which is a hyperparameter as well. There are a few we can talk about but I'll limit it to two; `Ξ΅` (`epsilon`) and `replay`. `Ξ΅` : This is basically randomness. When we start training the agent we set `Ξ΅=1.0` meaning all actions are random. In the beginning the agent doesn't know anything so it has to act randomly to acquire experiences. However, as the network learns, we need to gradually decay `Ξ΅` so the agent relies more and more on its own experiences. It's really tricky to get this decay right, if the agent becomes too `greedy` before it has learned good strategies it will get stuck in `local minima` and behave poorly. In my code I implement decay like this: ```js // Define epsilon parameters for training let epsilon = 1.0; const epsilonDecay = 0.999; // The factor by which epsilon decays each update. const epsilonMin = 0.1; // The minimum threshold for epsilon. const epsilonBoost = 0.2; // The amount to boost epsilon when it decays below epsilonMin. // Periodic exploration boost constants const explorationCycleLength = 1200; // Every 1200 episodes, boost Ξ΅ instead of decaying normally. const epsilonResetCycle = 0.2; // Maximum reset boost for Ξ΅ (we can experiment) if (episode % explorationCycleLength === 0) { // Every explorationCycleLength, boost epsilon by a random amount // between epsilonMin (0.1) and epsilonResetCycle (0.2). const boostAmount = epsilonMin + Math.random() * (epsilonResetCycle - epsilonMin); epsilon = epsilon + boostAmount; } else { // Otherwise, decay epsilon multiplicatively. epsilon = epsilon * epsilonDecay; // Soft reset: if epsilon falls below the minimum threshold, reset it to epsilonBoost (0.2). if (epsilon < epsilonMin) { epsilon = epsilonBoost; } } ``` This is not really great, it's serviceable but the technique is very crude. Probably something that would be better is implementing an `adaptive boost`, like this: ```js // Set epsilon parameters for training. let epsilon = 1.0; const epsilonDecay = 0.999; // The factor by which epsilon decays each update. const epsilonMin = 0.1; // The minimum threshold for epsilon. const epsilonBoost = 0.2; // The amount to boost epsilon when it decays below epsilonMin. // Adaptive epsilon update based on moving average reward. const movingAvgWindow = 20; // window size for moving average const performanceThreshold = 50; // performance threshold (adjust as needed) if (episodeStats.length >= movingAvgWindow) { const recentRewards = episodeStats.slice(-movingAvgWindow).map(s => s.totalReward); const movingAvg = recentRewards.reduce((a, b) => a + b, 0) / recentRewards.length; if (movingAvg < performanceThreshold) { // If performance is poor, boost exploration (ensure epsilon is at least 0.3). epsilon = Math.max(epsilon, 0.3); } else { // Otherwise, decay epsilon normally. epsilon *= epsilonDecay; if (epsilon < epsilonMin) epsilon = epsilonBoost; } } else { // Not enough episodesβ€”decay epsilon normally. epsilon *= epsilonDecay; if (epsilon < epsilonMin) epsilon = epsilonBoost; } ``` Bottom line, we need a booster mechanic to push `Ξ΅` up if needed to gain more, novel experiences. I tried some other things as well, this was just the latest version. Apparently there is good [science](https://arxiv.org/pdf/2502.04327), but I haven't applied it! In fact, I'm pretty sure there is an issue with how I implement decay (and replay) based on training results where agent intelligence plateaus around 1200-1400 episodes and even decays afterwards (even though `Ξ΅` looks ok). I could try and figure it out but this has been eating up a lot of my evenings already! `Replay`: This is the final thing we should talk about. As the agent plays it generates `experiences`, these experiences are sent through the neural network, one forward and backward pass. We store these experiences in a `replay buffer` that fills up with transitions as the agent plays. We don't send everything to the network, we batch them up in chunks of `32`. This seems straight forward and it might be (`?`). Again, I went through some iterations, you can see I have implemented a `sampleEnhancedBatch` function. This gets a mix of purely `random experiences` and `also` uses some tricks to get samples that are `more recent` and have `higher priority`. ```js function sampleEnhancedBatch(buffer, batchSize) { if (buffer.length < batchSize) { return buffer.slice(); } // Split the batch into uniform and prioritized samples. const halfBatch = Math.floor(batchSize / 2); const uniformSamples = []; // Uniform Sampling: randomly select 'halfBatch' transitions. for (let i = 0; i < halfBatch; i++) { const index = Math.floor(Math.random() * buffer.length); uniformSamples.push(buffer[index]); } // Compute effective priorities using an exponential recency bias. const alpha = 2.0; // We can experiment const effectivePriorities = []; for (let i = 0; i < buffer.length; i++) { const transition = buffer[i]; const originalPriority = (transition.priority !== undefined) ? transition.priority : Math.abs(transition.reward); // Recency weight: newest transitions get weight 1, oldest get weight ~ (1 / buffer.length) const recencyWeight = ((i + 1) / buffer.length); // Effective priority is the product of the original priority and the recency weight raised to alpha. const effectivePriority = originalPriority * Math.pow(recencyWeight, alpha); effectivePriorities.push(effectivePriority); } ``` It is worth mentioning also that we `clip` rewards in the replay buffer so there aren't outliers that can destabilize training (but the details aren't important). Our replay buffer has a maximum size and we cycle elements out as needed. ```js const REPLAY_BUFFER_SIZE = 10000; // Limit the replay buffer to the maximum allowed size. if (replayBuffer.length > REPLAY_BUFFER_SIZE) { replayBuffer.splice(0, replayBuffer.length - REPLAY_BUFFER_SIZE); } ``` There is a pitfall we have to be vigilant of here. As the agent learns and fills up the buffer with `experiences` it develops a more sophisticated understanding of the environment. If, at a later stage, the buffer is filled with a lot of stale early transitions that no longer make sense, then that can also destabilize training (causing learning to plateau or decay even). I handle this by pruning the buffer at fixed episode counts. ```js if (episode % 500 === 0 && replayBuffer.length > 0) { console.log("Purging oldest 20% of replay buffer to refresh data..."); const numToRemove = Math.floor(replayBuffer.length * 0.2); replayBuffer.splice(0, numToRemove); } ``` This seems ok but I haven't done extensive testing. We have to also be careful that we don't discard too much because the agent will lose valuable insights and may eventually forget important lessons. # Model Training Great, now lets have a look at our labour and train the model (I trained like 6 models, I'm just saying..). Here we can see the first time the model manages to achieve an average positive score over a `20 Episode` sample size. Keep in mind though that this will fluctuate for a while and `Ξ΅` is pretty high of course. ``` Episodes 161-180 summary: Avg Reward = 1.99 Avg Mean Loss = 0.07496 Avg Median Loss = 0.04736 Avg Max Q Value = 2.30198 Epsilon = 0.83603 Episode 181 -> Steps = 500, Reward = 67.70 Episode 182 -> Steps = 500, Reward = 30.85 Episode 183 -> Steps = 500, Reward = -56.00 Episode 184 -> Steps = 500, Reward = -150.10 Episode 185 -> Steps = 500, Reward = -104.55 Episode 186 -> Steps = 500, Reward = -103.90 Episode 187 -> Steps = 500, Reward = -92.85 Episode 188 -> Steps = 500, Reward = 26.55 Episode 189 -> Steps = 500, Reward = 149.35 Episode 190 -> Steps = 500, Reward = 107.55 Episode 191 -> Steps = 500, Reward = 67.85 Episode 192 -> Steps = 500, Reward = 39.25 Episode 193 -> Steps = 500, Reward = -36.55 Episode 194 -> Steps = 500, Reward = -12.15 ... ``` When you look at the trends of the current episodes, they appear promising. In fact, the initial run produced an average of `-160` so learning is happening gradually. Just a short while later we can see that our trends continue at pace. ``` Episodes 281-300 summary: Avg Reward = 86.07 Avg Mean Loss = 0.06964 Avg Median Loss = 0.04986 Avg Max Q Value = 2.86062 Epsilon = 0.74145 Episode 301 -> Steps = 500, Reward = 328.60 Episode 302 -> Steps = 500, Reward = -225.05 Episode 303 -> Steps = 500, Reward = 128.15 Episode 304 -> Steps = 500, Reward = 129.40 Episode 305 -> Steps = 500, Reward = 298.40 Episode 306 -> Steps = 500, Reward = 167.35 Episode 307 -> Steps = 500, Reward = 67.15 Episode 308 -> Steps = 500, Reward = 235.05 Episode 309 -> Steps = 500, Reward = 328.90 Episode 310 -> Steps = 500, Reward = 4.40 Episode 311 -> Steps = 500, Reward = 243.00 Episode 312 -> Steps = 500, Reward = 316.65 Episode 313 -> Steps = 500, Reward = -57.70 ... ``` Finally lets look at late stage statistics, this is right around the time that the performance of the `neural network` is declining. It fluctuates a bit around these episode counts (`629` being the highest recorded average) and goes steadily down afterwards. ``` Episodes 1481-1500 summary: Avg Reward = 629.73 Avg Mean Loss = 0.35360 Avg Median Loss = 0.23227 Avg Max Q Value = 10.62163 Epsilon = 0.34747 Episode 1501 -> Steps = 500, Reward = 342.80 Episode 1502 -> Steps = 500, Reward = 554.45 Episode 1503 -> Steps = 500, Reward = 535.55 Episode 1504 -> Steps = 500, Reward = 713.85 Episode 1505 -> Steps = 500, Reward = 759.85 Episode 1506 -> Steps = 500, Reward = 645.70 Episode 1507 -> Steps = 500, Reward = 902.30 Episode 1508 -> Steps = 500, Reward = 626.05 Episode 1509 -> Steps = 500, Reward = 538.10 Episode 1510 -> Steps = 500, Reward = 637.10 ... ``` Again I'm not totally sure why there is a decline, it could be `Ξ΅` (34% looks ok tbh) or the `replay buffer` (size/sampling) or maybe the network isn't big enough to capture rich strategic actions over time. ### Demos Lets have a look at a few demos using snapshot data (btw you really need snapshots as you train and you need a mechanism to resume training from snapshots, just saying). In the project I have integrated an `auto play` mode with some settings you can specify. You can manually set `Ξ΅`, this is useful for two reasons: - The model has never trained with an `Ξ΅` of `0` so its strategies are likely not adapted to this. - The model is `still pathological` at higher levels of training and `Ξ΅` can help it escape `local minima`. You can also adjust the `obstacle` in the middle of the screen. Training happens with the default `cross obstacle` but since the model is only reasoning over a `9x9` square it generalizes. For `auto play` there are three obstacle settings: - Default, has the cross obstacle - `diamond`, has a tapered diamond structure in the middle, easier to navigate - `none`, has no obstacle in the middle You can play with `Ξ΅` and obstacle configurations. Generally the larger the protrusions on the obstacle the harder it is to navigate. The `cross`, where each of the arms has a length `x3` the length of the agent, is the hardest. The clip below is from a `training snapshot` at `1000 episodes` navigating the default environment with the cross. We apply an `Ξ΅` of `0.2` to help it escape pathological states. ![[snake-05.mp4]] Because of `Ξ΅` it can escape some states where it would otherwise get stuck (it also causes it to make some small mistakes). Overall it actually does very well, it can escape the sharp corners of the cross and consistently prioritizes `food` and `green` blocks. I noticed also that it will prioritize `green` blocks over `food` (despite `food` giving more rewards). I guess that the network has learned `green` is an opportunistic reward, when it leaves the field of vision it is gone forever whereas it knows the location of `food` at all times. Here is an example using the `diamond` pattern with lower `Ξ΅`. ![[snake-06.mp4]] The tapered shape makes it a lot easier for the agent to get around obstacles and so we can get really good performance even if we halve `Ξ΅`. Finally lets look at `no obstacle` with an `Ξ΅` of `0.05`. ![[snake-07.mp4]] The performance is really excellent without obstacles. Because `Ξ΅` is `0.05` the agent is executing its learned policy in `95%` of cases here. # Conclusions It was a really cool `side-quest` and I learned a ton of stuff about `rl`. As I mentioned in the beginning, these types of low complexity environments exist in abundance in the real world. Learning some `rl` to `solve` these problems is generally very useful I think and I'm sure I will use these skills in the future to work on real problems. Learning about this primarily through `o3` was a huge boost in productivity. I think collecting all this knowledge from the internet manually would have easily `x3` my timeline on completing the work. Not to mention that, through `cursor`, it was extremely easy to rapidly iterate on code as I grokked new concepts (I was the neural net all along). I'm sure I made many mistakes and there are probably structural issues with my setup but overall I think it bodes well for `AI assisted` learning. You made it down here, amazing! You have access to the [GitHub repository](https://github.com/FuzzySecurity/SolidBlock-RL), I suggest you have a look at the code and I would be `extremely interested in pull requests` that result in `better performance`. If you make a `pr` please attach a screenshot of your scores and explain so I, and others, can learn from your changes!