Boyer-Moore Search Optimization ft. ChatGPT

# Intro Recently I was working on some memory scanning functionality. If we try to reduce this problem to it's most basic form, what is memory scanning really? The core feature of memory scanning is traversing large arrays and looking for some kind of pattern. Being a naïve creature, unbound in code by time constraints, most of the time, I never gave much thought to optimization. However, by it's nature, memory scanning should be fast, we don't want to wait around for our results. This post is really about two things: - Explaining search optimization using Boyer-Moore. - Perhaps more importantly, showing how `ChatGPT` can help us work more efficiently. There has been a lot of discourse about the advances in public-facing AI services which go into a number of different directions *Omg, AI is going to take my joooobbb!* or *Think of the children, AI is teaching them malwarez* 💀 or *AI is now our lord and saviour* or *I'm fuming, AI is going to monetize our ideas* or *Respect, maaa privacy, how dare you have this public code in your corpus*.. The truth is different. The advent of more, public, and increasingly proficient AI models results in an outlook which is neither `utopian` or `dystopian`. The reality is that we are just dealing with new constrains and parameters that have their own advantages and disadvantages. We can use these models to do more but the level of BS remains relatively the same (different BS, different day). Personally I'm really excited by these advances and I can see how transformative they (already) are for my work. It will be increasingly important to leverage these advances to augment existing workflows. This post is really meant to be an example of that. Below you will find an `interview` I conducted with `ChatGPT` where we learn about array search optimization using `Boyer-Moore`. For each answer I have also provided an audio transcript generated using [Google Cloud Text-To-Speech](https://cloud.google.com/text-to-speech) (this seemed suitably appropriate). # The Markov Thinking Sand Interview #### Question So ChatGPT, I want you to consider the function below and tell me what it does and how it works. ```cs internal static Int32 NaiveFindPattern(Byte[] haystack, Byte[] needle) { // Check if the needle is larger than the haystack. if (needle.Length > haystack.Length) { return -1; } // Iterate through the haystack and try to match the needle. for (Int32 i = 0; i < haystack.Length - needle.Length + 1; i++) { Boolean found = true; for (Int32 j = 0; j < needle.Length; j++) { if (haystack[i + j] != needle[j]) { found = false; break; } } if (found) { return i; } } // No match found. return -1; } ``` #### Answer ![[GPT-question1.webm]] This is a function written in C# that implements a naïve algorithm for finding the starting index of a pattern (the "needle") within a larger array of data (the "haystack"). The function takes two arguments: a byte array called "haystack" and a byte array called "needle". The function first checks if the length of the needle is larger than the length of the haystack, and if so, it immediately returns -1, indicating that the needle cannot be found in the haystack. Then, the function uses nested for loops to iterate through the haystack and try to match the needle. The outer loop iterates through the haystack starting at index 0 and ending at the last index of the haystack minus the length of the needle plus 1. The inner loop iterates through the needle and compares each byte of the needle to the corresponding byte of the haystack at the current index of the outer loop. If the needle does not match the corresponding slice of the haystack, the inner loop will set the found Boolean to false and break. If the needle matches the slice, then the outer loop will return the current index "i" as the result of the function. Finally, if no match is found, the function will return -1, indicating that the needle was not found in the haystack. It's worth noting that this algorithm has a time complexity of O(nm) where n is the size of the haystack and m is the size of the needle. This is relatively inefficient when compared to more sophisticated algorithms like KMP or Boyer-Moore. #### Question Ok, my code is inefficient but explain a bit more about what would be efficient. Without going into specific, know algorithms, provide a general approach of how the search speed can be improved. #### Answer ![[GPT-question2.webm]] Sure! One way to improve the search speed would be to preprocess the needle in a way that allows you to skip over certain parts of the haystack during the search. This can be done by using an algorithm that takes advantage of the properties of the needle and haystack to narrow down the possible locations of a match. Another way to improve the search speed is to use a data structure that allows for faster lookups, such as a hash table or trie. This can be useful if the needle or haystack contains repeating patterns, as it allows you to quickly skip over sections of the haystack that are unlikely to contain a match. Additionally, you can implement parallelism and multithreading in the search, this will take advantage of multiple cores of the CPU to perform the search in chunks of data simultaneously. Finally, an approach that might be useful, is to use a combination of these techniques. For instance, you can use a preprocessing algorithm and a data structure in conjunction with parallelism to achieve the most efficient search possible. Of course, it's important to consider the specific properties of your data and use-case when choosing an algorithm or data structure to optimize the search. #### Question Ok, consider that the haystack would typically be large and the needle typically small. The range of possible values in the needle and haystack can contain byte sequences that have all possible valid values from MinVal to MaxVal. Does this change or inform your assessment? #### Answer ![[GPT-question3.webm]] Yes, this information does change my assessment. If the haystack is typically large and the needle is typically small, then it makes sense to use an algorithm that can quickly scan through the haystack and skip over sections that are unlikely to contain a match. The algorithm I described earlier, called Boyer-Moore, can do this very efficiently, by utilizing a preprocessing step to construct a "bad character" and "good suffix" heuristic that allow for efficient skipping. Also, when all possible byte values can be found in both the needle and haystack, you can use algorithms that don't rely on the specific properties of the data, such as the Rabin-Karp algorithm that uses hash values to quickly identify matches, these algorithms can also be very efficient. It's important to note that, as the needle and haystack have a wide range of possible values, the preprocessing step of the Boyer-Moore algorithm, or the hash table/trie data structure, may take longer. However, the runtime for each search iteration will be shorter, which means that the overall search will still be faster. As always, it's best to evaluate the specific properties of the data and use-case when choosing an algorithm, rather than relying on generalities. #### Question Great, now we have some background on this problem, can you show me an alternate version of this function which uses Boyer-Moore to optimize the search. #### Answer ![[GPT-question4-1.webm]] Sure! Here is an example of how the Boyer-Moore algorithm could be implemented to find the starting index of a pattern (the "needle") within a larger array of data (the "haystack") in C#: ```cs internal static int BoyerMoore(byte[] haystack, byte[] needle) { // Preprocessing step: // create the bad character and good suffix heuristics int[] badChar = new int[256]; int[] goodSuffix = new int[needle.Length]; // Initialize the bad character heuristic for (int i = 0; i < badChar.Length; i++) { badChar[i] = needle.Length; } for (int i = 0; i < needle.Length - 1; i++) { badChar[needle[i]] = needle.Length - i - 1; } // Initialize the good suffix heuristic int lastPrefixIndex = needle.Length - 1; for (int i = needle.Length - 1; i >= 0; i--) { if (IsPrefix(needle, i + 1)) { lastPrefixIndex = i + 1; } goodSuffix[i] = lastPrefixIndex + (needle.Length - i - 1); } for (int i = 0; i < needle.Length - 1; i++) { int suffixLen = SuffixLength(needle, i); goodSuffix[needle.Length - 1 - suffixLen] = needle.Length - 1 - i + suffixLen; } // Search step: int i = 0; while (i <= haystack.Length - needle.Length) { int j; for (j = needle.Length - 1; j >= 0 && needle[j] == haystack[i + j]; j--) ; if (j < 0) { return i; } i += Math.Max(badChar[haystack[i + j]] - needle.Length + j + 1, goodSuffix[j]); } // No match found. return -1; } private static bool IsPrefix(byte[] needle, int p) { for (int i = p, j = 0; i < needle.Length; i++, j++) { if (needle[i] != needle[j]) { return false; } } return true; } private static int SuffixLength(byte[] needle, int p) { int len = 0; for (int i = p, j = needle.Length - 1; i >= 0 && needle[i] == needle[j]; i--, j--) { len++; } return len; } ``` ![[GPT-question4-2.webm]] The Boyer-Moore algorithm consists of two main steps: a preprocessing step, where the bad character and good suffix heuristics are created, and a search step, where the needle is matched against the haystack. In the preprocessing step, two arrays are created: the bad character array and the good suffix array. The bad character array is used to skip characters in the haystack that don't match the needle, based on the last occurrence of each character in the needle. The good suffix array is used to align the needle with a suffix of the haystack that matches a suffix of the needle. In the search step, the function starts at the first character of the haystack and compares it to the last character of the needle. If it matches, it compares the next character, and so on, until a mismatch is found or the entire needle has been matched. If there is a mismatch, the function uses the bad character and good suffix arrays to determine how many characters to skip in the haystack before trying again. The algorithm repeats this process of comparing and skipping until the needle is found or the end of the haystack is reached. It's worth noting that Boyer-Moore has a best-case time complexity of O(n/m) where n is the size of the haystack and m is the size of the needle, which is much faster than the previously mentioned naïve algorithm. With all possible bytes, Boyer-Moore has a worst-case time complexity of O(nm) as the preprocessing step takes O(m+R) where R is the range of possible byte values, this is still much faster than the O(nm) of the naïve algorithm. Please keep in mind that this is just an example and it may not work in all cases, it could also be optimized depending on the properties of the data. #### Question As a layperson, it seem to me that the efficiency here comes from the ability to skip large parts of the haystack while searching. Can you give a step-by-step explanation of how the skipping works. You should use some dummy data to illustrate the process. #### Answer ![[GPT-question5-1.webm]] Sure! Let's say we have the following haystack and needle: ```cs haystack: "thequickbrownfoxjumpsovrthelazydogthequickbrownfoxjumpsovrthelazydog" needle: "jumpsovrthelazydog" ``` ![[GPT-question5-2.webm]] The Boyer-Moore algorithm begins by comparing the last characters of the needle and haystack. Since they don't match, the algorithm uses the bad character heuristic to determine how many characters to skip in the haystack before trying again. The bad character heuristic is implemented as an array, where each index represents a character in the alphabet and the value at that index represents the distance between the last occurrence of that character in the needle and the end of the needle. For example, the last occurrence of 'a' in the needle is at position 18, the last occurrence of 'g' is at position 20, so the bad character array for this example would look like this: ```cs badChar = [20, 18, ... ] ``` ![[GPT-question5-3.webm]] Now, the algorithm looks up the value of the bad character array for the mismatched character, in this case the character 'z' of the haystack, and finds that it's 17. Since the last occurrence of 'z' in the needle is at the position 17, the algorithm will skip 17 characters in the haystack before trying again. The algorithm continues by comparing the remaining characters of the needle and haystack, and will find again a mismatch at the next 'e' at the needle, this time the bad character array will tell us that we can skip 7 characters in the haystack before trying again. At some point, the algorithm will find a match for the needle, it will return the index where the needle matches in the haystack. If no match is found, the function returns -1. As you can see in this example, using the Boyer-Moore algorithm allows you to skip large sections of the haystack, because the algorithm can use the mismatch information to quickly find the next potential match location. #### Question Ok ChatGPT, that all makes a lot of sense now, thank you for your help! #### Answer ![[GPT-question6.webm]] You're welcome! I'm glad I could help. If you have any other questions or need further information on the Boyer-Moore algorithm or string search in general, don't hesitate to ask. # Interview Retrospective This is very impressive as a learning aid. A close review of the transcript will show some errors in the responses (I intentionally left the responses 💯 intact, can you see what I mean?). I found this to be typical in certain scenarios. ChatGPT can be a bit pathological sometimes but overall that doesn't detract from the insights you can get in a wide variety of situations. One assumes that there will also be a gradual reduction of AI pathology over time. Consider also how much effort is expended here to learn about search optimization versus using more traditional methods: buy a book, read a blog, spend time on stack overflow, watch a video, etc. ### Speed Comparison I think it pays to actually compare the efficiency of the `NaiveFindPattern` function and the `BoyerMoore` function suggested by `ChatGPT`. What I do here is take a relatively large file (1.2 GB) and search for a `Byte[]` sequence towards the end of the buffer. ```cs Byte[] bPattern = { 0x74, 0x68, 0x5F, 0x70, 0x61, 0x73, 0x73, 0x77, 0x6F, 0x72, 0x64, 0x20, 0x3D, 0x20, 0x48, 0x72 }; ``` You can see the results in the screenshot below: ![[Pasted image 20230111135410.png]] The pattern, in both cases, was matched at `0x44660B08` (an offset of `1147538184 Bytes` or `1.0687 GB`). Just doing some arbitrary napkin math, in this instance, the speed of both functions is as follows: - `NaiveFindPattern` : 0.7957 GB/s - `BoyerMoore` : 3.5863 GB/s (4.5 times faster) ![[facts-straight.gif]] ### Supervised Learning Finally, I want to say, you should have a healthy scepticism when you evaluate outputs from tools like `ChatGPT` and `Github Copilot`. It is important that `you supervise` the outputs, test proposals, validate assertions and ask questions. ``` I wou'd have you Observe what was told to us aboute tak'g Care whom to calle upp, for you are Sensible what Mr. Mather writ in ye Magnalia of —, and can judge how truely that Horrendous thing is reported. I say to you againe, doe not call up Any that you can not put downe; by the Which I meane, Any that can in Turne call up Somewhat against you, whereby your Powerfullest Devices may not be of use. Ask of the Lesser, lest the Greater shal not wish to Answer, and shal commande more than you. ```