ASAR Format Spec - KnifeCoat

# Intro Recently I was looking at the ASAR file format and I'm no expert but I thought I would explain a little bit about the structure it uses. The official ASAR format is laid out [here](https://github.com/electron/asar), it's a core part of the Electron ecosystem. ASAR's are a sort of `virtual file system (VFS)` they have an interesting structure. The format is as follows: - A well-ordered `JSON header object` describing a directory structure (with nested files and folders etc.). These are in a specific search order. - A concatenated set of files appended to the JSON header The one other thing I should mention is that ASAR's use [chromium-pickle-js](https://github.com/electron/node-chromium-pickle-js) to *"serialize"* some data. I honestly don't know what the purpose is of using this, I think there is no need for it and they could probably just get rid of it entirely. # Pickle me this? So pickling values isn't very interesting but it needs some explanation. `Chromium-pickle-js` supports some simple data types only like `String` and `Int` (or variations of these). Lets take a look at an example. Imaging we have a string, `AABB`: ```js AABB --> 0x41414242 |__ Size 4-bytes ``` The pickling process essentially creates a `DWORD` size value of the object which is pickled followed by the object. Using this logic the pickled string would appear like so. ```js // Values [0x00000004][0x41414242] // Raw bytes (Sizes are in LE) 04 00 00 00 41 41 42 42 ``` That seems straightforward enough but there is one more caveat to this. Pickled objects have to be `DWORD` aligned and are null padded if they are not. Consider the following example. ```js AABBCC --> 0x414142424343 |__ Size 6-bytes // Values [0x00000006][0x414142424343][0x0000] // Raw bytes (Sizes are in LE) 06 00 00 00 41 41 42 42 43 43 00 00 ``` You can calculate padding for an arbitrarily sized buffer length like so: ```csharp private static UInt64 calculatePicklePadding(UInt64 uLength) { UInt64 uNewLen = (uLength + 3) & ~(UInt64)3; return uNewLen - uLength; } ``` Good, you're an expert now, just like me 😉! # Header Struct This is sort of convoluted but here it goes. Imagine there is a `JSON Header` of some arbitrary long length (`0x06bb10`). The pickle for this would look like: ```js // Values [0x0006bb10][ ..... ] // Raw bytes (Sizes are in LE) 10 bb 06 00 ..... ``` Note that `0x06bb10` is `DWORD` aligned, if we consider a different size like `0x06bb09` (which would have 3 NULL byte padding) then we have the following. ```js // Values [0x0006bb09][ ..... ][0x000000] // Raw bytes (Sizes are in LE) 09 bb 06 00 ..... 00 00 00 ``` That would seem like it would be enough to pack the header but ASAR does some thing a bit more complicated. If we take a header of size `0xe8dd` then the header is packed like so: ```js // Step 1 - Pickle the header [0x0000e8dd][ ..... ][0x000000] // Step 2 - Pickle the pickle (but why?) [0x0000e8e4][Step 1] // Step 3 - Pickle the pickle the pickle (really though?) [0x0000e8e8][Step 2] ``` This is prefixed by a `DWORD` of `0x00000004` the assumption is that this is either the size of the second element or maybe referencing `DWORD` alignment or it's some magic value. We could probably find out by looking at the [asar](https://github.com/electron/asar) repo and reading the implementation `¯\_(ツ)_/¯`. The final representation of a `JSON Header` of size `0xe8dd` can be seen below: ```js // We represent the data here as DWORD's 0x00000004 0x0000e8e8 0x0000e8e4 0x0000e8dd // 16 bytes 0x................... 0x000000 // Raw JSON Header 0xe8dd ``` # JSON Header So the `JSON Header` is just a nested list of objects that describe a directory. Fairly easy to understand. Let's look at some sample data. Here we can see what a directory looks like, or how it is represented in JSON. ```json "test": { "files": { "file1.txt": { "size": 2224, "offset": "13950963", "integrity": { "algorithm": "SHA256", "hash": "fb0505567d018af7bf21ba566960d814ee0b70fb653cb49f421bba18ff14d90b", "blockSize": 4194304, "blocks": [ "fb0505567d018af7bf21ba566960d814ee0b70fb653cb49f421bba18ff14d90b" ] } }, "file2.txt": { "size": 12223, "offset": "13953187", "integrity": { "algorithm": "SHA256", "hash": "c1b35b7ac6187b587685d55f0bc586b3bcbfbb9bbbe646f8a5fd9e971f5f2167", "blockSize": 4194304, "blocks": [ "c1b35b7ac6187b587685d55f0bc586b3bcbfbb9bbbe646f8a5fd9e971f5f2167" ] } } } } ``` In this case, there is a folder called `test`. It's a small folder, it has only two files (`file1.txt` and `file2.txt`). This gives you the important information that you need to access these files in the ASAR archive. - `size`: Tells you the size of the file in the archive - `offset`: Tells the the offset of the file in the archive - `integrity`: The nested integrity objects have some SHA data validation hashes which can be used to check the file (not necessarily actually checked) If the test folder had a subfolder you can imagine it's just a nested object, something like this: ```json "test": { "test2": { "files": { ...... } } } ``` One more thing to mention is that files can be in the ASAR (naturally, that makes sense), but they can also be `paged out` of the archive where they exist on disk and only maintain their reference in the `JSON Header`. You can see an example of this below. ```json "WellNow": { "files": { "jumanji.txt": { "size": 1165120, "unpacked": true, "integrity": { "algorithm": "SHA256", "hash": "1774fe73119cf3d5ae92d416ca8ce04199993ad7d753d249f6ce5cb12ec26b8c", "blockSize": 4194304, "blocks": [ "1774fe73119cf3d5ae92d416ca8ce04199993ad7d753d249f6ce5cb12ec26b8c" ] } } } } ``` You can see that this is very similar to what we saw before except that the `jumanji.txt` file object has a property called `unpacked` and that it doesn't have a property called `offset`. There is an implicit limitation here on unpacking and repacking ASAR archives. If you unpack the archive you lose some context that exists in the `JSON Header`. That means that, unless you use the original `JSON Header` as input you can't repack an ASAR to it's original state. # General Layout Just so you have an idea of the general layout visually (assuming a size of `0xe8dd`). ``` ------------------------------------------ | 16-byte file header | Size | -------- ------------------------------------------ | Size of JSON Header | | <------- | | | JSON Header | -------------------------- | ---------| File 4 - JSON Object | | |0x000000| | ------------------------------------------ | | File 1 | | ------------------------------------------ | | File 2 | | ------------------------------------------ | | File 3 | | ------------------------------------------ | | File 4 | <------------------------- ------------------------------------------ | File 5 | ------------------------------------------ | File 6 | ------------------------------------------ | ...... | ------------------------------------------ | File n | ------------------------------------------ ``` # Thoughts? So I'm a bit conflicted about ASAR archives. I think they have some good properties: - Simple format - `JSON Header`, this is good because everything speaks or can be made to speak JSON. Especially considering the Electron context where JSON is supported out-of-the-box - The lack of compression means that files can be used in-place, more of a VFS and an archive if that distinction makes sense - No need for cryptography beyond SHA hashing Some design decisions make no sense though and could (/should) be abandoned I think: - Why do you need `Chromium-pickle-js`? You don't, stop it! - Padding for `DWORD` alignment is also not necessary, `Byte[]` readers don't have a `DWORD` alignment requirement so what is the point exactly? - Look, you really want to use `Chromium-pickle-js`, I get it. But why do you need three nested pickles on the `JSON Header`. You don't, stop it! # Custom .NET archive format So I had this idea to create a horrible mutant version of ASAR called `GNAR`. The idea would be to have some simple magic bytes followed by an array of serialized .NET objects describing the archive contents followed by an array of file objects (gzip compressed). These could also be encrypted if that is desirable. ``` ------------------------------------------ | GNAR | Boolean Crypto | Header Count | --> 12 bytes ------------------------------------------ | | | | | .NET Serialized Array | -------------------------- | Of File Objects | File 4 - .NET Object | | | | ------------------------------------------ | | File 1 | | ------------------------------------------ | | File 2 | | ------------------------------------------ | | File 3 | | ------------------------------------------ | | File 4 | <------------------------- ------------------------------------------ | | File 5 | | ------------------------------------------ |__ AES decrypt? | File 6 | |__ gzip decompress ------------------------------------------ | ...... | ------------------------------------------ | File n | ------------------------------------------ ``` Listen it would be great, trust the format 😂! If someone promises me that they will write a deserialization exploit for the archiving utility (`BinaryFormatter` 👀) then I'll create the spec. But I want to see `calc` pop or the deal is off!