Zip bomb how it works. What are ZIP bombs, or how a small archive can easily crash your computer

One very interesting and important event happened, offensively quietly and unnoticed. Russian President Vladimir Putin met and talked with one of the heads of the civil service, whose name does not often appear in the press. No, this person is not a secret agent, although due to the nature of his activity he has access to secret information, and can even be considered a kind of “fighter of the invisible front.” In general, Putin talked with the head of the Federal Archival Agency Andrei Artizov.

The conversation between the head of state and the head of the agency was of a business nature, the possible consequences of which could in some places cause the effect of an exploding bomb, and in some people even the detonation of a personal nozzle.

The President made a quiet statement that he had decided to declassify many archival documents, and the decree would be signed on the same day. In addition, Putin announced the transfer of Rosarkhiv to direct subordination to the President of Russia, since many of the department’s materials “are of particular value and have global significance.”

The head of the agency, in turn, informed the president that the Russian archival fund stores 500 million files, and that “never in recent decades has planned, organized declassification work been carried out the way it is being carried out now.”

The key point is that immediately after declassification, archival documents will be available on the official website of the Russian Archive, for which a special database has already been created.

Among the already declassified archives are materials that domestic and foreign historians have long been licking their lips at: 1,400 unique directives of Stalin, orders from Headquarters, front-line orders, operational maps, resolutions and photographs of that time, which until recently were stored in archives under the heading “top secret.”

One can only sincerely rejoice for the historians, wish them fresh and interesting works, which will be based on the above-mentioned documents, but only declassification will affect more than one military-historical topic.

A source close to Rosarkhiv provides interesting details: “As far as I know from information from the archives department, we are talking about the period from 1930 to 1989. There are cases of, excuse me, informers - as well as innocent repressed people, with very interesting surnames. There will be data on space and military developments that can already be reported. In addition, data on the course of battles, orders and received intelligence information during the Great Patriotic War, as well as on interstate relations during the Cold War, are declassified.”

And he adds very forcefully: “Some documents will surprise society. You need to know your own history, whatever it may be.”

Archives have a long memory and carry a potential charge no worse than a hydrogen bomb. It is no coincidence that in our country the “marshals’ case” and the “doctors’ case” continue to remain classified, creating the basis for all kinds of speculation for a long time. Not so long ago in Great Britain, a special commission dealt with archival documents whose period of secrecy was coming to an end, but based on data on pre-war contacts between British intelligence and the Nazi SD, it was decided to extend the secrecy regime for another 50 years.

The reservation about “some documents will surprise society” was not made by chance. In the late 1980s, both in Russia and in the republics of the former USSR, a variety of people came to power on the wave of “democratization.”

Many show amazing political vitality, despite a clear lack of managerial talent and a tendency to teach the people about democracy from their own point of view.

In the near abroad, former leaders of the “popular fronts”, who changed their worldview from anti-Soviet to anti-Russian, have become firmly attached to power and are pursuing unfriendly policies towards Russia from their territory - from organizing congresses of all kinds of disenfranchised from “Open Russia” to training pro-fascist militants and providing military assistance Banderaites.

The European community, as well as the citizens of these limitrophes, will be interested to find out from the declassified documents of the Federal Archive - which of the registered “Eurodemocrats” were KGB informants. The earth is full of rumors that in their foggy youth the former leader of the Lithuanian National Front Landsbergis and the current Madame President Dalia Grybauskaite “knocked” at the KGB. Now, I believe, it will be possible to learn about this side of their biography from the primary source.

You can gossip about “it was a long time ago and it’s not true”, “it’s overgrown with reality”, as much as you like, but you shouldn’t underestimate the destructive power of such revelations.

One may recall how last year a package of documents, unearthed by the Institute of National Remembrance, about the collaboration of Lech Walesa (under the operational guise of “Bolek”) in the midst of his activities at the Gdansk Shipyard, was spectacularly released in the Polish press. The information explosion left no stone unturned in the image of the icon, “Pole No. 1,” leader of Solidarity, Nobel Peace Prize laureate and the first president of anti-Soviet Poland. From now on and forever, Walesa is just a mustachioed fat old man, a ruin of a bygone era, whose shame can be seen with your own eyes in the form of 17 receipts exhibited at the same Institute of National Remembrance for receiving money for the knock-knock information transferred to the special services.

One can only regret that the archival “bomb” did not explode under the agent “Bolek” in the 1980s, when he and his “Solidarity”, working together with Western intelligence services, were shaking socialist Poland.

A lot of interesting surprises may lie in wait for dedicated domestic liberals. Their winding life path is in many ways similar to Walesa’s path to the heights of power. Unfortunately, the KGB was in no hurry to expose the reformed informants for many reasons, among which ethical ones were not the least important. After all, if you reveal an agent, especially a voluntary one, then who will cooperate? Agents caught committing unseemly acts and recruited on this basis will not get you very far.

Unofficial information is leaking in the media that “prominent figures of the liberal-democratic movement in Russia” cooperated with the Soviet intelligence services voluntarily, for selfish reasons: interesting business trips, career advancement, prestigious work, etc.

One can only imagine what kind of snitching vipers various creative associations of writers, theater workers and filmmakers were.

Many critics of the USSR and the Soviet system had parents who were not just prominent party or economic figures, but also served in the NKVD-MGB-KGB, and even took personal part in repressions.

Of course, children are not responsible for their fathers, but it becomes disgusting in the soul when repainted offspring, who do not want to remember the past of their parents, but once, without any remorse, used their high position as a springboard to a personal bright future, begin to expose and tear off the covers.

Declassified archives can influence the balance of power in embattled Ukraine. One can recall how a number of documents published on social networks by Miroslava Berdnik even before the Maidan putsch, concerning the cooperation of OUN leaders with the MGB and the Ministry of Internal Affairs, caused attacks of writhing and fountains of throat diarrhea among fans of Bandera’s corpses. Svidomites cursed and called photocopies of documents “FSB fakes,” but could not argue with any argument.

Why is it important? Ukrainian radical nationalism is evil without any reservations. But it is professed by a variety of people. Among modern OUN members there are their idealists who want to cleanse Ukrainian nationalism of the most odious figures of Bandera and Shukhevych, since they have long ago irrevocably compromised themselves as Hitler’s puppets and punitive forces. Be that as it may, post-war Soviet intelligence had a good understanding of the feelings of German nationalists, who clearly separated themselves from the Nazis and supporters of General Gehlen, who fell under the Americans. German nationalists, who considered Bismarck their idol, having survived the horrors of war and seeing how the Americans were pushing Germany into a new war with the USSR, chose to work for Soviet and East German intelligence. This point is definitely worth taking into account.

On the other hand, Ukraine is full of ardent neo-Banderists who do not suffer from excess disgust, who are not afraid of any documentary evidence of the collaboration of their idols with Hitler’s special services. What if declassified documents are published saying that their idols, like Vasil Kuk, leaked information to the MGB about “brothers” sitting in caches?

If it suddenly turns out that all sorts of “rights activists” and “independents” of the late Soviet period knocked on the KGB to soften the prison regime, for an additional parcel of lard from home or a pack of shag from a camp stall? Will Bandera’s “spilnota”, which sees everywhere the long hand of Moscow, the FSB and Putin personally, stand such a test of the strength of convictions?

Information of this kind can cause a powerful release of thermonuclear plasma from Svidomo doupas, the owners of which boast of “national purity” and “Svidomo.” For some, revelations of idols may make them recoil and come to their senses.

It would be interesting to learn about the double life of the leaders of the self-proclaimed “Majlis”, as well as other disgusting figures who languish under the self-imposed burden of the “conscience of the nation.”

So, under whose personal backside will the first archival “bomb” explode? Who will be the “locomotive” unwinding a long chain of revelations?

[Update] Now I'm on some kind of intelligence agency list because I wrote an article about some kind of "bomb", right?

If you've ever hosted a website or administered a server, you're probably well aware of the bad people who try to do all sorts of bad things to your property.

When I first hosted my little Linux box with SSH access at the age of 13, I looked at the logs and every day I saw IP addresses (mostly from China and Russia) trying to connect to my sweet little box (which is actually It was actually an old ThinkPad T21 laptop with a broken display, buzzing under the bed). I reported these IPs to their providers.

In fact, if you have a Linux server with open SSH, you can see for yourself how many connection attempts occur daily:

Grep "authentication failures" /var/log/auth.log

Hundreds of unsuccessful authorization attempts, although the server has password authentication disabled and is running on a non-standard port

Wordpress sentenced us

Okay, let's face it, web vulnerability scanners existed before Wordpress, but after the platform became so popular, most scanners started checking for misconfigured wp-admin folders and unpatched plugins.

So if a little budding hacker gang wants to get some fresh accounts, they'll download one of these scanner tools and set it on a bunch of websites in hopes of gaining access to some site and deface it.

Sample logs when scanning with Nikto tool

This is why all servers and website administrators deal with gigabytes of logs full of scan attempts. So I thought...

Is it possible to strike back?

After experimenting with the potential use of IDS or Fail2ban, I was reminded of the good old ZIP bombs from the past.

What kind of thing is a ZIP bomb?

As it turns out, ZIP compression is great at dealing with repetitive data, so if you have a giant text file filled with repetitive data like all zeros, it will compress very well. I mean, VERY good.

As 42.zip showed, it is possible to compress 4.5 petabytes (4,500,000 gigabytes) into 42 kilobytes. When you try to view the contents of the archive (extract or unzip it), you will probably run out of disk space or RAM.

How to drop a ZIP bomb on a vulnerability scanner?

Unfortunately, web browsers don't understand ZIP, but they do understand GZIP.

So the first thing we'll do is create a 10GB GZIP file filled with zeros. There are many nested compressions you can do, but let's start simple.

Dd if=/dev/zero bs=1M count=10240 | gzip > 10G.gzip

Making a bomb and checking its size

As you can see, its size is 10 MB. It could have been compressed better, but that's enough for now.

Now let's install a PHP script that will deliver it to the client.

Ready!

Now we can use it as a simple defense:

Obviously, this script is not the epitome of elegance, but it can protect us from the script kids mentioned earlier, who have no idea that the user-agent can be changed in scanners.

So... What happens if you run this script?

(if you tested the bomb on other devices/browsers/scripts, please

The article mentions 9 layers of zip files, so this is not a simple case of sticking together a bunch of zeros. Why 9, why 10 files in each?

Firstly, the Wikipedia article currently talks about 5 layers of 16 files. Don't know where the discrepancy is occurring, but that's not all that's relevant. The real question is why use nesting in the first place.

DEFLATE, the only supported compression method for zip files*, has a maximum compression ratio of 1032. This can be achieved asymptotically for any repeating sequence of 1-3 bytes. No matter what you do with the zip file, as long as it is only used with DEFLATE, the size of the decompressed file will be no more than 1032 times the size of the original zip file.

Therefore, to achieve truly outrageous compression ratios, you must use nested zip files. If you have 2 compression layers, the maximum ratio is 1032^2 = 1065024. For 3, it's 1099104768 and so on. For the 5 layers used in 42.zip, the theoretical maximum compression ratio is 1170572956434432. As you can see, the actual size of 42.zip is nowhere near this level. Part of this is the overhead of the zip format, and part of it is that they just don't care.

If I had to guess, I would say that 42.zip was formed by creating a large empty file and copying and copying over and over again. There's no way to push format limits or maximize compression or anything - they just arbitrarily chose 16 copies per layer. The idea was to create a large payload without much effort.

Note. Other compression formats, such as bzip2, offer much, much, much higher maximum compression ratios. However, most zip parsers do not accept them.

P.S. You can create a zip file that will extract a copy of itself (quine). You can also make one that unpacks multiple copies. So if you recursively decompress the file forever, the maximum possible size is infinite. The only limitation is that it can increase by no more than 1032 in each iteration.

P.P.S. A value of 1032 assumes that the file data in the zip does not overlap. One of the features of the zip file format is that it has a central directory that lists the files in the archive and the file data offsets. If you create multiple entries in files pointing to the same data, you can achieve much higher compression ratios even without nesting, but the zip file will likely be rejected by parsers.

Decompression bomb testing

A decompression bomb is a file designed to knock down or remove a useless program or system that reads it, i.e. refuse service. The files in this project can be used to test the vulnerability of an application to a given type of attack.

Download Bombs

A zip bomb, also known as a zip of death or decompression bomb, is a malicious archive file designed to crash or remove a useless program or system reading it. It is often used to disable antivirus software to create an opening for more traditional viruses. Instead of hijacking the normal operation of a program, a zip bomb allows the program to run as expected, but the archive is carefully processed so that unpacking it (for example, using a virus scanner to look for viruses) requires too much time, disk space, or memory.

A Zip bomb is a small file to simplify its transfer and avoid suspicion. However, when you try to unpack this file, its contents request more than the system can process. Another example of a zip bomb is the file 42.zip, which is a zip file containing 42 kilobytes of compressed data, containing five levels of nested zip files in sets of 16, each bottom-level archive containing 4.3 gigabytes (4,294,967,295 bytes, ~3.99 GiB), in a total of 4.5 petabytes (4,503,599,626,321,920 bytes, ~3.99 PiB) of uncompressed data. Such files can still be downloaded from various websites on the Internet. Many virus scanners perform only a few levels of recursion on archives to prevent attacks that would cause a buffer overflow, an out-of-memory condition, or exceed the program's acceptable execution time. Zip bombs often (if not always) rely on repeating identical files to achieve their ultimate compression ratios. Dynamic programming techniques can be used to limit the traversal of such files so that only one file is recursively executed at each level, effectively converting their exponential growth into linear growth.

When testing, it's always better to start small and work your way up. Starting to work with the largest file can seriously harm the application or system - use these bombs with great caution.

All files have been bzipped to bypass GitMub's 50MB file upload limit. Groups of files were zipped and then bzipped again. Remove these additional pre-encodings before scanning.

Additional sources

HTTP/2: In-depth analysis of the top four flaws of the next generation web protocol
You're not looking at the big picture
In the Compression Hornet's Nest: A Security Study of Data Compression in Network Services
Devilish HTTP compression - compression bombs (

*There are two versions of 42.zip: the old one is 42,374 bytes, and the newer is 42,838 bytes. The difference is that the new one requires a password before unpacking. We compare only with the old version. Here's a copy of the file if you need it: 42.zip.

Zip bombs must overcome the fact that the most commonly supported compression algorithm by parsers, DEFLATE, cannot exceed a compression ratio of 1032 to 1. For this reason, zip bombs typically rely on recursive decompression, nesting zip files within zip files to gain an extra factor 1032 with each layer. But the trick only works on implementations that unpack recursively, and most don't. The most famous 42.zip bomb expands to a formidable 4.5 PB if all six layers are recursively decompressed, but on the top layer it has a paltry 0.6 MB. Zip quines, like those of Cox and Ellingsen, produce a copy of themselves and thus expand infinitely when recursively unzipped. But they are also completely safe once unpacked.

This article shows how to create a non-recursive zip bomb whose compression ratio exceeds the DEFLATE limit of 1032. It works by overlapping files within a zip container to reference a "core" of highly compressed data across multiple files without making multiple copies. The output size of a zip bomb grows quadratically with the input size; that is, the compression ratio improves as the bomb size increases. The design depends on the features of zip and DEFLATE: it does not directly transfer to other file formats or compression algorithms. The bomb is compatible with most zip parsers, except "streaming" ones, which parse files in one pass without checking the central directory of the zip file. We try to balance two conflicting goals:

Increase compression ratio. We define the compression ratio as the sum of the sizes of all files in the archive divided by the size of the zip file itself. This does not take into account file names or other file system metadata, only the content.
Maintain compatibility. Zip is a complex format and parsers differ, especially in edge situations and in additional features. Do not use techniques that only work with certain parsers. We'll point out some ways to make the zip bomb more effective with some loss of compatibility.

Zip file structure

The zip file consists of central catalog links to files.

The central directory is at the end of the zip file. This is the list central directory headers. Each central directory header contains metadata for one file, such as the file name and CRC-32 checksum, as well as a back pointer to the local file header. The central directory header is 46 bytes long plus the length of the file name.

The file consists of a local file header followed by compressed file data. The length of the local file header is 30 bytes plus the length of the file name. It contains a redundant copy of the metadata from the central directory header, as well as the sizes of the compressed and uncompressed data files behind it. Zip is a container format, not a compression algorithm. Each file's data is compressed using an algorithm specified in the metadata - usually DEFLATE.

This description of the zip format omits a lot of details that are not needed to understand the zip bomb. For complete information, see section 4.3 APPNOTE.TXT or "PKZip File Structure" by Florian Buchholz, or see .

The considerable redundancy and many ambiguities in the zip format open up opportunities for mischief of all kinds. The zip bomb is just the tip of the iceberg. Links for further reading:

$ python3 -m zipfile -e overlap.zip . Traceback (most recent call last): ... __main__.BadZipFile: File name in directory "B" and header b"A" differ.
Next, we'll look at how to change the design for filename consistency while still retaining most of the benefits of overlapping files.

Second discovery: quoting local file headers

We need to split the local file headers for each file while reusing one core. Simply concatenating all the headers doesn't work because the zip parser will find the local file header where it expects the DEFLATE stream to start. But the idea will work, with minor changes. We'll use the DEFLATE function of uncompressed blocks to "quote" the local file headers so that they appear to be part of the same DEFLATE stream that ends up in the kernel. Each local file header (except the first one) will be interpreted in two ways: as code (part of the zip file structure) and as data (part of the file content).

A DEFLATE stream is a sequence of blocks, where each block can be compressed or uncompressed. We usually only think of compressed blocks, for example the kernel is one big compressed block. But there are also uncompressed ones, which start with a 5-byte header with a length field, which simply means: "print the following n bytes verbatim." Uncompressing an uncompressed block only means removing the 5-byte header. Compressed and uncompressed blocks can be mixed freely in the DEFLATE stream. The output is a concatenation of the results of unpacking all blocks in order. The term "uncompressed" only has meaning at the DEFLATE level; the file data is still considered "compressed" at the zip level, no matter what blocks are used.

The easiest way to think of this structure is as an internal overlap, from the last file to the first. We start by inserting a kernel that will form the end of the data file for each file. Adding a header to the local LFH file N and header of the central directory CDH N, which points to it. Set the “compressed size” metadata field to LFH N and C.D.H. N per compressed kernel size. Now we add a 5-byte uncompressed block header (in green in the diagram), the length field of which is equal to the LFH size N. Add a second header to the local LFH file N−1 and the header of the central directory CDH N−1 which points to it. Set the "compressed size" metadata field to be the new compressed kernel size header plus uncompressed block header size (5 bytes) plus size LFH N .

At the moment, the zip file contains two files named Y and Z. Let's go over what the parser will see when parsing. Let's assume the compressed kernel size is 1000 bytes and the LFH size is N- 31 bytes. Let's start with CDH N−1 and follow the pointer to LFH N−1. The first file name is Y and its compressed data file size is 1036 bytes. Interpreting the next 1036 bytes as a DEFLATE stream, we first encounter a 5-byte uncompressed block header that tells us to copy the next 31 bytes. Write the next 31 bytes, which are LFH N, which we decompress and add to file Y. Moving further in the DEFLATE stream, we find a compressed block (kernel), which we decompress into file Y. We have now reached the end of the compressed data and are finished with file Y.

Moving to the next file, we follow the pointer from CDH N to LFH N and find a file named Z, the compressed size of which is 1000 bytes. Interpreting these 1000 bytes as a DEFLATE stream, we immediately encounter the compressed block (the kernel again) and decompress it into file Z. We have now reached the end of the final file and are done. The output file Z contains the unpacked kernel; output file Y is the same, but additionally prefixed with 31 bytes LFH N .

We complete the design by repeating the quoting procedure until the zip archive includes the required number of files. Each new file adds a central directory header, a local file header, and an uncompressed block to quote the immediately following local file header. Compressed file data is typically a chain of uncompressed DEFLATE blocks (quoted local file headers) followed by a compressed kernel. Each byte in the core contributes about 1032 N to the output size because each byte is part of all N files. The output files are also different sizes: earlier ones are larger than later ones because they quote local file headers more. The contents of the output files don't make much sense, but no one said they had to make sense.

This overlap citation construction has better compatibility than the full overlap construction from the previous section, but compatibility comes at the cost of compression. There, each added file cost only the header of the central directory, here it costs the header of the central directory, the header of the local file and another 5 bytes for the citation header.

Optimization

Having received the basic design of a zip bomb, we will try to make it as effective as possible. We want to answer two questions:

What is the maximum compression ratio for a given zip file size?
What is the maximum compression ratio given the limitations of the zip format?

Kernel compression

It is beneficial for us to compress the kernel as much as possible, because each unpacked byte is multiplied by N. For this purpose, we use a custom DEFLATE compressor called bulk_deflate, specialized in compressing a string of repeating bytes.

All decent DEFLATE archivers get close to a compression ratio of 1032 on an endless stream of repeating bytes, but we care about the specific size. In our archive size, bulk_deflate fits more data than general-purpose archivers: about 26 KB more than zlib and Info-ZIP, and about 15 KB more than Zopfli, which sacrifices speed for compression quality.

The price of high bulk_deflate compression is lack of versatility. It can only compress strings of repeating bytes and only a certain length, namely 517 + 258 k for integer k≥ 0. In addition to good compression, bulk_deflate is fast, completing the job in essentially the same amount of time regardless of the size of the input data, not counting the work of actually writing the compressed string.

File names

For our purposes, file names are practically dead weight. Although they contribute to the output size by being part of the quoted headers of local files, a byte in the filename contributes much less than a byte in the kernel. We want file names to be as short as possible, but different, without forgetting compatibility.

Every byte spent on a filename means two bytes not spent on the kernel (two because each filename appears twice: in the central directory header and in the local file header). A filename byte results in an average of only ( N+ 1) / 4 bytes of output, while a byte in the kernel is counted as 1032 N. Examples: , , .

The first compatibility consideration is the encoding. The zip format specification states that file names must be interpreted as CP 437 or UTF-8 if a certain flag bit is set (APPNOTE.TXT, Appendix D). This is a major point of incompatibility between zip parsers, which may interpret filenames in some fixed or locale-specific encoding. Therefore, for compatibility, it is better to limit yourself to characters with the same encoding in both CP 437 and UTF-8. Namely, it is 95 printable US-ASCII characters.

We are also bound by file system naming restrictions. Some file systems are case-insensitive, so "a" and "A" are not considered different names. Common file systems such as FAT32 disallow certain characters such as "*" and "?".

As a safe, but not necessarily optimal, compromise, our zip bomb will use filenames from the 36-character alphabet, which does not include special or mixed-case characters:

0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
The file names are generated in the obvious way, all positions in order, with the addition of a position at the end of the loop:

"0", "1", "2", ..., "Z", "00", "01", "02", ..., "0Z", ..., "Z0", "Z1", "Z2", ..., "ZZ", "000", "001", "002", ...
There are 36 filenames that are one character long, 36² filenames that are two characters long, and so on. Four bytes is enough for 1,727,604 different file names.

Given that file names in an archive will usually have different lengths, what is the best way to order them: from shortest to longest, or vice versa? If you think about it a little, it's better to put the longest names last. This ordering adds over 900 MB of output to zblg.zip , compared to ordering from longest to shortest. However, this is a minor optimization, because 900 MB is only 0.0003% of the total output size.

Kernel size

The overlap citation design allows a compressed core of data to be stored and then copied many times cheaply. For a specific zip file size X, how much space is optimal to allocate for storing the kernel, and how much for creating copies?

To find the optimal balance, you only need to optimize one variable N, number of files in the zip archive. Each value N requires a certain amount of overhead for central directory headers, local file headers, quotation block headers, and file names. All remaining space will be occupied by the core. Because the N must be an integer, and you can only fit a certain number of files before the kernel size drops to zero, just check every possible value N and choose what gives the greatest results.

Applying the optimization procedure to X= 42,374 for 42.zip finds the maximum at N= 250. These 250 files require 21,195 bytes of overhead, leaving 21,179 bytes for the kernel. A kernel of this size is decompressed into 21,841,249 bytes (a ratio of 1031.3 to 1). 250 copies of the unpacked kernel plus some quoted local file headers gives a total unpacked output of 5,461,307,620 bytes and a compression ratio of 129,000.

The CRC-32 can be modeled as a state machine that updates a 32-bit status register for each input bit. The basic update operations for bits 0 and 1 are:

Uint32 crc32_update_0(uint32 state) ( // Shift out the least significant bit. bit b = state & 1; state = state >> 1; // If the shifted-out bit was 1, XOR with the CRC-32 constant. if (b == 1) state = state ^ 0xedb88320; return state; ) uint32 crc32_update_1(uint32 state) ( // Do as for a 0 bit, then XOR with the CRC-32 constant. return crc32_update_0(state) ^ 0xedb88320; )
If you represent the status register as a 32-element binary vector and use XOR for addition and multiplication, then crc32_update_0 is a linear mapping; that is, it can be represented as multiplication by a 32x32 binary transition matrix. To understand why, note that matrix-vector multiplication is simply summing the columns of the matrix after multiplying each column by the corresponding element of the vector. The shift operation state >> 1 simply takes every bit i state vector and multiplies it by a vector that is zero everywhere except the bit i− 1 (bit numbering from right to left). Relatively speaking, the final XOR state ^ 0xedb88320 occurs only when bit b is equal to one. This can be thought of as first multiplying b by 0xedb88320 and then XORing it to that state.

Also, crc32_update_1 is just crc32_update_0 plus an (XOR) constant. This makes crc32_update_1 an affine transformation: matrix multiplication followed by mapping (i.e., vector addition). We can represent matrix multiplication and mapping in one step if we increase the dimensions of the transformation matrix to 33x33 and add an additional element to the state vector, which is always equal to 1 (this representation is called homogeneous coordinates).

33×33 transformation matrices M 0 and M 1 that compute the CRC-32 state change made by bits 0 and 1, respectively. Column vectors are stored with the most significant bit at the bottom: reading the first column from bottom to top, you see the CRC-32 polynomial constant edb8832016 = 111 011 011 0111 0001 0000011 001 00000 2 . The two matrices differ only in the final column, which represents the translation vector in uniform coordinates. In M 0 the translation is zero, and in M 1 it is edb88320 16 , a CRC-32 polynomial constant. The ones immediately above the diagonal represent the state of the operation state >> 1

Both operations crc32_update_0 and crc32_update_1 can be represented by a 33x33 transition matrix. Matrices M 0 and M 1 are shown. The advantage of matrix representation is that matrices can be multiplied. Let's say we want to see the state change produced by processing the ASCII character "a", whose binary representation is 01100001 2 . We can represent the cumulative CRC-32 state change of these eight bits in a single transformation matrix:

And we can imagine the change in state of a string of repeating “a”s by multiplying many copies of M a - raising the matrix to a power. We can do this quickly using the fast exponentiation algorithm, which allows us to calculate M n in just log 2 n steps. For example, here is a state change matrix for a string of 9 “a” characters:
The fast matrix multiplication algorithm is useful for computing the M kernel, the matrix for an uncompressed kernel, since the kernel is a string of repeating bytes. To obtain a CRC-32 checksum from a matrix, multiply the matrix by the zero vector (the zero vector is in uniform coordinates, that is, 32 zeros and then a one; here we are omitting the minor complication of pre- and post-processing the checksum for matching). To calculate the checksum for each file, we work backwards. Let's start with initialization M:=M kernel. The kernel checksum is also the final file checksum N, so we multiply M to the zero vector and store the resulting checksum in CDH N and LFH N . File data N−1 same as file file data N, but with the prefix LFH N added. Therefore, we calculate the state change matrix for LFH N and update it. Now M represents the cumulative state change from LFH N treatment per kernel. Calculate the checksum for the file N−1, multiplying again M to the zero vector. We continue the procedure, accumulating state change matrices in M until all files have been processed.

Extension: Zip64

Previously, we encountered an expansion problem due to the limitations of the zip format - it was impossible to output more than 281 TB, no matter how cleverly the zip file was packaged. It is possible to overcome these limitations by using Zip64, an extension to the zip format that increases the size of some header fields to 64 bits. Zip64 support is by no means universal, but it is one of the most commonly implemented extensions. In terms of compression ratio, the effect of Zip64 is to increase the central directory header size from 46 to 58 bytes, and the local directory header size from 30 to 50 bytes. Looking at the formula for the optimal expansion factor in a simplified model, we see that the Zip64 zip bomb still grows quadratically, but at a slower rate due to the larger denominator - as can be seen in the diagram below. Due to loss of compatibility and slower growth, almost any restrictions on file size are removed.

Let's say we want a zip bomb that expands to 4.5 PB like 42.zip. How big should the archive be? Using binary search, we find that the minimum size of such a file is 46 MB.

zbxl.zip 46 MB → 4.5 PB (Zip64, less compatible)

zipbomb --mode=quoted_overlap --num-files=190023 --compressed-size=22982788 --zip64 > zbxl.zip
4.5 petabytes - that's about the amount of data the Event Horizon Telescope recorded for the first image of a black hole: racks and racks of hard drives in a data center.

With Zip64, it's almost no longer interesting to consider the maximum compression ratio, because we can just keep increasing the size of the zip file and the compression ratio along with it, until even a compressed zip file becomes prohibitively large. An interesting threshold, however, is 2 64 bytes (18 EB or 16 EiB) - this is not a lot of data that will fit on most file systems. Binary search finds the smallest zip bomb that produces at least that much output: it contains 12 million files and a 1.5 GB compressed kernel. The total size of the zip file is 2.9 GB and it decompresses into 264+11727895877 bytes with a compression ratio of over 6.2 billion to one. I haven't made this file available for download, but you can generate it yourself using . His files are so large that even a bug in Info-ZIP UnZip 6.0 was discovered.

Zipbomb --mode=quoted_overlap --num-files=12056313 --compressed-size=1482284040 --zip64 > zbxxl.zip

Extension: bzip2

DEFLATE is the most common compression algorithm for the zip format, but it is only one of many options. Probably the second most common algorithm is bzip2. Although it is not as compatible as DEFLATE. Theoretically, bzip2 has a maximum compression ratio of about 1.4 million to one, which allows for denser packing of the kernel.

Bzip2 first encodes the run-length encoding, reducing the length of the string of repeating bytes by 51 times. The data is then divided into 900 KB blocks and each block is compressed separately. Theoretically, one block can be compressed to 32 bytes. 900,000 × 51 / 32 = 1,434,375.

Ignoring the compatibility losses, does bzip2 make a more effective bomb?

Yes - but only for small files. The problem is that bzip2 doesn't have anything like the uncompressed DEFLATE blocks we used to quote local file headers. Thus, it is not possible to overlap files and reuse the kernel - each file needs to write its own copy, so the overall compression ratio will be no better than the ratio for any individual file. In the graph below we see that without overlap, bzip2 outperforms DEFLATE only for files around megabytes in size.

The only hope left is for an alternative means of quoting headers in bzip2, which is discussed in the next section. Also, if you know that a certain zip parser supports bzip2 And allows for filename mismatches, you can use the full overlap construct, which doesn't need quoting.

Comparison of compression ratios of different zip bombs. Note the logarithmic axes. Each design is shown with and without Zip64. Structures without overlap have a linear growth rate, which is evident from the constant ratio of the axes. The vertical displacement of the bzip2 graph means that the compression ratio of bzip2 is about a thousand times greater than that of DEFLATE. DEFLATE constructions with citations have a quadratic growth rate, as evidenced by a 2:1 slope to the axes. The Zip64 option is slightly less efficient, but allows for over 281 TB. Graphs for bzip2 with citation via an additional field go from quadratic to linear when either the maximum file size (2 32 −2 bytes) or the maximum allowed number of files is reached

Extension: citing via additional field

So far we've been using the DEFLATE function to quote local file headers, and we've just seen that this trick doesn't work in bzip2. However, there is an alternative citation method, somewhat more limited, that uses only the zip format features and is independent of the compression algorithm.

At the end of the local file header structure there is additional field variable length to store information that does not fit into normal header fields (APPNOTE.TXT, Section 4.3.7). Additional information may include, for example, a timestamp or uid/gid from Unix. Zip64 information is also stored in an additional field. The additional field is represented as a length-value structure; if you increase the length without adding a value, the extra field will include what's behind it in the zip file, namely the next local file header. Using this method, each local file header can "quote" subsequent headers by enclosing them in its own additional field. Compared to DEFLATE there are three advantages:

Quoting via the extra field only requires 4 bytes rather than 5, leaving more space for the kernel.
It does not increase file size, which means a larger kernel size given the limitations of the zip format.
It provides citation in bzip2.

Despite these advantages, citing via additional fields is less flexible. This is not a chain, as in DEFLATE: each local file header must contain not only the immediately following header, but also All subsequent headings. Additional fields increase as you get closer to the beginning of the zip file. Since the maximum field length is 2 16 −1 bytes, you can only quote up to 1808 local file headers (or 1170 in Zip64), assuming that the names (in the case of DEFLATE, you can use an additional field to quote the first (shortest) local file headers, and then switch to quoting DEFLATE for the rest). Another problem: to match the internal data structure of the additional field, you must select a 16-bit tag for the type (APPNOTE.TXT, section 4.5.2) preceding the citation data. We want to choose a type tag that will cause parsers to ignore quoted data rather than try to interpret it as meaningful metadata. Zip parsers must ignore tags of unknown type, so we can choose tags randomly, but there is a risk that in the future some tag will break design compatibility.

The previous diagram illustrates the possibility of using additional fields in bzip2, with and without Zip64. Both graphs have an inflection point where growth moves from quadratic to linear. Without Zip64, this occurs where the maximum uncompressed file size (2 32 −2 bytes) is reached; then you can only increase the number of files, but not their size. The graph completely ends when the number of files reaches 1809, then we run out of space in the additional field to cite additional titles. With Zip64, the breaking point occurs at 1171 files, after which only the size of the files can be increased, but not their number. The additional field also helps in the case of DEFLATE, but the difference is so small that it is not visually noticeable. It increases the compression ratio of zbsm.zip by 1.2%; zblg.zip by 0.019%; and zbxl.zip by 0.0025%.

Discussion

In their work on this topic, Ploetz and his colleagues use file overlap to create a nearly self-replicating zip archive. The overlap of files itself was suggested earlier (slide 47) by Ginvael Coldwind.

We designed the over-quote zip bomb design to take into account compatibility - a number of implementation differences, some of which are shown in the table below. The resulting design is compatible with zip parsers, which work in the usual way, that is, by first checking a central directory and using it as a file index. Among them is the unique zip parser from Nail, which is automatically generated from the formal grammar. However, the design is incompatible with "stream" parsers, which parse a zip file from start to finish in a single pass without first reading the central directory. By their nature, stream parsers do not allow file overlap in any way. Most likely they will only extract the first file. In addition, they may even throw an error, as is the case with