menu

Questions & Answers

gzip -l returning incorrect values for uncompressed file size

I am trying to quickly assess the line number of gzipped files. I do this by checking the uncompressed size of the file, sampling lines from the beginning of the file with zcat filename | head -n 100 (for instance), and dividing the uncompressed size by the average line size of this sample of 100 lines.

The problem is that the data I'm receiving from gzip -l is invalid. Mostly it seems the uncompressed size is too small, in some cases producing negative compression values. For example, in one case the compressed file is 1.8gb, and the uncompressed is listed as 0.7gb by gzip -l, when it is actually 9gb when decompressed. I tried to decompress and recompress but still get the same uncompressed size.

gzip 1.6 on ubuntu 18.04.3

Comments:
2023-01-17 23:52:17
I wasn't parsing the output, just looking at the print on the console. Anyway, the answer below explains.
Answers(1) :

Below is the part of the gzip spec (RFC 1952) where it defines how the uncompressed size is stored in the gzip file.

ISIZE (Input SIZE)
    This contains the size of the original (uncompressed) input
    data modulo 2^32.

You are working with a gzip archive where the uncompressed size is > 2^32, so the uncompressed size reported by gzip -l is always going to be incorrect.

Note that this design limitation in the gzip file format doesn't cause any problems when uncompressing the archive. The only impact is with gzip -l or gunzip -l

Comments:
2023-01-17 23:52:17
So technically, if I have consistent compression rates for these files (the character usage/distribution is essentially similar), I can estimate the size by looking at the compressed size, multiplying by this factor, and making it accurate by choosing the nearest n*2^32+uncompressed
2023-01-17 23:52:17
That's worth investigating with a sample of compressed files. It might work.