Menu

Do not store duplicate files

2019-10-22
2021-07-29
  • Rick C. Hodgin

    Rick C. Hodgin - 2019-10-22

    I have several situations where I have to do a backup of something like:
    c:\path\to\mycode_current\
    c:\path\to\mycode_v1.1\
    c:\path\to\mycode_v1.2\
    c:\path\to\mycode_v1.3\
    etc.

    Many of the files in those directories are the same and do not change version to version. Many do. To make a backup of c:\path\to\ results in a zip archive containing many 10s of megabytes of redundant data. It's a waste of space and also takes additional processing time to compress.

    Could 7-Zip add an option allowing 7-Zip to store one master copy of dup files, and then insert links for each reference rather than actually storing duplicate files?

    [x] Allow 7-Zip to store duplicate files only once

    If checked, all of the files that are the same would be noted and referenced in the .7z archive as master files, with only links to each master being stored. In directory listings, or when we go to unzip anything in the archive, it pulls from the correct location as if it had been stored fully. The master file handling is all internal to 7-Zip.

    It would greatly reduce the size of backups, while not losing any data on identical files.

    You could use SHA-1 to determine if they are the same on all files of equal length. A SHA-1 pass start to finish, plus several fractional SHA-1 values computed on sections (like every 1/16th of a file maybe, with a minimum of 4096 bytes per portion). If all computed SHA-1 values match, declare them to be the same and store the one master, and links for each reference.

     
  • Igor Pavlov

    Igor Pavlov - 2019-10-23

    You can use qs option in Parameters for 7z archive.
    7-Zip will sort files by name. And it can use same dictionary. So compression ratio will be good.

     
  • Rick C. Hodgin

    Rick C. Hodgin - 2019-10-23

    Wow! Using qs makes a tremendous difference. 18 MB compared to 29 MB on a project's source code history:

    Project had 19 versions (in 19 separate folders) totaling about 750 MB in source files and non-executable binary files (images, databases, etc) totaling 25K files in 640 folders.

    The various archive backups were:

    ZIP default Normal  --> 232 MB
    ZIP default Ultra   --> 229 MB
    7z default Ultra    -->  29 MB
    7z qs Ultra         -->  18 MB
    

    How can I use the qs option with 7za on the command line? I don't see the q or qs option.
    7za a -r -qs myarchive.7z .\folder\*.*

     

    Last edit: Rick C. Hodgin 2019-10-23
    • Shell

      Shell - 2019-10-23

      -mqs. You can find more details in the help under the description of the -m switch.

       
      • Rick C. Hodgin

        Rick C. Hodgin - 2019-10-23

        Thank you, Shell.

         
  • Rick C. Hodgin

    Rick C. Hodgin - 2019-10-23

    I did notice one file (a file.zip included in each folder) that has not changed in any version, and was 10,810,124 bytes raw on disk, and it was stored packed in 7z archive as 16,208,917 bytes.

    I was curious about that increase in packed size?

     
    • Shell

      Shell - 2019-10-23

      Compression algorithms don't perform well on already compressed files. A size increase is normal in that case, but it usually does not exceed several percent. Maybe you are comparing the compressed size of a whole solid block with uncompressed size of a single file?

       
      • Rick C. Hodgin

        Rick C. Hodgin - 2019-10-23

        I would think the 7-zip algorithm would attempt a compression, and if it could not compress it more than in raw form, to then just store it. Might be a bug.

         
        • Shell

          Shell - 2019-10-23

          7-Zip is smart enough to revert to no compression in some cases. However, it is very difficult to implement when solid compression is on (the default case for 7-Zip), so in solid mode files can be expanded somewhat. For LZMA2 and Deflate/Deflate64, the expansion is usually up to several bytes, while for LZMA it can raise to several KB.

          I still don't believe that a single file can be "compressed" from 10M to 16M. Are you sure 16M is the compressed size of one file?

           
          • Rick C. Hodgin

            Rick C. Hodgin - 2019-10-23
             
            • Shell

              Shell - 2019-10-24

              7-Zip does not show compressed sizes of individual files in solid mode. 16 million bytes is the size of block 0, which can contain several files. Look in the folder which is 39 million bytes in size: are there any files that belong to that block? I am sure there are.

               
              • Rick C. Hodgin

                Rick C. Hodgin - 2019-10-24

                I realized this last night when I was doing another archive with the qs parameter. It was a 4 KB file and was storing 260+ MB of data. I put 2 and 2 together then realized what was going on.

                Thank you for the info. That qs feature is pretty slick. It's made notable differences in compression on some of my archives. The example above from 29 MB down to 18 MB was a 40% reduction. Amazing.

                 
  • Igor Pavlov

    Igor Pavlov - 2019-10-24

    The command for your archive:

    7za a -mqs -mx myarchive.7z .\folder\*
    
     
  • Igor Pavlov

    Igor Pavlov - 2019-10-24

    qs is not default because of problems with slow HDDs.
    qs changes the order of files, and unpacked files in HDD can be slow for some operations like the search or copy.
    It can be big problem, if you have millions of files.
    SSD probably are ok for qs.

     

    Last edit: Igor Pavlov 2019-10-24
    • Rick C. Hodgin

      Rick C. Hodgin - 2019-10-24

      I have an SSD on an older dev computer with a quad-core 2.4 GHz CPU and 8 GB of RAM. I used the qs option on one of my backups and it took it down from 32 MB to 21 MB, another huge reduction. However, I did have to go in and delete two big folders after I archived it the first time. And because I used qs it was slow even on the SSD. The first folder I deleted was about 40 MB, the second folder was about 160 MB (uncompressed sizes).

      I was wondering if you could add an option to checkmark folders, and then do an operation on the checkmarks? It would allow us to select files here, files there, and files in other places, and then delete or extract them all with one operation, rather than several?

       

      Last edit: Rick C. Hodgin 2019-10-24
  • Edward Hall

    Edward Hall - 2019-10-24
     

    Last edit: Edward Hall 2019-10-24
  • Edward Hall

    Edward Hall - 2019-10-24

    pls pardon my question since it might already have been answered...

    i want to create a zip that contains multiple versions of the same file with the date/time the file was updated as the only difference eg
    abcd_10242019_14_12.doc
    abcd_10252019_15_12.doc
    abcd_10252019_15_13.doc
    abcd_10262019_16_12.doc

    as i look at the zip documentation, it seems that zip wants to create a separate zip file for each

    is there a parameter that can use the original zip file eg abcd.zip and add the updates to it?

    an example of the parameter string would be appreciated

    thanks!

     

    Last edit: Edward Hall 2019-10-24
    • Shell

      Shell - 2019-10-25

      7-Zip (and many other archivers) don't support adding several files with the same name into archive. If, however, the file names contain the date/time, then it is simply
      7z a abcd.zip abcd*.doc

       
  • nightson

    nightson - 2021-03-07

    After some tests, it seems that qs doesn't help with identical files with different file names. Could you consider adding an options of storing identical files as references like in WinRAR? Thanks!

     
  • tumagonx

    tumagonx - 2021-07-29

    @nightson +1
    plese support reference... I have busybox archive, there are 250 instance of exes, each is 800kb. So far I made 500kb 7zsfx which has bat script to hardlink them after extraction, otherwise it would swell up to 6MB.

    I've tried wim.7z or tar.xz but it felt inconvenience.

     
    • Igor Pavlov

      Igor Pavlov - 2021-07-29

      7z format doesn't support hard links now.

       
  • tumagonx

    tumagonx - 2021-07-29

    is it need 7z format changes? seems weird it hasn't supported for so long.

     

Log in to post a comment.