Simple Deduplication (Implementation)
Structure
I’ve made the implementation in total of 4 .java files
- Deduplication main process, all the chunking, fingerprinting, indexing is done here
- Chunk single chunk handler
- Index in order to utilize Serializable, interface is required, which is dealt here
- Storage handles storage related procedures (i.e. create, get, etc.)
The program will be able to execute 3 commands
- Upload
- Download
- Delete
For Upload it will require more than just the file. It requires
- min = Minimum chunk size (in bytes)
- avg = Average chunk size (in bytes)
- max = Maximum chunk size (in bytes)
- d = Base parameter
- dir = File location
For Download it will only require the original file name and location to save. For Delete it will only require the original file name. Note that the filename before deduplication is the original filename that will be used to reference the deduplicated data.
The index file will be stored in the same directory that the programming is running on and the data will be stored in a data folder which will be automatically created on the first deduplication.
Code Snippets
Because the whole program is kind of big for a blog post, I decided to only put in the main Chunking, Fingerprinting, Indexing portion only. Sorry for the messy variable names.
- m minimum size
- q average size
- x maximum size
- d base parameter
Checking if the Chunk is New (i.e. Unique) Notice Fingerprinting is processed here
If Chunk does not exist in Index (note that “SHA-256” was used to encrypt data)
Tests and Results
For test, I’ve downloaded the plain text file of “Tale of Two Cities” from gutenberg.org.
The first upload is the original file.
As seen in the output above, all of the chunks are unique since this is the first file every uploaded. For the second upload, I’ve deleted few of the paragraphs in the text file so that the file is not exactly the same.
Now since most of the file is exactly the same, it has lots of duplicated chunks from the first upload. Basically I have just uploaded 2 files, but I have reduced the storage compared to storing both files!
Time to move on to downloading the modified file.
Reconstructed bytes are the sum of bytes that are not unique to this file. Basically these bytes are shared chunks. Now let’s try deleting the modified file.
The numbers are very similar when I have uploaded the modified file. The other chunks are needed for the original files so they are still stored. But, what if I didn’t delete the modified file and deleted the original file instead?
Because only 1 chunk is not shared with the modified file, it is now deleting only 1 chunk that’s in the size of 89 bytes. Note that because the minimum size byte set to 128, it says 128, but the actual real unique file byte size is 89 (rest are filled up with zeroes).
External Libraries Used
- common-lang3-3.1.jar (from Apache Common)
Comments