the modern era, Cloud is the best platform for storing the data any type of the
information. Due to high demand for the cloud storage there are high chances of
data deduplication on the cloud since everyone wants to store the data and
sometimes users store same files unknowingly, users also wants to secure their
files from the third party persons. When it comes to the personal data or any
government data then at that time security and efficiency are the most
important concern. For checking the duplication we are using History Aware
In-line Deduplication. This will check the duplicity of the file before
uploading to the cloud. For securing the file we are using data destruction in
which user will provide a threshold for the file at the time of uploading the
file and after that period the file will automatically get remove from the
server .We are using erasure technique for creating the backup of the file , we will create back up of all the file
before uploading to the server so that if any file get corrupted then this will
become to recover that file from the backup easily.
terms : Erasure Technique , History Aware Inline Deduplication, Data Destruction, MD5.
Cloud services are the most widely used
services. Cloud give the different kind
of benefits to our day to day life. Generally cloud is used for the storing
data , data on the cloud may be present in different formats .We can called
cloud as data centric system and it has
been playing an important role in blog sharing, news broadcasting, in social
media and content sharing. Sometimes it happens that we store the same data
repeatedly. Cloud has a huge amount of data and this is a big challenge to
handle this much amount of data to cloud. Redundant data increases load on the
cloud and users get in trouble to access and operate the data. Many
duplicate file get generated everyday so when
anyone searches for a particular file then it takes more time to search that
file because of the duplication of the data. Data also faced with the danger of
the data residue, the complete destruction, and the illegal data recovery. So
to overcome all these problems we proposed a system.
In the proposed system we are going to use
different algorithms like Md5 for creating the hash value of the file, content
defined chunking algorithm, erasure coding for creating the backup of the file,
Data destruction algorithm for deleting the file after a fixed period of time.
Coding has been widely used to provide high availability, efficiency and
reliability of data while introducing low storage overhead in storage systems.
Erasure coding is a process of data protection from being corrupting and
misplacing during processing in different applications. This method protects
data from fragmented into segments, extended, encoded and avoid redundant data
bits and stored at different locations or storage media system. The goal of
erasure coding is to reconstruct corrupted data by using information that can
be metadata (data about data) about the data stored at in another array in the
disk storage system.
2. EXISTING SYSTEM
file could be regarded as a long string (one byte, one character), but it
couldn’t be treated like a string during file differing. There were some
problems associated with the existing systems.
iDedup 7, Inline deduplication solution for major workloads. Projected
algorithm is roughly based on two keys first a perception from actual dynamic
world workloads i.e. spatial locality occurs in duplicated primary data and
another second temporal locality occurs in the access patterns of duplicated
data of backup. Proposed algorithm for deduplication minimizes extra
Input/Outputs and seeks.
indicator for dedupe scheme provides two-fold approach, first, a novel pointer
for dedupe system called cache-aware Chunk Fragmentation Level (i.e. CFL)
display and second discriminating duplication for improvement read performance.
The CFL consists of finest chunk fragmentation parameter and cache-aware recent
chunk fragmentation parameter. To boost read performance ,selective duplication technique is activated
if the existing CFL becomes worse than the required one.
Fig 1: Existing System
and Won 5 developed a original system that is content based file chunking
which consist of two subsystems: one is CPU chunking subsystem and other is
GPGPU subsystem. This system will decide which subsystem would use chunks. Manogar and Abirami analysed different
de-duplication methods and compared these techniques and concluded that variable
size data de-duplication is very efficient from other techniques. Lin et
al.5Developed a data reorganize method that is Re-De-dup it works to address
data fragmentation problem and reallocate files and places them on disk.
3. PROPOSED SYSTEM
input to the system will be users login credentials and mainly a file. As per
the selected file the system will identify that the file is already present on
the server or not. In order to detect the file whether it is on the server
History Aware In-Line deduplication will be used. To create the hash value MD5
algorithm will be used. On the basis of the generated hash value the the system
will reposed, if the hash value will match with the other hash values then the
file will get discarded otherwise the file will upload on the server.
In this section proposed algorithms are presented and explained with
system architecture. These are as
Take input file.
Divide the input file into the blocks.
some bits are inserted
at the end of last blocks.
If last block is less than other blocks
size. Extra bits are added.
Uses four rounds to process the blocks.
performing all rounds the MD5 digest is
B) Data Destruction Algorithm.
For each document in database
Current time and date wise life span calculation
If life span expires,
. Check dependency of the document
. If document is associated with other owners
links of the owner whose life span
.Notify owner about expiry
. Delete document from main space
of the server and transfer it into trash for N days.
.Delete links of the owner whose
life span expired.
.Notify owner about expiry
(backup is in process)
block of the file from cache and search
for the hash value in the Hash _Index
(data block is already present)//loop1
(values) = = C_Historical(values)) then //loop2
the block of file into the new container.
discard the block of file.
block of file in a new container
the progress backup record
CD_ Current Data File.
Compare progress backup record with threshold
Eliminate all exceeding progress backup records from C_CurrentDataFile.
the rewrite fraction for the succeeding
While (rewrite fraction > rewrite
Eliminate largest progress backup records from
CD_ Current Data File.
Update the computed rewrite fraction.
Return CD_ Current Data File.
Description of Proposed Work.
Fig 2 : FLOW
DIAGRAM OF THE PROPOSED SYSTEM
Data Deduplication is an technique
in which duplicate data is
automatically get removed from storage system. Deduplication result in data
fragmentation because continuously data is spread across many disk location .Fragmentation mainly caused by
duplicate data from previous back, since such duplicates are frequent due to full backups which containing a lots of
data which does not change .
We are using History aware in-line duplication check algorithm
(HIDC) for duplication. Erasure coding for encoding and decoding purpose where
erasure coding helps to encode data .Encoded data get compressed and then get
store on the server. As we are going to use one backup sever where encoded file
get stored on that severs and user can download the files by decoding from that
backup sever when files get corrupt or when any other problem will occurs in
the file. Self-Data Destruction algorithm is use delete the file after specific
time interval which user has to specify at the time of data storage.
In proposed algorithm user select the file for uploading to the
server. MD5 algorithm is used to
generate the hash value of the blocks of the files. In erasure coding data
block get encoded. HIDC algorithm is used to make the validation of the
duplication file storing over server. That is, De-duplication module gets files
to store on server; it validates its hash value in cache. IF hash value is not
accessible in cache, than only file will store on the servers, otherwise file will
discarded and diminishes the memory wastage. This process will takes place
before uploading the file on the servers so it is called inline de-duplication.
After this process the client will be able to upload the file on the server and
for uploading the files, firstly the file will get divide and store on
different servers and this process will maintain the security of files. The
client can also download the file in the same manner as per his requirement.
Destruction algorithm is used at the time of uploading file on
the server user will specify the life span for file storage. Life span is a
time intervals in hours, minute, days, months, year, etc. System will
constantly check whether the life span of any file expires or not. As soon as
the life span of the file is
expired then file will be delete from the server. If it is found that
the life span of the file has expired then the dependency of the file is checked.
If no dependency found then directly file is deleted from server. Otherwise
record from file related to particular user.
In cloud storage data is too large and proficiently storing
the data is a difficult problem. Data
backup storage is not only difficult but also challenging task in terms of
storage space consumption, recovery, effectiveness. With changing technology
users have started to keep back up of their data on cloud servers because of
suppleness and mobility purpose. Data stored by the users may be duplicate
also. In backing up data, data blocks get distributed on multiple cloud servers
so it reduces chances of data loss from exploitation, but at a same time it
utilizes more space. For this problem our proposed system implements erasure
coding for recovery and inline de-duplication for cloud backup storage. Erasure
coding encodes the break apart data. At the
time of storing data on cloud, life span to each data item will be provided
Self data destruction algorithm is used to automatically delete the data on
cloud whose life span has finished.