(Msg. 25) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: microsoft>public>win98>gen_discussion (more info?)
"FromTheRafters" <erratic.DeleteThis@nomail.afraid.org> wrote in
news:#mS$aA8NJHA.3876@TK2MSFTNGP04.phx.gbl:
>> Your understanding of networks in not nearly as impeccable as
>> your logic of finding duplicates on one drive on which I
>> commented in my previous reply.
>
> Okay, so as this thread reaches its EOL, it may interest
> someone that all might not be as it seems.
>
> I'm not sure about modern disk operating systems, but
> some older ones would not actually make a copy when
> asked to do so. Rather, they would make another full
> path to the same data on disk (why waste space with
> redundant data). Copying to another disk, or partition
> on the same disk, would actually necessitate a copy
> and would take longer as a result. When access was
> made to the file, and it was modified, then the path used
> to access that file would point to a newly created file
> while the *original* would still be accessed from the
> other paths.
>
> So, deleting duplicate files on a single drive in this case
> would only clean up the file system without freeing up
> any harddrive space.
OR deleting duplicates, it would seem (don't want to read it
again, see below).
Thanks for the headache. What a nightmare.
--
Those who cast the votes decide nothing. Those who count the
votes decide everything.
- Josef Stalin
NB: Not only is my KF over 4 KB and growing, I am also filtering
everything from discussions.microsoft and google groups, so no
offense if you don't get a reply/comment unless I see you quoted
in another post.
(Msg. 26) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
FromTheRafters wrote:
>> AFAICS, a fundamental flaw in duplicate finder software is that it
>> relies on direct binary comparisons. With programs like FindDup, if we
>> have 3 files of equal size, then we would need to compare file1 with
>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>> For n equally sized files, the number of reads is n(n-1).
>>
>> Alternatively, if we relied on MD5 checksums, then each file would
>> only need to be read once.
>
> So...once it is found to be the same checksum, what should the
> program do next? How important are these files?
> A fundamental
> flaw would be to trust MD5 checksums as an indication that the
> files are indeed duplicates.
Since when? What is the statistical likelyhood of that being true?
> You can mostly trust MD5 checksums
> to indicate two files are different, but the other way around?
(Msg. 27) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
On Sun, 26 Oct 2008 16:23:16 -0400, "FromTheRafters"
<erratic RemoveThis @nomail.afraid.org> put finger to keyboard and composed:
>> AFAICS, a fundamental flaw in duplicate finder software is that it
>> relies on direct binary comparisons. With programs like FindDup, if we
>> have 3 files of equal size, then we would need to compare file1 with
>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>> For n equally sized files, the number of reads is n(n-1).
>>
>> Alternatively, if we relied on MD5 checksums, then each file would
>> only need to be read once.
>
>So...once it is found to be the same checksum, what should the
>program do next? How important are these files? A fundamental
>flaw would be to trust MD5 checksums as an indication that the
>files are indeed duplicates. You can mostly trust MD5 checksums
>to indicate two files are different, but the other way around?
OK, I retract my ill-informed comment, but it still seems to me that
the benefits far outweigh the risks. FindDup has been running for the
past 18 hours or so as I write this, so I'm happy to accept a 30
minute alternative. In any case, all programs appear to require that
the user decides whether or not a file can be safely deleted. To this
end the programmer could allow for a binary comparision in those cases
where there is any doubt.
- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
(Msg. 28) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
"Bill in Co." wrote:
> > A fundamental flaw would be to trust MD5 checksums as an
> > indication that the files are indeed duplicates.
>
> Since when? What is the statistical likelyhood of that being
> true?
If there was no malicious intent or source involved, I'd say the odds
are pretty low. But even if you had 2 identical hashs, it's simple
enough to just see if the files are the same length, and if they were,
then you do a byte-by-byte comparison.
(Msg. 29) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
"Franc Zabkar" <fzabkar.RemoveThis@iinternode.on.net> wrote in message
news:rpn9g4h6d10d3kv20ud3j02e2phuq66ucg@4ax.com...
> On Sun, 26 Oct 2008 16:23:16 -0400, "FromTheRafters"
> <erratic.RemoveThis@nomail.afraid.org> put finger to keyboard and composed:
>
>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>> relies on direct binary comparisons. With programs like FindDup, if we
>>> have 3 files of equal size, then we would need to compare file1 with
>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>> For n equally sized files, the number of reads is n(n-1).
>>>
>>> Alternatively, if we relied on MD5 checksums, then each file would
>>> only need to be read once.
>>
>>So...once it is found to be the same checksum, what should the
>>program do next? How important are these files? A fundamental
>>flaw would be to trust MD5 checksums as an indication that the
>>files are indeed duplicates. You can mostly trust MD5 checksums
>>to indicate two files are different, but the other way around?
>
> OK, I retract my ill-informed comment, but it still seems to me that
> the benefits far outweigh the risks. FindDup has been running for the
> past 18 hours or so as I write this, so I'm happy to accept a 30
> minute alternative. In any case, all programs appear to require that
> the user decides whether or not a file can be safely deleted. To this
> end the programmer could allow for a binary comparision in those cases
> where there is any doubt.
It all depends on the risk you are willing to assume. It would be nice
to have a hybrid case where you could switch between the MD5
mode and the byte by byte mode depending on such factors as type
or location of files etc.
(Msg. 30) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
"Bill in Co." <not_really_here DeleteThis @earthlink.net> wrote in message
news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
> FromTheRafters wrote:
>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>> relies on direct binary comparisons. With programs like FindDup, if we
>>> have 3 files of equal size, then we would need to compare file1 with
>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>> For n equally sized files, the number of reads is n(n-1).
>>>
>>> Alternatively, if we relied on MD5 checksums, then each file would
>>> only need to be read once.
>>
>> So...once it is found to be the same checksum, what should the
>> program do next? How important are these files?
>
>> A fundamental
>> flaw would be to trust MD5 checksums as an indication that the
>> files are indeed duplicates.
>
> Since when?
Forever.
Checksums are often smaller than the file they are derived from
(thats kinda the point, eh?).
> What is the statistical likelyhood of that being true?
(Msg. 31) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
FromTheRafters wrote:
> "Bill in Co." <not_really_here.DeleteThis@earthlink.net> wrote in message
> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>> FromTheRafters wrote:
>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>> relies on direct binary comparisons. With programs like FindDup, if we
>>>> have 3 files of equal size, then we would need to compare file1 with
>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>> For n equally sized files, the number of reads is n(n-1).
>>>>
>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>> only need to be read once.
>>>
>>> So...once it is found to be the same checksum, what should the
>>> program do next? How important are these files?
>>
>>> A fundamental
>>> flaw would be to trust MD5 checksums as an indication that the
>>> files are indeed duplicates.
>>
>> Since when?
>
> Forever.
>
> Checksums are often smaller than the file they are derived from
> (thats kinda the point, eh?).
No, that's not the point. Your statement was that the checksums did not
assure the integrity of the file, whatsoever - i.e., that two files could
have the same hash valus and yet be different, which I still say is *highly*
unlikely. A statistically insignificant probability, so that using hash
values is often prudent and is much more expedient, of course.
(Msg. 32) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
"Bill in Co." <not_really_here.TakeThisOut@earthlink.net> wrote in message
news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
> FromTheRafters wrote:
>> "Bill in Co." <not_really_here.TakeThisOut@earthlink.net> wrote in message
>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>> FromTheRafters wrote:
>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>> relies on direct binary comparisons. With programs like FindDup, if we
>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>
>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>> only need to be read once.
>>>>
>>>> So...once it is found to be the same checksum, what should the
>>>> program do next? How important are these files?
>>>
>>>> A fundamental
>>>> flaw would be to trust MD5 checksums as an indication that the
>>>> files are indeed duplicates.
>>>
>>> Since when?
>>
>> Forever.
>>
>> Checksums are often smaller than the file they are derived from
>> (thats kinda the point, eh?).
>
> No, that's not the point. Your statement was that the checksums did not
> assure the integrity of the file, whatsoever
I didn't say anything about the integrity of a file, and I also didn't
say 'whatsoever'. You can still read what I said above.
If you want to ensure they are duplicates - compare the files exactly.
If you only need to be reasonably sure they are duplicates, checksums
are adequate.
> - i.e., that two files could have the same hash valus and yet be
> different, which I still say is *highly* unlikely.
Highly unlikely -yes. But files can be highly valuable too. Just
how fast does such a program need to be? How much speed
is worth how much accuracy?
> A statistically insignificant probability, so that using hash values is
> often prudent and is much more expedient, of course.
True, but to aim toward accuracy instead of speed is not a flaw.
All times are: Eastern Time (US & Canada) (change) Goto page Previous1, 2, 3, 4, 5
Page 4 of 5
You can post new topics in this forum You can reply to topics in this forum You can edit your posts in this forum You can delete your posts in this forum You can vote in polls in this forum