(Msg. 33) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: microsoft>public>win98>gen_discussion (more info?)
FromTheRafters wrote:
> "Bill in Co." <not_really_here.RemoveThis@earthlink.net> wrote in message
> news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
>> FromTheRafters wrote:
>>> "Bill in Co." <not_really_here.RemoveThis@earthlink.net> wrote in message
>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>>> FromTheRafters wrote:
>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>>> relies on direct binary comparisons. With programs like FindDup, if
>>>>>> we
>>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>>
>>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>>> only need to be read once.
>>>>>
>>>>> So...once it is found to be the same checksum, what should the
>>>>> program do next? How important are these files?
>>>>
>>>>> A fundamental
>>>>> flaw would be to trust MD5 checksums as an indication that the
>>>>> files are indeed duplicates.
>>>>
>>>> Since when?
>>>
>>> Forever.
>>>
>>> Checksums are often smaller than the file they are derived from
>>> (thats kinda the point, eh?).
>>
>> No, that's not the point. Your statement was that the checksums did not
>> assure the integrity of the file, whatsoever
>
> I didn't say anything about the integrity of a file, and I also didn't
> say 'whatsoever'. You can still read what I said above.
>
> If you want to ensure they are duplicates - compare the files exactly.
> If you only need to be reasonably sure they are duplicates, checksums
> are adequate.
"Very reasonably sure" is correct.
>> - i.e., that two files could have the same hash valus and yet be
>> different, which I still say is *highly* unlikely.
>
> Highly unlikely -yes. But files can be highly valuable too. Just
> how fast does such a program need to be? How much speed
> is worth how much accuracy?
That's the question, isn't it. Considering the difference in speed, and
for most of our applications, I'd say the hash checksum approach does just
fine.
>> A statistically insignificant probability, so that using hash values is
>> often prudent and is much more expedient, of course.
>
> True, but to aim toward accuracy instead of speed is not a flaw.
And there is a point of diminishing returns. Prudence comes in here; i.e.,
using the appropriate technique for the case at hand.
(Msg. 34) Posted: Mon Oct 27, 2008 3:09 am
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
On Sun, 26 Oct 2008 21:46:24 -0400, "FromTheRafters"
<erratic.RemoveThis@nomail.afraid.org> put finger to keyboard and composed:
>
>"Bill in Co." <not_really_here.RemoveThis@earthlink.net> wrote in message
>news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
>> FromTheRafters wrote:
>>> "Bill in Co." <not_really_here.RemoveThis@earthlink.net> wrote in message
>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>>> FromTheRafters wrote:
>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>>> relies on direct binary comparisons. With programs like FindDup, if we
>>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.
>>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>>
>>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>>> only need to be read once.
>>>>>
>>>>> So...once it is found to be the same checksum, what should the
>>>>> program do next? How important are these files?
>>>>
>>>>> A fundamental
>>>>> flaw would be to trust MD5 checksums as an indication that the
>>>>> files are indeed duplicates.
>>>>
>>>> Since when?
>>>
>>> Forever.
>>>
>>> Checksums are often smaller than the file they are derived from
>>> (thats kinda the point, eh?).
>>
>> No, that's not the point. Your statement was that the checksums did not
>> assure the integrity of the file, whatsoever
>
>I didn't say anything about the integrity of a file, and I also didn't
>say 'whatsoever'. You can still read what I said above.
>
>If you want to ensure they are duplicates - compare the files exactly.
>If you only need to be reasonably sure they are duplicates, checksums
>are adequate.
>
>> - i.e., that two files could have the same hash valus and yet be
>> different, which I still say is *highly* unlikely.
>
>Highly unlikely -yes. But files can be highly valuable too. Just
>how fast does such a program need to be? How much speed
>is worth how much accuracy?
>
>> A statistically insignificant probability, so that using hash values is
>> often prudent and is much more expedient, of course.
>
>True, but to aim toward accuracy instead of speed is not a flaw.
Sorry, bad choice of word on my part. However, speed and accuracy, or
speed and safety, are legitimate compromises that we make on a daily
basis. For example, our residential speed limit has been reduced from
60kph to 50kph in the interests of public safety, but we could easily
have a zero road toll if we reduced the limit all the way to 1kph.
Similarly, I could have left FindDup running for several more hours (I
killed it after about 24), but the inconvenience finally got to me.
I'd rather go for speed with something like FastSum, and safeguard
against unlikely losses with a total backup. In fact, I wonder why it
is that no antivirus product seems to be able to reliably detect *all
known* viruses. Is this an intentional compromise of speed versus
security? For example, I used to download Trend Micro's pattern file
updates manually for some time, and noticed that the ZIP files grew to
as much a 23MB until about a year (?) ago when they suddenly shrank to
only 15MB. Have Trend Micro decided to exclude extinct or rare viruses
from their database, or have they really found a more efficient way to
do things?
- Franc Zabkar
--
Please remove one 'i' from my address when replying by email.
(Msg. 35) Posted: Mon Oct 27, 2008 9:47 pm
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
"Franc Zabkar" <fzabkar.DeleteThis@iinternode.on.net> wrote in message
news:upjag4hvtmrrrnldacv22l39huud20jt74@4ax.com...
[snip]
> Sorry, bad choice of word on my part. However, speed and accuracy, or
> speed and safety, are legitimate compromises that we make on a daily
> basis. For example, our residential speed limit has been reduced from
> 60kph to 50kph in the interests of public safety, but we could easily
> have a zero road toll if we reduced the limit all the way to 1kph.
> Similarly, I could have left FindDup running for several more hours (I
> killed it after about 24), but the inconvenience finally got to me.
> I'd rather go for speed with something like FastSum, and safeguard
> against unlikely losses with a total backup.
Having read some of your excellent posts, I was sure you would
know where I was coming from with those comments.
> In fact, I wonder why it
> is that no antivirus product seems to be able to reliably detect *all
> known* viruses.
Add to that the many methods applied by viruses to make the task
more difficult for the detector.
Heuristics is a less accurate but faster approach, and reminds me of the
current topic (only more markedly). It seems a shame to equate a near
100% accurate byte by byte method as equivalent to a MD5 hash when
in virus detection world 100% is a pipe dream and heuristics must be
dampened to avoid false positives getting out of hand.
Many of the better AV programs use a mixture of methods including
but not limited to the above methods.
> Is this an intentional compromise of speed versus
> security? For example, I used to download Trend Micro's pattern file
> updates manually for some time, and noticed that the ZIP files grew to
> as much a 23MB until about a year (?) ago when they suddenly shrank to
> only 15MB. Have Trend Micro decided to exclude extinct or rare viruses
> from their database, or have they really found a more efficient way to
> do things?
(Msg. 36) Posted: Mon Oct 27, 2008 9:47 pm
Post subject: Re: Is there a hard drive file organizer that will ... [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
"Bill in Co." <not_really_here.TakeThisOut@earthlink.net> wrote in message
news:%23c5PZw%23NJHA.1144@TK2MSFTNGP05.phx.gbl...
> FromTheRafters wrote:
>> "Bill in Co." <not_really_here.TakeThisOut@earthlink.net> wrote in message
>> news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...
>>> FromTheRafters wrote:
>>>> "Bill in Co." <not_really_here.TakeThisOut@earthlink.net> wrote in message
>>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...
>>>>> FromTheRafters wrote:
>>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it
>>>>>>> relies on direct binary comparisons. With programs like FindDup, if
>>>>>>> we
>>>>>>> have 3 files of equal size, then we would need to compare file1 with
>>>>>>> file2, file1 with file3, and file2 with file3. This requires 6
>>>>>>> reads.
>>>>>>> For n equally sized files, the number of reads is n(n-1).
>>>>>>>
>>>>>>> Alternatively, if we relied on MD5 checksums, then each file would
>>>>>>> only need to be read once.
>>>>>>
>>>>>> So...once it is found to be the same checksum, what should the
>>>>>> program do next? How important are these files?
>>>>>
>>>>>> A fundamental
>>>>>> flaw would be to trust MD5 checksums as an indication that the
>>>>>> files are indeed duplicates.
>>>>>
>>>>> Since when?
>>>>
>>>> Forever.
>>>>
>>>> Checksums are often smaller than the file they are derived from
>>>> (thats kinda the point, eh?).
>>>
>>> No, that's not the point. Your statement was that the checksums did
>>> not
>>> assure the integrity of the file, whatsoever
>>
>> I didn't say anything about the integrity of a file, and I also didn't
>> say 'whatsoever'. You can still read what I said above.
>>
>> If you want to ensure they are duplicates - compare the files exactly.
>> If you only need to be reasonably sure they are duplicates, checksums
>> are adequate.
>
> "Very reasonably sure" is correct.
>
>>> - i.e., that two files could have the same hash valus and yet be
>>> different, which I still say is *highly* unlikely.
>>
>> Highly unlikely -yes. But files can be highly valuable too. Just
>> how fast does such a program need to be? How much speed
>> is worth how much accuracy?
>
> That's the question, isn't it. Considering the difference in speed, and
> for most of our applications, I'd say the hash checksum approach does just
> fine. >
>>> A statistically insignificant probability, so that using hash values is
>>> often prudent and is much more expedient, of course.
>>
>> True, but to aim toward accuracy instead of speed is not a flaw.
>
> And there is a point of diminishing returns. Prudence comes in here;
> i.e., using the appropriate technique for the case at hand.
We agree then! )
I think a hybrid approach would be best. For instance filetypes like JPEG
are rather large and I value them much lower than I do PDF, DOC, and
even some JPEG depending on their location. Plus, that puts the
responsibilty
on the user who made the informed decision to use a pretty nearly flawless
approach instead of a most nearly flawless approach in the event of a
disaster.
Writing such a program selling it as just as good but faster than byte by
byte
comparisons could leave one open to a lawsuit.
All times are: Eastern Time (US & Canada) (change) Goto page Previous1, 2, 3, 4, 5
Page 5 of 5
You can post new topics in this forum You can reply to topics in this forum You can edit your posts in this forum You can delete your posts in this forum You can vote in polls in this forum