Duplicate files on Your disk.

Q&A's, tips, howto's
Locked
alex
Posts: 100
Joined: Thu Mar 10, 2005 2:27 pm
Location: Russia

Duplicate files on Your disk.

Post by alex »

I wrote function "dup-files" for searching duplicate file names in directory,
including subdirectories.

Code: Select all

;; dup-files
;;
;; Searching duplicate file names in directory, including subdirectories.
;;
;; syntax: (dup-files dir)
;; syntax: (dup-files dir str-pattern)
;;
;; In the first form returns a list of ALL pairs (filename directory)
;; where filename has duplicated file names in directory "dir"
;;
;; In the second form returns a list of pairs (filename directory)
;; where filename satisfy regular expression "str-pattern" and
;; has duplicated file names in directory "dir" 
;;
;; ATTETION! Works only with newlisp-8.7.2 and later 
;;
;; Example1:
;;  (load "dup-files.lsp")
;;  (println (dup-files "c:"))
;;
;; Example2:
;;  (load "dup-files.lsp")
;;  (setq li (dup-files "c:" "\.jpg$"))
;;  (println (format "%20s  %20s  %10s  %s" "Filename" "Modification time" "Size" "Directory")) 
;;  (dolist (x li)
;;    (setq tv (select (file-info (append (nth 1 x) (nth 0 x))) 0 6))
;;    (println 
;;      (format "%20s  %20s  %10d  %s" 
;;          (nth 0 x) (date (nth 1 tv) 0 "%Y-%m-%d %H:%M:%S") (nth 0 tv) (nth 1 x) ) ) )
;;
(context 'dup-files)
(setq file-mask nil)
(setq list1 '())

(define (dup-files:dup-files d f-mask)
  (if f-mask (setq file-mask f-mask) (setq file-mask ".*"))
  (replace "\\" d "/")
  (unless (ends-with d "/") (setq d (append d "/" )))
  (file-tree d)
  (let (li (map (lambda (x) (nth 0 x)) list1))
    (setq li (map (lambda (x y) (if (!= y 1) x 0)) list1 (count li li) ))
    (sort (replace 0 li)) ) )

(define (file-tree d)
  (dolist (f (replace "." (replace ".." (directory d)))) 
    (if (directory? (append d f)) 
      (file-tree (append d f "/"))
      (if (regex file-mask f 1) (push (list f d) list1)) ) ) )

(context 'MAIN)
Have anybody remarks for purposes of improve the function?

Excuse me for my bad English.

Dmi
Posts: 408
Joined: Sat Jun 04, 2005 4:16 pm
Location: Russia
Contact:

Post by Dmi »

Code: Select all

(define (dup-files d file-mask)
  (replace "\\" d "/")
  (unless (ends-with d "/") (setq d (append d "/" )))
  (let (lo '() grp-list '())
    (dolist (l (sort (file-tree d file-mask)))
      (if (= (lo 0) (l 0))
        (push (l 1) lo 1 -1)
        (begin
          (push lo grp-list -1)
          (set 'lo (list (l 0) (list (l 1)))))))
    (push lo grp-list -1)
    grp-list))

;; file-tree not using global variables and returns list of results
(define (file-tree d file-mask)
  (unless file-mask (setq file-mask ".*"))
  (let (list1 '())
    (dolist (f (replace "." (replace ".." (directory d))))
      (if (directory? (append d f))
        (set 'list1 (append list1 (file-tree (append d f "/"))))
        (if (regex file-mask f 1) (push (list f d) list1))))
    list1))

;; return name and list of places for duplicated entries
;(filter (fn (x) (> (length (x 1)) 1)) (dup-files "/home/dmi"))
;
;; return name and count for duplicate entries
;(filter (fn (x) (> (x 1) 1))
;  (map (fn (x) (list (x 0) (length (x 1)))) (dup-files "/home/dmi")))
Some notes.
- 'file-tree' doesn't really need in global variables.
- 'file-tree' returns a list of results, so we can later include it in nested functional sequence.
- 'file-tree' has no side effects. It's good because we will not care about them now.
- there is no really reason to have distinct 'f-mask' and 'file-mask'.
- I slightly change duping algorithm, so it's not using 8.7.2 features and also using only one instance of a 'file-tree' result.
- 'dup-files' returns more complex information. newlisp has a perfect engine for easy transforming resulting data to any form we want. Examples are at the bottom of code.

Next improvements? :-)
WBR, Dmi

newdep
Posts: 2038
Joined: Mon Feb 23, 2004 7:40 pm
Location: Netherlands

Post by newdep »

Nice program!

1 remark in general, dont use resursive directory search on a Unix machine
with 'directory? because it will seek into Symlinks and seek and seek and seek...
I did not test that with your tool (i dont have windows) ...

Regards, Norman.
-- (define? (Cornflakes))

Dmi
Posts: 408
Joined: Sat Jun 04, 2005 4:16 pm
Location: Russia
Contact:

Post by Dmi »

Indeed... just realize how 'count' works here... (I haven't 8.7.2 yet).
Nice! Better than mine in overall! :-))
WBR, Dmi

Lutz
Posts: 5289
Joined: Thu Sep 26, 2002 4:45 pm
Location: Pasadena, California
Contact:

Post by Lutz »

Instead of: (count li li) you just do: (count (unique li) li) on pre 8.7.2 and this way also save the (replace 0 li).

Lutz

alex
Posts: 100
Joined: Thu Mar 10, 2005 2:27 pm
Location: Russia

Post by alex »

Thanks all!

For some reason I have on my disk d: now 172756 files
and 30494 of them has 129349 duplicate names.
So d: is very good space for testing function dup-files :)
But not all is very good...

Preliminary testing:

My code (adapted)

Code: Select all

(context 'dup-files)
(setq file-mask nil)
(setq list1 '())

(define (dup-files:dup-files d f-mask)
  (if f-mask (setq file-mask f-mask) (setq file-mask ".*"))
  (replace "\\" d "/")
  (unless (ends-with d "/") (setq d (append d "/" )))

(println "time1="(time
  (file-tree d)
))

  (let (li nil) 
(println "time2="(time  (begin
    (setq li (map (lambda (x) (nth 0 x)) list1))
    (setq li (map (lambda (x y) (if (!= y 1) x 0)) list1 (count li li) ))
    (setq li (sort (replace 0 li))) 
)))
    li
  ) 
)

(define (file-tree d)
  (dolist (f (replace "." (replace ".." (directory d)))) 
    (if (directory? (append d f)) 
      (file-tree (append d f "/"))
      (if (regex file-mask f 1) (push (list f d) list1)) ) ) )

(context 'MAIN)
DMI code (adapted)

Code: Select all

(define (dup-files d file-mask) 
  (replace "\\" d "/") 
  (unless (ends-with d "/") (setq d (append d "/" ))) 
  (let (lo '() grp-list '() li nil) 

(println "time1="(time
    (setq li (file-tree d file-mask))
))

(println "time2=" (time (begin
    (dolist (l (sort li)) 
      (if (= (lo 0) (l 0)) 
        (push (l 1) lo 1 -1) 
        (begin 
          (push lo grp-list -1) 
          (set 'lo (list (l 0) (list (l 1)) )))))
    (push lo grp-list -1) 
)))
    grp-list
  )
) 

;; file-tree not using global variables and returns list of results 
(define (file-tree d file-mask) 
  (unless file-mask (setq file-mask ".*")) 
  (let (list1 '()) 
    (dolist (f (replace "." (replace ".." (directory d)))) 
      (if (directory? (append d f)) 
        (set 'list1 (append list1 (file-tree (append d f "/") file-mask))) 
        (if (regex file-mask f 1) (push (list f d) list1)))) 
    list1)) 

Test program

Code: Select all

;; You can switch tests
(load "variant1.lsp")   ;; my first variant
#(load "variant2.lsp")   ;; DMI variant
#(load "variant3.lsp")   ;; new variant

(setq li (dup-files "d:"))
(exit)
My variant output:

Code: Select all

time1=14375
time2=12000
DMI variant output:

Code: Select all

time1=146109
time2=9000
We can see, that one part of DMI algorithm - "file-tree not using
global variables" - is very slowly, but another part is more common and fast
and, moreover, can be improved (see below). So I decide to stop on
"middle" variant now:

Code: Select all

(context 'dup-files)
(setq file-list '())
(setq file-mask nil)

(define (dup-files:dup-files dir file-mask) 
  (unless file-mask (setq file-mask ".*"))
  (replace "\\" dir "/") 
  (unless (ends-with dir "/") (setq dir (append dir "/" ))) 

(println "time1="(time
  (file-tree dir)
))

  (let (result '() tv nil) 

(println "time2=" (time
    (dolist (x (sort file-list))
      (if (= tv (x 0)) 
        (push (x 1) result 0 1 0)
        (begin
          (setq tv (x 0))
          (push (list tv (list (x 1)) ) result) ) ) )
))
    (setq file-list '())   ;; restore original values for the 
    (setq file-mask nil)   ;; purpose of "reduce" side effects :)
    result) ) 

(define (file-tree d)
  (dolist (f (replace "." (replace ".." (directory d)))) 
    (if (directory? (append d f)) 
      (file-tree (append d f "/"))
      (if (regex file-mask f 1) (push (list f d) file-list)) ) ) )

(context 'MAIN)
"middle" variant output:

Code: Select all

time1=14469
time2=1906
More remarks?

PS. Final variant will be later.

alex
Posts: 100
Joined: Thu Mar 10, 2005 2:27 pm
Location: Russia

Post by alex »

Linux users can replace file-tree by something near
(ATTENTION! I hav no Linux, and can't test the code,
but idea must be clear)

Code: Select all

(define (file-tree d file-mask) 
  (let (list1 '() tv nil tv1 nil) 
    (if file-mask 
      (setq tv (append " -iname " file-mask) )
      (setq tv "") ) 
    (setq tv (exec (append "find.exe " d " -type f " tv))) 
    (dolist (f tv) 
      (setq tv1 (parse f "/"))
      (push (list (last tv1) (join (chop tv1) "/")) list1)) 
    list1) ) 

Dmi
Posts: 408
Joined: Sat Jun 04, 2005 4:16 pm
Location: Russia
Contact:

Post by Dmi »

Nice timings. Nice improvements!

Some notes:
1. (setq file-mask nil) is completely useless because it is dynamically redefined as a 'dup-files function argument. Subsequent 'file-tree call will either use internal value of 'dup-files - not global one!

2. (setq list1 '()) can better be placed directly into 'dup-files function:
(define (dup-files:dup-files dir file-mask . list1)
(setq list1 '())
So, no side effects either and this will automatically destroy unneeded data after end of executing.
And then the final part you marked ";; restore original values" can be omited.

3. About "UNIX" variant: I don't think that 'find' in such wrap will be much faster. Just more unreadable ;-) Your 'native' 'file-tree code works fine with UNIX too.
Note:
In pure *NIX I will do something like:
$ find /dir -type f -name pattern >file-list
$ awk -F/ '{print $NF}' file-list|sort|uniq -cd >dup-filenames
... and then will grep or awk "file-list" file against entries in "dup-filenames" file by many available ways.
Pure standard *nix shell - no newlisp/basic/java/C at all.
*nix methods are wonderful ;-)

4. IMHO, it will be cool to have separate 'file-tree function under the hand in the MAIN context. I'll include one in my "funlib.lsp" till community will have establish a common library standards.
WBR, Dmi

alex
Posts: 100
Joined: Thu Mar 10, 2005 2:27 pm
Location: Russia

Post by alex »

Problem about dup-files transforms to two functions: list-assoc and file-tree

Code: Select all

(define (list-assoc li) 
;; version 2005.11.28
;; Transform list of lists to association list 
;; with unique key (more precise - see code and examles)
;;
;; syntax: (dup-files list-of-lists)
;;
  (let (result '() tv nil);; syntax: (dup-files dir)
    (dolist (x (sort li))
      (if (= tv (x 0)) 
        (dolist (y (rest x)) (push y result 0 1))
        (begin
          (setq tv (x 0))
          (push x result)
        )
      )
    )
    result
  )
) 

Code: Select all

(define (file-tree dir mask)
;; version 2005.11.28
;; Returns a list of pairs (filename directory)
;; where filename satisfy regular expression "str-pattern"
;;(more precise - see code and examles)
;;
;; syntax: (file-tree dir [str-pattern])
;; syntax: (file-tree dir-list [str-pattern])
;;
  (unless mask (setq mask ".*"))
  (unless (list? dir) (setq dir (list dir)))
  (letn 
    (
      result '()
      file-tree-utility (lambda (d)
       (dolist (f (replace "." (replace ".." (directory d)))) 
         (if (directory? (append d f)) 
          (file-tree-utility (append d f "/"))
          (if (regex mask f 1) (push (list f d) result)) ) ) )
    )   
    (dolist (d dir)
      (replace "\\" d "/") 
      (unless (ends-with d "/") (setq d (append d "/" ))) 
      (file-tree-utility d)    
    )
    result 
  )
)
list-assoc and file-tree enough common, fast and useful (IMHO)
Now I can easy get list of files with duplicate names

Code: Select all

(setq dupfiles (filter (lambda (x) (> (length x) 2)) (list-assoc (file-tree (list "c:" "d:" "e:")))))

alex
Posts: 100
Joined: Thu Mar 10, 2005 2:27 pm
Location: Russia

Post by alex »

It is interesting, that test-code

Code: Select all

(define (main)
(println "time1=" (time
  (setq li (file-tree "d:"))
))
(println "time2=" (time
  (setq li (list-assoc li))
))

(println "time3=" (time
  (setq li1 (list-assoc (file-tree "d:")))
))
)

(setq t (time (main)))
(println "full-time=" t)
(exit)
produce output:

time1=13421
time2=5391
time3=31577
full-time=50405

Why time3 is so big?
Where are time3 - time1 - time2 = 12.745 s ?

Dmi
Posts: 408
Joined: Sat Jun 04, 2005 4:16 pm
Location: Russia
Contact:

Post by Dmi »

You said, you have more than 170000 files, so about 350000 string cells for file-tree's result plus some memory for copying lists, plus overhead for unfreed-unused memory - say, big lists can leave big memory holes after freeing (newlisp can reuse them later, but can't free under windows, afaik).

On my linux,
(define (q1) (dup '("qwerty") 350000))
(set 'qq1 (q1))
used about 53m of vram and 49m of resident ram.
so, after time1 and time2 are passed, I think, more than 70 - 100m of ram is used. Such size may impact the OS, the filesystem cache, the fragmentation of something or so...

Try to check time3 separately from time1&time2

Just a guess... ;-)
WBR, Dmi

alex
Posts: 100
Joined: Thu Mar 10, 2005 2:27 pm
Location: Russia

Post by alex »

To DMI:
yes, indeed.

Thanks all!

alex
Posts: 100
Joined: Thu Mar 10, 2005 2:27 pm
Location: Russia

Post by alex »

More powerful version of file-tree

Code: Select all

(define-macro (file-tree _dir _filter)
;; by alex
;; version 2005.12.02
;; Returns a list of pairs (filename directory)
;; where filename satisfy to filter-function
;;(more precise - see code and examles)
;;
;; syntax: (file-tree dir [filter-function])
;; syntax: (file-tree dir-list [filter-function])
;;
;; Examples:
;;
;;  (setq list1 (file-tree "." (lambda (x) (starts-with x "b" nil))))
;;
;;  (setq list2 (file-tree "." ))
;;
;;  (define (filt x) (ends-with x ".htm" nil))
;;  (setq list3 (file-tree (list "c:" "d:") filt))
;;
  (if (symbol? _filter) (setq _filter (eval _filter))) 
  (letn 
    ( _result '()
      _file-tree-utility 
        (lambda (_d)
          (dolist (_f (replace "." (replace ".." (directory _d)))) 
            (if (directory? (append _d _f)) 
              (_file-tree-utility (append _d _f "/"))
              (if (or (not _filter) (and _filter (_filter _f))) 
                (push (list _f _d) _result)
              )  
            ) 
          ) 
        )
    )   

    (unless (list? _dir) (setq _dir (list _dir)))
    (dolist (_d _dir)
      (replace "\\" _d "/") 
      (unless (ends-with _d "/") (setq _d (append _d "/" ))) 
      (_file-tree-utility _d)    
    )
    _result 
  )
)

Locked