File Deduplicator

Introduction

# File-Deduplicator - Find and Remove Duplicates

**Vernox Utility Skill - Clean up your digital hoard.**

## Overview

File-Deduplicator is an intelligent file duplicate finder and remover. Uses content hashing to identify identical files across directories, then provides options to remove duplicates safely.

## Features

### ✅ Duplicate Detection - Content-based hashing (MD5) for fast comparison - Size-based detection (exact match, near match) - Name-based detection (similar filenames) - Directory scanning (recursive) - Exclude patterns (.git, node_modules, etc.)

### ✅ Removal Options - Auto-delete duplicates (keep newest/oldest) - Interactive review before deletion - Move to archive instead of delete - Preserve permissions and metadata - Dry-run mode (preview changes)

### ✅ Analysis Tools - Duplicate count summary - Space savings estimation - Largest duplicate files - Most common duplicate patterns - Detailed report generation

### ✅ Safety Features - Confirmation prompts before deletion - Backup to archive folder - Size threshold (don't remove huge files by mistake) - Whitelist important directories - Undo functionality (log for recovery)

## Installation

```bash clawhub install file-deduplicator ```

## Quick Start

### Find Duplicates in Directory

```javascript const result = await findDuplicates({ directories: ['./documents', './downloads', './projects'], options: { method: 'content', // content-based comparison includeSubdirs: true } });

console.log(`Found ${result.duplicateCount} duplicate groups`); console.log(`Potential space savings: ${result.spaceSaved}`); ```

### Remove Duplicates Automatically

```javascript const result = await removeDuplicates({ directories: ['./documents', './downloads'], options: { method: 'content', keep: 'newest', // keep newest, delete oldest action: 'delete', // or 'move' to archive autoConfirm: false // show confirmation for each } });

console.log(`Removed ${result.filesRemoved} duplicates`); console.log(`Space saved: ${result.spaceSaved}`); ```

### Dry-Run Preview

```javascript const result = await removeDuplicates({ directories: ['./documents', './downloads'], options: { method: 'content', keep: 'newest', action: 'delete', dryRun: true // Preview without actual deletion } });

console.log('Would remove:'); result.duplicates.forEach((dup, i) => { console.log(`${i+1}. ${dup.file}`); }); ```

## Tool Functions

### `findDuplicates` Find duplicate files across directories.

**Parameters:** - `directories` (array|string, required): Directory paths to scan - `options` (object, optional): - `method` (string): 'content' | 'size' | 'name' - comparison method - `includeSubdirs` (boolean): Scan recursively (default: true) - `minSize` (number): Minimum size in bytes (default: 0) - `maxSize` (number): Maximum size in bytes (default: 0) - `excludePatterns` (array): Glob patterns to exclude (default: ['.git', 'node_modules']) - `whitelist` (array): Directories to never scan (default: [])

**Returns:** - `duplicates` (array): Array of duplicate groups - `duplicateCount` (number): Number of duplicate groups found - `totalFiles` (number): Total files scanned - `scanDuration` (number): Time taken to scan (ms) - `spaceWasted` (number): Total bytes wasted by duplicates - `spaceSaved` (number): Potential savings if duplicates removed

### `removeDuplicates` Remove duplicate files based on findings.

**Parameters:** - `directories` (array|string, required): Same as findDuplicates - `options` (object, optional): - `keep` (string): 'newest' | 'oldest' | 'smallest' | 'largest' - which to keep - `action` (string): 'delete' | 'move' | 'archive' - `archivePath` (string): Where to move files when action='move' - `dryRun` (boolean): Preview without actual action - `autoConfirm` (boolean): Auto-confirm deletions - `sizeThreshold` (number): Don't remove files larger than this

**Returns:** - `filesRemoved` (number): Number of files removed/moved - `spaceSaved` (number): Bytes saved - `groupsProcessed` (number): Number of duplicate groups handled - `logPath` (string): Path to action log - `errors` (array): Any errors encountered

### `analyzeDirectory` Analyze a single directory for duplicates.

**Parameters:** - `directory` (string, required): Path to directory - `options` (object, optional): Same as findDuplicates options

**Returns:** - `fileCount` (number): Total files in directory - `totalSize` (number): Total bytes in directory - `duplicateSize` (number): Bytes in duplicate files - `duplicateRatio` (number): Percentage of files that are duplicates

## Use Cases

### Digital Hoarder Cleanup - Find duplicate photos/videos - Identify wasted storage space - Remove old duplicates, keep newest - Clean up download folders

### Document Management - Find duplicate PDFs, docs, reports - Keep latest version, archive old versions - Prevent version confusion - Reduce backup bloat

### Project Cleanup - Find duplicate source files - Remove duplicate build artifacts - Clean up node_modules duplicates - Save storage on SSD/HDD

### Backup Optimization - Find duplicate backup files - Remove redundant backups - Identify what's actually duplicated - Save space on backup drives

## Configuration

### Edit `config.json`: ```json { "detection": { "defaultMethod": "content", "sizeTolerancePercent": 0, // exact match only "nameSimilarity": 0.7, // 0-1, lower = more similar "includeSubdirs": true }, "removal": { "defaultAction": "delete", "defaultKeep": "newest", "archivePath": "./archive", "sizeThreshold": 10485760, // 10MB threshold "autoConfirm": false, "dryRunDefault": false }, "exclude": { "patterns": [".git", "node_modules", ".vscode", ".idea"], "whitelist": ["important", "work", "projects"] } } ```

## Methods

### Content-Based (Recommended) - Fast MD5 hashing - Detects exact duplicates regardless of filename - Works across renamed files - Perfect for documents, code, archives

### Size-Based - Compares file sizes - Faster than content hashing - Good for media files where content hashing is slow - Finds near-duplicates (similar but not exact)

### Name-Based - Compares filenames - Detects similar named files - Good for finding version duplicates (file_v1, file_v2)

## Examples

### Find Duplicates in Documents ```javascript const result = await findDuplicates({ directories: '~/Documents', options: { method: 'content', includeSubdirs: true } });

console.log(`Found ${result.duplicateCount} duplicate sets`); result.duplicates.slice(0, 5).forEach((set, i) => { console.log(`Set ${i+1}: ${set.files.length} files`); console.log(` Total size: ${set.totalSize} bytes`); }); ```

### Remove Duplicates, Keep Newest ```javascript const result = await removeDuplicates({ directories: '~/Documents', options: { keep: 'newest', action: 'delete' } });

console.log(`Removed ${result.filesRemoved} files`); console.log(`Saved ${result.spaceSaved} bytes`); ```

### Move to Archive Instead of Delete ```javascript const result = await removeDuplicates({ directories: '~/Downloads', options: { keep: 'newest', action: 'move', archivePath: '~/Documents/Archive' } });

console.log(`Archived ${result.filesRemoved} files`); console.log(`Safe in: ~/Documents/Archive`); ```

### Dry-Run Preview Changes ```javascript const result = await removeDuplicates({ directories: '~/Documents', options: { dryRun: true // Just show what would happen } });

console.log('=== Dry Run Preview ==='); result.duplicates.forEach((set, i) => { console.log(`Would delete: ${set.toDelete.join(', ')}`); }); ```

## Performance

### Scanning Speed - **Small directories** (<1000 files): <1s - **Medium directories** (1000-10000 files): 1-5s - **Large directories** (10000+ files): 5-20s

### Detection Accuracy - **Content-based:** 100% (exact duplicates) - **Size-based:** Fast but may miss renamed files - **Name-based:** Detects naming patterns only

### Memory Usage - **Hash cache:** ~1MB per 100,000 files - **Batch processing:** Processes 1000 files at a time - **Peak memory:** ~200MB for 1M files

## Safety Features

### Size Thresholding Won't remove files larger than configurable threshold (default: 10MB). Prevents accidental deletion of important large files.

### Archive Mode Move files to archive directory instead of deleting. No data loss, full recoverability.

### Action Logging All deletions/moves are logged to file for recovery and audit.

### Undo Functionality Log file can be used to restore accidentally deleted files (limited undo window).

## Error Handling

### Permission Errors - Clear error message - Suggest running with sudo - Skip files that can't be accessed

### File Lock Errors - Detect locked files - Skip and report - Suggest closing applications using files

### Space Errors - Check available disk space before deletion - Warn if space is critically low - Prevent disk-full scenarios

## Troubleshooting

### Not Finding Expected Duplicates - Check detection method (content vs size vs name) - Verify exclude patterns aren't too broad - Check if files are in whitelisted directories - Try with includeSubdirs: false

### Deletion Not Working - Check write permissions on directories - Verify action isn't 'delete' with autoConfirm: true - Check size threshold isn't blocking all deletions - Check file locks (is another program using files?)

### Slow Scanning - Reduce includeSubdirs scope - Use size-based detection (faster) - Exclude large directories (node_modules, .git) - Process directories individually instead of batch

## Tips

### Best Results - Use content-based detection for documents (100% accurate) - Run dry-run first to preview changes - Archive instead of delete for important files - Check logs if anything unexpected deleted

### Performance Optimization - Process frequently used directories first - Use size threshold to skip large media files - Exclude hidden directories from scan - Process directories in parallel when possible

### Space Management - Regular duplicate cleanup prevents storage bloat - Delete temp directories regularly - Clear download folders of installers - Empty trash before large scans

## Roadmap

- [ ] Duplicate detection by image similarity - [ ] Near-duplicate detection (similar but not exact) - [ ] Duplicate detection across network drives - [ ] Cloud storage integration (S3, Google Drive) - [ ] Automatic scheduling of scans - [ ] Heuristic duplicate detection (ML-based) - [ ] Recover deleted files from backup - [ ] Duplicate detection by file content similarity (not just hash)

## License

MIT

---

**Find duplicates. Save space. Keep your system clean.** 🔮

Back

Introduction

More Products

Summarize

Ontology

Nano Pdf