Recently, I was tasked with comparing two directory structures of two different project versions and to identify 'new files' and 'changed files' between the most current version of the project and the prior version of the project. The idea was to produce a list of files that are in the current version of the project and not in the prior version of the project as well as any files associated with current version of the project that have been changed from the prior version of the project.
The Script
$folder1 = "c:\PathToFolder1"
$folder2 = "c:\PathToFolder2"
# Get file lists
$folder1files = Get-ChildItem -Recurse -File $folder1
$folder2files = Get-ChildItem -Recurse -File $folder2
# Get relative paths for comparison
$relativePathFolder1 = @{ }
$relativePathFolder2 = @{ }
foreach ($file in $folder1files)
{
$relativePathFolder1[$file.FullName.Substring($folder1.Length + 1)] = $file.FullName
}
foreach ($file in $folder2files)
{
$relativePathFolder2[$file.FullName.Substring($folder2.Length + 1)] = $file.FullName
}
# Find new files
Write-Host "`nFinding new files..."
$newFiles = $relativePathFolder1.Keys | Where-Object { -not $relativePathFolder2.ContainsKey($_) }
# Find changed files
Write-Host "`nFinding changed files..."
$changedFiles = @()
foreach ($file in $relativePathFolder1.Keys)
{
if ($relativePathFolder2.ContainsKey($file))
{
# Calculate hashes for both files
$hashfolder1 = Get-FileHash $relativePathFolder1[$file] -Algorithm MD5 | Select-Object -ExpandProperty Hash
$hashfolder2 = Get-FileHash $relativePathFolder2[$file] -Algorithm MD5 | Select-Object -ExpandProperty Hash
# Compare hashes
if ($hashfolder1 -ne $hashfolder2)
{
Write-Host "Changed file: $file"
$changedFiles += $file
}
}
}
# Output results
Write-Host "`nNew Files:"
if ($newFiles.Count -eq 0)
{
Write-Host "No new files found."
}
else
{
$newFiles | ForEach-Object { Write-Host $_ }
}
Write-Host "`nChanged Files:"
if ($changedFiles.Count -eq 0)
{
Write-Host "No changed files found."
}
else
{
$changedFiles | ForEach-Object { Write-Host $_ }
}
Steps Performed by the Script
1 Define Folder Paths:
- The paths to the 'PathToFolder1' and 'PathToFolder1' folders are stored in $folder1 and $folder2.
2 Retrieve File Lists:
- Get-ChildItem is used to recursively retrieve all files (including those in subdirectories) from each folder.
3 Generate Relative Path Dictionaries:
- Two dictionaries, $relativePathsFolder1 and $relativePathsFolder2, are created:
- Keys are file paths relative to the root folder.
- Values are the full file paths.
- This enables direct comparison of files regardless of their root folder paths.
4 Identify New Files:
- Compares the keys (relative paths) in $relativePathsFolder1 against $relativePathsFolder2.
- Files in 'PathToFolder1' that are not present in 'PathToFolder2' are added to $newFiles.
5 Identify Changed Files:
- Iterates through the relative paths in $relativePathsFolder1 that also exist in $relativePathsFolder2.
- Computes the hash of each file using Get-FileHash with the MD5 algorithm.
- Compares the hashes of the corresponding files from both folders.
- Files with mismatched hashes are added to $changedFiles.
6 Output Results:
- Lists new files by displaying each entry in $newFiles. If no new files are found, it outputs "No new files found."
- Lists changed files by displaying each entry in $changedFiles. If no changed files are found, it outputs "No changed files found."
Purpose of Using File Hashes
1 Content-Based Comparison:
- Why: Comparing files based on content rather than metadata ensures you accurately detect differences. Files may have the same name, size, and timestamps but still differ in content.
- How: The MD5 hash uniquely represents the file's content. If two files have different hashes, their content is not identical.
2 Efficient and Reliable:
- File hashes allow for a quick and reliable comparison of file content without manually inspecting the files or relying on less reliable methods like file size.
- MD5 is widely used for this type of integrity checking in scenarios where cryptographic security isn't a concern (as in this use case).
Breaking Down the Hash Command
-
Get-FileHash:
- Calculates a hash of the file content using the MD5 algorithm.
-
Algorithm MD5:
- Using MD5 for its simplicity and speed. While it is not suitable for cryptographic purposes, it works perfectly for file comparison.
- I could have used the SHA-256 algorithm for greater reliability, however with the speed trade-off, it was not required for this comparison.
-
Select-Object -ExpandProperty Hash:
- Extracts only the Hash property of the Get-FileHash result, making it easier to compare directly.
Why Not Use Metadata?
- Using file properties like size or timestamps could lead to false positives or negatives:
- A file may change content without changing its size.
- Timestamps might not reliably reflect changes, especially when copied or modified under certain conditions.
Key Features:
- Recursive Comparison: Handles files in subdirectories by comparing relative paths.
- Content Comparison: Ensures changes are detected based on content, not just metadata, using hash comparison.
- Informative Output: Provides clear feedback on newly added or changed files.
Conclusion
This script provides a simple solution for comparing files between two directories. By identifying new files and detecting changed files based on their content, the script ensures that you have a clear understanding of differences between folder versions.
Its use of relative paths and hash comparisons ensures accuracy, while its straightforward implementation makes it easy to adapt for other scenarios and maintain better control over your directory structures.
Top comments (0)