View Source:
DistributionChart
Note:
This page has been locked and cannot be edited.
! Command Line Distribution Chart Scenario: you have a whole bunch of files that are mostly identical. You want to know the distribution of identical files vs. non-identical files. How do you do that on the unix command line? Here's my solution: First, run <code>md5sum</code> on all the files to get a hash of every file. Identical files will have the same hash. Then, sort the results and find the unique ones. Count how many occurrences you find of each hash. Here's the bash fragment to do this: <verbatim> for i in *.dat do md5sum $i | awk '{ print $1}' done | sort | uniq -c </verbatim> and here's what the output looks like: <verbatim> 1 0f1c9426c5959d478d49f49063016563 31 2846bde822c8d77c752fbb88e2d77997 1 4be0e00d2cc87929e08b69d5e20700df 1 5d3a104d7e3b5587791bc392c699736c 3 9faa92c5423fc00e2ad1e47000e43cd4 1 ccf2fb7b5278d8ceb48ce66bc141178f </verbatim> this shows clearly that the majority of the files have the same content and there are just a few outliers. Improving this to tell you which files are the same is left as an exercise for the reader. ----- CategoryGeekStuff CategoryBlog
Please enable JavaScript to view the
comments powered by Disqus.
HollenbackDotNet
Home Page
Popular Pages
All Categories
Main Categories
General Interest
Geek Stuff
DevOps
Linux Stuff
Pictures
Search
Toolbox
RecentChanges
RecentNewPages
What links here
Printable version
AllPages
RecentChanges
Recent Changes Cached
No changes found
Favorite Categories
ActionPage
(150)
WikiPlugin
(149)
GeekStuff
(137)
PhpWikiAdministration
(102)
Help/PageList
(75)
Help/MagicPhpWikiURLs
(75)
Blog
(69)
Pictures
(60)
GeneralInterest
(44)
LinuxStuff
(38)
Views
View Page
View Source
History
Diff
Sign In