or, "Monolithic Mashup Wordlists Considered Harmful"
or, "Sort once, deduplicate often"
Hi!
Someone may have sent you here in response to a question like "How do I sort and dedupe this 300GB wordlist?"
Don't.
Do this instead.
Spend a little time to understand why now ... instead of mopping up later.
The motivation
A recurring workflow in password cracking concerns the management of new wordlist sources (human wordlist corpora, leaks, etc.). This usually means collection, followed at times by deduplication.
The reason for the collecting phase of this wordlist management should be clear. Password cracking needs to match the human process of password creation. If someone uses their kid's name in a password, then a list of many first names as a base wordlist (a list first assembed by Ron Bowes of Skull Security) can get many cracks.
The deduplicating phase of wordlist management is driven by the need for efficiency. We don't want to use the same word twice. Even for very fast hashes, attacks that use these lists as a base, and then amplify that work with rules, etc. can take significantly less time if duplicates are removed.
Many people learn how to manage the deduplication part of process iteratively, in a way that scales poorly. This eventually compels them to seek help in the cracking forums or chats. (No shame -- this progress corresponds with their own knowledge journey in a very understandable way.)
My suggested approach is very different from the one that usually emerges organically, and is superior for larger-scale use cases (without having to resort to a database).
This approach is good for password cracking, but can also be used for any other problem space with similar "dedupe a recurring stream of new sets of strings over time" use cases.
The key benefit of my approach is that you only need enough RAM to hold a factor (double or quadruple, depending on speed tradeoffs) of each new file you merge in, instead of enough RAM/storage/tempfiles to handle the entire thing.
The tl;dr
If you're just looking for a mini-HOWTO:
#!/bin/bash
# Create a repository for deduplicated wordlists.
wordlist_dir=/wordlists
cd "$wordlist_dir" || exit
if [ ! -d ./new/ ]; then
mkdir ./new/
fi
# Initialize repo with the Hashmob full founds file.
# Only do this manually once, prior to first run.
# wget -c \
# https://hashmob.net/api/v2/archive/hashmob.net_[YEAR].found
# cp [hashmob-freq-founds-file] ./new/00-[filename]
# Consider splitting large files first.
# Dedupe and add any wordlists missing from the repo.
for file in ./*.dict; do \
if [ ! -f "./new/$file.new" ]; then
# Remove all dupes from $file found in ./new/*
# Write the output to $file.new.
# https://github.com/Cynosureprime/rlite
rlite -m "$file" -o "$file.new" ./new/*;
# Move the new file into your repository.
mv "$file.new" ./new/;
fi
done
How we got here: the knowledge journey
If we can understand how people iterate when learning the traditional approaches of recurring wordlist dedplication, we can learn when (and why) it's worth transcending them.
Knowledge Phase One: Collect all the (wordlist) things
When new password crackers search online for advice about how to deduplicate multiple large files, the common answers usually say something like this:
$ cat *.dict | sort -u > superdict
... which does this:
1. Concatenates all the files into one big input pipeline
2. Runs sort -u
on the input (-u
enforces uniqueness by removing duplicates)
3. Writes the output to disk as a single file.
This works fine ... for a while.
Knowledge Phase Two: Optimize all the things
Our new password cracker willl quickly hit a wall: local resource size constraints (RAM and storage). Deduplicating a 100GB file with 8GB of RAM is painfully slow, and gets even moreso as the total amount of data increases. To help, the Internet will next tell our hero to do this:
$ cat *.dict | LC_ALL=C sort -u -S 4G -T /path/to/ssd/ > superdict
- The
LC_ALL=C
variable tellssort
to use a simple and faster way to compare strings for sorting - one that is based on single bytes, and not on complex multibyte / multilanguage rules for sorting. - The
-S
flag says how much RAM to use. Tune this to taste. Somewhere close to your amount of physical RAM, less everyday overhead, is usually the most efficient. - The
-T
flag says to write the tempfiles to a specific directory -- often faster SSD, or a drive with enough room to store all of the tempfiles (roughly the same size as the input). These temp files are used in the final sorting phase (which performs an N-way merge sort)
This works better ... for a while. It's definitely faster, and keeps you from filling up your wordlist drive accidently. But it's also going to drive your recurring writes through the roof, reducing media lifetimes.
Knowledge Phase Three: Parallel all the things
Next, our hero will hit another wall: local compute constraints (sort is CPU-bound). Ole Tange's excellent parsort will quickly surface in searches as a nice solution. The invocation usually ends up looking something like this:
$ LC_ALL=C parsort -u -S 500M \
-T /path/to/sacrificial/ssd/ \
--parallel=[thread-count] \
*.dict \
> superdict
The -u
and -T
flags are the same as sort
's.
The -S
option is similar to sort
's, but it's the memory allocated
per thread, so may need to be tuned carefully. The --parallel
option
allows parsort
to operate on thread-count
files in parallel (to the
point of I/O saturation). This means that your optimal thread count may be less than your actual available thread count.
This works even better ... for a while. This apache does use CPU more efficiently ... but is still bound by the sheer volume of both compute and storage writes necessary.
Our hero has now gotten very good ... at treating a symptom.
To the pain1
But all of these approaches eventually hit the same wall. As each new wordlist is added, the storage and sorting resources needed increase with the size of all previous unique strings. This eventually melts even a reasonably hefty cracking system.
It's at this point that our hero usually shows up in the cracking forums or chats, asking "How can I sort my 200GB wordlist?". And then a month later, "How can I sort my 500GB wordlist?"
And as I've been building up to ... the correct answer (IMO) is ... you don't. Sometimes, the correct answer to "how do I make it hurt less when I hit myself in the face?" is ... "stop hitting yourself in the face". Especially when you're hitting yourself in the face over and over. And every time you do it, you're hitting harder and harder. And you intend to keep doing it indefinitely.
Instead ... unlearn all of the above, and think about the problem in an entirely different way.
You need to ...
Flip the script
The key insight: you don't need to store everything in a single file to deduplicate it. Instead, you can just dedupe each new file, one by one, removing everything else that's already your repository of unique words. And then add that new file to the repository.
Once you're not mashing it all into a single file, you also don't have to constantly re-sort the entire repository. Instead, you only have to sort and index the new file in memory, remove everything that's already in the repository, and write the delta to disk.
But how do we do that?
hashcat's solution: rli
To support this approach -- -- starting with a "base" file, and removing matching records from other input files -- the hashcat project published the rli
and rli2
utilities.
These utilities have the following trade-offs:
-
rli
can take multiple "remove" files, but is limited by memory size of the "base" file:$ rli [basefile] [removefiles...]
[basefile]
will be altered, removing all records also appearing in any [removefiles]. -
rli2
can only take one "remove" file, and both the input file and the "remove" file must already be inLC_ALL=C
sort order on disk (for a two-way merge sort).$ rli2 [basefile-sorted] [one-removefile-sorted]
- Both
rli
andrli2
are single-threaded. This was less concerning when they were created, but both thread counts and I/O speeds have increased substantially since.
I nerdsnipe Cynosure Prime: rling
To address some trade-offs, and in part due to me nerd-sniping him (which lasted for six weeks, and remains one of the highlights of my cracking career), Waffle of the
Cynosure Prime cracking team
created
rling
("rli next gen") to cover the same use case.
Key enhancements include strong parallelism, informative verbosity, memory usage estimates early in invocation, and a -b
(binary tree) flag, which uses half as much memory in exchange for some reduction in speed.
$ rling [basefile] [outfile] [removefiles...]
The rling
repository expands on the use cases, and also includes some extremely handy utilities for cracking workflow (I use its getpass
daily).
The rling
repo also shows bake-off statistics between among rli
, rling
, and rling -b
.
Cynosure Prime then promptly nerdsnipe themselves: rlite
Later, CsP's
blazer
created
rlite
,
a "lite" version of rling
, improving speed work and simplifying the interface.
$ rlite [basefile] -o [outfile] [removefiles...]
For performance, rlite
often has the edge (examples later), but not always -- see their respective GitHub READMEs for discussion.
My recommended approach
So now that we know which utilities can serve this use case, here's how I manage my deduplicated wordlist repositories over time in practice.
1. Create a new directory to store your deduped repository, such as ./new/
2. Pick the highest-quality, frequency-sorted, deduped file that you have as a "starter". I recommend the latest full Hashmob founds list.
3. Copy your "starter" file directly to ./new/
. I recommend naming this file 000-[filename], so that it will usually be processed first when sourcing the directory for an attack.
4. Pick the next file you want to add to your repository -- say, leak5.dict
. This is the file from which we will be removing all previous strings that are already in the repository.
5. Use rling
or rlite
to remove all lines from your new file that are already in your ./new/
:
$ rling leak5.dict leak5.dict.new ./new/*
5a. (If the memory estimates provided by these tools make it too big to process in memory, use the -b flag to reduce memory usage in exchange for some reduction in speed. If the file is too big to fit into RAM to dedupe locally, use split
to split the file into N memory-sized chunks, (split -n l/2
for two files, split -n l/3
for 3 files, etc.), and then do the dedupe loop for each of them.)
6. Move this newly deduped file to your repository:
$ mv leak5.dict.new ./new/
7. Pick another file and repeat, processing all the files that you want to commit to the repository.
You can (and should) write a wrapper script to make this easy. The tl;dr example above is a good starting point for bash.
The payoff
This method has the following benefits:
- You no longer have to sort arbitrarily large input. You're just sorting -- temporarily or permanently, depending on your use case -- only the new wordlist file. This means that total compute time required over time increases much more slowly. Adding another wordlist to your repository is no longer daunting.
- Required memory is reduced to a linear factor of the size of each new wordlist file, instead of the ever-growing size of all records you've ever ingested.
- Your ongoing need for disk space for larger inputs is effectively cut in half, because you don't have to pay the "need as much scratch space as the size of all my inputs" tax anymore.
- The total number of disk writes is dramatically reduced, improving storage media lifetimes.
- Because you are preserving part of each new wordlist, you can see trends more clearly during an attack. More hits from a specific source means that you can return to that specific source and use it as a base for amplified attacks.
- Because the utilities I suggest sort the incoming file internally as part of the deduplication process, but can retain the original sort order, you get to control the final sort order. You can choose to sort each file (great for recurring merge sorts), or preserve the existing (often frequency) order.
To process an entire backlog of wordlists (and to re-run when you add new ones), you could use something roughly like the tl;dr at the top of this post, which only dedupes and merges files that have not yet been merged. Dropping a new wordlist into your raw wordlist directory, rerun the script, and you're done.
As your repository grows, instead of sorting a larger and larger single file, you're always removing a larger and larger set of sorted files from your latest acquisition (hence "sort once, deduplicate often").
Putting it all together -- full walkthrough
Here's the manual / unscripted walkthrough of the initial setup of a repository.
# Set up our repository of unique wordlists.
$ pwd
/wordlists
$ mkdir ./new/
# Get the latest full HashMob founds.
$ wget --quiet \
https://hashmob.net/api/v2/archive/hashmob.net_2025.found
# ... or from their CDN (but note that the filename will change over time):
$ wget --quiet \
https://cdn.hashmob.net/combined_founds/hashmob.net_2025-04-13.found.7z
$ 7z x hashmob.net_2025-04-13.found.7z
$ mv hashmob.net_2025-04-13.found hashmob.net_2025.found
$ wc -l hashmob.net_2025.found
1765746406
# Move the file to our new repo, prefixing the filename with '00-'.
$ mv hashmob.net_2025.found ./new/00-hashmob.net_2025.found
# Pick our first wordlist to dedupe and "merge".
$ wc -l /wordlists/hashes.org/found.2017.txt
324025174 /wordlists/hashes.org/found.2017.txt
$ ls -lah /wordlists/hashes.org/found.2017.txt
-rw-r----- 1 royce royce 3.4G Dec 31 2017 /wordlists/hashes.org/found.2017.txt
Dedupe our first file using rling
:
$ rling /wordlists/hashes.org/found.2017.txt \
hashes-org-found.2017.txt.new ./new/*
Reading "/wordlists/hashes.org/found.2017.txt"...3545600666 bytes total in 1.8418 seconds
Counting lines...Found 324025174 lines in 1.5302 seconds
Optimal HashPrime is 805306457
Estimated memory required: 15,277,312,738 (14.23Gbytes)
Processing input list... 324025174 unique (0 duplicate lines) in 5.1602 seconds
Occupancy is 266748535/805306457 33.1239%, Maxdepth=7
Removing from "./new/00-hashmob.net_2025.found"... 319686833 removed
319,686,833 total lines removed in 42.8395 seconds
Writing to "hashes-org-found.2017.txt.new"
Wrote 4,338,345 lines in 0.3309 seconds
Total runtime 51.7027 seconds
... or rling
with -b
option:
$ rling -b /wordlists/hashes.org/found.2017.txt \
hashes-org-found.2017.txt.new ./new/*
Reading "/wordlists/hashes.org/found.2017.txt"...3545600666 bytes total in 1.8453 seconds
Counting lines...Found 324025174 lines in 1.4881 seconds
Estimated memory required: 6,242,659,690 (5.81Gbytes)
Sorting... took 0.0000 seconds
De-duplicating: 324025174 unique (0 duplicate lines) in 0.2197 seconds
Removing from "./new/00-hashmob.net_2025.found"... 319686830 removed
319,686,830 total lines removed in 60.6416 seconds
Final sort in 2.2109 seconds
Writing to "hashes-org-found.2017.txt.new"
Wrote 4,338,345 lines in
0.3095 seconds
Total runtime 66.4956 seconds
... or rlite
:
$ rlite /wordlists/hashes-org/found.2017.txt \
-o hashes-org-found.2017.txt.new ./new/*
Reading input: /wordlists/hashes-org/found.2017.txt
Total number of lines 324,025,174 Memory required (~5.72GBs)
Reading took 6.101 seconds
Sorting took 6.159 seconds
De-duplicating 324,025,174 unique (0 duplicate lines) took 0.570 seconds
Indexing took 0.967 seconds
Reading file ./new/00-hashmob.net_2025.found
Adjusting workload
Adjusting workload
Adjusting workload
Finished reading ./new/00-hashmob.net_2025.found E:0
Searching took 15.231 seconds
./new/00-hashmob.net_2025.found 320,336,275
Writing took 0.362 seconds
Unique
matches: 319,686,829 Wrote 4,338,345 lines
Total time took 28.820 seconds
$ mv hashes-org-found.2017.txt.new ./new/
Merge the next file (already split into two pieces).
Let's stick with rlite
for the rest of our example.
$ rlite /wordlists/hashes-org/found.2018.txt.aa \
-o hashes-org-found.2018.txt.aa.new ./new/*
Reading input: /wordlists/hashes-org/found.2018.txt.aa
Total number of lines 302,930,583 Memory required (~5.25GBs)
Reading took 4.317 seconds
Sorting took 5.806 seconds
De-duplicating 302,930,583 unique (0 duplicate lines) took 0.506 seconds
Indexing took 0.860 seconds
Reading file ./new/00-hashmob.net_2025.found
Adjusting workload
Adjusting workload
Adjusting workload
Finished reading ./new/00-hashmob.net_2025.found E:0
Reading file ./new/hashes-org-found.2017.txt.new
Finished reading ./new/hashes-org-found.2017.txt.new E:0
Searching took 14.225 seconds
./new/00-hashmob.net_2025.found 295,918,465
./new/hashes-org-found.2017.txt.new 9,333
Writing took 0.343 seconds
Unique matches: 295,428,831 Wrote 7,501,752 lines
Total time took 25.552 seconds
$ mv hashes-org-found.2018.txt.aa.new ./new/
$ rlite /wordlists/hashes-org/found.2018.txt.ab \
-o hashes-org-found.2018.txt.ab.new ./new/*
Reading input: /wordlists/hashes-org/found.2018.txt.ab
Total number of lines 310,447,312 Memory required (~5.31GBs)
Reading took 5.583 seconds
Sorting took 5.987 seconds
De-duplicating 310,447,312 unique (0 duplicate lines) took 0.516 seconds
Indexing
took 0.852 seconds
Reading file ./new/00-hashmob.net_2025.found
Adjusting workload
Adjusting workload
Adjusting workload
Adjusting workload
Finished reading ./new/00-hashmob.net_2025.found E:0
Reading file ./new/hashes-org-found.2017.txt.new
Finished reading ./new/hashes-org-found.2017.txt.new E:0
Reading file ./new/hashes-org-found.2018.txt.aa.new
Finished reading ./new/hashes-org-found.2018.txt.aa.new E:0
Searching took 12.792 seconds
./new/00-hashmob.net_2025.found 306,364,679
./new/hashes-org-found.2017.txt.new 8,468
./new/hashes-org-found.2018.txt.aa.new 0
Writing took 0.247 seconds
Unique matches: 305,770,408 Wrote 4,676,904 lines
Total time took 25.461 seconds
$ mv hashes-org-found.2018.txt.ab.new ./new/
Here's what our repository looks like now:
$ ls -lA ./new/
total 22156996
-rw-r----- 1 royce royce 22443023546 Apr 18 02:55 00-hashmob.net_2025.found
-rw-r----- 1 royce royce 99176159 Apr 18 12:44 hashes-org-found.2017.txt.new
-rw-r----- 1 royce royce 94366744 Apr 18 12:45 hashes-org-found.2018.txt.aa.new
-rw-r----- 1 royce royce 52121267 Apr 18 12:46 hashes-org-found.2018.txt.ab.new
Wash, rinse, repeat.
Performance art
In a more real-world example, for one of my own repositories on NVMe that is 89GB in total size, using rlite
and the same storage speeds etc. as in the full walkthrough above, on a system with 48 cores, removing that 89GB from hashes-org-found.2018.txt.ab (a 3.4GB file) takes 440 seconds (about 7.33 minutes), while consuming 6.6GB of RAM.
Using rlite
's -m
flag (enable lookup map), the memory usage rises to 10GB, but runtime drops to 420 seconds.
I should put a table here. 😅
Adjust to taste
This approach can be tweaked for a variety of use cases:
- Keep more than one repository. Want to keep a giant bucket of all possible words, but also dedicated sublists for things like unique usernames, email addresses, hostnames, etc.? Just maintain one ./new/ directory for each subtype.
- Make metadata easy. Want to keep simple metadata about each wordlist (line count, longest string, etc.)? Keep a ./new/.stats/ file to cache this data on a per-file basis, refresh it only when the cache doesn't exist and is older than the file. I use this method to keep a running count of all lines in my repositories, which I can summarize at a speed of more than 10 billion lines per second using simple shell-based arithmetic.
- Mash it up faster if you have to. Still want to occasionally regenerate a single sorted file? Modify your ingestion to sort each new file as it's added to the repository. Then you can use sort's -m / --merge option ("merge already sorted files") to merge them all without having to re-sort them. This necessarily destroys the original sort order, which may matter for some attacks, so YMMV.
That's it!
Now you have more ways to hit yourself in the face. Gently.
Sort once, deduplicate often.
Footnotes
-
Yes, this is a Princess Bride reference. ↩︎
Created: Fri Apr 18 21:19:59 UTC 2025