Did Sqlite break my dupdb program?

Estimated reading time 15 minutes

Remember how a couple weeks ago I added SQLite to my duplicate file detection tool? Well, a day or so after I ended up reverting my changes due to getting too many false positives, even though nothing had truly changed beyond where I was storing the data. So. Here's a blogpost to chronicle the suffering of such things, what I did about it, and hopefully 1, the solution.

Step 1: Revert

Well, the first first thing I did after repeatedly getting notifications for files that were most certainly brand new to the pictures folder was the following:

commit 2c5f349954eb76ef8c23b1c16392c4e1b35f191d
Date:   Tue Jul 15 18:42:53 2025 -0400

    Revert "Use SQLite for the database for dupdb"

    This reverts commit 864234afed31b80be3038c55c29bd1189fd20717.

    I'm noticing some strange behavior with the new SQLIte stuff, reverting
    for now

The simplest thing to do when a new "upgrade" has broken things is to just toss it out like a baby with the bathwater 2 And so, out ye get. Five days later, I decided to actually start investing some time into the problem and figured I'd cover my bases and then reintroduce the flaky software.

Step 2: Unit tests!

By covering my bases, I of course mean adding some form of repeatable test. When I was blogging about the upgrade last time, I just built up a small test method over the course of the journey and called it good enough because things were seemingly working just fine as I plodded away. When there's not a lot of code and things are all working first try, unit tests don't always feel like they need to exist yet.

But when the software starts misbehaving a day or two after? Yeah, I want to add some tests in place so that I can refactor, prod, break, and slap the code around a little bit without worrying too much. And thus:

commit 989ac83b11db77dbc1194eed43d4635c11e28040
Date:   Sun Jul 20 11:53:47 2025 -0400

    Add unit tests against sqlite code

    They're all passing, so the issues I observed that led the previous
    revert are likely in the other methods and not a direct result of the
    sqlite db's queries and such

The unit tests themselves aren't particularly worth noting. Well, except for maybe one thing I did that made the rust compiler somewhat mad at me:

#[cfg(test)]
mod test {
    use super::*;
    use std::fs;
 
    static mut TEST_DB_NO: u32 = 0;

    fn open_test_database() -> Connection {
        let connection;
        unsafe {
            TEST_DB_NO += 1;
            let filename = format!("test_{TEST_DB_NO}.sqlite.db");
            let _ = fs::remove_file(&filename);
            connection = Connection::open(filename).expect("Cannot open database for test");
        };
        initialize(&connection);
        connection
    }

As you can see, I've got a mutable static. And so the compiler underlines this and then proceeds to toss up warning whenever I run cargo test and tells me:

shared references to mutable statics are dangerous; it's undefined behavior if the static is mutated or if a mutable reference is created for it while the shared reference lives

Yes, yes, very nice. Very true. Thank you Crab Compiler. But I don't think I'm overtly worried about a single global incrementing number opened up by tests ran in serial on the same machine. Granted, this did give me a reason to use the unsafe keyword for the first time ever. I have yet to see this cause a problem in my tests, and the list of .sqlite.db files in my folder from running the tests remains constant and without explosions.

I use my little helper from each of my tests so that they have a clean slate each time:

#[test]
fn count_of_non_existing_hash_should_be_0() {
    let connection = open_test_database();
    let count_of_nothingness = count_of_same_hash(&connection, 196248234750);
    let marking_time_waiting_for_death = 0;
    assert_eq!(count_of_nothingness, marking_time_waiting_for_death);
}

And, while I won't copy over to show you every test, their function names can show you that I've got pretty decent coverage on the core functionality of my little dupe database:

fn count_of_non_existing_hash_should_be_0() {
fn count_of_existing_hash_should_be_n() {
fn select_dupes_based_on_filepath_hash() {
fn can_delete_from_database_for_matches() {

So, if it's not the SQL that's the problem. Then, it must be the methods surrounding the usage of the SQL then. And so, I turned to the harder task of writing unit tests for the functions I had written into the main file. Firstly, I moved things over to a dupdb.rs module, and then I went ahead and

fn adding_a_dupe_can_be_detected () {
fn removing_a_file_removes_detected_dupes () {
fn does_not_detect_dupes_in_dir_if_not_there () {
fn does_detect_dupes_in_dir_if_not_there () {

The first two tests verified that the wrappers around the SQL, the methods in the DuplicateDatabase struct, weren't doing anything strange or add. Like the SQL tests, I created a standalone and fresh database for each test and then did the testing in one shot.

The second set of tests focused in on the dupdb_update_hashes_for method, as that was primarily disconnected from difficult to test things such as actually moving files around on the OS. To avoid that bother, I instead made a test folder with duplicate txt files in it, as well as a folder without any duplicate files in it. Then, it was simple to write some verification tests for the dupdb_update_hashes_for code:

#[test]
fn does_detect_dupes_in_dir_if_not_there () {
    let mut dupdb = get_test_dupdb();

    let path: PathBuf = [".", "test", "dupes"].iter().collect();
    let entries = RecursiveDirIterator::new(&path)
                      .expect("Could not load path to reindex database");
    let paths: Vec<PathBuf> = entries
        // Remove directories, keep files only.
        .filter(|dir_entry| dir_entry.path().extension().is_some())
        .map(|file| file.path())
        .collect();

    let hashes: Vec<u64> = paths
        .iter()
        .map(|path| {
            let bytes = fs::read(path).expect("Test files are not set up correctly");
            seahash::hash(&bytes)
        }).collect();
    
    dupdb_update_hashes_for(paths, &mut dupdb);
    for hash in hashes {
        let db_has_dupe = dupdb.contains_duplicate_for_hash(hash);
        assert_eq!(db_has_dupe, true);
    }
}

As before, warning about my static mutabe counter aside, the results of cargo test were showing that everything was nominal. And yet, the mysterious behavior persisted. In fact, it was happening on nearly every single save to my picture folders. Why? If the logic is sound for how these pieces are handling things, then why would the simple act of using SQLite cause false positives?

Event deduplication

After spending some time on those tests and not seeing anything from that, I started doing what most people I know do at this point:

let mut paths_and_seconds: Vec<(PathBuf, u64)> = debounced_events.into_iter().filter_map(|event|{
    println!("{:?}", event);
    let maybe_paths: Option<Vec<PathBuf>> = match event.kind {
        EventKind::Remove(_) => Some(event.paths.clone()),
        ...

Putting in printlns and staring at logs trying to understand things. I thought about firing up a debugger, but the inner voice of the computer spoke to me through the whines of my fans: "timing issue" they sang. And thus, did I resolve to print and not to hook a fishing line into the throngs of electrons jittering to the machinations of my minds desires.

Ahem. Sorry, sometimes when working on a problem it is easier to let autopilot turn on and let the stream of consciousness go where it may. In my case, it's often into poetic nonsense. Anyway, if we've got a timing bug from an external event like the OS then stopping the world to let inspect values is probably going to impact the timing of everything happening, and so it doesn't feel like a good idea.

Thankfully, this paid off pretty quickly because I saw logs like this come through (formatted for reading clarity):

DebouncedEvent { 
    event: Event { kind: Create(Any), paths: ["path\\to\\file\\filename.jpg"], 
    attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }, 
    time: Instant { t: 1731067.3857255s } 
}
DebouncedEvent { 
    event: Event { kind: Modify(Any), paths: ["path\\to\\folder\\file\\is\\in"], 
    attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }, 
    time: Instant { t: 1731067.4719451s } 
}
DebouncedEvent { 
    event: Event { kind: Modify(Any), paths: ["path\\to\\file\\filename.jpg"],
    attr:tracker: None, attr:flag: None, attr:info: None, attr:source: None }, 
    time: Instant { t: 1731067.5530981s } 
}

Immediately I saw two things:

  1. Downloading a new file created a Create and a Modify event a few moments apart
  2. Between the two was an event from the parent folder

All I do with the these paths is pass them along for de-dup checking against the database. Ironically, I never bothered to de-dupe the events themselves. If I get a create event, and then the OS (in all its wisdom) proceeds to do something immediately and generate a modification event, then perhaps that could explain the false positives? 3 And so, I went ahead and tweaked it:

for result in rx {
       match result {
           Ok(debounced_events) => {
               let paths: Vec<PathBuf> = debounced_events.into_iter().filter_map(|event| {
               let mut paths: Vec<PathBuf> = debounced_events.into_iter().filter_map(|event| {
                   match event.kind {
                       EventKind::Remove(_) => Some(event.paths.clone()),
                       ...
                       EventKind::Other => None,
                   }
               }).flatten().collect();
               paths.sort();
               paths.dedup();
               dupdb_update_hashes_for(paths, duplicate_database);
           },
           Err(error) => eprintln!("Watch error: {:?}", error),
       }
   };

Fun fact, the combination of .sort() and .dedup() on something like a Vector in rust results in every consecutive values that's equal, to be reduces to 1 instead of many. Which effectively deduplicates a list, provided its sorted. So,

commit 5d476af0ef39abd65efdc6d9ce89f721986cde62
Date:   Sun Jul 20 15:15:08 2025 -0400

    Fix bug resulting in multiple alerts

    Duh, cuz I didn't de-dup the paths so an access + delete or a create +
    access or similar would result in multiple hits in the debounced list of
    events.

This seemed to help for a bit. But then, while saving, deleting, and saving an image multiple times to my folders, I was able to get the bug over and over again. So, while it seemed to help a bit, and logically made sense, it felt like I wasn't done yet.

Taking the other observation into account, I decided that the program should work just fine if we were to ignore events that are "close enough" to each other. The debounce window is set to a timeout of 1 second, with a "tick" interval of the default ¼, and so I expected that if I'm getting events with more than a second between them, that they're probably real, and if not, then I only really need the path of one of the duplicates to pass down to the function we have to add things to the database and whatnot.

I had to argue with the rust compiler a bit, but eventually, I got something working using Instant and Duration

let right_now = Instant::now();
let mut path_and_seconds: Vec<(PathBuf, u64)> = debounced_events.into_iter().filter_map(|event| {
    	let timestamp = event.time;
    	let maybe_paths: Option<Vec<PathBuf>> = match event.kind {
    	    EventKind::Remove(_) => Some(event.paths.clone()),
    	    EventKind::Create(_) => Some(event.paths.clone()),
    	    EventKind::Modify(_) => Some(event.paths.clone()),
    	    EventKind::Any => Some(event.paths.clone()),
    	    EventKind::Access(_) => None,
    	    EventKind::Other => None,
    	};

    	if maybe_paths.is_none() {
    	    return None;
    	}

    	let paths = maybe_paths.unwrap();

    	// No time travel please.
    	assert!(timestamp < right_now);
    	let rounded_seconds = right_now.saturating_duration_since(timestamp).as_secs();

    	Some(paths.into_iter().map(|p| (p, rounded_seconds)).collect::<Vec<(PathBuf, u64)>>())
}).flatten().collect();

And, did it fix the problem?

A backed up channel?

So the problem persisted, and I returned to looking at the log file again and thinking about where we were, and where we went to, and what really changed. Obviously, we swapped in sqlite, and we had replaced an in-memory mapping with something that was writing out to SQlite.

But, was that really it? Was it the fact that we're now waiting on file IO from SQlite's journal writing and similar such things truly the cause? It feels kind of difficult to blame everything on that because while the dupdb was in memory for the lookups and insertions, that didn't change the fact that I was still doing this after processing each event batch:

if db_dirty {
    dupdb_save_to_file(duplicate_database)
}

fn dupdb_save_to_file(duplicate_database: &DuplicateDatabase) {
    let folder = dupdb_database_path();
    let mut index_file = folder.clone();
    index_file.push(NAME_OF_HASH_FILE);

    let mut file = File::options()
        .read(true)
        .write(true)
        .truncate(true)
        .open(index_file)
        .expect("Could not open index file");
    let mut buf = Vec::new();
    duplicate_database.serialize(&mut Serializer::new(&mut buf))
        .expect("Could not serialize empty DuplicateDatabase");
    file.write_all(&buf)
        .expect("Did not write bytes to file");
}

Granted, one write to a file is maybe less to do than running:

INSERT INTO dupdb_filehashes (hash, file_path) VALUES (?1, ?2)
SELECT COUNT(*) FROM dupdb_filehashes WHERE hash = ?1
SELECT hash, file_path FROM dupdb_filehashes
WHERE hash IN (
    SELECT hash FROM dupdb_filehashes WHERE file_path = ?1
)

But really? So, again, I was staring at the logs after adding in a println in my dupdb_update_hashes_for method which is called after de-duplicating the events as I've described above. And, another pattern emerged:

DebouncedEvent { 
    event: Event { 
        kind: Create(Any), paths: ["file.png"], ... 
    }, 
    time: Instant { t: 1738619.8547049s }
}
DebouncedEvent { 
    event: Event { 
        kind: Modify(Any), paths: ["file.png"], ... 
    }, 
    time: Instant { t: 1738619.8549331s } 
}
DebouncedEvent { 
    event: Event { 
        kind: Modify(Any), paths: ["folder/"], ... 4
    }, 
    time: Instant { t: 1738619.9174044s } 
}
dupdb_update_hashes_for 2 paths
Adding 6593874484457310113 "file.png" to db
DebouncedEvent { 
    event: Event { 
        kind: Modify(Any), paths: ["file.png"], ... 
    }, 
    time: Instant { t: 1738620.0140303s } 
}
dupdb_update_hashes_for 1 paths
Adding 6593874484457310113 "file.png" to db
Duplicate detected "file.png" 6593874484457310113
Value: Hash: "6593874484457310113" Path: "file.png"
Value: Hash: "6593874484457310113" Path: "file.png"

The boundaries look a bit... off don't they? All these logs were generated by me saving 1 file that was brand new to the folder. And if I look at the timestamps here, they're all very very close. Does this happen if I swap back to my in-memory based setup rather than the one using sqlite?

DebouncedEvent { 
    event: Event { 
        kind: Create(Any), paths: ["file.webp"], ... }, 
    time: Instant { t: 2060991.1648232s } 
}
DebouncedEvent { 
    event: Event { kind: Modify(Any), paths: ["file.webp"], ...}, 
    time: Instant { t: 2060991.1650927s } 
}
dupdb_update_hashes_for 2 paths
Adding 17953420598969731156 "file.webp" to db
Adding 17953420598969731156 "file.webp" to db
DebouncedEvent { 
    event: Event { kind: Modify(Any), paths: ["file.webp"], }, 
    time: Instant { t: 2060991.308183s } 
}
dupdb_update_hashes_for 1 paths
Adding 17953420598969731156 "file.webp" to db

Well. There goes that theory. Even with the in-memory struct, the debounced events come through in two chunks just like the sqlite version. Then, why does one trigger a notification and not the other? As you can see in the second set of logs from the in-memory based struct, no duplicate was detected. The code that's testing for that is:

match fs::read(path) {
    Ok(bytes) => {
        let hash = seahash::hash(&bytes);
        println!("Adding {:?} {:?} to db", hash, path);
        duplicate_database.add(hash, absolute_path.clone());
        if duplicate_database.contains_duplicate_for_hash(hash) {
            println!("Duplicate detected {:?} {:?}", absolute_path, hash);

Where we add the hash and path to the database and then find out if it's made the total number of things with that hash increase. In the in-memory code that's doing this:

struct DuplicateDatabase {
    hash_to_files: HashMap<u64, Vec<String>>,
    files_to_hash: HashMap<String, u64>
}

fn add(&mut self, hash: u64, full_file_path: String) {
    let entry = self.hash_to_files.entry(hash);
    let existing_files = entry.or_default();
    if !existing_files.contains(&full_file_path) {
        existing_files.push(full_file_path.clone());
    }

    self.files_to_hash.entry(full_file_path).insert_entry(hash);
}

fn contains_duplicate_for_hash(&self, hash: u64) -> bool {
    let contains_hash = self.hash_to_files.contains_key(&hash);
    if !contains_hash {
        return false;
    }

    if let Some(files) = self.hash_to_files.get(&hash) {
        return files.iter().count() > 1
    }
    
    false
}

And the SQlite code is executing the two queries:

INSERT INTO dupdb_filehashes (hash, file_path) VALUES (?1, ?2)
SELECT COUNT(*) FROM dupdb_filehashes WHERE hash = ?1

I suppose within the two HashMaps we only ever had one pair of path to hash, while in the sqlite database there's no key to stop us from having multiple copies. But the first hash map is a hash to a list of paths, which is fulfilling the same purpose as the table in the SQL, so this doesn't really seem like that big of a difference here. Thinking about our logs and their impacts we have, I'll use a {} for the hashes, and a [] for the table then walk through it:

In-memory logs:
{
    hash_to_files: {}
    files_to_hash: {}
}

Batch 1 of events
Adding 17953420598969731156 "file.webp" to db
{
    hash_to_files: { 17953420598969731156: ["file.webp"]}
    files_to_hash: { "file.webp" : 17953420598969731156}
}
Adding 17953420598969731156 "file.webp" to db
{
    hash_to_files: { 17953420598969731156: ["file.webp"]} // No change because .contains check
    files_to_hash: { "file.webp" : 17953420598969731156}
}
Batch 2 of events
Adding 17953420598969731156 "file.webp" to db
{
    hash_to_files: { 17953420598969731156: ["file.webp"]} // No change because .contains check
    files_to_hash: { "file.webp" : 17953420598969731156}
}

SQLite logs
[]

Batch 1 of events
Adding 6593874484457310113 "file.png" to db
[
    "6593874484457310113", file.png
]
Batch 2 of events
Adding 6593874484457310113 "file.png" to db
[
    "6593874484457310113", file.png,
    "6593874484457310113", file.png
]

Hm... Hmmm....

What if I were to swap

SELECT COUNT(*) FROM dupdb_filehashes WHERE hash = ?1

for

SELECT COUNT(distinct file_path) FROM dupdb_filehashes WHERE hash = ?1

And try again? Doing this results in my tests breaking, but putting those aside for a moment and trying to reproduce the bug and watching my logs and notifications:

DebouncedEvent { 
    event: Event { kind: Create(Any), paths: ["file.png"], ... }, 
    time: Instant { t: 2064822.8182736s }
}
DebouncedEvent { 
    event: Event { kind: Modify(Any), paths: ["file.png"], ... }, 
    time: Instant { t: 2064822.8186554s }
}
DebouncedEvent { 
    event: Event { kind: Modify(Any), paths: ["folder/"], ... }, 
    time: Instant { t: 2064822.8863262s }
}
dupdb_update_hashes_for 2 paths
Adding 14149869348940171455 "file.png" to db
DebouncedEvent { 
    event: Event { kind: Modify(Any), paths: ["file.png"], ... }, 
    time: Instant { t: 2064822.9832525s }
}
dupdb_update_hashes_for 1 paths
Adding 14149869348940171455 "file.png" to db

No false dup detection!

So, that all makes sense. Now we;re more in line with the actual behavior of the old way of storing the data, which specifically disallowed the same file path being stored (good job past me, smarter than current me). So that seems like what the actual bug was, not some weird timing or batching issue of the events. I'm sorry for doubting you SQLite! It was a classic PEBCAK problem all along!

Buttoning things up

So it's awesome that my program is now behaving itself and not spamming with false positives on file saves anymore. But I still need to fix up this:

failures:
    dupdb::test::adding_a_dupe_can_be_detected
    dupdb::test::removing_a_file_removes_detected_dupes
    sql::test::count_of_existing_hash_should_be_n

test result: FAILED. 5 passed; 3 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.12s

The last test is expecting the count to return the total number of rows with that given hash (without distinct file_paths), and so its tests were targetting that. That's an easy fix:

#[test]
fn count_of_existing_hash_should_be_n() {
    let connection = open_test_database();
    let hash = 123456789;
    let path = "12345689";
    // Insert something other than the one we're testing too
    insert_file_hash(&connection, 9876543211, "987654321");
    insert_file_hash(&connection, hash, path);
    let begin_instrumentality = count_of_same_hash(&connection, hash);
    let hall_of_goff = 1;
    assert_eq!(begin_instrumentality, hall_of_goff);
    insert_file_hash(&connection, hash, path);
    let rejoicing_of_the_masses = count_of_same_hash(&connection, hash);
    assert_eq!(rejoicing_of_the_masses, 1); // 1 was 2 here
}

And that's easy. Looking at the test for the dupe detection:

#[test]
fn adding_a_dupe_can_be_detected () {
    let mut dupdb = get_test_dupdb();
    let hash = 12456;
    let fake_path = "the_file_path";
    let db_has_dupe = dupdb.contains_duplicate_for_hash(hash);
    assert_eq!(db_has_dupe, false);

    dupdb.add(hash, fake_path.to_string());

    let db_has_dupe = dupdb.contains_duplicate_for_hash(hash);
    assert_eq!(db_has_dupe, false);

    dupdb.add(hash, fake_path.to_string());
    let db_has_dupe = dupdb.contains_duplicate_for_hash(hash);
    assert_eq!(db_has_dupe, true);
}

It's easy to see why this is now breaking. Our "dupe" has the same file path as the original, and so the check against the distinct file_path is doing what we needed it to do all along and so it doesn't count as a dupe. That's an easy fix by just using a different path for the dup in the test. The other failing test:

#[test]
fn removing_a_file_removes_detected_dupes () {
    let mut dupdb = get_test_dupdb();
    let hash = 12456;
    let fake_path = "the_file_path";
    let dup_path = "the_dup_path";

    // somebody set us up the bomb
    dupdb.add(hash, fake_path.to_string());
    dupdb.add(hash, fake_path.to_string());
    let db_has_dupe = dupdb.contains_duplicate_for_hash(hash);
    assert_eq!(db_has_dupe, true);

    // Main screen turn on
    dupdb.remove(fake_path.to_string());
    let db_has_dupe = dupdb.contains_duplicate_for_hash(hash);
    assert_eq!(db_has_dupe, false);
}

Same situation. Relying on the internals of the sqlite database to not care about the duplicate path for the dupe itself. Fixing those two tests up and then we're back to a happy:

test result: ok. 8 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.11s

And so. What have we learned from this experience?

  1. I should write tests that actually mimic real situations, and not leak abstractions
  2. I shouldn't be so quick to assume timing issues
  3. I probably should have written tests for the dupdb BEFORE refactoring to use SQlite
  4. Windows file events are really weird

That last one maybe isn't reflected so much in what I covered here. But while poking around on a few occasions I ended up reading the ReadDirectorychangesW page to try to figure out if there could be something related to that that would have explained why and what I was seeing in the logs. Ultimately, there wasn't really anything there (beyond noting that Microsoft timestamps for events are in microseconds), and I did enjoy poking around and reading the file watcher source code for notify-rust. so I just wanted to give a shout out to callback functions, it seems to be how the world gets around inside of those things.

Anyway. With the bug fixed and the path cleared. This means that next time I return to this project, I can move onto the next phase… Adding a little frontend to it! Whether or not I do that in rust or something else I suppose I'll figure out later, but hey, it can't be that hard to make a simple UI to view and remove dups where I want to, right?