Snapgene Version Control

In the early days at Mozza we quickly ran into a problem where there were a lot of Snapgene files floating around between emails, flash drives and chat logs. These files were supposed to be the official record of real plasmids that were being constructed in the lab. Each person was making changes to their copy of the files which eventually devolved into chaos. Since our company had already settled on the Google suite, our initial solution was to host the files on a google drive. Reading and writing was a little slow but it solved the problem in the moment. the google drive was mounted on everyone's Windows laptops and acted just like a hard drive, but slower.

This worked for about 6 months when there was around 4 people working with Snapgene files regularly. As we added more employees we started running into problems where changes weren't successfully propagating to everyone else's google drives. We eventually came to learn that (1) Snapgene was maintaining a lock on open files and (2) Snapgene was frequently writing small amounts of data to files while they were open even when the user was not trying to save any changes. If you had a file open, small changes were being written every 30 seconds or so. This led to situations where one computer would write changes to a file, those changes would propagate through Google and make their way to another laptop which had the same file open. Since snapgene had a lock on that file Google would fail to push the changes. And then that laptop would generate its own changes which would try to propagate the opposite direction. This quickly led to errors and corrupted files.

In retrospect we must have been lucky early on and the probability of two people opening the same file at the same time was just too low with so few people. Once we hit a critical mass of employees this problem quickly forced us to find an alternative.

For a brief moment we tried hosting the files on a NAS but ran into a lot of the same problems. Much better performance but the lock and and the propagation of Snapgene's micro-changes still plagued us.

Version Control to the Rescue

The next logical step was to use version control. I have been a long time user of Mercurial, and CVS before that. Mostly for my own small software doodles. So the concepts weren't foreign to me. Our founder was very familiar with Git and so it came to be. We hosted Git on a local Synology NAS instead of Github or Gitlabs, mostly because of paranoia. Snapgene was performant since the files were stored locally on each laptop and there was no immediate propagation of the micro-changes to cause conflicts. A lot of the conveniences of modern version control systems don't work here since all the files are binary. But the core functionality was there for tracking changes and distributing changes to everyone's laptops without corruption.

The largest problem we ran into was employee education. Mozza was and still is 90% scientists, lab techs and other people with primarily biological training. Not a single other employee aside from me and the founder had ever heard of a version control system before. Computer systems in general are not these people's strong suit. There have been multiple training sessions with white boards full of arrows depicting branches, pull requests, merges and more. We have become extensive users of Git's many features to the extent that one can with binary files. And to this day we still have people committing directly to the master branch. 🀦

Other Technical Problems

Aside from the matter of educating biologists to use version control, Git s still not the perfect system for sharing Snapgene files.

(1) The micro-changes made by Snapgene to open files still causes problems because we often have many files open at once (my Snapgene and Chrome compete for the most tabs open) and Git sees all of those micro-changes. If one is not careful while making a commit, tens of "changed" files will get committed along with the one change they were trying to make.

[picture]

We do have one person deemed the "Snapgene tzar" who reviews all changes and has official permission to merge into the master branch. But his job is unnecessarily tedious due to this "feature" of Snapgene. When inadvertent changes make their way into master, it becomes difficult to distinguish real changes from these insignificant changes in the Git history. The only way to know for sure is to do sequence alignments between versions and manually scan the annotations for any changes. Tedious.

(2) The other technical problem I find myself stumbling over frequently is that because Snapgene maintains locks on open files, when switching between branches, Git is unable to shuffle the files around so there are a lot of errors or or confusion about Git showing lots of files as "modified" since it was unable to swap the file out for the new one during the branch switch. This leads to a lot of confusion and, if you're not careful about issue (1), there is a tendency for older versions of files to get re-committed as new changes. Tracking these bugs becomes very cumbersome and sometimes the best option is to forget about "fixing" it and just having the user re-make the Snapgene files from scratch.

The solution is to close all Snapgene files before doing any branch operations. And to carefully review all changes before committing. But again - biologists don't think about these things and the least careful ones and can make a tangled mess real quick.

Overall it's the least bad system we have come up with. Small changes to the Snapgene and Git software could make a very slick system. But we are a biology company and shouldn't spend time on such endeavors. The tool is 90% good and fixing the last 10% would be a distraction from the real goals.

All posts

  • Reading Snapgene Files
  • News
  • My super title