A few weekends ago I started a side project that has been rattling around in my head since fellowship. This is the long version of it.
The first year of practice in Colorado has been a mix of complex spine surgeries and endoscopic spine cases. The endoscopic side just means working through very small incisions and a tiny tube with a fiberoptic camera at the end of it. It is a relatively new approach for the degenerative spine problems I see most, and I have been recording footage from cadaver-lab sessions to study my own technique. After enough hours of that footage I started wondering what a computer-vision model could actually pick out of it. So I trained one over a few weekends on a ten-year-old GTX 1060 sitting in my home office.
The first thing I tried was MediaPipe, which is Google’s off-the-shelf hand-tracking library. It works really well on bare hands in normal lighting, which is what it was built for. It does not work well on surgical gloves. There is a clean published comparison on this (Müller et al., Int J CARS 2022) showing detection drops to under 10% on blue gloves and 0% on green. My own footage tracked the same pattern: on the cadaver-lab clips, MediaPipe found the scope-side hand only about 12.8% of the time. An earlier run had looked closer to 75%, but that turned out to be a left/right handedness mislabeling artifact rather than a real signal, which I caught and fixed before it polluted anything downstream. After that, MediaPipe was out as the primary model, although I kept it around as a fallback for the right (instrument) hand on a few clips.
The replacement was a YOLO pose model, specifically YOLOv8n-pose. YOLO is a family of small, fast vision models that you can train end-to-end on whatever objects and keypoints you care about, and the “pose” variant predicts skeleton-style keypoints (like fingertip and knuckle positions) on top of bounding boxes. The “n” is for nano, the smallest variant, which is what fits comfortably on an old GPU. I set up the schema with five classes (left hand, right hand, endoscope, cannula, instrument) and three keypoints per class. The catch is that the model needs labeled examples to learn from, and labeled examples for endoscopic spine surgery do not exist anywhere. So I had to make them.
That is where most of the actual work lived. I labeled 20 frames by hand in CVAT, trained a small version of the model on those 20, used that small model to pre-label the next batch of frames, and then refined the predictions instead of placing every keypoint from scratch. Each pass through the loop the predictions got better and the corrections got smaller. Six rounds got me from 20 hand-labeled frames to 140 verified frames at roughly three to five times the speed of doing it cold. Total training compute across all six rounds came out to about an hour on the old GPU.
The detection side (the bounding boxes around hands, scope, cannula, instruments) was annotated separately, using CVAT’s track feature, which interpolates between keyframes and is its own roughly ten-times speedup over per-frame boxes. Of 151 source frames, I annotated 149 using 15 tracks plus 7 standalone shapes, totaling 681 keyframes.
For evaluation, I used a clean train/val split on the labeled frames (112 train / 28 held-out for validation, fixed random seed) and then ran the trained model against an entire separate cadaver-lab session it had never seen during training. On that held-out session (92 filtered frames), the model still finds the scope about 81% of the time and the cannula about 67%. Hands transferred at about 50%, roughly half the scope rate, which is what you would expect from a small model trained on a single session.
The biggest surprise across the whole project was how lopsided the time was. Almost all of it is labeling. The training itself is the easy part, and even on a ten-year-old GPU it didn’t take long. The reason this kind of work is not already everywhere in surgery has more to do with that than with the math.
Nothing earth-shattering. I wanted to actually build one of these myself before having strong opinions about the surgical AI demos that get shown at conferences. Was a fun way to spend a few weekends, and already thinking about what to try next. The plan is more frames, more sessions, and seeing how cleanly any of this generalizes when the lighting and the technique drift away from what the model was trained on.