One of the many reasons I have been quiet recently is I am working on a personal project – an amateur stop-motion video. The final product will appear to be a large group of my friends playing music in an orchestra. The raw material I have is dozens of sets of still photos of my friends pretending to play instruments in front of a green-screen. Each set has about 10 photos of the person playing the instrument.
I am working on an interesting challenge, and I thought that I might benefit if I wrote it out as a blog article to force me to think it through.
For the time being, I am just going to consider wind and brass instruments (e.g. flute, oboe, clarinet, trumpet, trombone, etc.)
Plan A
The original, original plan was to take a photo of every different legal fingering position for the instrument, and map that across to the particular notes being played in the music. Whenever a B♭ was played, the photo corresponding to the B♭ fingering would be substituted in.
This original, original plan is almost feasible for an instrument such as a trumpet. If you assume that the lip-position is not discernable, then there are only 8 fingerings – several notes correspond to the same fingering. It would take some research, but is doable.
However, I looked up the fingering charts for an oboe and found there were (from memory) 44 different positions. That would take too much studio time, too much categorisation time and be too boring for my volunteers.
Plan B
So, the original, original plan was replaced with Plan B before the photo shoot. Plan B involved taking approximately 10 photos of each instrument being played. Where convenient, these would be associated with their correct notes. Where a note didn’t have a corresponding photo available, the software would assign a random fingering to that position, and then maintain that association throughout the piece. A musician who played that instrument might notice the discrepancy, but hopefully the casual viewer wouldn’t notice the “anachronism”. Perhaps some smarts might be added to prevent two sequentially-played but differently-pitched notes from having the same fingering, so that some movement occured between notes, but that would probably not be required.
The Shoot
Associated with the shoot was my birthday party.
You can view the causality either way: in order to get volunteers to come to the photo shoot, I invited them to a party in a pub near the studio, or in order to get my friends to come to my party, I invited them to a photo-shoot in a studio near the pub.
I relied on my friends to take up roles such as Assistant Director, Photographer, Wranglers and Tour Guides. They were terrific, and I am ever so thankful for the work that they did. They made the day run very smoothly, and I was – several times – banished from the studio and set to the pub because they had everything under control.
But as part of having many short shifts of volunteers, and volunteers training other volunteers (while the director lazed in the pub) is that there was a certain element of Chinese Whispers.
In particular, the “franticness” of the playing slowly increased over the day. At the beginning, I have photos of flautists keeping their heads and bodies still, while just moving their fingers. By the end of the shoot, I have saxophonists wildly boogeying left and right as they play.
Now, this isn’t quite in keeping with the sedate orchestral image I was originally planning, but I can adapt. More importantly, it offers great challenges for the stop-motion – I can’t have the player swinging their hips 90 degrees with each note played, if they are playing, say, 200 notes per minute – they will look like they are holding an out-of-control jack-hammer to their lips.
I want to be quite clear that I am not blaming the volunteers involved. I could have made my needs clearer, I could have kept a closer eye on the proceedings and, really, this problem is a pretty minor one compared with what could have gone wrong. I am overall still pleased and thankful.
(Some of the production people claim that they tried to get some serious playing photos before some more fun ones. That is certainly true for some instruments. For other ones, I really only have a couple of “sedate” playing images, making the animation rather dull if I don’t incorporate some of the later images.)
Plan C
So today I came up with another idea.
Suppose we ignore the fingering – no-one will notice the fingering positions of a dancing saxophonist. Instead, just focus on the instrument/body position of the player.
For each set of photos of a single player, we can compare each pair of photos and rate them as say “tiny movement”, “moderate movement”, “large movement”.
You know what – this will be easier to explain if I can degenerate to Comp Sci geek…
Let each of the 10-20 photos be a node in a fully-connected bi-directional graph, and each edge be weighted by the size of the delta between the positions.
Idea #1: Discard all the edges which are rated higher than “moderate”. Then, each time a different note is played, one of the connections is randomly chosen to allow the player to continually make small, but random movements as they move through the space. Hopefully the random big jumps will be turned into more gradual hip swinging.
Idea #2: Reduce the probability of leaving a node along the same edge it entered through, so the player doesn’t “vibrate” as much – jumping back and forth between the same two nodes. Note, some nodes will only have one other node similar to them, making vibration inevitable.
Idea #3: Increase the probability of leaving a node along the same edge it entered through if the previous note is repeated. e.g. let the player vibrate if they are playing A-B-A-B-A.
Idea #4: Tag each edge with the next note played during each transition, and then re-use that transition if you find yourself doing the same note transition, so the player is likely to (but not definitely) repeat their movements when they play the same refrain a second time.
Idea #5: Don’t discard the large weightings. Instead just make them stochastically less common. This will reduce the chance that a player will become randomly stuck in some small set of nodes, vibrating until the piece is over. It will increase the chance that they do get stuck in such a situation for short periods.
Idea #6: Occasional frantic movements may be justified and natural. Could there be some trigger that makes larger movements more likely for some transitions? Big jumps in tone? Jumps in tempo, key or volume? Semibreves?
Categorisation
With a large number of instruments and a large number of photos of each, coming up with a rating for each pair is likely to be laborious.
Richard A. suggested the other day that I farm out some of the categorisation work. I guess I could make a web-site that displayed two merged photos and asked for a personal rating of the differences, and then asked some volunteers to walk through them. Would writing the web-site (Oh gawd! PHP?) be worth it? How much consistency will I get through different people rating them?
Could I do it automatically, by taking a delta between the two images (at the pixel-level) and then somehow summing the result to get a metric for how much the picture had changed? Would that metric be reasonably correlated to a manual rating?
Conclusion
Writing this down has clarified it in my mind, and triggered a few more ideas.
It has also revealed it to be a lot of work – and possibly the result will be unwatchable. Hmmmm….
On the other hand, the solution (at least a manully-rated one) is likely to be useful for string instruments too (violin, cello, double-bass, etc.) – ensuring the bow positions transition rather than jump.
Suggestions very welcome.
Comment by Mr Rohan on December 24, 2010
Make sure you remember that the bowing instruments bow in both directions, but occasionally bow in just one or reset the bowing – i.e. take the bow off, reset and then start bowing again.
Comment by Julian on December 24, 2010
Mr Rohan,
Yeah, I didn’t touch too much on the string instruments in the post, but I did start to give it some thought.
You can imagine each bowing position being a node with two neighbours – the neighbour with the bow further to the left and the neighbour with the bow further to the right. Then you either continue in the same direction to extend the bow further or reverse the direction to “vibrate”. Lifting the bow to reset corresponds to certain rest notes, and can be handled separately. Perhaps some sense of “velocity” could be added to have the bow tend to extend to the extremes.
At least, that could work if people were playing regularly and not frantically. With body movement and fingering, it becomes trickier.
Observing some real violinists in orchestras play, before the shoot, I observed that the fingering wasn’t the most visible aspect – arguably even bow position was secondary to the head movement counting out the measure! Tying head movement and bow movement together was already going to be hard, without the violinists jumping around like they had been possessed by the devil once in Georgia.
Actually, I have to admit the string players didn’t jump around nearly as much during the photo shoot as some others.
Unlike wind and brass, there is still some movement associated with long notes on strings. Presumably they should tend to be played in one consistent direction. Looking ahead to avoid reaching the end of the bow’s extent during the middle of a long note may be tricky.
Comment by John Y. on December 28, 2010
Sorry I don’t have much constructive to say here; more like a differently worded “me too”: I have found that people who don’t understand bowed string instruments but are tasked with pretending to play them (such as in a high school play), sometimes treat them as though they were wind instruments (or perhaps plucked string instruments), in that they make some short, fixed movement to initiate a note change, then more or less stay still while the note is sounding. I don’t personally play a string instrument, but this looks ludicrously unnatural to me, particularly if there are long, slow notes. It’s not as bad if there are lots of short notes.
From a pure physics point of view, the salient feature of a bowed string instrument is that there must be (bow) movement during the sounding of a note. I don’t know whether this means that bowed instruments have two main problems (i.e. the note-position problem shared with all the other instruments, as well as the note-duration problem), or just one, but a different one (i.e. the note-duration problem, and basically don’t worry too much about note position).
I know the latter would be a gross oversimplification, but of course this whole exercise is in finding what oversimplifications you can get away with.
Comment by Julian on December 28, 2010
You are touching on a sore point, John.
Yesterday, I watched some videos of violinists. The problem is even harder than I expected.
The factors I was ready for:
The factors I didn’t consider:
So, I sat down and came up with an algorithm. I would categorise the photographs of each player into several sequences of 4 or more bow positions. Each sequence represented one fingering/posture. A simple long note would proceed along the sequence, at a determined rate. Short notes would proceed – not as far, but much faster, to get velocity changes. I had some simple rules that would deterministically identify direction reversals, resets, changes of speed and changes of sequence. By applying the same rules to each player, they would appear somewhat synchronised (even though bow positions wouldn’t exactly match).
Then I went to categorise the photos so I could build a prototype and see how it looked. I start with the four cellists.
One had 4 postures (great!), with 3 bow positions in each posture. Only 3 bow positions? I would need to tweak my algorithm.
The next only had one posture, with 3 bow positions. The other positions were too frantic and random to actually string a sequence together. Oh well, that person is going to be uninteresting too look at. He won’t be assigned any solos!
The next only had one posture with only 2 bow positions – but lots of photos of her bouncing all over the place. I fear I shall have to drop her entirely. If I animate with only two positions, she will looking like a sewing machine, not an elite player synchronised with the others.
By that stage, I was too depressed to even look at the last cellist.
So much for having them all playing in time…
(Aside: It is the nature of these sorts of projects that each problem you encounter seems like a show-stopper at the time, until you come up with a work-around, or accept a change to your original vision and move on. Chances are, by the end of the project, you prefer the look of most of the work-arounds to the original idea – except one niggling one which always sticks out to you when no-one else even notices. My point is: I am not truly despondent here, just a little frustrated, which shows I am proceeding according to plan.)
Comment by Sunny Kalsi on December 28, 2010
Add me to your patent of automatically generating Rock-Band songs out of your algorithm.
Comment by Alastair on December 30, 2010
Interesting problem(s), I don’t really have a lot to add except:
The problem of determining the difference between two images is a pretty well-understood one. You could possibly harness some open source video compression software (eg x264) to help. For a given pair of possible transitions you want to apply a block-based motion compensation scheme, and look at the aggregate (median probably) motion vectors. Larger motion vectors == more movement, obviously.
Also, crowdsourcing could be used to solve more than just the image-difference problem. Maybe use it so that the crowd can either seed the correct sets of images for the different passages in the piece (eg more movement for louder passages), and let the algorithm take over from there.
As you say though, the web doesn’t exactly present an ideal environment for this sort of thing. Still, maybe people would be willing to download a set of images and the audio, and assemble the video themselves using an readily-available video editing app (iMovie, Windows Movie Maker, others?). If you could just get the timeline from the resulting edits, you might be able to combine/assemble them?