Being the human behind an AI

Ayush Mangal
3 min readNov 23, 2024

--

Over the last 8 months or so, in my new (now slightly oldish) job as an AI Engineer at a startup, I have been one of the humans behind some of the AI products we offer ( mostly text to speech, and other speech stuff ), and I have also spent a lot of time in the past 5 years trying to get something meaningful out of these models. And its, fun, mostly.

Andrej Karpathy, famously spent a lot of time manually labelling the Imagenet dataset, looking a just a shit ton of images, and trying to figure out what they represent, and that effort led to the recent AI boom almost as much as anything else. But, how did it feel for him, and other researchers at Stanford, and at other academic labs, to be the ones putting in the manual grind to painfully, and slowly label these datasets, and get these models to work, when they, just fucking wont. Probably fun, mostly.

I am in no way even remotely close to the level of work done my these now almost mythical forefathers of our fields, but, hell, I have spent a LOT of time manually labelling datasets, looking at way too many images of travel destinations I had no hope of actually travelling during my stint in the travel team at MSFT, and an inhuman amount of time listening to different pronunciations of names and different tonalities and accents of humans in my current job. And I think I have become a lot more sensitive to human voices and sounds due to it, mostly.

A few of my colleagues work in lip-sync, where you have to create accurate AI generated lip-motions for an accompanying AI generated audio for a human speaker, and I can safely say that I suck at identifying what makes a good lip-sync. My teammates attribute this to my habit of consuming a plethora of anime, that too in sub, so I am more used to looking down at the subtitles than at the face of the characters to the point where I didn’t even realise that most animes have minimal lip-motion and at one point I almost sweared to someone that animes have diverse lip movements only to be shown multiple videos characters having just simple vertical or horizontal movements lol. My teammates working on lip-sync are however a different case, and have almost a superhuman ability to detect even a minor offset between lip movements and audios, while I am just sitting there finding no difference at all, mostly.

I wrote about occupational hazards in a previous post, which was mostly a negative take on how the tech industry is affecting my personality, but there is also something really transforming about being the human behind an AI. It has really made me more sensitive to things which we often take for granted as humans, and has probably changed me fundamentally in how I reason about things. A simple example of this is, text to speech, after working on it for 8 months now, I mentally play out any text I see or any name I see and think about how easy or hard would it be for an AI to read it out, and what mistakes it might make. Also since I have seen how hard it is for an AI to speak properly, even when trained on hundreds of thousands of hours, it is hard not to appreciate the easy at which we are able to just speak, and communicate with each other so well, mostly.

Another interesting effect of becoming more cognisant of this effect is the important of choosing important problems to work on, which I am actually interested in ( and quite frankly, I wasn’t interested in text-to-speech at all previously, now I do find it quite interesting ), since those are the areas of the human experience I will become more knowledgeable about and probably develop and unfair advantage in, well…. mostly :P

--

--

Ayush Mangal
Ayush Mangal

No responses yet