In her commentary on speech recognition software (“Speech Recognition Tech Is Yet Another Example of Bias“, Scientific American, Oct 2020), author Claudia Lopez-Llorenda derides the limits of technology because of her need to alter her speech pattern to a non-accented version of her own voice in order to be recognized fluently. In her words, “[changing] such an integral part of an identity to be able to be recognized is inherently cruel.”
The technology community is a scientific one, subject to the same limitations of other sciences: those of resources, knowledge, and capabilities. It is also subject to the same limitations of economics: funding and supply and demand. To imply that technology (and technology companies) are ignoring smaller demographic groups or populations for reasons other than those limitations is short-sighted, and ignores the complexity of the problem and the allocation of available resources to solve it.
Speech recognition has become mainstream, and we have seen solutions delivered to market in the past ten years that were likely considered science fiction 20 years ago. It is still far from a complete solution, and that is shown by the continued rapid advances and developments in the field. Apple, Google, Amazon, and others have brought speech recognition to dozens of languages in just a few years, delivering an imperfect solution to a complex problem that consumers today expected to work without fail, much like we expect our cars to work when we turn the key.
The difference, of course, is that cars are all the same; people are all different, and even though many of us speak similarly, many of us do not, as we use dialects of the same root language — and that is where the economics of science come to play. When you want to implement language support, you will start with the baseline language, to capture the widest population of potential users. As technology improvements come, the availability to pick up smaller and smaller populations of users who speak in dialects will come with it. These are not limitations built into a solution; rather, they are limitations based on technological capabilities and available resources to implement them.
When companies decide to do this, it is not to be exclusive; rather, it is to be inclusive of as many people as possible. Turn on the television in the US and watch the news, and you will largely see newscasters speaking in a standard form of American English. This is by design; they are speaking in a way that people with nearly any dialect or understanding of spoken English can follow, thereby being inclusive of the most number of people by resorting to a common baseline. Technology companies do the same. This decision does not take away anyone’s cultural identity, nor should it be seen as “inherently cruel”.
Over the next ten years, Ms. Lopez-Llorenda will undoubtedly see incredible advances in speech recognition. As technological capabilities grow, and as company resources are freed up by completion of tasks for larger populations of users, she will see improvements in language support (including dialects) and eventually will see that speech recognition is able to recognize an individual’s speech (for example, picking up the l in salmon, for the one person I met in my life who pronounced it that way). The inherent cruelty will go away, not because anyone felt it was cruel, but simply because technology has caught up.
In the meantime, I will continue, at times, to mask my New York dialect in conversations — not because I am trying to hide my cultural identity. Rather, perhaps I am trying to be inclusive of the listener, who may be unfamiliar with such a dialect; or perhaps I don’t want to come across as a paisano — because, after all, if you heard me ordering a cup of cawfee (milk, no sugah), you probably would quickly make a certain opinion of who I am. And that opinion may be right or wrong, and you are entitled to make it, and I don’t take it as an insult. You’re merely taking in speech, and making a decision based on the limited amount of data and information you have to process it — which is, ironically, the same thing Siri is doing.