DEBBIE GINGRICH SAW THINGS turn for the worse in 2016, when Cincinnati schools experienced an unexplained surge in youth suicides. Suicide is the second most common cause of death for teens and young adults in the country, but its prevalence in Cincinnati had mostly held steady for the previous 15 years. Now the area was seeing an alarming spike. The trend continued into January 2017, when an eight-year-old boy killed himself after reportedly being bullied at school, and another six students took their lives soon after. Parents and school officials were desperate with worry, and the local medical community hunted for a way to identify the children most at risk. “In the mental health world, we don’t have the equivalent of an X-ray to detect a broken bone,” says Gingrich, director of behavioral health at The Children’s Home, which provides mental health support to troubled minors. “Everyone wants to know, ‘What can we do to save a life?’”

One answer, the Cincinnati schools decided, was to try an experimental artificial intelligence technology. It promised to detect telltale signs of suicidal intent hidden in human speech. Developed by John Pestian, a professor in the divisions of biomedical informatics and psychiatry at Cincinnati Children’s Hospital Medical Center, the machine-learning algorithm sifts through recordings of a patient’s voice to analyze a combination of signals, some of which no human could detect: minute changes in inflection or delays in the nanoseconds between words and syllables.

Pestian’s algorithm had been trained by scanning suicide notes and recordings of patients who had recently survived a suicide attempt. In one study from 2016, his team tested the algorithm on recordings of 379 patients. Some of them had attempted suicide in the previous 24 hours; some had mental illness, according to psychiatric assessments, but had not attempted suicide; and a third cohort fit neither category. By screening the content of the tapes alone, Pestian’s algorithm was able to assign patients to the correct category 85% of the time.

Pestian’s algorithm made its appearance in a few Cincinnati schools this spring. In the first phase, counselors made mental health assessments of the students using the usual tools but also recorded them on a custom mobile phone app. Researchers looked to see whether the voice analysis matched up with the psychiatric surveys and opinions of the professionals. It performed well enough that the technology rolled out at about 20 schools this fall, recording interviews with thousands of students. The researchers hope it will help direct the right students to psychiatrists for further evaluation and head off tragedies for at least a few.

Speech analysis is a promising frontier in the emerging field of computational psychiatry, which applies the tools of artificial intelligence to mental health. Using high-powered machines to sort through piles of data, researchers try to spot patterns in cognition, behavior or brain function that can help them understand and detect mental illness. On the speech front, these programs automate detection of linguistic and vocal patterns that only a highly trained psychiatrist might pick up, as well as some acoustic clues the human ear can’t perceive. Algorithms created by scientists at Harvard, MIT, Columbia and Stanford, among others, have so far been able to use as little as a minute of speech, collected with consent, to identify people with post-traumatic stress disorder, depression, schizophrenia, psychosis and bipolar disorder. These automated assessments have been found to align with the opinions of trained psychiatrists between 70% and 100% of the time.

As a mental health crisis unfolds in the United States and suicide rates hit their highest levels since World War II, many people are pinning their hopes on AI to help at a time when the psychiatric field is severely understaffed. The U.S. Department of Defense is funding ongoing research to develop AI tools that can detect PTSD—to determine whether a soldier back from war is psychologically suited for redeployment, for instance. Silicon Valley is investing heavily, too. Earlier this year, for example, Google launched a partnership with The Trevor Project, a nonprofit that works in suicide prevention for LGBTQ youth. The project will use proprietary technology from Google that can detect and analyze human emotions in voice and text to help alert counselors to a patient’s possible suicide risk.

Plentiful real world data, collected from smartphones and social media—and, perhaps one day, voice-activated assistants such as Amazon’s Alexa or Google Home—are helping scientists develop clinical tools that promise a way to scan for mental illness cheaply, remotely and noninvasively. “You don’t have to biopsy someone, you don’t even have to draw their blood,” says Charles R. Marmar, chair of the department of psychiatry at New York University School of Medicine who specializes in PTSD. “All you have to do is record them.”

But with that ease comes a round of questions, both clinical and ethical. Who should collect this data, and who should analyze it? How confident can researchers be about AI diagnoses? And if a machine delivers an incorrect assessment about a person’s mental health, what can be done to head off dangerous consequences?


EACH YEAR THE UNITED STATES spends more than $201 billion on mental health services, making it the most expensive category of illness to treat. Yet there is a shortfall of providers. Over half of U.S. counties don’t have a single social worker, psychologist or psychiatrist. Unlike in other medical fields, there’s no blood test or biomarker to speed diagnosis. Uncovering a mental illness still largely relies on a single expert going through the time-consuming process of conversation and observation.

Even then, the science is far from exact. Serious mental illnesses are categorized based on symptoms set forth in the Diagnostic and Statistical Manual of Mental Disorders, or DSM. Yet there is considerable diagnostic overlap among these conditions.

Anxiety, difficulty with concentration and changes in energy level, for instance, could indicate bipolar disorder, PTSD or depression. At least half of patients receive more than one psychiatric diagnosis, according to a 2018 study published in JAMA Psychiatry. And settling on the right one sometimes takes years.

In 2013, just prior to publication of the fifth edition of the DSM, Thomas Insel, then head of the National Institute of Mental Health, became so frustrated with the reference book that he publicly denounced it in his director’s blog on the NIMH website. He wrote that it lacked scientific “validity” and that “patients with mental disorders deserve better.” Insel championed moving research away from DSM categories, instead focusing less on symptoms and more on the causes of these conditions, a shift that he called a first step toward “precision medicine” in mental health. A research group at NIMH began to define criteria for a new classification system for mental health disorders. One of those criteria is language.

Insel believes that natural language processing, a marriage of data science and linguistics, could be a game-changing biomarker for mental health, offering objective measures of how the mind is working. He now serves as president of Mindstrong Health, a technology company that measures mental health via mobile phone use data, and is optimistic about the potential of digital technology to usher in a new age in mental health diagnosis and treatment. “Over the next decade, the use of AI tools to classify language may transform the field, giving community health workers and emergency room physicians the tools of a master clinician,” he says.

Using language for diagnosis is as old as the field of psychiatry itself. Sigmund Freud was famously inspired by slips of the tongue, which he believed could reveal unconscious urges. In the early 1900s, Swiss psychologist Eugen Bleuler and his then-assistant Carl Jung pioneered the use of word association, one of the first observational, empirical tests used in psychoanalysis. A delayed response time or jarring word associations could indicate psychological conflicts and help point toward a diagnosis.

After World War II, researchers began looking beyond the linguistic content of speech toward acoustic content, or meanings hidden in the sounds of speech itself. For example, NASA began taking recorded language samples from astronauts to analyze their stress levels, among other metrics, and in the 1990s, the Department of Defense started testing voice analysis for lie detection to replace the much maligned polygraph.

Today, psychiatrists are trained to look for speech traits in interviews with patients: Unusual talkativeness can indicate a hypomanic episode in bipolar disorder; reduced pitch and a slower speaking rate can indicate severe depression; and jarring breaks in meaning or coherence from one sentence to the next might suggest schizophrenia.

The first attempts to measure the language of mental illness quantitatively began in the 1980s, when a University of Maryland psychiatrist named Walter Weintraub began hand-counting words in speeches and medical interviews. Weintraub noticed that higher ratios of “I” and “me” in a person’s speech were reliably linked to depression. In the next decade, American social psychologist James Pennebaker created software that counted individual words and classified them into more than 80 linguistic categories—words that expressed insight or negative emotion, for example. Language that favored some of these categories correlated with mental health issues. Analysis of the auditory features of mental illness kicked off around 2000, when a team from Vanderbilt and Yale found that fluctuations in voice “power,” among other features, could serve as an indicator of depression and suicidality.

More recently, advances in AI have transformed this approach to understanding speech. Machines can now sort through vast troves of data, looking for patterns humans might miss. Improvements in mobile phone recording technology as well as the advent of automated transcription over the past decade have also been critical to the field, making rigorous large-scale studies possible for the first time, according to Jim Schwoebel, CEO and founder of NeuroLex Diagnostics, which is working to build a speech analysis tool for primary care physicians to screen for schizophrenia. In the past several years, scientists have continued to refine their analytical tools, in some cases devising studies with larger sample sizes by extracting data from social media posts instead of working only with small cohorts in the lab.

Researchers with the University of Pennsylvania’s World Well-Being Project and Stony Brook University in Long Island, New York, for instance, have been collecting written language samples from social media. They recently published a study showing how one of their AI programs was trained to scour the Facebook posts of 683 consenting users—114 of whom had a depression diagnosis in their medical records—and could predict the condition up to three months earlier than clinicians could. With a vast database of people sharing thoughts and feelings in public, and the computing power to sift through it and look for patterns, the internet becomes a laboratory of speech.


BUT IT IS WITH THE SPOKEN voice that AI has really been able to break new ground, as computers learn to detect changes in sound that even a highly skilled psychiatrist would never pick up. In work supported by the Department of Defense, for instance, a team of researchers from New York University’s Langone Medical Center are collaborating with SRI International, the nonprofit research institute responsible for creating Apple’s voice assistant Siri. This past spring they published results showing that their program had identified imperceptible features of the voice that can be used to diagnose PTSD with 89% accuracy.

The production of speech uses more motor fibers—the nerves that carry messages to muscles and glands—than any other human activity. Speech involves more than 100 laryngeal, orofacial and respiratory muscles, creating a neurologically complex behavior that produces subtle variations in sound. The engineers at SRI International isolated 40,526 features of the human voice and asked their program to listen to half-hour speech samples taken from 129 male veterans who had been to war in Iraq and Afghanistan.

The team, led by NYU psychiatrist Charles Marmar, was able to identify 18 voice features that were present in all speakers but had a different pattern in PTSD cases. These included a narrower tonal range (fewer highs and lows), less careful enunciation, a more monotonous cadence and vocal changes caused by tension in throat muscles or by the tongue touching the lips.

“We thought the 18 features would reflect high levels of anxious arousal,” says Marmar, “but they didn’t. They reflected monotonous speech, slowed speech, less bursty speech, flatter speech, less animated speech. In other words, low energy, atonal and unemotional.” Marmar thinks that this may result from studying soldiers five to eight years after they served in a war zone, and that this long window after the event may have led to a numbing of emotions as a defense mechanism against long-term stress mixed with alcohol and other problems.

Marmar’s team now wants to repeat this analysis using a sample that includes both male and female veterans and nonveterans. If the AI continues to show high marks, the team plans to use the program to test the effectiveness of a new drug for PTSD, studying the voice quality of a group of veterans before and after they take the treatment.

Another complicated but critical task for AI is to predict a future mental health event, such as an episode of psychosis, which can take the form of delusions and incoherent speech. Evidence suggests that the earlier mental illness is caught and treated, the better the outcome, so predictive powers would be particularly valuable.

One lab making headway in this regard is run by Guillermo Cecchi, a computational biologist at IBM Research in New York. Cecchi and his team are building an automated speech analysis application for a mobile device. In a study published in 2018, his algorithm was able to use a few minutes of speech collected during interviews to identify those who would develop psychosis over a two-and-a-half-year period. It accomplished the task with 79% accuracy—a rate validated in two additional studies. The computer model was also found to outperform other advanced screening technologies such as neuroimaging and electroencephalograms, which record brain activity.

“Language is one of the best windows into mental processes that we have,” says Cecchi. “Now we are using machine learning techniques and AI techniques to quantify what was mostly based on the particular experience of a well-trained psychiatrist or neurologist.” He envisions such tools serving as “stethoscopes of the mind,” available in the office of every psychiatrist, neurologist and social worker—and in every patient’s pocket.


A NUMBER OF BARRIERS stand between these early efforts and their wider adoption. One of them is the scarcity of good training data, as the number of voice samples teaching the current generation of AI is still relatively small. Even the most rigorously tested models learn from, at most, a few hundred professionally diagnosed psychiatric patients. And larger samples can be difficult to collect and share among researchers because of medical privacy concerns—a problem that affects medical AI projects in every field.

“It’s encouraging that these pilot projects are showing us what’s possible with voice analysis, but it’s kind of just the beginning,” says John Torous, director of the division of digital psychiatry at Beth Israel Deaconess Medical Center in Boston. “No one has found a way to capture clinically actionable and useful data at the population level.” Most researchers agree that sample sizes need to be in the tens of thousands before these projects can have full confidence that the algorithms work.

One of the biggest problems with a small sample size is that the AI can falter when it encounters a speech pattern it hasn’t adequately trained on, such as a linguistic subculture. Apple’s voice assistant Siri, for instance, still struggles to handle questions and commands from Scottish users. IBM’s Cecchi notes that research participants, who mostly belong to similar socioeconomic and linguistic groups, likely have trained existing AI algorithms to recognize vocal cues that will not be relevant for other populations. “We study the temporal structure of your voice, its cadence. That can vary by culture,” says Cecchi.

But such problems may be easier to solve than larger, ethical questions. One well-known concern is that AI can propagate bias. When a model makes a diagnosis, it is only as good as the human psychiatrists it has learned from, but racial biases are well known to exist in current mental health care settings. An African American patient with the same symptoms as a white patient is more likely, for instance, to be diagnosed with schizophrenia and less likely to be diagnosed with a mood disorder. So AI may simply carry those errors forward, and to a wider population.

One response is to increase the “explainability” of AI models. Machine learning algorithms are generally considered “black box” models that present results without offering researchers any sense of how the machine arrived at the final answer. But the Navy Center for Applied Research in AI, via funding and research from the Department of Defense’s investments arm, DARPA, as well as IBM, are working to create AI that can explain how it came to its conclusions. Other teams are laboring to develop AI programs that can effectively communicate how much uncertainty is involved in a prediction. That information would help practitioners understand how much weight to give AI in making clinical decisions. “It’s very important that the AI be explainable, so that we can fiddle with the knobs and address where these AI formulations are coming from,” says Cecchi.

Another major concern is who should have access to these diagnostic tools. Facebook already has a function that scans posts by members and flags those who might be at risk of suicide. Facebook users can’t opt out, and since last fall, the tool has been involved in sending emergency responders to check on 3,500 users who were thought to be in danger. Yet despite criticism of the intrusiveness of the function, Facebook has declined to release data or publish findings related to the interventions.

As the relics of voice recording become a part of daily technology use—Amazon’s voice-controlled Alexa device apparently keeps its voice data and transcripts forever, for example—many people worry about police, employers or private companies snooping into the mental health of those who use the devices. “We need regulation,” says Jim Schwoebel of NeuroLex Diagnostics, “because right now, at least in some states, you can capture and reproduce someone’s voice without their consent.” And there are currently no laws to prevent discrimination based on speech.

BEHIND ALL THESE CONCERNS is a nagging question: What happens when an AI-derived conclusion is wrong? In mental health care, small errors can be catastrophic, and false positives—in which someone might wrongly be flagged as being bipolar, for example—can do significant damage. “Just having that diagnosis can make people feel sick and change their view of themselves,” says Steve Steinhubl, director of digital medicine at Scripps Research Translational Institute in San Diego. “That’s something we need to be really cautious about, especially if it’s just coming from a digital interface with no face-to-face conversation.”

Even as these and other concerns are raised, companies working in the field of computational speech analysis are forging ahead. Some are looking for ways to collect population-size samples of data. Schwoebel is building something he calls the Voiceome, a gigantic online repository of speech and voice data contributed by volunteers. Others, like the project in Cincinnati schools and the phone screening with The Trevor Project, are looking to bring diagnostic and prognostic tools into real-world applications.

Sonde Health, based in Boston, is doing both. Sonde is building a mobile phone platform that uses vocal analytics, licensed from MIT, with the potential to monitor and screen patients for depression from samples of speech of as short as six seconds. The Sonde app is already being used in India for research purposes through partnerships with hospitals and rural clinics. The company has audio files from 15,000 people that it is analyzing for signs of a range of mental and physical health conditions.

In Sonde’s grand plans, the platform will be available to patients everywhere, and it will be able to diagnose dementia, Parkinson’s and other conditions that go beyond its initial scope. CEO and cofounder Jim Harper says the company intends the platform to be used by both patients and health care providers.

Harper imagines a future in which people could choose to have a voice screening device set up in their homes, passively monitoring speech for clues to changes in mental and physical health. The app he’s imagining would work much like the recently released Alexa Guard, which tunes devices to listen for breaking glass or a smoke alarm to alert people who are away from home.

But he’s cautious, too. He can see how a tool that suggests a mental health diagnosis can too easily be misused, employed for harm rather than good. “That’s not a world any of us wants to live in,” he says.