false
Catalog
Speech & Language Markers in Psychiatry - The NLP ...
View Presentation
View Presentation
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Hi everyone, welcome. I think I have the great honor of speaking right after lunch, so my main challenge will be to try and keep everyone awake. At the outset, I have two presentations coming up. In the first presentation, which is this one, we'll talk about speech and language markers in psychiatry. So we'll be thinking about how we can use speech and language and the variations in speech and languages in psychiatric disorders as biomarkers. And in the next presentation, we will go ahead and talk about how we can use same AI models and use large language models to try and develop AI-based therapy chatbots. To introduce myself, I'm Manu Sharma. I'm the Assistant Medical Director for Research and Innovation at the Institute of Living in Hartford, Connecticut. I'm also an Assistant Clinical Professor of Psychiatry at Yale School of Medicine. Language and psychosis are my area of interest, but today we'll be talking about language and speech in general. Language is a very unique human capability and it evolved as our cognitive processes evolved. Nowhere in the animal kingdom are the language abilities so advanced. The way we humans can symbolically represent abstract thought processes or abstract concepts is fascinating and perhaps is one of the main drivers of the success we have seen as a species. Again, we can claim that other animals in the animal kingdom oftentimes do have communication abilities, but most of them are limited to functions like food and mating. It's only humans that can represent these complex abstract concepts like quantum physics or the structure of a molecule in the form of language. So historically, we've been using language as an important tool as we go during our day-to-day psychiatric practice, as we are making diagnoses, as we are providing therapy. And oftentimes we use language and speech markers in our mental status exam, in our cognitive exams, VTAC, MOCA, or MMSC. There's always a component of speech and language in there. Now, for example, oftentimes when we have patients with depression, we talk about slow speech rates, decreased verbal outputs, right? We talk about the speech content in the form of negative themes or themes around depression. In schizophrenia, obviously we have disorganized thought process, which is reflected in tangentiality of thought process and those associations. And we can see it in the language they use and the speech they're producing. And of course, in mania, we see pressured speech. So historically, we've always had a component of speech and language as a part of our assessment of psychiatric disorders. And I think the idea is that we can use these characteristics in a quantified way to give us insights, give us diagnostic insights, or even prognostic insights into how psychiatric disorders might progress, right? And these attempts are not very recent. Even historically, there have been lots of research going back to the 1950s and 60s where linguists used to transcribe patient interviews or patient-reported symptoms, and they would manually code the speech sample for various linguistic properties. And they would see differences. Of course, that was a very old way of doing things. It was resource-intensive. You would have to have an individual, an RA or a postdoc researcher, sit down and look at all the entire transcript and then code it manually to come up with the themes or come up with specific linguistic markers. That advanced a little. And in early stages, we developed computer models or computer programs that could use preset rules. They oftentimes referred to as rule-based algorithms where linguists would manually code and say, hey, if there's a word that's ending in S and is following a the, most likely it's a now, right? And it would create similar rules. So for example, a very common method or a common technique we use in natural language processing is called parts of speech tagging, right? So we divide the sentences up into individual words, and we assign a part of speech to it. Is it an adjective? It's an adverb, a noun? And as we'll see later on, there are some differences in these measures in various psychiatric disorders. But there's one way of doing it historically was to actually go through the entire transcript and keep marking each and every word and giving it, say, this is a noun, this is an adjective, this is a verb. Then came the rule-based systems where the linguists sat back and said, okay, fine, I'm going to predefine rules and I'm going to feed this transcript into this computer program and it's going to tag it for me, right? But again, there are only so many rules you can code for, right? Because language by its nature is generative. There are unlimited permutations and combinations in which we can produce language. So as it gets more complex, it's difficult for these rule-based models to be accurate or work with large datasets, right? And of course, they lack all context, right? So for example, like the crane is flying over the construction site, I and me and most of us know that they're talking about a construction crane and not the bird, the crane, that is also a crane. But if we use linguistic rule-based systems like this, we will not be able to tell the difference. So they're difficult to scale. And of course, they're not sensitive to context. So the computer systems evolved a little bit more, right? And what happened was we used statistical methods where we said, okay, fine, I'm going to train the program or the computer program on large bodies of text data. And the model will identify certain patterns and using those patterns, and these are statistical probabilities of specific words being present in specific parts of the sentences, and then use these probabilities to kind of have a best estimate of what the word might be. Again, I'm continuing to use the parts of speech tagging analogy, where instead of just kind of having a rule that says, oh, if it's something that ends with S followed by the, probably a noun, it kind of looks at the word before and the word after, and then guesses what the probability of the word in question, like what part of speech it might be. Again, better than rule-based methods, but again, it lacks, again, as going back to the example of the crane flying over a construction site, it might run into issues. And again, it's not very scalable. And as the sentences get complex, and it struggles to kind of maintain its accuracy. So that leads us to today, right? So the current frontier, as we call it, is with deep learning. And we'll be talking, and I'm pretty sure you've had conversations and presentations already about GPT and BERT. And basically what these are, these are automated systems, which would leverage artificial intelligence to learn patterns in language, which are very sensitive to context, right? So in this case, it recognizes jump as a verb, not only because of the suffix, but also from the role, its role in connecting forks and the object dog. So in the previous example that I gave, the crane flying over a construction site, it will use the context of a construction site to rightly label crane as a mechanical crane, which is used in construction sites and projects, rather than confusing it with a crane that's a bird. So it's very sensitive to context. It can analyze whole sentences together. And there is more nuance in our understanding of what languages are. So just to recap, we started by manually coding individual words and tagging words and picking up themes to statistical methods. And then now the current frontier is through deep learning and artificial intelligence. So that leads us to computational analysis of language. And so researchers have been, this is a growing field and a very exciting field. And with every passing year, we have more and more papers that are being published. We have more and more companies that are coming in, leveraging natural language processing and computational analysis of language to help us make diagnostic and prognostic decisions. So over the next few slides, what we will do is we will try and get a feel of what various applications can be. So there are some of the very commonly used measures in computational analysis of language are things like speech rate, like how fast is one speaking, how slow is one speaking. As we remember, it's like when we think about mental status exam, we talk about slow rate of speech. In mania, we talk about fast rate of speech. It can look through pitch variations. So it's like I have intonation, I'm going up and down based on the point I'm trying to make. It can pick up on those things. It can not only think about how we are saying it and what the structure of the sentence is. We can also use computational analysis of language to try and figure out the content of the sentence as well. So it's like when we talk about, oh, there was predominant negative themes or themes related to depression or themes related to grandiosity. We can pick up on those things and quantify those things using computational analysis. And this will become more clear as we use some examples to explain this. But also, we can talk about how coherent a speech sample is, how organized it is. And we can also detect emotions using computational analysis of language. So as I said before, let's go through some examples and some research that has already happened, which will help us understand how we can use natural language processing to help us develop biomarkers which we can use in clinical psychiatry. So the first example, we're using vocal acoustic markers. And this was done in patients with depression. So in this paper, what they did was they recorded speech samples from patients who were receiving treatment for depression in a clinical trial. It was placebo versus Zoloft. And they were able to recruit about 105 adults with major depression. It was a four-week randomized double-blind control trial. And the drug in question was sertraline. So this is a paper all the way back from 2012. And what they did was they conducted interviews with the patients. And they had several tasks which they used to elicit the speech sample. So what they did was they asked a question around how they were doing and what their symptoms have been like. So that was open-ended questions with free narrative speech. Then they asked the patients to read a paragraph, so a passage, to look at what the reading abilities are, to look at speech rates when they were reading. Then they were asked to read out the whole alphabet, so from A to Z, and then numbers from 1 to 20. That helped them figure out some of the vocal and acoustic markers, which we'll talk about a little in the next slide. And then lastly, they were also asked to describe pictures. So the idea is they had multiple tasks using which they collected speech samples. And they did that at baseline and week one, three, and after the trial had ended. And of course, they were collecting other samples related to severity of depression and other things, other clinical markers as well. Now, they recorded a bunch of measures. So they talked about total recording time, like how much the patient spoke, and during what was the time of the recording, during that recording, how much they were speaking, how many pauses they took, like what was the total time of the pauses, what was the variation in the pause length. So like, were the pauses, like some pauses very long, some pauses very short, what was the difference between that part of the variation, and the total number of pauses. And then they also talked about speech to pause ratio, like how much time the patient spent speaking versus how much time was spent without speaking. And they also measured something known as F0, F1, and F2, and mean and standard deviations. So a little, I'll spend some time trying to explain this. So F0 talks about pitch variation. So it's basically the frequency, so when you record speech samples, you can represent it as pitch frequencies of pitches. And when you talk about mean F0, it basically is the mean pitch. So it's like, it will be the mean of all the frequencies of pitches that I'm speaking right now. It'll average it out, and that'll be mean, and the standard deviation, and the coefficient of variation of that frequency. So that's F0. F1 and F2 are slightly different ways of measuring acoustic features of speech. They are generally, have to do with how we pronounce and speak vowels. So it kind of depends on what the, F1 generally depends on how high or low your tongue is, as you are enunciating some of these vowels. And then F2 is more of a measure of the shape of your mouth. Of course, these are approximations of what F1 and F2 are, but it is another characteristic which varies as an acoustic marker of speech. So they measured all of these things. So what did they find? What they found was that as the depression severity changes, they were able to pick up on changes in acoustic measures. And for instance, more severe depression was associated with longer pause times, and greater pause variabilities, and a slow speaking rate. So this makes sense to us. Anybody who sees patients every day and sees patients with depression understands that when you notice somebody who's not speaking as loudly, is speaking slowly, and has a history of depression, of course, it's like maybe they're experiencing more depression now than they were before. So it makes sense. But they found a way of quantifying that, and tracking that over a period of time. And that quantification varied with the depression severity. So not only do we have some kind of statistical variation or agreement, but it makes clinical sense as well. And then of course, people who responded to treatment also showed the differences in their speaking rates. And as the pause times increased, and the speaking rate increased, they showed reduction in these pause times. Sorry, as the responders showed reduction in pause times and increase in speaking rate as the treatment progressed. So again, this is, and these kind of studies have been replicated several times. So how you say things, and how your pitch varies, the amount of time you spend in variations, have been shown to be related to depression severity in several studies since this was published in 2012. Now, in another study, so away from the acoustic analysis, we can move into content of speech, right? So in this study, what they did was they looked at the speech content of patients who were at clinical high risk for psychosis, and they tried to predict who would transition from a clinical high risk to full-fledged psychosis. So in this study, they initially had 34 patients who were identified as clinical high risk, and they followed them for 2.5 years, and five of the patients transitioned to psychosis. Now, what they did was at baseline, they had clinical interviews, which they had recorded. Now, they analyzed these clinical interviews to look for markers that would help predict psychosis. So what they figured out was they used a technique called latent semantic analysis, or they looked at semantics. So when we think about syntax, it's the grammar of speech, or the structure of a sentence. When we think about semantics, it's the content or the meaning of the words, right? So what we do is, in latent semantic analysis or semantic analysis of a sentence, initially, we divide the transcript into each sentence. Now, as you remember, it's like most of these models, AI models, are trained on large bodies of texts. And using those training, they're able to predict what is the probability of a specific word being present in a specific part of the sentence. So for example, if you use this example, which is, I can't think of them all offhand, they were the ones I always considered my best songs. They were the ones I really wrote from experience, right? So first, what the algorithm does is it divides them up into individual sentences. So here's a sentence, and then each word in a sentence gets assigned a vector. If I want to explain it easily, it is given a number between 0 and 1, which corresponds to the probability of that word being present at that part of the sentence. And this is based on the algorithm being trained on millions and millions of sentences of text. So for example, it can be New York Times articles, it can be medical literature, and other things. Now, once you have a vector or a numerical value for each word, you can simply just average it out, and then kind of create a sentence level coherence, right? So for higher the word, so higher the average coherence, more commonly, that combination of words are used, right? So for example, if you're talking in a sentence, for example, in this, I'm guessing the patient was talking about songs, you would hear words like music, or words like lyrics in the sentence, because they are related to the themes around songs, they're more likely to present with each other. So you will have a higher coherence rate, right? But suddenly, if somebody starts talking about random things, the coherence values will go down. So how organized the speeches in terms of content can be measured using semantic analysis of the speech sample. So this is what they did. And in simple words, every word in a sentence got assigned a value based on the probability of that word existing in that place. And they added it up, averaged it at the level of a sentence. And then that was called the semantic coherence value. Now, when they use that semantic coherence value, in addition to the use of determiners to create a classifier or a machine learning algorithm to predict who converts into a psychosis from clinical high risk, they actually had 100% accuracy. Whereas standardized scales such as SIPs and SOPs, they had very poor accuracy levels with a sensitivity as low as 40% and specificity as like in the 80s. So again, but remember, this was only 34 participants. And I will not buddy the lead. They tried to replicate the same model on a different speech set. And the model failed miserably. So we have to kind of understand this value of 100% accuracy with some humility and think about and use some caution as we understand what this result might mean. But again, proof of concept, yes, you can use the content or coherence in a speech sample to try and predict who might convert the psychosis from clinical high risk. So this paper did it in a very small sample. Another way of looking at risk for psychosis or psychotic disorder relapse is to look at their social media feeds, which is quite interesting. So in this study, what they did was they recruited 51 individuals. And these were individuals who were receiving care in first episode psychosis programs in New York, Michigan, and one more state, one more first episode psychosis clinic. And what they did was at their visit, once they consented for the study, they would just download their entire Facebook history. So from the time they joined Facebook till the time of the visit. So they download all of that and they ran a machine learning classifier using language models in the background. And then what they did was they basically compared it, or the timeline at least, to clinical data. That was hospitalizations versus psychosis and other things. So what they found was as the patient gets closer to a hospitalization, their Facebook posts have an increased frequency of using words like they're swearing more. There are more words that can represent anger or even death categories. And they use more first and second person pronouns and they use decreased words related to work and friends and help. Now they also saw some behavioral changes in the way they use Facebook. So they were increasingly tagging more friends and they were posting a lot during high morning hours. Right. So it makes sense. Like they generally have sleep disturbances and that can lead to like worse things in a psychosis and can eventually lead to hospitalization. But they were able to detect all these things in the social media field using natural language processing algorithms. So when they use this as a, as to try and say, okay, fine, can we use the Facebook feed and all the data we collected from there to create a machine learning algorithm or a classifier to predict psychosis admissions, it performed quite well. It's like it had a specificity of 71% in predicting relapse. So it's not only samples that are collected during interview. We can also collect samples from a social media feed, from text messaging and other things which have oftentimes been used in digital phenotyping to make such predictions. Now, another aspect, right? So when we think about how can natural language processing help us become better clinicians, not only by diagnosing better, but also giving us feedback and recommendations therapy, right? So oftentimes what individuals who do a lot of therapy might know this, that we give patients homeworks to complete. And right. It's like, and how many, and then of course in this modern world, right? It's like you see a patient once a month, you might not have mentioned that you're giving them homework in your notes. So you forget to follow up on those notes, right? But then we know that following up on these things and have them having engaged in doing those homework exercises is actually very beneficial for therapeutic outcomes, right? So what they did was in this paper, what they did was they actually collected all the transcripts from close to 2,500 therapy sessions. And what they did was they did something which is very interesting. What they did was they created a machine learning model that could identify when a homework was being given and in the next session was their followup, right? So it was able to pick up on these themes from the therapy sessions, and then was able to kind of predict those themes very accurately with almost 76 accuracy levels in the 70%. And then of course, it also using the same information, it was able to say that, okay, fine, in the sessions where we used like techniques where you were giving homework and then following up on that, the improvement in depression was also much higher, which makes sense. But imagine training therapy students or training residents in therapy by actively recording their therapy sessions. And of course, supervisors give them feedback, but you can quantify the traits or the skills you're using during therapy using these algorithms and give them like quantified feedback where you can say, okay, you use active listening here, you use reflective statements here, just like they were able to train a model where they were able to pick up people giving homework and following up on those things. So another application, right? I like talking about psychosis because that's the world I live in. So in another example, they use natural language processing in regular data, which they collected from electronic medical records, right? So remember, we started off by thinking about how we can use natural language processing and the measures we developed from that, from speech samples collected from patients. But then it's not just limited to that. We can pick up these similar things from their social media posts, right? We can pick them up from, of course, during sessions. And then, of course, the language we use to describe things in the EMR data, we can use that as well. So this was a study done in South Korea. They used basically anybody who was hospitalized with a psychotic spectrum disorder. They just went back and looked at the electronic health records. And then they not only analyzed the data, which they called field data like labs and other things, but they actually assessed, they looked at clinical notes. And they analyzed psychological tests, admission notes, nursing notes. And they used NLP to extract topics. And then they used those topics to create a machine learning model, which could be used to help predict who would be hospitalized within the next one year. Now, in this graph, it sees, so you can see there's an initial model which shows here. So initial model, they did not use any of the notes. And they just use tabular data. Like they might have used diagnostics, sociodemographic data. They used symptom medication status if they were taking it or not, all of those things. And they created a model. And that reached an accuracy level of only 78%. But then when they started adding notes or notes from nurses and clinicians into that model, the accuracy of the model improved all the way up to 94%. So again, it's a very powerful tool where you can go back and not only look at structured data, which we've used historically. We've kind of taken drug tests and medication use and the number of diagnosis, the number of outpatient visits, the number of emergency room visits. And we can create a model on an algorithm that can predict the risk of hospitalization. But we haven't been very successful. And of course, anybody who does clinical work knows that nowadays, we don't use that in clinical practice at all, despite EMR being in existence for the past 20 years. And there have been multiple studies that have shown that we can use these markers. These algorithms haven't reached an accuracy level that's high enough that make them clinically meaningful. But with the power of natural language processing, when we start analyzing notes and making them a part of these prediction models, we actually end up coming up with much more accurate models, which might eventually, one day, hopefully be very clinically useful. So imagine a scenario where you're seeing a patient and you open your EMR and you say, it's like, hey, the risk of this person being hospitalized in the next three months or six months is this much. And imagine tailoring your interventions based on insights like this. So again, a very powerful method. But again, we have a long way to go to that. So in conclusion, when we use natural language processing, we can use a diverse kind of When we think about speech and language, we can collect data from varied places. It can be clinical interviews. It can be free speech. We can think about showing, again, imagine there have been studies that have looked at patients describing Rorschach tests. And then using NLP, we can pick up features from the speech to have some kind of diagnostic predictability. We can use social media posts and electronic health records. And from a clinical use standpoint, we can not only use it for more diagnostic purposes. I think, again, what excites me the most is we can use it for severities. How severe is a depression? Imagine there are companies out there right now that can kind of predict PHQ scores based on a two-minute voice sample. So you can get a binary outcome like, oh, the PHQ score is greater than 30 or less than 30. Now, of course, PHQ takes about two minutes as well. So again, we need more advancements. But imagine developing a model where you might be able to predict PANS scores or Madras scores or HAMD scores, which can be much more clinically relevant. And not only predicting these scores in a binary way where you can say higher than 30 or higher than 13 and less than 13, you can kind of categorize at mild, moderate, severe, or to an extreme level, which would be a deep scenario, give you an actual number. And that would make measurement-based care so easy. And it would not be burdensome or resource-intensive on the clinician to kind of sit and do a clinically-administered scale that might take close to 35 to 40 minutes, right? So again, there are lots of applications. And of course, we can use it to not only have severity expense. We can kind of monitor treatment response. Imagine sending a text message to a patient and them answering those, like speaking to their phone, answering some questions, and using that speech markers again to kind of do how severe the symptoms are and predict standardized scores and you getting that feedback while the patient's sitting at home. And you can actually track those things over a period of time. And then we can use it for predict relapse and, of course, for therapy training and enhancing therapy outcomes. There are lots of limitations. We have to be very careful as we kind of think through some of these applications. I can't say this loud enough. We are in very early phases. This technology is in its infancy when we think about it being ready for clinical applications, right? It's nowhere you're ready to kind of be present in any kind of clinical world. It's generally limited to research settings. It's generally limited to research settings. Unfortunately, that's the case for most biomarkers research in psychiatry. We don't use any biomarkers clinically despite having 30, 40, 50 years worth of research and imaging and genetics and other things. But there are lots of limitation language samples. But the opportunity is it's very scalable, right? I don't need to do $1,000, $1,500 MRI scan, $800 EEG. I can collect speech samples during my regular clinical interview. This makes it very scalable. That would allow for large amounts of data being collected. And that would hopefully result in us getting models which are very clinically useful. But there are limitations like context sensitivity. I don't know how much progress we'll make in terms of how sensitive it is to context. Human speech is generative. We use sarcasm. We use nuance. There are cultural references, right? There's difference in the way we express ourselves not only based on where we grew up, even if we grew up in the same country, depending on which part of the country you're in. Or if you think about urban ways, like imagine kind of running the same algorithm from someone who's from Boston with their peculiar way of saying things and kind of running up against someone with a deep Southern accent, right? So there are challenges there. There's a lot of cultural differences. And of course, there'll be culturally specific idioms which will be present in one part of the world, which might be missing from the other part of the world. Another very important limitation, and something which we have very considered, especially if you follow the news, you know there have been instances where companies that have done text-based therapy or have provided therapy to patients have sold anonymized data to advertising companies. And they've gotten in trouble for that. So again, data and privacy and consent, especially when we think about recording somebody's speech sample away from the clinic, say in a home setting, right? You might end up recording things which they never consented to be recorded. So we have to, and again, we have to kind of protect this data a lot, because again, there's a lot of sensitive information in these samples, which can be personal and sensitive and can have severe ramifications down the line. Then I talked about bias and generalizability, right? So of course, that exists. So again, I get excited about speech and language samples, because I know for a fact that if you compare how much data we can get at such a low cost, I just feel that all of us should be recording our clinical interviews and in an anonymized way, donating it to research so that we can run analysis and develop algorithms so that we can kind of have large, scalable applications that we can develop for predicting treatment outcomes or kind of developing a measurement-based care system. But again, what ends up happening is when we train these models to analyze this data, there's always a question of bias, right? So for example, in the study I talked about in schizophrenia, the models they used was trained on 50 years of New York Times articles, right? And New York Times articles are not a good representation of what a day-to-day person speaks like, right? So when you think about measuring syntactic structures or the grammar or the content or the coherence levels, and the person you're competing against is a journalist who works at New York Times, of course, there's risk of bias. And there's risk of saying that, OK, fine, this is faulty speech, or these are the markers which might be deficient, but there might not be anything. And like anything else, it's like we will have algorithmic bias. And when we think about therapeutic approaches, we have to remember that these models do not have access to a lot of medical data or therapy data. So we have to be very careful when we kind of try and generalize this. Lastly, this is a big one. And this is true for all machine learning algorithms, right? So every day, if you get up, you open things like you open the news, or you open Medscape or something else, and I see, oh, brain scans can predict schizophrenia, or brain scans can predict dementia. Like this morning, oh, you can use one speech sample to predict dementia, right? Most of these algorithms are black box, right? So they can't tell you what is wrong with the speech sample and why they said that this person might have schizophrenia and this person might not. Going back to the classifiers in the schizophrenia study, which they use to kind of predict open words, they can't go back and say, this aspect of semantic coherence, or it was responsible for us to kind of say that, hey, this person might convert to psychos or not. So these are black boxes. The input goes in, and there's an output that comes out. What happens within these boxes, nobody knows. So for example, when I use HbA1c to kind of tell somebody the risk of if they're pre-diabetic or diabetic, I can say, hey, we know that if it's greater than 6.5, I think we need to bring it down. You'd keep it below 6. There's some explanatory power there. But if somebody asked me, hey, what was it about my speech sample that made you think I might have schizophrenia? If I'm using these black box algorithms, the answer to the question would be, I don't know. So that should make a lot of us nervous, especially when there's no explanatory power. And it's difficult to understand as to why they might be suggesting something that they're suggesting. So with that, I would like to end my presentation and open up for questions. Let me see. So I think I have one. Are there any AI apps that currently that can be integrated into Zoom telepsychiatry sessions to assess speech and language? That's the dream, Divya. So again, there have been lots of that. So there are lots of apps out there that include speech samples. So these are mostly phone-based apps. And that can be used to predict depression. There are apps. There are apps. There used to be a company called Quantify that did stuff with social media posts. There's a company called Ellipsis out of California that does depression prediction. There have been other companies, like Mindstrong used to use speech samples, if I'm not wrong. So NLP has been included in these apps, which can be used for disease monitoring reportedly. They are largely what we call digital phenotyping applications. That means that we are using data collected from phone, both passive and speech data. You can come up with predictive insights and develop a phenotype. So yes, there are apps like that. None that I'm aware of that can be used with Zoom. But I know of a lot of new telepsychiatry companies that are coming up, which kind of provide telepsychiatry as almost like a SaaS, like a software as a solution, where they are including real-time transcription and real-time insights. I know one exists for providing feedback for therapy, and I'm blanking on the name. But if you Google it, I'm pretty sure you'll be able to find it. Fair enough. We have a question that says, construction trains swing over construction sites. The bird crane would fly over construction sites. Fair enough. But I think the point you're making is grammar can be complex. Of course, you can have rules-based analysis of these things, and you can do a pretty good job. But as nuance increases, as we have more differences in urban language speakers, English is not my first language, right? So the way I would say things, the way a Britisher would speak English, or an Australian would speak English, or somebody else might speak English might vary, right? So that's why you need more powerful models which can keep context in mind and help, I think, understand that nuance better. That's where the deep learning models will far outperform any rules-based model. But point taken, yeah, construction might swing. But again, it actually helps me make my point where language is complex, and might vary between speakers. Nancy Martin asks, interesting to consider current apps that are integrated to common life. Alexa, speech data for these purposes. Yes. So they are collecting speech samples. Alexa, Siri, it's like all of us have had that experience, right? You're saying something, and then suddenly you get an advertisement for vacation planning because you were discussing a vacation trip with your wife. So yes, they constantly record. I have, hey, Siri, hey, Alexa, turned off in all my devices because they constantly record. And the way that works is they go back a few seconds to kind of recognize the pattern that's associated with the vocal markers of, hey, Siri, or hey, Alexa. And that's why they're constantly recording. So these have been used, and these are very advanced in the fields of advertising. And we just need to find ways of using it in clinical psychiatry. Next is, what might explain the preference for text messaging over speech? Are there age or personality factors? So I'm not aware of any specific research that looks into this. But of course, there are companies out there that do both synchronous and asynchronous text therapy with patients. There's a company out of the UK, which is a company that does asynchronous text therapy. Perhaps one of the largest providers of telepsychiatry services or teletherapy services for NHS. And they pioneered using text-based therapy. Now what they've done is they have like 10 years worth of data from these text therapy messaging apps, which they've been using. And they have used the power of NLP to develop automated chatbots and automated apps that can be used for helping someone with depression and anxiety. So I'm not aware of specific things that someone would measure text versus vocal. Of course, the kinds of analysis we could run, getting data from text versus spoken words will be different. We can think about, instead of, and again, there'll be patterns, right? When we think about texting, we can think about texting speed. We can kind of think about how much, how many emojis they're using and stuff like that. Whereas in vocal speech, you can always kind of go back to acoustic markers, rate of speech and other things. So the kinds of analysis we might end up doing when we collect data from speech versus when we collect data from elsewhere might vary. But I'm not sure if there's a preference for one way or the other. All right, so some more questions. Are NLP enrichment models best used only on population speakers for which the model was developed? That is, American-ish programs as speakers? Yes, ideally, yes. So you will have customization based on language, ideally. That's why, again, we run into the problem of scale, right? And that has been the problem with biomarker research in psychiatry for a very long time. We've had large-scale imaging studies and stuff like that, but they've never had enough participants for it to be generalizable. So when we think about NLP enrichment models, you might not even have models for, like a Usain model that you might use for patients with depression or for patients for psychosis, or you might have different models that might be good at predicting different things. I don't think there'll be an overarching model which can be good at everything. So you will have to not only have specification based on the kind of language speakers, but also what the application is, right? So one application, like one NLP model might be great at predicting if the person is depressed or not, whereas the other might be if the patient is having some cognitive issues or not, while the other, if there's psychosis or not. So even that level of specialization. And of course, when we think about American-English versus British-English versus Australian-English, you will have speech characteristics which will be very different. Of course, if you collect large enough samples, you might be able to account for those differences and come up with the same model for language-specific model, because different languages will be like structurally and grammatically different. But that would be unique collections of like hundreds and thousands of samples, and it has to be cross-sectional. But ideally, it would be more specialized than that. And then I have a question from Jennifer. I'm really trying to hear these lectures with an open mind. Do you think that an app analyze speech patterns for psychiatric ads a lot of assessment by an experienced clinician? Yes, and this is how I'll tell you how. So when you think about, and this is how I see the applications. I don't think there'll be an app which will tell me whether the patient has schizophrenia or not. I think that that's never going to be the way we will approach these algorithms. But think about the example I used when I was talking about measurement-based care. So for example, I deal with patients with schizophrenia every day, and I work in a schizophrenia clinic, but it's very rarely that we have the time or ability or the resources to do like standardized scales like PANS regularly. Then there's this whole concept of white space, right? So the patient comes and sees me once every month, and then I try and ask the patient, okay, tell me exactly everything that happened in the past month, right? How difficult it is for them to express everything in a 30 or a 45-minute visit, right? There will be recall bias. There'll be bias around saying they've had a bad day in like a couple of bad days in a row. They'll say, okay, fine, I'm having a very bad time, which might not be reflective of the entire time they spend away from the clinic. So using these speech measures, right, you can augment the amount of information you can get that will help you with decision-making. They will never replace clinical acumen. They will never replace clinical judgment. What it can do is give you quantitative insights, right? That will be helpful. The way I see it, using digital phenotyping, using speech and language markers, we can have a seamless measurement-based care system which we can rely on objective data, objective metrics, not only when the patient's in the clinic, but in between sessions as well. And that is where the value proposition for such technique lies. And of course, we can always say we rely on judgment, but we know that there are issues because judgment varies. And we're comparing these, I never compare these algorithms with an excellent psychiatrist. Let's compare them with an average psychiatrist, right? Let's compare them with someone who's, for no fault of theirs, through whatever training they've had, has seen five patients with schizophrenia versus someone who had access to a clozapine clinic in residency, right? So judgment varies. There's no objectivity there, right? So we need to kind of move away from purely clinical acumen to clinical acumen, which is augmented by objective measurement. Rest of medicine doesn't. And I think we should be comfortable doing it as well. So I don't see it replacing things. I see it augmenting things, creating more data points from when the patients are away, helping us be on par with everybody else. I think that that's where it is. That's where the applications are. It would be interesting how these models can be applied to population who speak English as a second language. Can you comment on that? Yes. So there are lots of studies that are happening now, which are actually collecting what their first language is to try and figure out how that might affect these speed samples. The way I speak English is very different from the way a person who was born and raised in America might speak English. So that's where the bias comes in, in these models. If I use, say, something like a word2vec, I'm competing against a New York Times journalist, which I don't think is a fair competition because they'll be better at English any day. But again, two solutions to that, collecting large amounts of data, and then hopefully the algorithm picks up the variations, but actually kind of creating research studies that look at what English as a second language. Do you have specific EHR recommendations for easy data collection to add to researchers' databases? None. Most of the EHRs will make it extremely difficult for them to be interoperable because that's not good for, I don't know, it might not be a good business practice for them, but I do not have any specific recommendations for EHR data. You can actually, it's easy to, once you know how, it's easy to pull data from notes from almost any EHR at the backend. So if you have a guy who does that for you in your organization already, it'll be super easy to do that, regardless of which EHR at the use. I'm curious to know if cultural contest is considered. So not as of now, right? Most of these are very general. So we don't, these NLP models are not trained on cultural context, not yet. We do not have that kind of data. We do not have that kind of nuance. And that's why we have to be very aware of using these methods when we actually apply them. Can speech analysis analyzing AR be used by employees to stream prospective employees? Yes. If you can think of doing a personality test and converting that into a speech algorithm, maybe, but I'm not aware of that happening. But again, there are some interesting applications, right? The way your intonation works. So you can have a lie detector test based on NLP, of course, but we know lie detector tests can be faked, so can speech analysis. So I'm not quite sure how we would be using it for prospective employees, but I don't think there's a technological barrier that will prevent you from doing so. So would it work well as is when used comparing change for individual patient, change appointment, appointment? Is this available individual profiles to compare change over time? I think, yes. So Jennifer, exactly. That's the point, right? I feel that it might not be great. And this is just based on my understanding of the literature, which might be very biased to literature in psychosis, because that's the world I live in, is it's not great at making group level differences, right? So can you differentiate between group of schizophrenia versus group of non-patients without schizophrenia, right? Because when we think about psychiatric disorders, what we're thinking is these heterogeneous presentations, right, which meet certain predefined DSM-based diagnostic criteria, right? So this is very heterogeneous, the way people present with, say, schizophrenia. One might be more predominant negative symptoms and cognitive symptoms, which might be associated with increased number of pauses, reduced slow rate of speech, reduced syntactic complexity, reduced semantic coherence, whereas predominantly positive symptoms where disorganization of speech might be the big component and there might be no issues with the rate at which they're speaking and what the cause lengths are, right? So when you have somebody's baseline, you can compare them over a period of time in subsequent visits, and you can kind of help that to derive insights. Just like you would imagine somebody would do like a PHQ-9 every week for you, and then there'll be somebody aggregates that data in the form of a graph for you as you kind of follow up with them three months later for a depression visit. And you can be saying like, hey, that week in February was pretty rough for you. Your PHQ-9 scores were very bad. Let's talk about that. So we can kind of do that with speed samples as well. So that'll be the application. I don't think they are presently available now, but don't quote me on that because new companies keep popping up every day and there might actually be one company that might be doing it. And if anybody knows about it, please feel free to comment on them. This topic raises a question. Are certain qualities of speech positively and inversely associated with specific mental illness regardless of culture or ethnicity? For example, reticence, maybe common quality of indigenous American speech and not a sign of decision. Yes, absolutely. So that's why context matters, right? And the more nuanced context matters. I'm of the opinion that speech samples by themselves will never be enough as a diagnostic marker. You'll have to have some kind of context where you might need to ask a very specific question. And that's where we have to kind of collect data from a very representative large sample. Now, when we think about, so I keep coming back to that point. When you think about complex, like where biomarker research has been in psychiatry when it comes to imaging and EEG, these are very resource intensive. You can't kind of take an MRI machine anywhere outside big, large academic centers for a person's research. But you can take a voice recorder to reservations, to rural areas and continue to record. So the hope is eventually we will be able to collect. And there's no technological barrier. There's not a barrier with regards to the amount of money that you need to do this. It's very simple. It's easy to use. Anybody can do it. You can use it, do it with your phone, right? So we'd eventually be able to collect enough data that represent those things. So yes, it's a characteristic. So when somebody says there's reduced rate of speech, that's what it is. It might not be diagnostic of anything, but if you know that there's a context of depression, you might be able to develop some insights and develop a scoring mechanism for measurement-based care. As if they were waiting for big declaration that life changing. Well, it depends on the system you are. Fortunately, the system I am in, we are doing lots of work with AI and NLP. We have strong partnerships with various academic institutions and we're actively working on this. And if you ever want to reach out and collaborate, I'm pretty sure we'd be happy to figure out a way. But I think we're almost towards the end of the session. Thank you so much for these engaging questions. Thank you.
Video Summary
In this presentation, Manu Sharma discusses the potential of using speech and language as biomarkers in psychiatry, focusing on diagnostics and therapy through AI-driven methods. Historically, language has been vital in psychiatric assessments, like identifying speech patterns in disorders such as depression or schizophrenia. Sharma outlines the evolution from manual coding of speech samples to modern AI models, highlighting the use of deep learning systems like GPT and BERT to enhance context understanding in language processing.<br /><br />He reviews studies showcasing speech analysis applications, like predicting depression severity through vocal markers or identifying risk of psychosis via social media analysis. The session explores how AI could aid clinicians by automating symptom measurement, monitoring treatment responses, and even enhancing therapy through feedback mechanisms. Despite the promise, Sharma notes critical challenges, such as cultural and contextual variability, privacy concerns, the need for diverse datasets, and the complexity of speech, urging careful integration of AI in clinical settings. Concluding, he emphasizes AI's role in supplementing, not replacing, human clinical judgment, advocating for scalable, objective measures in psychiatric care.
Keywords
speech biomarkers
psychiatry AI
language processing
deep learning
depression diagnosis
psychosis risk
AI in therapy
privacy concerns
clinical AI integration
×
Please select your language
1
English