How to un-normalized unquantized predictions
Hello! Thanks for all your help so far:)
How do I un-normalized the quantized predictions/what function did you use to normalize it? I want to make sure I can properly compare the predictions to the true scores in a regression context.
Thanks!
Rahul
Hi Rahul,
You're welcome!
By normalize do you mean quantize? If you want the raw, un-quantized scores, just pass quantize=False to one of the Pipeline.run_on_* functions.
The quantization uses a sorted list of thresholds for each task, which are configurable at the Pipeline level; for the defaults see inference_thresholds in config.py. The quantization happens in the last function in model.py by calling torch.searchsorted to figure out where in the sorted list of thresholds a particular score lands.
Noah
Yeah fair question, to clarify - I don't quantize. The output I get is, as the model card says, "raw float values which correlate monotonically with PHQ-9 and GAD-7." The values I get out are on a range of -2.5 to 0.5, so I wasn't sure how to convert the true scores (0-27) to things normalized range or vis-versa.
Oh, ok. How to best map raw scores to the most likely corresponding PHQ-9 sums depends on what metric you use for "best". The tuning folder has tools to use for this; see the "Tuning thresholds" section of the dam model card, specifically the "Optimal Tuning for Multi-class Tasks" subsection, for some examples. Let me know if you have questions about how this is intended to answer your question.
Hi everyone. I would kindly ask you about the model that predicts the emotions output. Do we have any plans to release this model?
Thanks @NDStein ! Maybe I'm misunderstanding; from a regression standpoint (e.g. if quantize=False), I was expecting the model to output a continuous prediction between 0-27 for phq9 and 0-21 for gad7. The output I get instead is continuous but between -2.5 and 0.5 or so. My interpretation of that was that these scores map to an actual phq9 or gad7 sum score, but that there is some type of conversion or normalization happening in the DAM model (e.g. z-scoring). If so, I'm curious what that is. I didn't think that threshold tuning was what I wanted to do since I am not trying to look at any type of classifier but treating this as a continuous regression problem.
@rfbrito Thanks for the question. I think the confusion is coming from how the model outputs were structured.
The model itself does not directly regress to the raw PHQ-9 (0–27) or GAD-7 (0–21) totals. Instead, the underlying model produces a continuous latent score that reflects the model’s estimate of symptom severity from vocal features. Those raw outputs can fall in ranges like the one you’re seeing (e.g., roughly −2.5 to 0.5 depending on the checkpoint and calibration).
In the production API (https://www.kintsugihealth.com/api/voice-api#predict-results), we apply an additional mapping layer on top of that latent score to produce the clinically interpretable outputs. Specifically:
The latent score is used internally as the model’s continuous signal.
We then apply calibration and threshold mapping to translate that signal into:
- binary screening outputs (e.g., depression present/absent)
- severity bands (e.g., no_to_mild, mild_to_moderate, moderate_to_severe)
The API therefore returns clinically meaningful categories, rather than the raw latent regression value.
So if you’re using the open model outputs directly, what you’re seeing is the pre-calibration latent score, not a normalized PHQ-9 or GAD-7 regression target.
If someone wanted to approximate PHQ-9 / GAD-7 totals from that signal, you would typically apply a calibration function (e.g., isotonic/logistic mapping) trained on labeled validation data rather than simple threshold tuning.
The model is trained with ordinal regression rather than traditional regression. Ordinal regression is designed for problems with categorical labels which have a natural ordering.
The raw score is on an arbitrary numeric scale, not based on a pre-defined transformation of the labels. The ordinal regression objective teaches the model to output higher raw depression / anxiety scores for higher PHQ-9 / GAD-7 labels. There is nothing encouraging the relationship to be linear, but the ordinal regression objective tries to make it as close to monotone as possible.
I've added the figures below showing the cumulative score distribution for each label value to the "Output" section of the model card. Since predicting PHQ-9 or GAD-7 sum exactly from a 30-second voice clip on an arbitrary topic is hard, there is substantial overlap between the distributions for nearby PHQ-9 and GAD-7 values. But the trend of higher raw scores for higher label values is visible.
The reason I mentioned threshold tuning is that if you want to explicitly map ranges of scores to predicted PHQ-9 or GAD-7 sums, the threshold tuning code gives you an optimal way to do so for many definitions of "optimal".
These were plotted with the data https://huggingface.co/datasets/KintsugiHealth/dam-dataset/blob/main/data/test-00000-of-00001.parquet using the code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_parquet("test-00000-of-00001.parquet")
df['phq'] = df['phq'].astype(int) # All integers, but stored as floats
df['gad'] = df['gad'].astype(int)
sns.ecdfplot(data=df, x="scores_depression", hue="phq")
plt.title("CDF of unquantized depression model scores on test set grouped by PHQ-9 sum")
plt.grid()
sns.ecdfplot(data=df, x="scores_anxiety", hue="gad")
plt.title("CDF of unquantized anxiety model scores on test set grouped by GAD-7 sum")
plt.grid()
plt.show()

