ssws3
/

ssws3 bhaskarbuilds commited on
Commit
265940a
Β·
0 Parent(s):

Duplicate from JoshTalksAI/Human-1

Browse files

Co-authored-by: Bhaskar Singh <bhaskarbuilds@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - hi
5
+ tags:
6
+ - moshi
7
+ - speech-to-speech
8
+ - hindi
9
+ - conversational-ai
10
+ - audio
11
+ - full-duplex
12
+ - duplex-dialogue
13
+ - indian-languages
14
+ base_model: kyutai/moshiko-pytorch-bf16
15
+ pipeline_tag: audio-to-audio
16
+ ---
17
+
18
+ # Human-1: A Full-Duplex Conversational Model for Hindi
19
+ **πŸŽ™οΈ [Try the live demo β†’](https://ai.joshtalks.com/research/human-1)** | **πŸ“„ [Paper β†’](https://arxiv.org/pdf/2604.23295v1)**
20
+
21
+ Human-1 by Josh Talks is the first full-duplex spoken dialogue model for Hindi, built by adapting [Kyutai's Moshi](https://github.com/kyutai-labs/moshi) architecture. It enables real-time, natural Hindi conversation with support for interruptions, overlaps, backchannels, and natural turn-taking β€” trained on 26,000 hours of real spontaneous Hindi conversations from 14,695 speakers.
22
+
23
+ <p align="center">
24
+ <img src="hindi_moshi_architecture.svg" alt="Hindi-Moshi Architecture" width="480"/>
25
+ </p>
26
+
27
+ ## Model Details
28
+
29
+ | | |
30
+ |---|---|
31
+ | **Developed by** | Bhaskar Singh, Shobhit Banga, Pranav Sharma β€” [JoshTalks](https://joshtalks.com) |
32
+ | **Base model** | [kyutai/moshiko-pytorch-bf16](https://huggingface.co/kyutai/moshiko-pytorch-bf16) |
33
+ | **Language** | Hindi (hi) |
34
+ | **Model type** | Full-duplex speech-to-speech dialogue |
35
+ | **Format** | SafeTensors (fp32) |
36
+ | **Tokenizer** | Custom Hindi SentencePiece (32,000 vocabulary) |
37
+ | **Audio codec** | Mimi (frozen, 12.5 Hz, 1.1 kbps) |
38
+ | **License** | CC-BY-4.0 |
39
+
40
+ ## What was changed from base Moshi
41
+
42
+ The original English SentencePiece tokenizer was replaced with a Hindi SentencePiece model (32,000 vocabulary) trained on a large Hindi text corpus. This required reinitialisation of three vocabulary-dependent parameter groups:
43
+
44
+ - `text_emb` β€” text token embedding in the Temporal Transformer
45
+ - `depformer.emb.0` β€” text token embedding in the Depth Transformer
46
+ - `text_linear` β€” text output projection layer
47
+
48
+ All audio processing components (Mimi codec) and remaining transformer weights retain their pre-trained values. Mimi generalises to Hindi without retraining (STOI: 0.878, PESQ: 2.55).
49
+
50
+ For full architecture details, see the [Moshi paper](https://arxiv.org/abs/2410.00037).
51
+
52
+ ## Training
53
+
54
+ ### Data
55
+
56
+ The model was trained on a purpose-built corpus of **26,000 hours** of real Hindi spontaneous conversations β€” to our knowledge, the largest conversational speech corpus for any Indian language.
57
+
58
+ | Characteristic | Value |
59
+ |---|---|
60
+ | Total duration | 26,000 hours |
61
+ | Unique speakers | 14,695 |
62
+ | Recording type | Spontaneous, unscripted conversations |
63
+ | Channels | Stereo (separate per speaker) |
64
+ | Quality control | Trained annotators + manual checks |
65
+
66
+ The stereo recording format with separate speaker channels enables direct learning of turn-taking, overlaps, and backchannels from natural interactions β€” without requiring artificial speaker diarisation.
67
+
68
+ ### Two-stage training recipe
69
+
70
+ **Stage 1 β€” Pre-training** on the full 26,000-hour corpus. Learning rate of 3Γ—10⁻⁡ (matching original Moshi pre-training). AdamW with β₁=0.9, Ξ²β‚‚=0.95, weight decay 0.1. Effective batch size of 64 (\~2.9 hours of audio per update). Trained for 1 epoch (\~10,000 steps) in approximately 13 hours on 8Γ— NVIDIA H100 80GB GPUs.
71
+
72
+ **Stage 2 β€” Fine-tuning** on ~990 hours of curated high-quality conversational data. Split learning rates: 2Γ—10⁻⁢ for the Temporal Transformer, 4Γ—10⁻⁢ for the Depth Transformer. Optimal checkpoint selected at step 4,812 based on minimum total validation loss (3.370).
73
+
74
+ ### Training infrastructure
75
+
76
+ 8Γ— NVIDIA H100 80GB GPUs with bf16 mixed precision.
77
+
78
+ ## Evaluation
79
+
80
+ ### Perplexity
81
+
82
+ Measured using Sarvam-1 (2B) on Whisper-v3 transcriptions of generated speech.
83
+
84
+ | Temperature | PPL ↓ |
85
+ |---|---|
86
+ | Ground-truth | 237.1 |
87
+ | Human-1 (Ο„=0.8) | 356.9 |
88
+ | Human-1 (Ο„=0.9) | 467.1 |
89
+ | Human-1 (Ο„=1.0) | 640.6 |
90
+
91
+ ### Human Evaluation
92
+
93
+ 130 evaluators completed 2,125 rating tasks comparing human speech with model responses. Each instance contained two audio samples (Voice A: Human, Voice B: Model) rated on 5-point Likert scales for naturalness and clarity.
94
+
95
+ **Perceptual quality:**
96
+
97
+ | Metric | Human Score | Model Score | Human Preferred | Model Preferred | Tie |
98
+ |---|---|---|---|---|---|
99
+ | Naturalness | 4.55 | 4.10 | 30.0% | 3.1% | 66.9% |
100
+ | Clarity | 4.05 | 3.04 | β€” | β€” | β€” |
101
+
102
+ Generated speech achieves high perceptual quality, with naturalness scores approaching human speech and most pairwise comparisons resulting in ties.
103
+
104
+ **Conversational rubric evaluation:**
105
+
106
+ Evaluators also assessed conversational quality using three binary rubric questions measuring whether generated responses behave like natural conversational speech.
107
+
108
+ | Rubric | Pass Rate |
109
+ |---|---|
110
+ | Human-like interaction | β‰ˆ85% |
111
+ | Appropriateness (response follows prompt) | β‰ˆ53% |
112
+ | Completion (response forms a complete reply) | β‰ˆ42% |
113
+
114
+ While the model frequently produces speech that sounds human-like, maintaining contextual relevance and producing fully complete conversational responses remains an ongoing challenge.
115
+
116
+ ### Turn-Taking Analysis
117
+
118
+ Temperature Ο„=0.9 produces turn-taking dynamics closest to ground-truth.
119
+
120
+ | Model | Ο„ | IPU/min | Pause | Gap | Overlap |
121
+ |---|---|---|---|---|---|
122
+ | Ground-truth | β€” | 35.30 | 10.49 | 8.51 | 3.03 |
123
+ | Human-1 | 0.8 | 23.12 | 9.16 | 6.77 | 1.67 |
124
+ | Human-1 | 0.9 | 29.14 | 9.24 | 8.54 | 4.30 |
125
+ | Human-1 | 1.0 | 38.90 | 11.67 | 8.10 | 9.68 |
126
+
127
+ ## Conversation Style
128
+
129
+ Human-1 is trained on **topic-driven conversations** - real dialogues where two speakers discuss a subject naturally, with backchannels, interruptions, and organic turn-taking.
130
+
131
+ After an initial introduction, the model will typically **propose a topic and steer the conversation toward it**, preferring structured discussion over open-ended chitchat. Users can also **introduce their own topic** - the model will pick it up and engage in a focused discussion around it. This is an intentional design choice - the training data consists of real conversations where speakers engage in focused, in-depth discussions on assigned topics.
132
+
133
+ This makes the model particularly well-suited for **domain-specific conversational applications**. Our key finding is that the model's ability to stay on-topic emerges naturally from the structure of the training data alone - without any explicit prompting, reward shaping, or guardrails. This suggests that with sufficient hours of domain-specific conversational data, this approach can produce models that learn the conversational norms of virtually any domain - customer support, healthcare consultations, language tutoring, sales, therapy, and more - opening a direct path from curated conversations to deployable, real-world voice agents. Exploring this is an active direction of our future work.
134
+
135
+ ## Files
136
+
137
+ ```
138
+ β”œβ”€β”€ model.safetensors # Human-1 LM weights
139
+ β”œβ”€β”€ tokenizer-e351c8d8-checkpoint125.safetensors # Mimi audio codec (frozen, from Moshi)
140
+ β”œβ”€β”€ tokenizer_hindi.model # Hindi SentencePiece tokenizer
141
+ β”œβ”€β”€ tokenizer_hindi.vocab # Vocabulary reference
142
+ β”œβ”€β”€ hindi_moshi_architecture.svg # Architecture diagram
143
+ └── README.md
144
+ ```
145
+
146
+ ## Quick Start
147
+
148
+ ### 1. Install uv
149
+
150
+ ```bash
151
+ curl -LsSf https://astral.sh/uv/install.sh | sh
152
+ source $HOME/.local/bin/env
153
+ ```
154
+
155
+ ### 2. Create project and install dependencies
156
+
157
+ ```bash
158
+ uv init human-1 && cd human-1
159
+ uv python install 3.12
160
+ uv python pin 3.12
161
+ uv add moshi huggingface_hub
162
+ ```
163
+
164
+ ### 3. Download the model
165
+
166
+ ```bash
167
+ uv run huggingface-cli download JoshTalksAI/Human-1 --local-dir ./weights
168
+ ```
169
+
170
+ ### 4. Run the server
171
+
172
+ ```bash
173
+ uv run -m moshi.server \
174
+ --moshi-weight ./weights/model.safetensors \
175
+ --mimi-weight ./weights/tokenizer-e351c8d8-checkpoint125.safetensors \
176
+ --tokenizer ./weights/tokenizer_hindi.model
177
+ ```
178
+
179
+ ## Intended Use
180
+
181
+ The model is intended for research in full-duplex spoken dialogue systems for Hindi and Indian languages. It can be used as a conversational agent for casual Hindi conversations.
182
+
183
+ ## Limitations
184
+
185
+ - Trained primarily on Hindi conversational speech. Performance on other languages or domains is not guaranteed.
186
+ - Inherits limitations from the base Moshi architecture regarding audio quality at 1.1 kbps bitrate.
187
+ - Hindi text tokens are sparser relative to audio (~75% PAD ratio vs. 65% in English) due to Devanagari encoding more phonemic content per token.
188
+ - Not intended for impersonation or any malicious use.
189
+ - This model is for research purposes. We do not recommend it for providing advice or performing any professional duty.
190
+
191
+ ## Citation
192
+
193
+ ```bibtex
194
+ @article{singh2026human1,
195
+ title = {Human-1 by Josh Talks : A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations},
196
+ author = {Bhaskar Singh and Shobhit Banga and Pranav Sharma},
197
+ year = {2026},
198
+ institution = {JoshTalks}
199
+ }
200
+ ```
201
+
202
+ ## Acknowledgments
203
+
204
+ Built on [Moshi](https://github.com/kyutai-labs/moshi) by [Kyutai](https://kyutai.org/). We thank the 14,695 speakers who contributed to the Hindi conversational corpus.
hindi_moshi_architecture.svg ADDED
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:66e73545d7e54dc8ed39bcab1b62ae336c82d97b141dbc0622a07acbf2a5ea2d
3
+ size 30750958336
tokenizer-e351c8d8-checkpoint125.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09b782f0629851a271227fb9d36db65c041790365f11bbe5d3d59369cf863f50
3
+ size 384644900
tokenizer_hindi.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a05524805f181f74520be13b407cb00bcea3872398bcd0058d75e40c6bfc13c2
3
+ size 1080022
tokenizer_hindi.vocab ADDED
The diff for this file is too large to render. See raw diff