mrfakename commited on
Commit
497809f
·
verified ·
1 Parent(s): a8f46ef

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. LICENSE +61 -0
  2. README.md +205 -0
  3. checkpoint.pt +3 -0
  4. config.json +92 -0
LICENSE ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ SAM License
2
+ Last Updated: November 19, 2025
3
+
4
+ “Agreement” means the terms and conditions for use, reproduction, distribution and modification of the SAM Materials set forth herein.
5
+
6
+
7
+ “SAM Materials” means, collectively, Documentation and the models, software and algorithms, including machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code, and other elements of the foregoing distributed by Meta and made available under this Agreement.
8
+
9
+ “Documentation” means the specifications, manuals and documentation accompanying
10
+ SAM Materials distributed by Meta.
11
+
12
+
13
+ “Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or regulations to provide legal consent and that has legal authority to bind your employer or such other person or entity if you are entering in this Agreement on their behalf.
14
+
15
+
16
+ “Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your principal place of business is in the EEA or Switzerland) or Meta Platforms, Inc. (if you are located outside of the EEA or Switzerland).
17
+
18
+
19
+ “Sanctions” means any economic or trade sanctions or restrictions administered or enforced by the United States (including the Office of Foreign Assets Control of the U.S. Department of the Treasury (“OFAC”), the U.S. Department of State and the U.S. Department of Commerce), the United Nations, the European Union, or the United Kingdom.
20
+
21
+
22
+ “Trade Controls” means any of the following: Sanctions and applicable export and import controls.
23
+
24
+ By using or distributing any portion or element of the SAM Materials, you agree to be bound by this Agreement.
25
+
26
+
27
+ 1. License Rights and Redistribution.
28
+
29
+
30
+ a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property or other rights owned by Meta embodied in the SAM Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the SAM Materials.
31
+
32
+ b. Redistribution and Use.
33
+ i. Distribution of SAM Materials, and any derivative works thereof, are subject to the terms of this Agreement. If you distribute or make the SAM Materials, or any derivative works thereof, available to a third party, you may only do so under the terms of this Agreement and you shall provide a copy of this Agreement with any such SAM Materials.
34
+
35
+
36
+ ii. If you submit for publication the results of research you perform on, using, or otherwise in connection with SAM Materials, you must acknowledge the use of SAM Materials in your publication.
37
+
38
+
39
+ iii. Your use of the SAM Materials must comply with applicable laws and regulations, including Trade Control Laws and applicable privacy and data protection laws.
40
+ iv. Your use of the SAM Materials will not involve or encourage others to reverse engineer, decompile or discover the underlying components of the SAM Materials.
41
+ v. You are not the target of Trade Controls and your use of SAM Materials must comply with Trade Controls. You agree not to use, or permit others to use, SAM Materials for any activities subject to the International Traffic in Arms Regulations (ITAR) or end uses prohibited by Trade Controls, including those related to military or warfare purposes, nuclear industries or applications, espionage, or the development or use of guns or illegal weapons.
42
+ 2. User Support. Your use of the SAM Materials is done at your own discretion; Meta does not process any information nor provide any service in relation to such use. Meta is under no obligation to provide any support services for the SAM Materials. Any support provided is “as is”, “with all faults”, and without warranty of any kind.
43
+
44
+
45
+ 3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE SAM MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE SAM MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE SAM MATERIALS AND ANY OUTPUT AND RESULTS.
46
+
47
+ 4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT OR INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
48
+
49
+ 5. Intellectual Property.
50
+
51
+
52
+ a. Subject to Meta’s ownership of SAM Materials and derivatives made by or for Meta, with respect to any derivative works and modifications of the SAM Materials that are made by you, as between you and Meta, you are and will be the owner of such derivative works and modifications.
53
+
54
+ b. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the SAM Materials, outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the SAM Materials.
55
+
56
+ 6. Term and Termination. The term of this Agreement will commence upon your acceptance of this Agreement or access to the SAM Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete and cease use of the SAM Materials. Sections 3, 4 and 7 shall survive the termination of this Agreement.
57
+
58
+ 7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of the State of California without regard to choice of law principles, and the UN Convention on Contracts for the International Sale of Goods does not apply to this Agreement. The courts of California shall have exclusive jurisdiction of any dispute arising out of this Agreement.
59
+
60
+
61
+ 8. Modifications and Amendments. Meta may modify this Agreement from time to time; provided that they are similar in spirit to the current version of the Agreement, but may differ in detail to address new problems or concerns. All such changes will be effective immediately. Your continued use of the SAM Materials after any modification to this Agreement constitutes your agreement to such modification. Except as provided in this Agreement, no modification or addition to any provision of this Agreement will be binding unless it is in writing and signed by an authorized representative of both you and Meta.
README.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: sam-license
4
+ license_link: LICENSE
5
+ extra_gated_fields:
6
+ First Name: text
7
+ Last Name: text
8
+ Date of birth: date_picker
9
+ Country: country
10
+ Affiliation: text
11
+ Job title:
12
+ type: select
13
+ options:
14
+ - Student
15
+ - Research Graduate
16
+ - AI researcher
17
+ - AI developer/engineer
18
+ - Reporter
19
+ - Other
20
+ geo: ip_location
21
+ By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy: checkbox
22
+ extra_gated_description: >-
23
+ The information you provide will be collected, stored, processed and shared in
24
+ accordance with the [Meta Privacy
25
+ Policy](https://www.facebook.com/privacy/policy/).
26
+ extra_gated_button_content: Submit
27
+ language:
28
+ - en
29
+ ---
30
+
31
+ # SAM-Audio: Segment Anything Model for Audio
32
+
33
+ SAM-Audio is a model for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.
34
+
35
+ ## Authentication
36
+
37
+ Before using SAM-Audio, you need to:
38
+ 1. Request access to the checkpoints on the [SAM-Audio Hugging Face repo](https://huggingface.co/facebook/sam-audio-large)
39
+ 2. Authenticate with Hugging Face: `huggingface-cli login`
40
+
41
+ ## Usage
42
+
43
+ SAM-Audio supports three types of prompting: text, visual, and span. Each method allows you to specify which sounds to isolate in different ways.
44
+
45
+ ### 1. Text Prompting
46
+
47
+ Use natural language descriptions to isolate sounds.
48
+
49
+ ```python
50
+ import torch
51
+ import torchaudio
52
+ from sam_audio import SAMAudio, SAMAudioProcessor
53
+
54
+ # Load model and processor
55
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
56
+ model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval()
57
+ processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
58
+
59
+ # Load audio file
60
+ audio_file = "path/to/audio.wav"
61
+
62
+ # Describe the sound you want to isolate
63
+ description = "A man speaking"
64
+
65
+ # Process and separate
66
+ inputs = processor(audios=[audio_file], descriptions=[description]).to(device)
67
+ with torch.inference_mode():
68
+ result = model.separate(inputs)
69
+
70
+ # Save results
71
+ torchaudio.save("target.wav", result.target[0].unsqueeze(0).cpu(), processor.audio_sampling_rate)
72
+ torchaudio.save("residual.wav", result.residual[0].unsqueeze(0).cpu(), processor.audio_sampling_rate)
73
+ ```
74
+
75
+ **Examples of text descriptions:**
76
+ - "A person coughing"
77
+ - "Raindrops are falling heavily, splashing on the ground"
78
+ - "A dog barking"
79
+ - "Piano playing a melody"
80
+ - "Car engine revving"
81
+
82
+ ### 2. Visual Prompting
83
+
84
+ Isolate sounds associated with specific visual objects in a video using masked video frames.
85
+
86
+ ```python
87
+ import torch
88
+ import numpy as np
89
+ from sam_audio import SAMAudio, SAMAudioProcessor
90
+ from torchcodec.decoders import VideoDecoder
91
+
92
+ # NOTE: Requires SAM3 for creating masks
93
+ # pip install git+https://github.com/facebookresearch/sam3.git
94
+ from sam3.model_builder import build_sam3_video_predictor
95
+
96
+ # Load SAM-Audio model
97
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
98
+ model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval()
99
+ processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
100
+
101
+ # Load video
102
+ video_file = "path/to/video.mp4"
103
+ decoder = VideoDecoder(video_file)
104
+ frames = decoder[:]
105
+
106
+ # Create mask using SAM3 (example with text prompt)
107
+ video_predictor = build_sam3_video_predictor()
108
+ response = video_predictor.handle_request({
109
+ "type": "start_session",
110
+ "resource_path": video_file,
111
+ })
112
+ session_id = response["session_id"]
113
+
114
+ masks = []
115
+ for frame_index in range(len(decoder)):
116
+ response = video_predictor.handle_request({
117
+ "type": "add_prompt",
118
+ "session_id": session_id,
119
+ "frame_index": frame_index,
120
+ "text": "The person on the left", # Visual object to isolate
121
+ })
122
+ mask = response["outputs"]["out_binary_masks"]
123
+ if mask.shape[0] == 0:
124
+ mask = np.zeros_like(frames[0, [0]], dtype=bool)
125
+ masks.append(mask[:1])
126
+
127
+ mask = torch.from_numpy(np.concatenate(masks)).unsqueeze(1)
128
+
129
+ # Process with visual prompting
130
+ inputs = processor(
131
+ audios=[video_file],
132
+ descriptions=[""],
133
+ masked_videos=processor.mask_videos([frames], [mask]),
134
+ ).to(device)
135
+
136
+ with torch.inference_mode():
137
+ result = model.separate(inputs)
138
+ ```
139
+
140
+ ### 3. Span Prompting (Temporal Anchors)
141
+
142
+ Specify time ranges where the target sound occurs or doesn't occur. This provides a specific example to the model of what to isolate
143
+
144
+ ```python
145
+ import torch
146
+ import torchaudio
147
+ from sam_audio import SAMAudio, SAMAudioProcessor
148
+
149
+ # Load model and processor
150
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
151
+ model = SAMAudio.from_pretrained("facebook/sam-audio-large").to(device).eval()
152
+ processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
153
+
154
+ # Define anchors: [type, start_time, end_time]
155
+ # "+" means the sound IS present in this time range
156
+ # "-" means the sound is NOT present in this time range
157
+ anchors = [
158
+ ["+", 6.3, 7.0], # Sound occurs between 6.3 and 7.0 seconds
159
+ ]
160
+
161
+ # Process with span prompting
162
+ inputs = processor(
163
+ audios=[audio_file],
164
+ descriptions=["A horn honking"],
165
+ anchors=[anchors],
166
+ ).to(device)
167
+
168
+ with torch.inference_mode():
169
+ result = model.separate(inputs)
170
+ ```
171
+
172
+ **Example with multiple anchors:**
173
+ ```python
174
+ anchors = [
175
+ ["+", 2.0, 3.5], # Sound present from 2.0 to 3.5 seconds
176
+ ["+", 8.0, 9.0], # Sound present from 8.0 to 9.0 seconds
177
+ ["-", 0.0, 1.0], # Sound NOT present from 0.0 to 1.0 seconds
178
+ ]
179
+ ```
180
+
181
+ ## Output Format
182
+
183
+ The `model.separate()` method returns a result object with:
184
+ - `result.target`: The isolated sound (what you asked for)
185
+ - `result.residual`: Everything else (the remainder)
186
+
187
+
188
+ Both are `list[torch.Tensor]` where each tensor is a 1D waveform
189
+
190
+ ## Citation
191
+
192
+ If you use SAM-Audio in your research, please cite:
193
+
194
+ ```bibtex
195
+ @article{sam-audio,
196
+ title={SAM-Audio: Segment Anything in Audio},
197
+ author={Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee},
198
+ year={2025}
199
+ url={arxiv link coming soon}
200
+ }
201
+ ```
202
+
203
+ ## License
204
+
205
+ This project is licensed under the SAM License. See the [LICENSE](LICENSE) file for details.
checkpoint.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca55418b1d23e8c8a4dcc55f259d9801c8f79da0131a66e525d862c1289e3c4f
3
+ size 14861356211
config.json ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "in_channels": 768,
3
+ "audio_codec": {
4
+ "encoder_dim": 64,
5
+ "encoder_rates": [
6
+ 2,
7
+ 8,
8
+ 10,
9
+ 12
10
+ ],
11
+ "latent_dim": 1024,
12
+ "decoder_dim": 1536,
13
+ "decoder_rates": [
14
+ 12,
15
+ 10,
16
+ 8,
17
+ 2
18
+ ],
19
+ "n_codebooks": 16,
20
+ "codebook_size": 1024,
21
+ "codebook_dim": 128,
22
+ "quantizer_dropout": false,
23
+ "sample_rate": 48000,
24
+ "mean": 0.0,
25
+ "std": 1.0
26
+ },
27
+ "text_encoder": {
28
+ "dim": 768,
29
+ "name": "t5-base",
30
+ "max_length": 512,
31
+ "pad_mode": "longest"
32
+ },
33
+ "vision_encoder": {
34
+ "dim": 1024,
35
+ "batch_size": 300,
36
+ "name": "PE-Core-L14-336",
37
+ "normalize_feature": true,
38
+ "interpolation_mode": "BICUBIC",
39
+ "image_size": 336
40
+ },
41
+ "transformer": {
42
+ "dim": 2816,
43
+ "n_heads": 22,
44
+ "n_layers": 22,
45
+ "dropout": 0.1,
46
+ "norm_eps": 1e-05,
47
+ "qk_norm": true,
48
+ "fc_bias": false,
49
+ "ffn_exp": 4,
50
+ "ffn_dim_multiplier": 1,
51
+ "multiple_of": 64,
52
+ "non_linearity": "swiglu",
53
+ "use_rope": true,
54
+ "max_positions": 10000,
55
+ "frequency_embedding_dim": 256,
56
+ "timestep_non_linearity": "swiglu",
57
+ "t_block_non_linearity": "silu",
58
+ "t_block_bias": true,
59
+ "context_dim": 2816,
60
+ "context_non_linearity": "swiglu",
61
+ "context_embedder_dropout": 0.0,
62
+ "context_norm": false,
63
+ "out_channels": 256,
64
+ "in_channels": null
65
+ },
66
+ "num_anchors": 3,
67
+ "anchor_embedding_dim": 128,
68
+ "visual_ranker": {
69
+ "checkpoint": null,
70
+ "kind": "imagebind"
71
+ },
72
+ "text_ranker": {
73
+ "rankers": {
74
+ "clap": [
75
+ {
76
+ "checkpoint": null,
77
+ "kind": "clap"
78
+ },
79
+ 5.0
80
+ ],
81
+ "judge": [
82
+ {
83
+ "checkpoint_or_model_id": "facebook/sam-audio-judge",
84
+ "kind": "judge"
85
+ },
86
+ 1.0
87
+ ]
88
+ },
89
+ "kind": "ensemble"
90
+ },
91
+ "span_predictor": "pe-a-frame-large"
92
+ }