avelezarce commited on
Commit
5c31c72
·
verified ·
1 Parent(s): aaab90f

Upload 11 files

Browse files
MANIFEST.in ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ include geneformer/gene_median_dictionary_gc95M.pkl
2
+ include geneformer/gene_name_id_dict_gc95M.pkl
3
+ include geneformer/ensembl_mapping_dict_gc95M.pkl
4
+ include geneformer/token_dictionary_gc95M.pkl
README.md CHANGED
@@ -1,3 +1,66 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - single-cell
5
+ - genomics
6
+ base_model:
7
+ - ctheodoris/Geneformer
8
+ ---
9
+ # Geneformer
10
+ Geneformer is a foundational transformer model pretrained on a large-scale corpus of single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
11
+
12
+ # Abstract
13
+ Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding and computer vision by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
14
+
15
+ # Code
16
+ ```python
17
+ from tdc.model_server.tokenizers.geneformer import GeneformerTokenizer
18
+ from tdc import tdc_hf_interface
19
+ import torch
20
+ # Retrieve anndata object. Then, tokenize
21
+ tokenizer = GeneformerTokenizer()
22
+ x = tokenizer.tokenize_cell_vectors(adata,
23
+ ensembl_id="feature_id",
24
+ ncounts="n_measured_vars")
25
+ cells, _ = x
26
+ input_tensor = torch.tensor(cells) # note that you may need to pad or perform other custom data processing
27
+
28
+ # retrieve model
29
+ geneformer = tdc_hf_interface("Geneformer")
30
+ model = geneformer.load()
31
+
32
+ # run inference
33
+ attention_mask = torch.tensor(
34
+ [[x[0] != 0, x[1] != 0] for x in input_tensor]) # here we assume we used 0/False as a special padding token
35
+ outputs = model(batch,
36
+ attention_mask=attention_mask,
37
+ output_hidden_states=True)
38
+ layer_to_quant = quant_layers(model) + (
39
+ -1
40
+ ) # Geneformer's second-to-last layer is most generalized
41
+ embs_i = outputs.hidden_states[layer_to_quant]
42
+ # there are "cls", "cell", and "gene" embeddings. we will only capture "gene", which is cell type specific. for "cell", you'd average out across unmasked gene embeddings per cell
43
+ embs = embs_i
44
+ ```
45
+
46
+ # TDC Citation
47
+ ```
48
+ @inproceedings{
49
+ velez-arce2024signals,
50
+ title={Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics},
51
+ author={Alejandro Velez-Arce and Xiang Lin and Kexin Huang and Michelle M Li and Wenhao Gao and Bradley Pentelute and Tianfan Fu and Manolis Kellis and Marinka Zitnik},
52
+ booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
53
+ year={2024},
54
+ url={https://openreview.net/forum?id=kL8dlYp6IM}
55
+ }
56
+ ```
57
+
58
+ # Additional Citations
59
+ - C V Theodoris#, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor#. Transfer learning enables predictions in network biology. _**Nature**_, 31 May 2023. (#co-corresponding authors)
60
+ - H Chen*, M S Venkatesh*, J Gomez Ortega, S V Mahesh, T Nandi, R Madduri, K Pelka†, C V Theodoris†#. Quantized multi-task learning for context-specific representations of gene network dynamics. _**bioRxiv**_, 19 Aug 2024. (*co-first authors, †co-senior authors, #corresponding author)
61
+
62
+ # Model HF Homepage
63
+ https://huggingface.co/ctheodoris/Geneformer
64
+
65
+ # Notes
66
+ We use the 20L-95M-i4096 release of Geneformer on TDC. This model is trained on the 95M version of Genecorpus.
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.02,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "relu",
8
+ "hidden_dropout_prob": 0.02,
9
+ "hidden_size": 896,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 1792,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 4096,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 14,
16
+ "num_hidden_layers": 20,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.37.1",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 20275
24
+ }
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "pad_token_id": 0,
4
+ "transformers_version": "4.37.1"
5
+ }
gitattributes ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ <<<<<<< HEAD
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
7
+ =======
8
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
9
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
10
+ >>>>>>> 09de19734bf3da83050abc74408517ba15b5b185
11
+ *.ftz filter=lfs diff=lfs merge=lfs -text
12
+ *.gz filter=lfs diff=lfs merge=lfs -text
13
+ *.h5 filter=lfs diff=lfs merge=lfs -text
14
+ *.joblib filter=lfs diff=lfs merge=lfs -text
15
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
16
+ <<<<<<< HEAD
17
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
18
+ *.model filter=lfs diff=lfs merge=lfs -text
19
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
20
+ *.npy filter=lfs diff=lfs merge=lfs -text
21
+ *.npz filter=lfs diff=lfs merge=lfs -text
22
+ =======
23
+ *.model filter=lfs diff=lfs merge=lfs -text
24
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
25
+ >>>>>>> 09de19734bf3da83050abc74408517ba15b5b185
26
+ *.onnx filter=lfs diff=lfs merge=lfs -text
27
+ *.ot filter=lfs diff=lfs merge=lfs -text
28
+ *.parquet filter=lfs diff=lfs merge=lfs -text
29
+ *.pb filter=lfs diff=lfs merge=lfs -text
30
+ <<<<<<< HEAD
31
+ *.pickle filter=lfs diff=lfs merge=lfs -text
32
+ =======
33
+ >>>>>>> 09de19734bf3da83050abc74408517ba15b5b185
34
+ *.pkl filter=lfs diff=lfs merge=lfs -text
35
+ *.pt filter=lfs diff=lfs merge=lfs -text
36
+ *.pth filter=lfs diff=lfs merge=lfs -text
37
+ *.rar filter=lfs diff=lfs merge=lfs -text
38
+ <<<<<<< HEAD
39
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
40
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
41
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
42
+ *.tar filter=lfs diff=lfs merge=lfs -text
43
+ *.tflite filter=lfs diff=lfs merge=lfs -text
44
+ *.tgz filter=lfs diff=lfs merge=lfs -text
45
+ *.wasm filter=lfs diff=lfs merge=lfs -text
46
+ *.xz filter=lfs diff=lfs merge=lfs -text
47
+ *.zip filter=lfs diff=lfs merge=lfs -text
48
+ *.zst filter=lfs diff=lfs merge=lfs -text
49
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
50
+ =======
51
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
52
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
53
+ *.tflite filter=lfs diff=lfs merge=lfs -text
54
+ *.tgz filter=lfs diff=lfs merge=lfs -text
55
+ *.xz filter=lfs diff=lfs merge=lfs -text
56
+ *.zip filter=lfs diff=lfs merge=lfs -text
57
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
58
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
59
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
60
+ >>>>>>> 09de19734bf3da83050abc74408517ba15b5b185
gitignore ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ share/python-wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+
29
+ # PyInstaller
30
+ # Usually these files are written by a python script from a template
31
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
32
+ *.manifest
33
+ *.spec
34
+
35
+ # Installer logs
36
+ pip-log.txt
37
+ pip-delete-this-directory.txt
38
+
39
+ # Unit test / coverage reports
40
+ htmlcov/
41
+ .tox/
42
+ .nox/
43
+ .coverage
44
+ .coverage.*
45
+ .cache
46
+ nosetests.xml
47
+ coverage.xml
48
+ *.cover
49
+ *.py,cover
50
+ .hypothesis/
51
+ .pytest_cache/
52
+ cover/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ .pybuilder/
76
+ target/
77
+
78
+ # Jupyter Notebook
79
+ .ipynb_checkpoints
80
+
81
+ # IPython
82
+ profile_default/
83
+ ipython_config.py
84
+
85
+ # pyenv
86
+ # For a library or package, you might want to ignore these files since the code is
87
+ # intended to run in multiple environments; otherwise, check them in:
88
+ # .python-version
89
+
90
+ # pipenv
91
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
93
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
94
+ # install all needed dependencies.
95
+ #Pipfile.lock
96
+
97
+ # poetry
98
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
100
+ # commonly ignored for libraries.
101
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102
+ #poetry.lock
103
+
104
+ # pdm
105
+ # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106
+ #pdm.lock
107
+ # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108
+ # in version control.
109
+ # https://pdm.fming.dev/#use-with-ide
110
+ .pdm.toml
111
+
112
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113
+ __pypackages__/
114
+
115
+ # Celery stuff
116
+ celerybeat-schedule
117
+ celerybeat.pid
118
+
119
+ # SageMath parsed files
120
+ *.sage.py
121
+
122
+ # Environments
123
+ .env
124
+ .venv
125
+ env/
126
+ venv/
127
+ ENV/
128
+ env.bak/
129
+ venv.bak/
130
+
131
+ # Spyder project settings
132
+ .spyderproject
133
+ .spyproject
134
+
135
+ # Rope project settings
136
+ .ropeproject
137
+
138
+ # mkdocs documentation
139
+ /site
140
+
141
+ # mypy
142
+ .mypy_cache/
143
+ .dmypy.json
144
+ dmypy.json
145
+
146
+ # Pyre type checker
147
+ .pyre/
148
+
149
+ # pytype static type analyzer
150
+ .pytype/
151
+
152
+ # Cython debug symbols
153
+ cython_debug/
154
+
155
+ # PyCharm
156
+ # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157
+ # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158
+ # and can be added to the global gitignore or merged into this file. For a more nuclear
159
+ # option (not recommended) you can uncomment the following to ignore the entire idea folder.
160
+ #.idea/
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db85c081a6d392448955c7d0185e26aba74507518df991ca8c69ee9108ce8bbf
3
+ size 605292732
pre-commit-config.yaml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # See https://pre-commit.com for more information
2
+ # See https://pre-commit.com/hooks.html for more hooks
3
+ repos:
4
+ - repo: https://github.com/pre-commit/pre-commit-hooks
5
+ rev: v3.2.0
6
+ hooks:
7
+ - id: trailing-whitespace
8
+ - id: end-of-file-fixer
9
+ - id: check-yaml
10
+ - id: check-added-large-files
11
+ - id: check-merge-conflict
12
+ - id: mixed-line-ending
13
+ - id: check-docstring-first
14
+ - repo: https://github.com/pycqa/isort
15
+ rev: 5.12.0
16
+ hooks:
17
+ - id: isort
18
+ args: ["--profile", "black"]
19
+ - repo: https://github.com/astral-sh/ruff-pre-commit
20
+ # Ruff version.
21
+ rev: v0.1.4
22
+ hooks:
23
+ # Run the Ruff linter.
24
+ - id: ruff
25
+ # Run the Ruff formatter.
26
+ - id: ruff-format
requirements.txt ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ anndata>=0.9
2
+ datasets>=2.12
3
+ hyperopt>=0.2
4
+ loompy>=3.0
5
+ matplotlib>=3.7
6
+ numpy>=1.23
7
+ optuna>=3.6
8
+ optuna-integration>=3.6
9
+ packaging>=23.0
10
+ pandas>=2.0
11
+ peft>=0.11.1
12
+ pyarrow>=12.0
13
+ pytz>=2023.0
14
+ ray>=2.6
15
+ scanpy>=1.9
16
+ scikit_learn>=1.2
17
+ scipy>=1.10
18
+ seaborn>=0.12
19
+ setuptools>=65.6
20
+ statsmodels>=0.14
21
+ tdigest>=0.5.2
22
+ tensorboard>=2.15
23
+ torch>=2.0.1
24
+ tqdm>=4.65
25
+ transformers>=4.40
setup.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="geneformer",
5
+ version="0.1.0",
6
+ author="Christina Theodoris",
7
+ author_email="christina.theodoris@gladstone.ucsf.edu",
8
+ description="Geneformer is a transformer model pretrained \
9
+ on a large-scale corpus of single \
10
+ cell transcriptomes to enable context-aware \
11
+ predictions in settings with limited data in \
12
+ network biology.",
13
+ packages=find_packages(),
14
+ python_requires=">=3.10",
15
+ include_package_data=True,
16
+ install_requires=[
17
+ "anndata",
18
+ "datasets",
19
+ "loompy",
20
+ "matplotlib",
21
+ "numpy",
22
+ "optuna",
23
+ "optuna-integration",
24
+ "packaging",
25
+ "pandas",
26
+ "peft",
27
+ "pyarrow",
28
+ "pytz",
29
+ "ray",
30
+ "scanpy",
31
+ "scikit-learn",
32
+ "scipy",
33
+ "seaborn",
34
+ "setuptools",
35
+ "statsmodels",
36
+ "tdigest",
37
+ "tensorboard",
38
+ "torch",
39
+ "tqdm",
40
+ "transformers",
41
+ ],
42
+ )
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5afed602918d6f0c4916c1b9335bcdb619bca2c6fd6c7e0dd2a86d195264b8cc
3
+ size 5048