Update README.md
Browse files
README.md
CHANGED
|
@@ -100,7 +100,9 @@ visualize_tensor(output, "Output Image")
|
|
| 100 |
|
| 101 |
### Training Data
|
| 102 |
|
| 103 |
-
The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
|
|
|
|
|
|
|
| 104 |
|
| 105 |
### Training Procedure
|
| 106 |
|
|
@@ -124,7 +126,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
| 124 |
#### Training Hyperparameters
|
| 125 |
|
| 126 |
**v1**
|
| 127 |
-
- Precision:
|
| 128 |
- Embedded dimensions: 768
|
| 129 |
- Hidden dimensions: 3072
|
| 130 |
- Attention Type: Linear Attention
|
|
@@ -139,7 +141,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
| 139 |
- Style Transfer Module: AdaIN (Adaptive Instance Normalization)
|
| 140 |
|
| 141 |
**v2**
|
| 142 |
-
- Precision:
|
| 143 |
- Embedded dimensions: 768
|
| 144 |
- Hidden dimensions: 3072
|
| 145 |
- Attention Type: Location-Based Multi-Head Attention (Linear Attention)
|
|
@@ -172,7 +174,7 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
| 172 |
- Precision: FP32, FP16, BF16, INT8
|
| 173 |
- Embedding Dimensions: 768
|
| 174 |
- Hidden Dimensions: 3072
|
| 175 |
-
- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (
|
| 176 |
- Number of Attention Heads: 32
|
| 177 |
- Number of Attention Layers: 16
|
| 178 |
- Number of Transformer Encoder Layers (Feed-Forward): 16
|
|
@@ -182,10 +184,11 @@ Images and their corresponding style semantic maps were resized to **512 x 512**
|
|
| 182 |
- Swin Window Size: 7
|
| 183 |
- Swin Shift Size: 2
|
| 184 |
- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
|
|
|
|
| 185 |
|
| 186 |
#### Speeds, Sizes, Times
|
| 187 |
|
| 188 |
-
**Model size:** There are currently
|
| 189 |
- v1_1: 224M params
|
| 190 |
- v1_2: 200M params
|
| 191 |
- v1_3: 93M params
|
|
|
|
| 100 |
|
| 101 |
### Training Data
|
| 102 |
|
| 103 |
+
- **Preliminary:** The model was trained on [Pre-Training Dataset](https://huggingface.co/datasets/aoxo/latent_diffusion_super_sampling) and then the decoder layers were frozen to finetune it on the [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v). The former includes over 400,000 frames of footage from over 9 video games such as WatchDogs 2, Grand Theft Auto V, CyberPunk, several Hollywood films and high-defintion photos. The latter comprises of ~25,000 high-definition semantic segmentation map - rendered frame pairs captured from Grand Theft Auto V in-game and a UNet based Semantic Segmentation Model.
|
| 104 |
+
|
| 105 |
+
- **Latest:** The latest model was trained purely on [Calibration Dataset for Grand Theft Auto V](https://huggingface.co/datasets/aoxo/photorealism-style-adapter-gta-v), a composition of over 1.24 billion realworld images and over 117 million in-game captured frames.
|
| 106 |
|
| 107 |
### Training Procedure
|
| 108 |
|
|
|
|
| 126 |
#### Training Hyperparameters
|
| 127 |
|
| 128 |
**v1**
|
| 129 |
+
- Precision: FP32
|
| 130 |
- Embedded dimensions: 768
|
| 131 |
- Hidden dimensions: 3072
|
| 132 |
- Attention Type: Linear Attention
|
|
|
|
| 141 |
- Style Transfer Module: AdaIN (Adaptive Instance Normalization)
|
| 142 |
|
| 143 |
**v2**
|
| 144 |
+
- Precision: FP32
|
| 145 |
- Embedded dimensions: 768
|
| 146 |
- Hidden dimensions: 3072
|
| 147 |
- Attention Type: Location-Based Multi-Head Attention (Linear Attention)
|
|
|
|
| 174 |
- Precision: FP32, FP16, BF16, INT8
|
| 175 |
- Embedding Dimensions: 768
|
| 176 |
- Hidden Dimensions: 3072
|
| 177 |
+
- Attention Type: Location-Based Multi-Head Attention (Linear Attention) and Cross-Attention (Pretrained Attention-Guided)
|
| 178 |
- Number of Attention Heads: 32
|
| 179 |
- Number of Attention Layers: 16
|
| 180 |
- Number of Transformer Encoder Layers (Feed-Forward): 16
|
|
|
|
| 184 |
- Swin Window Size: 7
|
| 185 |
- Swin Shift Size: 2
|
| 186 |
- Style Transfer Module: Style Adaptive Layer Normalization (SALN)
|
| 187 |
+
- Style Encoder: Custom MultiScale Style Encoder
|
| 188 |
|
| 189 |
#### Speeds, Sizes, Times
|
| 190 |
|
| 191 |
+
**Model size:** There are currently four definitive versions of the model:
|
| 192 |
- v1_1: 224M params
|
| 193 |
- v1_2: 200M params
|
| 194 |
- v1_3: 93M params
|