The x34 crossover was excessive noise, the continuity is very difficult to rationalize and form a cohesive anchoring for.
Predominantly the boundaries. I ran multiple scans on the COCO data and found huge deviances of composite differentiation, which means the manifold overall would be massive.
Something like trying to represent 800b params in a single 84m param space, which even the best of alignment geometric would require considerably more refined mathematics to even channel. The full process will require a multi-stage relational conjecture and interpolation between alignment positive, negative, and including a symbolic relational architecture just to make sense of the noise itself.
It could be done, and I can say for certain it could be done. It's just a matter of what the finished product between such a complex relational structure would look like, and how much information would be required to actually train it. The math lines up, the architecture lines up, but it does not have a reasonable horizon for the manifold.
Simply put, because of the boundaries, even if constricted, training 34 experts at runtime is beyond my scope. It would require full patch extraction, not just attenuation like CaptionBERT. CaptionBERT is a different form of differentiation that allows rapid pooled learning between multiple from the same family, while the x34 would require multiple pooled learning from adjacent families. Each family requires it's own patchwork size, it's own formulas for alignment based on that, and it's own specific attenuation principality based on the adjacent differentiation.
Possible yes, very possible. A true challenge and test for the architecture with a team and a series of experts, but I am but one researcher. I would be here for days figuring out how to attenuate patch14 to patch16 differences, just to yield little information based on the hypersphere alignment.