Even though 2D pixel rectangles are extremely useful to how we interface with electronics and each other, if you really want to push the future, it’ll be in some form of 3D AR/VR. People pointing at virtual objects, graphs fluttering in mid air, zooming interfaces of maps, actually meeting and talking across half a planet, etc.
So let’s just let loose and completely explore the extremes. Do a little science fictioning before condensing it back to a real project.
So where does the current ecosystem break down? Let’s say we want to build one of 3 apps that have fully interactive groups of people, some joining from AR in the physical location, and some from VR:
Jenga – Two or more users sit around a virtual tower of blocks. Each in turn can pull out a block from the tower. When the tower falls, that player loses the game.
Conference Room – Two or more users sit together in a conference room. People can talk and react to each other like in a normal room.
Engineering Teacher – An engineering teacher shows students around a 3D engineering model. Students point at things and ask questions.
The first example is a typical experiment, to see how fast the network communication goes. The second example is what almost everyone is doing in VR nowadays, and the third example is a slightly more applied idea, that is also very popular in modern engineering. There are various startups built around all examples.
So why bring this up, if it already exists? Well, WYSIWYG editors also already existed for 20 years on various different platforms before in the early 2000s, the Web became mature enough to allow developers to build Web-based WYSIWIG editors. This very post is made with WordPress, a pretty successful one.
All 3 ideas are very time-consuming to build, and some of that work can already be a bit tedious. Also, all 3 ideas suffer from various forms of commercial protectionism. There are many top teams working to solve these issues in their own universes (Oculus Quest 2 belongs to Facebook, Apple collaborates tightly with Apple, etc.). It’s not trivial to just get your (i)phone out and join a conversation for anything more than a web-based video call. It’s not trivial to just sit together and manipulate a 3D object in space together.
And that’s all great. But if the opportunity to architect something is there, what should a new ecosystem really look like to make these collaborative AR/VR apps easy?
One important thing in AR is understanding reality. And with understanding reality, I mean figuring out the orientation and position of the AR device, understanding where faces and hands are, and understanding what the surroundings look like. This is essentially what ARKit, ARCore, WebXR, etc. do on their respective platforms.
However, it’s possible to improve models of reality if you use multiple cameras, and combine their information. The same goes for sound. So why not do exactly that? Allow the collaborative app to access all hardware of everyone involved. For the 3 examples, it would be necessary to at least synchronize each device’s understanding of the surrounding, or use some sort of shared anchors. The users actually experience sitting around the Jenga tower or around the engineering model, whether they are in reality or in VR.
Now, can we push this further into the future?
Well, let’s start by not only maintaining physical anchors and depth information, but also boost the mouse click, finger pinch and key press events that apps can currently access.
Let’s add the full spectrum of human social interaction to this understanding, as flawlessly as the available shared hardware allows. The app should know where everything is, where everyone is, what their facial expressions are, what their hands and limbs are doing, where they are pointing and gesturing, and where they are looking.
And this is where we also briefly talk about speech recognition, except it needs to work really well…
So, to recap in five overly ambitious sentences:
Simplify greatly the infrastructure and access to build sophisticated collaborative AR/VR apps for group activities.
Make it accessible on every possible platform in a uniform way (This is Web for 2D pixel surfaces).
Share sensors and AR capabilities from all involved hardware.
Maintain a very accurate model of reality.
Extract a rich variety of new input events based on human social signals.
So how would one go about and design an architecture for the underlying browser ecosystem that can do this?
Easy… Let’s start with open source, Rust (robust), a WASM microkernel (perhaps this is https://orbitalweb.github.io/ or https://github.com/faasm/faasm ?), distributed shared memory, lots of implementations of communication standards (maybe like ROS from robotics?), leverage of available AR capabilities on different devices (ARKit has awesome face reading), synchronization approaches between the devices (what is available for this?), various implementations of human social signals (through leveraging existing device capabilities, and later also deep learning), and a really good and long discussion on the various APIs that we could offer developers of new apps.
This is the current state of the idea. I’ll post more as the idea matures and becomes more realistic. I’m sure I haven’t seen all possibilities yet, and I probably forgot important aspects, or am not aware of already existing solutions…
If you have anything to add or comment, please do!