A Personal Take On Computer Vision Literature Trends in 2024

-

I have been constantly following the pc vision (CV) and image synthesis research scene at Arxiv and elsewhere for around five years, so trends grow to be evident over time, and so they shift in recent directions yearly.

Subsequently as 2024 draws to an in depth, I believed it appropriate to try some recent or evolving characteristics in Arxiv submissions within the Computer Vision and Pattern Recognition section. These observations, though informed by tons of of hours studying the scene, are strictly anecdata.

The Ongoing Rise of East Asia

By the tip of 2023, I had noticed that the vast majority of the literature within the ‘voice synthesis’ category was coming out of China and other regions in east Asia. At the tip of 2024, I actually have to look at (anecdotally) that this now applies also to the image and video synthesis research scene.

This doesn’t mean that China and adjoining countries are necessarily at all times outputting the most effective work (indeed, there’s some evidence on the contrary); nor does it take account of the high likelihood in China (as within the west) that a few of the most interesting and powerful recent developing systems are proprietary, and excluded from the research literature.

But it surely does suggest that east Asia is thrashing the west by volume, on this regard. What that is price is dependent upon the extent to which you think within the viability of Edison-style persistence, which normally proves ineffective within the face of intractable obstacles.

There are lots of such roadblocks in generative AI, and it isn’t easy to know which might be solved by addressing existing architectures, and which is able to have to be reconsidered from zero.

Though researchers from east Asia appear to be producing a greater variety of computer vision papers, I actually have noticed a rise within the frequency of ‘Frankenstein’-style projects – initiatives that constitute a melding of prior works, while adding limited architectural novelty (or possibly just a distinct form of data).

This 12 months a far higher variety of east Asian (primarily Chinese or Chinese-involved collaborations) entries gave the impression to be quota-driven reasonably than merit-driven, significantly increasing the signal-to-noise ratio in an already over-subscribed field.

At the identical time, a greater variety of east Asian papers have also engaged my attention and admiration in 2024. So if that is all a numbers game, it is not failing – but neither is it low cost.

Increasing Volume of Submissions

The quantity of papers, across all originating countries, has evidently increased in 2024.

The most well-liked publication day shifts all year long; in the mean time it’s Tuesday, when the variety of submissions to the Computer Vision and Pattern Recognition section is commonly around 300-350 in a single day, within the ‘peak’ periods (May-August and October-December, i.e., conference season and ‘annual quota deadline’ season, respectively).

Beyond my very own experience, Arxiv itself reports a record variety of submissions in October of 2024, with 6000 total recent submissions, and the Computer Vision section the second-most submitted section after Machine Learning.

Nevertheless, for the reason that Machine Learning section at Arxiv is commonly used as an ‘additional’ or aggregated super-category, this argues for Computer Vision and Pattern Recognition actually being the most-submitted Arxiv category.

Arxiv’s own statistics definitely depict computer science because the clear leader in submissions:

Source: https://info.arxiv.org/about/reports/submission_category_by_year.html

Stanford University’s 2024 AI Index, though not capable of report on most up-to-date statistics yet, also emphasizes the notable rise in submissions of educational papers around machine learning lately:

With figures not available for 2024, Stanford's report nonetheless dramatically shows the rise of submission volumes for machine learning papers. Source: https://aiindex.stanford.edu/wp-content/uploads/2024/04/HAI_AI-Index-Report-2024_Chapter1.pdf

Source: https://aiindex.stanford.edu/wp-content/uploads/2024/04/HAI_AI-Index-Report-2024_Chapter1.pdf

Diffusion>Mesh Frameworks Proliferate

One other clear trend that emerged for me was a big upswing in papers that cope with leveraging Latent Diffusion Models (LDMs) as generators of mesh-based, ‘traditional’ CGI models.

Projects of this sort include Tencent’s InstantMesh3D, 3Dtopia, Diffusion2, V3D, MVEdit, and GIMDiffusion, amongst a plenitude of comparable offerings.

Mesh generation and refinement via a  Diffusion-based process in 3Dtopia. Source: https://arxiv.org/pdf/2403.02234

Source: https://arxiv.org/pdf/2403.02234

This emergent research strand might be taken as a tacit concession to the continued intractability of generative systems akin to diffusion models, which only two years were being touted as a possible substitute for all of the systems that diffusion>mesh models are actually in search of to populate; relegating diffusion to the role of a tool in technologies and workflows that date back thirty or more years.

Stability.ai, originators of the open source Stable Diffusion model, have just released Stable Zero123, which might, amongst other things, use a Neural Radiance Fields (NeRF) interpretation of an AI-generated  image as a bridge to create an explicit, mesh-based CGI model that might be utilized in CGI arenas akin to Unity, in video-games, augmented reality, and in other platforms that require explicit 3D coordinates, versus the implicit (hidden) coordinates of continuous functions.

Source: https://www.youtube.com/watch?v=RxsssDD48Xc

3D Semantics

The generative AI space makes a distinction between 2D and 3D systems implementations of vision and generative systems. As an example, facial landmarking frameworks, though 3D objects (faces) in all cases, don’t all necessarily calculate addressable 3D coordinates.

The favored FANAlign system, widely utilized in 2017-era deepfake architectures (amongst others), can accommodate each these approaches:

Above, 2D landmarks are generated based solely on recognized face lineaments and features. Below, they are rationalized into 3D X/Y/Z space. Source: https://github.com/1adrianb/face-alignment

Source: https://github.com/1adrianb/face-alignment

So, just as ‘deepfake’ has grow to be an ambiguous and hijacked term, ‘3D’ has likewise grow to be a confusing term in computer vision research.

For consumers, it has typically signified stereo-enabled media (akin to movies where the viewer has to wear special glasses); for visual effects practitioners and modelers, it provides the excellence between 2D artwork (akin to conceptual sketches) and mesh-based models that might be manipulated in a ‘3D program’ like Maya or Cinema4D.

But in computer vision, it simply signifies that a Cartesian coordinate system exists somewhere within the latent space of the model – that it will probably necessarily be addressed or directly manipulated by a user; at the least, not without third-party interpretative CGI-based systems akin to 3DMM or FLAME.

Subsequently the notion of is inexact; not only can form of image (including an actual photo) be used as input to provide a generative CGI model, however the less ambiguous term ‘mesh’ is more appropriate.

Nevertheless, to compound the paradox, diffusion needed to interpret the source photo right into a mesh, in the vast majority of emerging projects. So a greater description is perhaps , while is an excellent more accurate description.

But that is a tough sell at a board meeting, or in a publicity release designed to interact investors.

Evidence of Architectural Stalemates

Even in comparison with 2023, the last 12 months’ crop of papers exhibits a growing desperation around removing the hard practical limits on diffusion-based generation.

The important thing stumbling block stays the generation of narratively and temporally consistent video, and maintaining a consistent appearance of characters and objects –  not only across different video clips, but even across the short runtime of a single generated video clip.

The last epochal innovation in diffusion-based synthesis was the advent of LoRA in 2022. While newer systems akin to Flux have improved on a few of the outlier problems, akin to Stable Diffusion’s former inability to breed text content inside a generated image, and overall image quality has improved, the vast majority of papers I studied in 2024 were essentially just moving the food around on the plate.

These stalemates have occurred before, with Generative Adversarial Networks (GANs) and with Neural Radiance Fields (NeRF), each of which did not live as much as their apparent initial potential – and each of that are increasingly being leveraged in additional conventional systems (akin to using NeRF in Stable Zero 123, see above). This also appears to be happening with diffusion models.

Gaussian Splatting Research Pivots

It seemed at the tip of 2023 that the rasterization method 3D Gaussian Splatting (3DGS), which debuted as a medical imaging technique within the early Nineties, was set to suddenly overtake autoencoder-based systems of human image synthesis challenges (akin to facial simulation and recreation, in addition to identity transfer).

The 2023 ASH paper promised full-body 3DGS humans, while Gaussian Avatars offered massively improved detail (in comparison with autoencoder and other competing methods), along with impressive cross-reenactment.

This 12 months, nevertheless, has been relatively short on any such breakthrough moments for 3DGS human synthesis; a lot of the papers that tackled the issue were either derivative of the above works, or did not exceed their capabilities.

As an alternative, the emphasis on 3DGS has been in improving its fundamental architectural feasibility, resulting in a rash of papers that provide improved 3DGS exterior environments. Particular attention has been paid to Simultaneous Localization and Mapping (SLAM) 3DGS approaches, in projects akin to Gaussian Splatting SLAM, Splat-SLAM, Gaussian-SLAM, DROID-Splat, amongst many others.

Those projects that did try to proceed or extend splat-based human synthesis included MIGS, GEM, EVA, OccFusion, FAGhead, HumanSplat, GGHead, HGM, and Topo4D. Though there are others besides, none of those outings matched the initial impact of the papers that emerged in late 2023.

The ‘Weinstein Era’ of Test Samples Is in (Slow) Decline

Research from south east Asia usually (and China specifically) often features test examples which are problematic to republish in a review article, because they feature material that’s a little bit ‘spicy’.

Whether it is because research scientists in that a part of the world are in search of to garner attention for his or her output is up for debate; but for the last 18 months, an increasing variety of papers around generative AI (image and/or video) have defaulted to using young and scantily-clad women and girls in project examples. Borderline NSFW examples of this include UniAnimate, ControlNext, and even very ‘dry’ papers akin to Evaluating Motion Consistency by Fréchet Video Motion Distance (FVMD).

This follows the final trends of subreddits and other communities which have gathered around Latent Diffusion Models (LDMs), where Rule 34 stays very much in evidence.

Celebrity Face-Off

This sort of inappropriate example overlaps with the growing recognition that AI processes mustn’t arbitrarily exploit celebrity likenesses – particularly in studies that uncritically use examples featuring attractive celebrities, often female, and place them in questionable contexts.

One example is AnyDressing, which, besides featuring very young anime-style female characters, also liberally uses the identities of classic celebrities akin to Marilyn Monroe, and current ones akin to Ann Hathaway (who has denounced this type of usage quite vocally).

Arbitrary use of current and 'classic' celebrities is still fairly common in papers from south east Asia, though the practice is slightly on the decline. Source: https://crayon-shinchan.github.io/AnyDressing/

Source: https://crayon-shinchan.github.io/AnyDressing/

In papers, this particular practice has been notably in decline throughout 2024, led by the larger releases from FAANG and other high-level research bodies akin to OpenAI. Critically aware of the potential for future litigation, these major corporate players seem increasingly unwilling to represent even photorealistic people.

Though the systems they’re creating (akin to Imagen and Veo2) are clearly able to such output, examples from western generative AI projects now trend towards ‘cute’, Disneyfied and very ‘secure’ images and videos.

Despite vaunting Imagen's capacity to create 'photorealistic' output, the samples promoted by Google Research are typically fantastical, 'family' fare –  photorealistic humans are carefully avoided, or minimal examples provided. Source: https://imagen.research.google/

Source: https://imagen.research.google/

Face-Washing

Within the western CV literature, this disingenuous approach is especially in evidence for systems – methods that are capable of making consistent likenesses of a specific person across multiple examples (i.e., like LoRA and the older DreamBooth).

Examples include orthogonal visual embedding, LoRA-Composer, Google’s InstructBooth, and a large number more.

Google's InstructBooth turns the cuteness factor up to 11, even though history suggests that users are more interested in creating photoreal humans than furry or fluffy characters. Source: https://sites.google.com/view/instructbooth

Source: https://sites.google.com/view/instructbooth

Nevertheless, the rise of the ‘cute example’ is seen in other CV and synthesis research strands, in projects akin to Comp4D, V3D, DesignEdit, UniEdit, FaceChain (which concedes to more realistic user expectations on its GitHub page), and DPG-T2I, amongst many others.

The convenience with which such systems (akin to LoRAs) might be created by home users with relatively modest hardware has led to an explosion of freely-downloadable celebrity models on the civit.ai domain and community. Such illicit usage stays possible through the open sourcing of architectures akin to Stable Diffusion and Flux.

Though it is commonly possible to punch through the security features of generative text-to-image (T2I) and text-to-video (T2V) systems to provide material banned by a platform’s terms of use, the gap between the restricted capabilities of the most effective systems (akin to RunwayML and Sora), and the unlimited capabilities of the merely performant systems (akin to Stable Video Diffusion, CogVideo and native deployments of Hunyuan), isn’t really closing, as many imagine.

Slightly, these proprietary and open-source systems, respectively, threaten to grow to be equally useless: expensive and hyperscale T2V systems may grow to be excessively hamstrung resulting from fears of litigation, while the dearth of licensing infrastructure and dataset oversight in open source systems could lock them entirely out of the market as more stringent regulations take hold.

 

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x