AI Papers to Read in 2025

-

with my series of AI paper recommendations. My long-term followers might recall the 4 previous editions ([1], [2], [3], and [4]). I’ve been away from writing for quite a while, and I couldn’t consider a greater option to return than resuming my most successful series — and the one I enjoyed writing probably the most.

For the uninitiated, this can be a very opinionated list, filled with perspectives and tangents, meant to maintain you updated on AI as an entire. This just isn’t a state-of-the-art models list but real insights on what to search for in the approaching years and what you may have missed from the past. The goal is to assist you think critically in regards to the state of AI.

In total, there are ten paper suggestions, each with a temporary description of the paper’s contribution and explicit the explanation why these papers are value reading. Furthermore, each has a dedicated further reading section with a number of tangents to explore. 

Before we move on, back to my 2022 article, I kicked off saying and Back then, I used to be pretty sure I might repeat myself in the long run, that a brand new GPT model would just be a bigger and marginally higher model, but removed from groundbreaking. Nonetheless, . Since release, ChatGPT has sparked many latest solutions and is definitely a turning point in all of computer science. 

Last but not least, as a small disclaimer, most of my AI work centers around Computer Vision, so there are likely many excellent papers on the market on topics reminiscent of Reinforcement Learning, Graphs, and Audio which might be just not under my radar. If there’s any paper you suspect I should know, please let me know ❤.

Let’s go!


#1 DataPerf: A Benchmark for Data Centric AI

Mazumder, Mark, et al. “Dataperf: Benchmarks for data-centric ai development.” (2022).

From 2021 to 2023, Andrew Ng was very vocal about data-centric AI: to shift our focus from evolving models over static datasets towards evolving the datasets themselves — while holding models static or mostly unchanged. In their very own words, our current model-centric research philosophy neglects the elemental importance of knowledge. 

In practical terms, it is commonly the case that increasing the dataset size, correcting mislabeled entries, and removing bogus inputs is much simpler at improving a model’s output than increasing its size, variety of layers, or training time.

In 2022, the authors proposed DataPerf, a benchmark for data-centric AI development, including tasks on speech, vision, debugging, acquisition, and adversarial problems, alongside the DataPerf working group. The initiative goals to foster data-aware methods and seeks to shut the gap between the information departments of many firms and academia.

Reason 1: Most, if not all, firms working on area of interest topics find yourself developing internal datasets. It’s wild how little research exists on how you can do that properly/higher. 

Reason 2: A mirrored image: what number of papers provide a solid 2% improvement over the State-of-the-Art (SOTA) nowadays? How much additional data would it’s essential boost your accuracy by 2%? 

Reason 3: For the remainder of your profession, you may wonder, as a substitute of doing the proposed X, we just collected more data?

Reason 4: If you’re in academia, stuck with some X or Y dataset, attempting to work out how you can get 0.1% improvement over SOTA, know that life will be far more than that.

Further Reading: In 2021, all of it began with Deeplearning.AI hosting a data-centric AI competition. You’ll be able to read in regards to the winner’s approach here. Since then, there was loads of work dedicated to the topic by other authors, as an illustration, 2023’s Data-centric Artificial Intelligence: A Survey. Finally, in the event you are a Talks form of person, there are various by Andrew Ng on YouTube championing the subject. 


#2 GPT-3 / LLMs are Few-Shot Learners

Brown, Tom, et al. “Language models are few-shot learners.” 33 (2020): 1877–1901.

This NeurIPS paper presented GPT-3 to the world. OpenAI’s third-gen model was in almost every way just a much bigger GPT-2. With 116 times more parameters and trained on 50 times more data. Their biggest finding wasn’t that it was just “higher” but that the way you prompted it could drastically improve its performance on many tasks.

Machine Learning models are sometimes expressed as predictable functions: given the identical input, they’ll all the time yield the identical output. Current Large Language Models (LLMs), however, can pose and answer the identical query in many various ways — wording matters.

Reason 1: Previously, we discussed keeping models static while we evolve the dataset. With LLMs, we are able to evolve the questions we ask.

Reason 2: GPT-3 sparked the sector of prompt engineering. After it, we began seeing authors proposing techniques like Chain-of-Thought (CoT) and Retrieval-Augmented-Generation (RAG).

Reason 3: Prompting well is much more vital than knowing how you can train or finetune LLMs. Some people say prompting is dead, but I don’t see that taking place ever. Ask yourself: do you word requests the identical way when addressing your boss your mom or friends?

Reason 4: When transformers got here out, most research targeted their training/inference speed and size. Prompting is a genuinely fresh topic in natural language processing.

Reason 5: It’s funny once you realize that the paper doesn’t really propose anything; it just makes an statement. Has 60k citations, though.

Further Reading: Prompting jogs my memory of ensemble models. As a substitute of repeatedly prompting a single model, we might train several smaller models and aggregate their outputs. Now nearly three a long time old, the AdaBoost paper is a classic on the subject and a read that can take you back to way before even word embeddings were a thingFast forward to 2016, a contemporary classic is XGBoost, which is now on its v3 upgrade


#3 Flash Attention

Dao, Tri, et al. “FlashAttention: Fast and memory-efficient exact attention with io-awareness.” 35 (2022): 16344–16359.

For the reason that 2017 groundbreaking paper “Attention is All You Need” introduced the Transformer architecture and the eye mechanism, several research groups have dedicated themselves to finding a faster and more scalable alternative to the unique quadratic formulation. While many approaches were devised, none has really emerged as a transparent successor to the unique work.

The unique Attention formulation. The softmax term represents how vital each token is to every query (so for N tokens, we’ve N² attention scores). The “transform” (within the name Transformer) is the multiplication between this N² attention map and the N-sized V vector (very like a rotation matrix “transforms” a 3D vector)

On this work, the authors don’t propose a brand new formulation or a clever approximation to the unique formula. As a substitute, they present a quick GPU implementation that makes higher use of the (complicated) GPU memory structure. The proposed method is significantly faster while having little to no drawbacks over the unique.

Reason 1: Many research papers get rejected because they are only latest implementations or not “novel enough”. Sometimes, that’s .

Reason 2: Research labs crave the eye of being the brand new Attention, to the purpose it’s hard for any latest Attention to ever get enough attention. On this instance, the authors only improve what already works.

Reason 3: Looking back, ResNet was groundbreaking for CNNs back within the day, proposing the residual block. In the next years, many proposed enhancements to it, various the residual block idea. Despite all that effort, most individuals just stuck with the unique idea. In such a crowded research field as AI, it’s best to stay cautious about all things which have many proposed successors.

Further Reading: Infrequently, I seek the advice of Sik-Ho Tsang’s list of papers he reviews here on Medium. Each section reveals the leading ideas for every area over time. It’s a bit sad how a lot of these papers might need seemed groundbreaking and are actually completely forgotten? Back to Attention, as of 2025, the most popular attention-replacement candidate is the Sparse Attention by the DeepSeek team.


#4 Training NNs with Posits

Raposo, Gonçalo, Pedro Tomás, and Nuno Roma. “Positnn: Training deep neural networks with mixed low-precision posit.” . IEEE, 2021.

Taking a turn to the world of hardware and low-level optimization, a few of a very powerful (but least sexy) advancements in AI training are related to floating points. We went from boring floats to halfs, then 8-bit and even 4-bit floats (FP4). The horsepower driving LLMs today are eightfold ponies. 

The longer term of number formats goes hand-in-hand with matrix-matrix multiplication hardware. Nonetheless, there will be far more to this topic than simply halving bit-depth. This paper, as an illustration, explores a very latest number format (posits) as a possible substitute for good old IEEE-754 floats. Are you able to imagine a future floats?

Reason 1: While latest algorithms take time to seek out widespread adoption, hardware improves consistently yearly. All ships rise with the hardware tide.

Reason 2: It’s value questioning how far we can be today if we didn’t have as many GPU improvements over the past ten years. For reference, the AlexNet authors broke all ImageNet records in 2012 using two high-end GTX 580 GPUs, a complete of three TFLOPs. Nowadays, a mid-range GPU, reminiscent of an RTX 5060, boasts ~19 TFLOPs — 6 times more.

Reason 3: Some technologies are so common that we take them with no consideration. All things can and needs to be improved; we don’t owe anything to floats (and even Neural Networks for that matter).

Further Reading: Since we’re mentioning hardware, it’s also time to speak about programming languages. Should you haven’t been maintaining with the news, the Python team (especially Python’s creator) is targeted on optimizing Python. Nonetheless, optimization nowadays appears to be a slang for rebuilding stuff in Rust. Last but not least, some hype was dedicated to Mojo, an AI/speed-focused superset of Python; nonetheless, I barely see anyone talking about it today. 


#5 AdderNet

Chen, Hanting, et al. “AdderNet: Can we really want multiplications in deep learning?.” . 2020.

What if we didn’t do matrix multiplication in any respect? This paper goes a very different route, showing it is feasible to have effective neural networks without matrix multiplication. The essential idea is to switch convolutions with computing the L1 difference between the input and the sliding filters. 

I like to think about this paper because the “alternate world” neural networks. In some parallel universe, NNs evolved based on addition, and amidst all of it, someone proposed a multiplication-based model; nonetheless, it never got traction since all of the tooling and hardware were neck deep in optimizing massive matrix addition and subtraction operators.

Reason 1: We easily forget there are still other algorithms on the market we’ve yet to seek out, besides CNNs and Transformers. This paper shows that an addition-based neural network is feasible, how cool is that?

Reason 2: A whole lot of our hardware and cloud infrastructure is tuned for matrix multiplication and neural networks. Can latest models still compete? Can non-neural networks still make a comeback? 

Further Reading: Lots of you may not be conversant in what existed before NNs took over most fields. Most individuals know staples like Linear Regression, Decision Trees, and XGBoost. Before NNs became popular, Support Vector Machines were all the craze. It’s been some time since I last saw one. On this regard, a cool paper to read is Deep Learning is Not All You Need.

Support Vector Machines learn to separate two groups of points with the perfect separation line possible. Through the use of the Kernel Trick, these points are solid right into a higher-dimensional space, during which a greater separation plane could be found, achieving a non-linear decision boundary while maintaining the linear formulation. Its an excellent solution value learning about. Source.

#6 Interpolation vs Extrapolation

Balestriero, Randall, Jerome Pesenti, and Yann LeCun. “Learning in high dimension all the time amounts to extrapolation.” (2021).

Sometime ago, I used to think the massive names on AI were visionaries or had superb educated guesses on the long run of the sector. This modified with this paper and all the controversy that followed. 

Back in 2021, Yann LeCun pushed this discussion about , claiming that in high-dimensional spaces, like all neural networks, what we call “learning” is data extrapolation. Right after publication, many renowned names joined in, some claiming this was nonsense, some that it was still is interpolation, and a few taking the extrapolation side.

Should you never heard about this discussion… it shows how pointless it really was. So far as I could see (and please write me in the event you think otherwise), no company modified course, no latest extrapolation-aware model was devised, nor did it spark latest relevant training techniques. It got here and it went.

Reason 1: To be honest, you possibly can just skip this one. I just needed to rant about this for my very own peace of mind.

Reason 2: From a purely academic perspective, I consider this an interesting tackle learning theory, which is indeed a cool topic.

Further Reading: Yoshua Bengio, Geoffrey Hinton, and Yann LeCun were awarded the 2018 Turing Award for his or her pioneering work on Deep Learning foundations. Back in 2023 or so, LeCun was focused on self-supervised learning, Hinton was concerned with Capsule Networks, and Bengio was Generative Flow Networks. By late 2025, LeCun moved towards world models while Hinton and Bengio moved towards AI Safety. Should you are second-guessing your academic selections, bear in mind that even the so-called godfathers switch gears.


#7 DINOv3 / Foundation Vision Models

Siméoni, Oriane, et al. “DINOv3.” (2025).

While the world of language processing has evolved to make use of big universal models that work for each task (aka foundation models), the sector of image processing continues to be working its way as much as that. On this paper, we see the present iteration of the DINO model, a self-supervised image model designed to be the inspiration for Vision. 

Reason 1: Self-supervised pretraining continues to be relatively evolving in other problem areas in comparison to text, especially if done entirely throughout the problem domain (versus adding text descriptions to assist it).

Reason 2: Don’t read only language papers, even in case your job is working with LLMs. Variety is essential.

Reason 3: Language models can only go to date towards AGI. Vision is paramount for human-like intelligence.

Further Reading: Continuing on the Vision topic, it’s value knowing in regards to the YOLO and the Segment-Anything Model. The previous is a staple for object-detection (but in addition boasts versions for other problems) while the latter is for image segmentation. Regarding image generation, I find it funny that just a few years back we might all speak about GANs (generative adversarial networks), and nowadays it’s probable that a lot of you’ve never heard of 1. I even wrote an inventory like this for GAN papers a few years ago.


#8 Small Language Models are the Future

Belcak, Peter, et al. “Small Language Models are the Way forward for Agentic AI.” (2025).

The sphere of “Generative AI” is quickly being rebranded to “Agentic AI”. As people try to understand how you can earn cash with that, they bleed VC money running behemoth models. On this paper, the authors argue that Small Language Models (< 10B params, on their definition) are the long run for Agentic AI development.

In additional detail, they argue that almost all subtasks executed on agentic solutions are repetitive, well-defined, and non-conversational. Subsequently, LLMs are somewhat an overkill. Should you include fine-tuning, SLMs can easily develop into specialized agents, whereas LLMs thrive on open tasks. 

Reason 1: What we call “large” language models today might just as well be the “small” of tomorrow. Learning about SMLs is future-proofing.

Reason 2: Many individuals claim AI today is heavily subsidized by VC money. Within the near future, we’d see an enormous increase in AI costs. Using SMLs could be the one option for a lot of businesses.

Reason 3: That is super easy to read. In actual fact, I feel it’s the primary time I actually have read a paper that so explicitly defends a thesis.

Further Reading: Smaller models are the one option for edge AI / low-latency execution. When applying AI to video streams, the model + post must execute in lower than 33 ms for a 30fps stream. You’ll be able to’t roundtrip to a cloud or batch frames. Nowadays, there are a selection of tools like Intel’s OpenVINO, NVIDIA’s Tensor-RT, or TensorFlow-Lite for fast inference on limited hardware. 


#9 The Lottery Ticket Hypothesis (2019)

As a follow-up to small models, some authors have shown that we most certainly aren’t training our networks’ parameters to their fullest potential. That is “humans only use 10% of their brains” applied to neural networks. On this literature, the Lottery Ticket Hypothesis is definitely some of the intriguing papers I’ve seen.

Frankle found that in the event you (1) train a giant network, (2) prune all low-valued weights, (3) rollback the pruned network to its untrained state, and (4) retrain; you’ll get a greater performing network. Putting it otherwise, what training does is uncover a subnetwork whose initial random parameters are aligned to solving the issue — all else is noise. By leveraging this subnetwork alone, we are able to surpass the unique network performance. Unlike basic network pruning, this the result.

Reason #1: We’re accustumed to ”larger models are higher but slower” whereas “small models are dumb but fast”. Perhaps we’re the dumb ones who insist on big models all the time.

Reason #2: An open query is how underutilized our parameters are. Likewise, how can we use our weights to their fullest? And even, is it even possible to measure a NN learning potential?

Reason #3: How again and again have you ever cared about how your model parameters were initialized before training?

Further Reading: While this paper is from 2018, there’s a 2024 survey on the hypothesis. On a contrasting note, “The Role of Over-Parameterization in Machine Learning — the Good, the Bad, the Ugly (2024)” discusses how over-parametrization is what really powers NNs. On the more practical side, this survey covers the subject of Knowledge Distillation, using a giant network to coach a smaller one to perform as near it as possible.


#10 AlexNet (2012)

Can you suspect all this Neural Network content we see today really began just 13 years ago? Before that, NNs were somewhat in between a joke and a failed promise. Should you wanted model, you’d use SVMs or a bunch of hand-engineered tricks.

In 2012, the authors proposed the usage of GPUs to coach a big Convolutional Neural Network (CNN) for the ImageNet challenge. To everyone’s surprise, they won first place, with a ~15% Top-5 error rate, against ~26% for the second place, which used state-of-the-art image processing techniques.

Reason #1: While most of us know AlexNet’s historical importance, not everyone knows which of the techniques we use today were already present before the boom. You could be surprised by how familiar lots of the concepts introduced within the paper are, reminiscent of dropout and ReLU.

Reason #2: The proposed network had 60 million weights, complete insanity for 2012 standards. Nowadays, trillion-parameter LLMs are across the corner. Reading the AlexNet paper gives us an incredible deal of insight into how things have developed since then.

Further Reading: Following the history of ImageNet champions, you possibly can read the ZF Net, VGG, Inception-v1, and ResNet papers. This last one achieved super-human performance, solving the challenge. After it, other competitions took over the researchers’ attention. Nowadays, ImageNet is especially used to validate radical latest architectures.

The unique portrayal of the AlexNet structure. The highest and bottom halves are processed by GPU 1 and a pair of, respectively. An earlier type of model parallelism. Source: The Alexnet Paper

That is all for now. Be at liberty to comment or connect with me if you’ve any questions on this text or the papers. Writing such lists is A LOT OF WORK. If this was a rewarding read for you, please be kind and share it amongst your peers. Thanks!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x