Introduction
within the period of 2017-2019, physics-informed neural networks (PINNs) have been a very talked-about area of research within the scientific machine learning (SciML) community [1,2]. PINNs are used to unravel atypical and partial differential equations (PDEs) by representing the unknown solution field with a neural network, and finding the weights and biases (parameters) of the network by minimizing a loss function based on the governing differential equation. For instance, the unique PINNs approach penalizes the sum of pointwise errors of the governing PDE, whereas the Deep Ritz method minimizes an “energy” functional whose minimum enforces the governing equation [3]. One other alternative is to discretize the answer with a neural network, then construct the weak type of the governing equation using adversarial networks [4] or polynomial test functions [5,6]. Whatever the selection of physics loss, neural network discretizations have been successfully used to investigate plenty of systems governed by PDEs. From the Navier-Stokes equations [7] to conjugate heat transfer [8] and elasticity [9], PINNs and their variants have proven themselves a worthy addition to the computational scientist’s toolkit.
As is the case with all machine learning problems, a vital ingredient of obtaining robust and accurate solutions with PINNs is hyperparameter tuning. The analyst has much freedom in constructing an answer method—as discussed above, the selection of physics loss function will not be unique, neither is the optimizer, strategy of boundary condition enforcement, or neural network architecture. For instance, while ADAM has historically been the go-to optimizer for machine learning problems, there was a surge of interest in second-order Newton-type methods for physics-informed problems [10,11]. Other studies have compared techniques for enforcing boundary conditions on the neural network discretization [12]. Best practices for the PINNs architecture have primarily been investigated through the selection of activation functions. To combat the spectral bias of neural networks [13], sinusoidal activation functions have been used to raised represent high-frequency solution fields [14,15]. In [16], plenty of standard activation functions were compared on compressible fluid flow problems. Activation functions with partially learnable features were shown to enhance solution accuracy in [17]. While most PINNs depend on multi-layer perceptron (MLP) networks, convolutional networks were investigated in [18] and recurrent networks in [19].
The studies referenced above are removed from an exhaustive list of works investigating the selection of hyperparameters for physics-informed training. Nevertheless, these studies show that the loss function, optimizer, activation function, and basic class of network architecture (MLP, convolutional, recurrent, etc.) have all received attention within the literature as interesting and necessary components of the PINN solution framework. One hyperparameter that has seen comparatively little scrutiny is the dimensions of the neural network discretizing the answer field. In other words, to the perfect of our knowledge, there are not any published works that ask the next query: what number of parameters should the physics-informed network consist of? While this query is in some sense obvious, the community’s lack of interest in it will not be surprising—there is no such thing as a price to pay in solution accuracy for an overparameterized network. The truth is, overparameterized networks can provide useful regularization of the answer field, as is seen with the phenomenon of double descent [20]. Moreover, within the context of data-driven classification problems, overparameterized networks have been shown to steer to smoother loss landscapes [21]. Because solution accuracy provides no incentive to drive the dimensions of the network down, and since the optimization problem may very well favor overparameterization, many authors use very large networks to represent PDE solutions.
While the accuracy of the answer only stands to realize by increasing the parameter count, the computational cost of the answer does scale with the network size. On this study, we take three examples from the PINNs literature and show that networks with orders of magnitude fewer parameters are able to satisfactorily reproducing the outcomes of the larger networks. The conclusion from these three examples is that, within the case of low-frequency solution fields, small networks can obtain accurate solutions with decreased computational cost. We then provide a counterexample, where regression to a fancy oscillatory function constantly advantages from increasing the network size. Thus, our suggestion is as follows:
Examples
The primary three examples are inspired by problems taken from the PINNs literature. In these works, large networks are used to acquire the PDE solution, where the dimensions of the network is measured by the variety of parameters. While different network architectures may perform in a different way with different parameter counts, we use this metric as a proxy for network complexity, independent of the architecture. In our examples, we incrementally decrease the parameter count of a multilayer perceptron network until the error with a reference solution begins to extend. This point represents a lower limit on the network size for the actual problem, and we compare the variety of parameters at this point to the variety of parameters utilized in the unique paper. In each case, we discover that the networks from the literature are overparameterized by at the least an order of magnitude. Within the fourth example, we solve a regression problem to point out how small networks can fail to represent oscillatory fields, which acts as a caveat to our findings.
Phase field fracture
The phase field model of fracture is a variational approach to fracture mechanics, which concurrently finds the displacement and damage fields by minimizing a suitably defined energy functional [22]. Our study is predicated on the one-dimensional example problem given in [23], which uses the Deep Ritz method to find out the displacement and damage fields that minimize the fracture energy functional. This energy functional is given by
where ( x ) is the spatial coordinate, ( u(x) ) is the displacement, ( alpha(x)in[0,1] ) is the crack density, and ( ell ) is a length scale determining the width of smoothed cracks. The energy functional comprises two components ( Pi^u ) and ( Pi^{alpha} ), that are the elastic and fracture energies respectively. As within the cited work, we take (ell=0.05). The displacement and phase field are discretized with a single neural network ( N: mathbb R rightarrow mathbb R^2 ) with parameters (boldsymbol theta). The issue is driven by an applied tensile displacement on the best end, which we denote ( U ). Boundary conditions are built into the 2 fields with
[ begin{bmatrix}
u(x ; boldsymbol theta) alpha (x; boldsymbol theta)
end{bmatrix} = begin{bmatrix} x(1-x) N_1(x;boldsymbol theta) + Ux x(1-x) N_2(x; boldsymbol theta)
end{bmatrix} ,]
where (N_i ) refers back to the (i)-th output of the network and the Dirichlet boundary conditions on the crack density are used to suppress cracking at the sides of the domain. In [23], a 4 hidden-layer MLP network with a width of fifty is used to represent the 2 solution fields. If we neglect the bias at the ultimate layer, this corresponds to (7850) trainable parameters. For all of our studies, we use a two-hidden layer network with hyperbolic tangent activation functions and no bias on the output layer, as, in our experience, these networks suffice to represent any solution field of interest. If each hidden layers have width (M), the entire variety of parameters on this network is (M^2+5M). When (M=86), we obtain (7826) trainable parameters. Within the absence of an analytical solution, we use this as the big network reference solution to which the smaller networks are compared.
To generate the answer fields, we minimize the entire potential energy using ADAM optimization with a learning rate of (1 times 10^{-3}). Total fracture of the bar is observed around (U=0.6). As within the paper, we compute the elastic and fracture energies over a spread of applied displacements, where ADAM is run for (3500) epochs at each displacement increment to acquire the answer fields. These “loading curves” are used to check the performance of networks of various sizes. Our experiment is conducted with (8) different network sizes, each comprising (20) increments of the applied displacement to construct the loading curves. See Figure 1 for the loading curves computed with different network sizes. Only when there are (|boldsymbol theta|=14) parameters, which corresponds to a network of width (2), will we see a divergence from the loading curves of the big network reference solution. The smallest network that performs well has (50) parameters, which is (157times) smaller than the network utilized in the paper. Figure 2 confirms that this small network is able to approximating the discontinuous displacement field, in addition to the localized damage field.


Burgers’ equation
We now study the effect of network size on a typical model problem from fluid mechanics. Burgers’ equation is often used to check numerical solution methods due to its nonlinearity and tendency to form sharp features. The viscous Burgers’ equation with homogeneous Dirichlet boundaries is given by
[frac{partial u}{partial t} + ufrac{partial u}{partial x} = nu frac{partial^2 u}{partial x^2}, quad u(x,0) = u_0(x), quad u(-1,t)=u(1,t) = 0,]
where (xin[-1,1]) is the spatial domain, (tin[0,T]) is the time coordinate, (u(x,t)) is the speed field, (nu) is the viscosity, and (u_0(x)) is the initial velocity profile. In [24], a neural network discretization of the speed field is used to acquire an answer to the governing differential equation. Their network comprises (3) hidden layers with (64) neurons in each layer, corresponding to (8576) trainable parameters. Again, we use a two hidden-layer network that has (5M+M^2) trainable parameters where (M) is the width of every layer. If we take (M=90), we obtain (8550) trainable parameters in our network. We take the answer from this network to be the reference solution (u_{text{ref}}(x,t)), and compute the discrepancy between velocity fields from smaller networks. We do that with an error function given by
[ E Big( u(x,t)Big)= frac{int_{Omega}| u(x,t) – u_{text{ref}}(x,t)| dOmega}{int_{Omega}| u_{text{ref}}(x,t)| dOmega},]
where (Omega = [-1,1] times [0,T]) is the computational domain. To resolve Burgers’ equation, we adopt the usual PINNs approach and minimize the squared error of the governing equation:
[ underset{boldsymbol theta}{text{argmin }} L(boldsymbol theta), quad L(boldsymbol theta) = frac{1}{2} int_{Omega} Big(frac{partial u}{partial t} + ufrac{partial u}{partial x} – nu frac{partial^2 u}{partial x^2}Big)^2 dOmega.]
The speed field is discretized with the assistance of an MLP network (N (x,t;boldsymbol theta)), and the boundary and initial conditions are built-in with a distance function-type approach [25]:
[ u(x,t;boldsymbol theta) = (1+x)(1-x)Big(t N(x,t; boldsymbol theta) + u_0(x)(1-t/T)Big). ]
On this problem, we take the viscosity to be (nu=0.01) and the ultimate time to be (T=2). The initial condition is given by (u_0(x) = – sin(pi x)), which ends up in the well-known shock pattern at (x=0). We run ADAM optimization for (1.5 times 10^{4}) epochs with a learning rate of (1.5 times 10^{-3}) to unravel the optimization problem at each network size. By sweeping over (8) network sizes, we again search for the parameter count at which the answer departs from the reference solution. Note that we confirm our reference solution against a spectral solver to make sure the accuracy of our implementation. See Figure 3 for the outcomes. All networks with (|boldsymbol theta|geq 150) parameters show roughly equal performance. As such, the unique network is overparameterized by an element of (57).

Neohookean hyperelasticity
In this instance, we consider the nonlinearly elastic deformation of a cube under a prescribed displacement. The strain energy density of a 3D hyperelastic solid is given by the compressible Neohookean model [26] as
[PsiBig( mathbf{u}(mathbf{X}) Big) = frac{ell_1}{2}Big( I_1 – 3 Big) – ell_1 ln J + frac{ell_2}{2} Big( ln J Big)^2 ,]
where (ell_1) and (ell_2) are material properties which we take as constants. The strain energy makes use of the next definitions:
[ mathbf{F} = mathbf{I} + frac{partial mathbf{u}}{partial mathbf{X}},
I_1 = mathbf{F} : mathbf{F},
J = det(mathbf{F}),]
where (mathbf{u}) is the displacement field, (mathbf{X}) is the position within the reference configuration, and (mathbf F) is the deformation gradient tensor. The displacement field is obtained by minimizing the entire potential energy, given by
[ PiBig( mathbf{u}(mathbf{X}) Big) = int_{Omega} PsiBig( mathbf{u}(mathbf{X}) Big) – mathbf{b} cdot mathbf{u} dOmega – int_{partial Omega} mathbf{t} cdot mathbf{u} dS,]
where (Omega) is the undeformed configuration of the body, (mathbf{b}) is a volumetric force, and (mathbf{t}) is an applied surface traction. Our investigation into the network size is inspired by [27], during which the Deep Ritz method is used to acquire a minimum of the hyperelastic total potential energy functional. Nevertheless, we opt to make use of the Neohookean model of the strain energy, versus the Lopez-Pamies model they employ. As within the cited work, we take the undeformed configuration to be the unit cube (Omega=[0,1]^3) and we subject the cube to a uniaxial strain state. To implement this strain state, we apply a displacement (U) within the (X_3) direction on the highest surface of the cube. Roller supports, which zero only the (X_3) component of the displacement, are applied on the underside surface. All other surfaces are traction-free, which is enforced weakly by the chosen energy functional. The boundary conditions are satisfied robotically by discretizing the displacement as
[ begin{bmatrix}
u_1(mathbf{X}; boldsymbol theta)
u_2(mathbf{X}; boldsymbol theta)
u_3(mathbf{X}; boldsymbol theta)
end{bmatrix} = begin{bmatrix}
X_3 N_1(mathbf{X}; boldsymbol theta)
X_3 N_2(mathbf{X}; boldsymbol theta)
sin(pi X_3) N_3(mathbf{X}; boldsymbol theta) + UX_3
end{bmatrix},]
where ( N_i) is the (i)-th component of the network output. Within the cited work, a six hidden-layer network of width (40) is used to discretize the three components of the displacement field. This corresponds to (8480) trainable parameters. On condition that the network is a map ( N: mathbb R^3 rightarrow mathbb R^3), a two hidden-layer network of width (M) has (8M+M^2) trainable parameters when no bias is applied on the output layer. Thus, if we take ( M=88 ), our network has (8448) trainable parameters. We are going to take this network architecture to be the big network reference.
In [27], the connection between the traditional component of the primary Piola-Kirchhoff stress tensor (mathbf{P}) within the direction of the applied displacement and the corresponding component of the deformation gradient (mathbf{F}) was computed to confirm their Deep Ritz implementation. Here, we study the connection between this tensile stress and the applied displacement (U). The primary Piola-Kirchhoff stress tensor is obtained with the strain energy density as
[ mathbf{P} = frac{ partial Psi}{partial mathbf{F}} = ell_1( mathbf{F} – mathbf{F}^{-T} ) + ell_2 mathbf{F}^{-T}log J.]
Given the unit cube geometry and the uniaxial stress/strain state, the deformation gradient is given by
[ mathbf{F} = begin{bmatrix}
1 & 0 & 0 0 & 1 & 0 0 & 0 & 1+U
end{bmatrix}.]
With these two equations, we compute the tensile stress (P_{33}) and the strain energy as a function of the applied displacement to be
[ begin{aligned} P_{33} = ell_1Big( 1 + U – frac{1}{1+U}Big) + ell_2frac{log(1+U)}{1+U}, Pi = frac{ell_1}{2}(2+(1+U)^2-3) – ell_1 log(1+U) + frac{ell_2}{2}(log(1+U))^2. end{aligned}]
These analytical solutions could be used to confirm our implementation of the hyperelastic model, in addition to to gauge the performance of various size networks. Using the neural network model, the tensile stress and the strain energy are computed at each applied displacement with:
[ P_{33} = int_{Omega} ell_1( {mathbf{F}} – {mathbf{F}}^{-T} ) + ell_2 {mathbf{F}}^{-T}log J dOmega, quad Pi = int_{Omega} PsiBig( {mathbf{u}}(mathbf{X})Big) dOmega,]
where the displacement field is constructed from parameters obtained from the Deep Ritz method. To compute the stress, we average over your entire domain, provided that we expect a relentless stress state. In this instance, the fabric parameters are set at (ell_1=1) and (ell_2=0.25). We iterate over (8) network sizes and take (10) load steps at each size to acquire the stress and strain energy as a function of the applied displacement. See Figure 4 for the outcomes. All networks exactly reproduce the strain energy and stress loading curves. This includes even the network of width (2), with only (20) trainable parameters. Thus, the unique network has (424times) more parameters than needed to represent the outcomes of the tensile test.

A counterexample
Within the fourth and final example, we solve a regression problem to point out the failure of small networks to suit high-frequency functions. The one-dimensional regression problem is given by
[ underset{boldsymbol theta}{text{argmin }} L(boldsymbol theta), quad L(boldsymbol theta) = frac{1}{2}int_0^1Big( v(x) – N(x;boldsymbol theta) Big)^2 dx,]
where ( N) is a two hidden-layer MLP network and (v(x)) is the goal function. In this instance, we take (v(x)=sin^5(20pi x)). We iterate over (5) different network sizes and report the converged loss value (L) as an error measure. We train using ADAM optimization for (5 times 10^4) epochs and with a learning rate of (5 times 10^{-3}). See Figure 5 for the outcomes. Unlike the previous three examples, the goal function is sufficiently complex that enormous networks are required to represent it. The converged error decreases monotonically with the parameter count. We also time the training procedure at each network size, and note the dependence of the run time (in seconds) on the parameter count. This instance illustrates that representing oscillatory functions requires larger networks, and that the parameter count drives up the fee of coaching.

Conclusion
While tuning hyperparameters governing the loss function, optimization process, and activation function is common within the PINNs community, it’s less common to tune the network size. With three example problems taken from the literature, we have now shown that very small networks often suffice to represent PDE solutions, even when there are discontinuities and/or other localized features. See Table 1 for a summary of our results on the potential of using small networks. To qualify our findings, we then presented the case of regression to a high-frequency goal function, which required numerous parameters to suit accurately. Thus, our conclusions are as follows: solution fields which don’t oscillate can often be represented by small networks, even once they contain localized features comparable to cracks and shocks. Because the fee of coaching scales with the variety of parameters, smaller networks can expedite training for physics-informed problems with non-oscillatory solution fields. In our experience, such solution fields appear repeatedly in practical problems from heat conduction and static solid mechanics. By shrinking the dimensions of the network, these problems and others represent opportunities to render PINN solutions more computationally efficient, and thus more competitive with traditional approaches comparable to the finite element method.
| Problem | Overparameterization |
| Phase field fracture [23] | (157 times ) |
| Burgers’ equation [24] | ( 57 times ) |
| Neohookean hyperelasticity [27] | ( 424 times ) |
References
[1] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339–1364, December 2018. arXiv:1708.07469 [q-fin].
[2] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707, February 2019.
[3] Weinan E and Bing Yu. The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems, September 2017. arXiv:1710.00211 [cs].
[4] Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak Adversarial Networks for High-dimensional Partial Differential Equations. Journal of Computational Physics, 411:109409, June 2020. arXiv:1907.08272 [math].
[5] Reza Khodayi-Mehr and Michael M. Zavlanos. VarNet: Variational Neural Networks for the Solution of Partial Differential Equations, December 2019. arXiv:1912.07443 [cs].
[6] E. Kharazmi, Z. Zhang, and G. E. Karniadakis. Variational Physics-Informed Neural Networks For Solving Partial Differential Equations, November 2019. arXiv:1912.00873 [cs].
[7] Xiaowei Jin, Shengze Cai, Hui Li, and George Em Karniadakis. NSFnets (Navier-Stokes Flow nets): Physics-informed neural networks for the incompressible Navier-Stokes equations. Journal of Computational Physics, 426:109951, February 2021.
[8] Shengze Cai, Zhicheng Wang, Sifan Wang, Paris Perdikaris, and George Em Karniadakis. Physics-Informed Neural Networks for Heat Transfer Problems. Journal of Heat Transfer, 143(060801), April 2021.
[9] Min Liu, Zhiqiang Cai, and Karthik Ramani. Deep Ritz method with adaptive quadrature for linear elasticity. Computer Methods in Applied Mechanics and Engineering, 415:116229, October 2023.
[10] Sifan Wang, Ananyae Kumar Bhartari, Bowen Li, and Paris Perdikaris. Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective, September 2025. arXiv:2502.00604 [cs].
[11] Jorge F. Urban, Petros Stefanou, and Jose A. Pons. Unveiling the optimization means of physics informed neural networks: How accurate and competitive can PINNs be? Journal of Computational Physics, 523:113656, February 2025.
[12] Conor Rowan, Kai Hampleman, Kurt Maute, and Alireza Doostan. Boundary condition enforcement with PINNs: a comparative study and verification on 3D geometries, December 2025. arXiv:2512.14941[math].
[13] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A. Hamprecht, Yoshua Bengio, and Aaron Courville. On the Spectral Bias of Neural Networks, May 2019. arXiv:1806.08734 [stat].
[14] Mirco Pezzoli, Fabio Antonacci, and Augusto Sarti. Implicit neural representation with physics-informed neural networks for the reconstruction of the early a part of room impulse responses. In Proceedings of the tenth Convention of the European Acoustics Association Forum Acusticum 2023, pages 2177–2184, January 2024.
[15] Shaghayegh Fazliani, Zachary Frangella, and Madeleine Udell. Enhancing Physics-Informed Neural Networks Through Feature Engineering, June 2025. arXiv:2502.07209 [cs].
[16] Duong V. Dung, Nguyen D. Song, Pramudita S. Palar, and Lavi R. Zuhal. On The Selection of Activation Functions in Physics-Informed Neural Network for Solving Incompressible Fluid Flows. In AIAA SCITECH 2023 Forum. American Institute of Aeronautics and Astronautics. Eprint: https://arc.aiaa.org/doi/pdf/10.2514/6.2023-1803.
[17] Honghui Wang, Lu Lu, Shiji Song, and Gao Huang. Learning Specialized Activation Functions for Physics-informed Neural Networks. Communications in Computational Physics, 34(4):869–906, June 2023. arXiv:2308.04073 [cs].
[18] Zhao Zhang, Xia Yan, Piyang Liu, Kai Zhang, Renmin Han, and Sheng Wang. A physics-informed convolutional neural network for the simulation and prediction of two-phase Darcy flows in heterogeneous porous media. Journal of Computational Physics, 477:111919, March 2023.
[19] Pu Ren, Chengping Rao, Yang Liu, Jianxun Wang, and Hao Sun. PhyCRNet: Physics-informed Convolutional-Recurrent Network for Solving Spatiotemporal PDEs. Computer Methods in Applied Mechanics and Engineering, 389:114399, February 2022. arXiv:2106.14103 [cs].
[20] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, August 2019. Publisher: Proceedings of the National Academy of Sciences.
[21] Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks, February 2020. arXiv:1806.07572 [cs].
[22] B. Bourdin, G. A. Francfort, and J-J. Marigo. Numerical experiments in revisited brittle fracture. Journal of the Mechanics and Physics of Solids, 48(4):797–826, April 2000.
[23] M. Manav, R. Molinaro, S. Mishra, and L. De Lorenzis. Phase-field modeling of fracture with physics-informed deep learning. Computer Methods in Applied Mechanics and Engineering, 429:117104, September 2024.
[24] Xianke Wang, Shichao Yi, Huangliang Gu, Jing Xu, and Wenjie Xu. WF-PINNs: solving forward and inverse problems of burgers equation with steep gradients using weak-form physics-informed neural networks. Scientific Reports, 15(1):40555, November 2025. Publisher: Nature Publishing Group.
[25] N. Sukumar and Ankit Srivastava. Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks. Computer Methods in Applied Mechanics and Engineering, 389:114333, February 2022.
[26] Javier Bonet and Richard D. Wood. Nonlinear Continuum Mechanics for Finite Element Evaluation. Cambridge University Press, Cambridge, 2 edition, 2008.
[27] Diab W. Abueidda, Seid Koric, Rashid Abu Al-Rub, Corey M. Parrott, Kai A. James, and Nahil A. Sobh. A deep learning energy method for hyperelasticity and viscoelasticity. European Journal of Mechanics – A/Solids, 95:104639, September 2022. arXiv:2201.08690 [cs].
