Data Science as Engineering: Foundations, Education, and Skilled Identity

is having an identity crisis.

Indications of this crisis have been around for years. As an example, the inaugural issue of found it easier to define what data science is just not reasonably than what it’s (Meng, 2019). This confusion hasn’t cleared up. Actually, a case could be made that it has gotten worse. As Meng noted years ago (2019), most of us have some knowledge about other forms of scientists. But what’s a knowledge scientist and what exactly do they do?

The history of knowledge science is deeply rooted in statistics. Way back to 1962, one of the vital influential statisticians of the twentieth century, John Tukey, was calling for recognition of a brand new science focused on learning from data. Subsequent work by the statistics community, particularly Jeff Wu (Donoho, 2015) and William Cleveland (2001), formally proposed the name “data science” and suggested academic statistics expand its boundaries (Donoho, 2015). Yet, the following years have seen a big influence from computer science, calls for data science to be recognized as a novel discipline distinct from statistics, and a fundamental reckoning with data science being a science.

The expansion of the probabilistic and inferential traditions of statistics together with the algorithmic, programming, and system-design concerns of computer science has led to a contemporary view of knowledge science as an interdisciplinary field, which Blei and Smyth (2017) affectionately discuss with as ‘the kid of statistics and computer science’. Wing and colleagues (2018) see the defining characteristic being data science is just not nearly methods, but additionally in regards to the use of those methods within the context of a site. This interplay between domain and methods makes data science not merely the sum of its parts, but a definite field with its own focus.

Yet, there’s the basic query of the name itself. Wing’s probing query (2020), “Is there an issue unique to data science that one can convincingly argue wouldn’t be addressed or asked by any of its constituent disciplines, e.g., computer science and statistics?” is an important litmus test for whether data science needs to be considered a science. Some questions emerging from data science may feel novel (Wing, 2020); nevertheless, even these often reduce to applications of existing disciplines (statistics, computer science, optimization theory) reasonably than indicate a fundamentally latest science.

Contributions from different disciplines could make data science richer. Yet, there’s mounting evidence (Wilkerson, 2025) it’s also causing confusion for college kids, educators, and employers. There’s evidence of essential differences across undergraduate data science education, between data science education efforts for majors versus nonmajors, and between K–12 data science initiatives emerging from different groups and disciplines.

Contributions from multiple disciplines don’t easily flow into within the absence of a centralized community (Dogucu et al., 2025) resulting in fragmentation. The interdisciplinary nature of knowledge science is becoming multidisciplinary. Quite a few skilled societies now have explicit data science, or closely related, subgroups and focus areas. Domain specific data science journals — and the to call just a few — are excellent outlets for research; yet, we could also be losing the interactive and holistic aspect of an interdisciplinary field. Navigating your entire data science landscape is a challenge. This further manifests itself in the various distinct roles that appear across “Data Scientist” job advertisements (Saltz and Grady, 2017) and culminates within the “unicorn problem” where employers have the unrealistic expectation that one person can master all the abilities of what is taken into account data science (Saltz and Grady, 2017).

An Engineering Perspective

Wing’s questions (2020) reveal that data science has a fundamentally different relationship with domain context than mathematics, statistics, or computer science. This different relationship — where domain is integral reasonably than inspirational — is precisely what distinguishes engineering from science.

Domains encourage questions within the sciences, however the domains aren’t fundamental. Mathematics studies abstract structures, and we are able to do group theory with none application in mind. Statistics studies inference from data basically and we are able to develop a statistical theory with out a specific domain. Computer Science studies computation abstractly and we are able to develop algorithms, complexity theory, and coding languages without applications in mind. These fields are inspired by domains but exist independently of those domains.

Engineering, alternatively, cannot exist without application context. Civil engineering literally can’t be studied without considering what you’re constructing (bridges, dams, buildings). The domain isn’t just inspirational — it’s constitutive. We will’t teach mechanical engineering as pure abstraction after which “add” applications later. Trade-offs (e.g. algorithmic, efficiency, cost) only make sense throughout the engineer’s domain constraints. Data science suits this model.

An information scientist’s job is more analogous to a civil engineer designing a bridge than a physicist studying fundamental forces. The bridge must work given the materials available, the budget, the terrain, and safety requirements — even when meaning using approximations reasonably than perfect solutions. Yet, engineering disciplines also can generate foundational insights as byproducts without that being their purpose. Thermodynamics emerged partly from engineers attempting to construct higher steam engines∂. Information theory got here from engineers working on telecommunications. But the sector’s telos is constructing systems that work, not advancing foundational theory. An information scientist who develops a model that improves customer retention by 5% has succeeded, even in the event that they used off-the-shelf methods and generated zero novel insights.

Data science is fundamentally about . Like other engineering disciplines, it involves:

Making pragmatic trade-offs (accuracy vs. interpretability vs. computational cost)
Working inside constraints (limited data, computational resources, business requirements)
Integrating multiple techniques to unravel practical problems
Specializing in deployment, maintenance, and iteration

Perhaps data science is best understood — and taught — using an engineering framework. Perhaps data science needs specializations analogous to mechanical, civil, and electrical engineers. This engineering framing is about epistemology and practice, not necessarily organizational structure. Engineering is fundamentally about the way you approach problems — constructing systems that work under constraints — not about departmental affiliation. Biomedical engineering is engineering whether it’s housed with mechanical engineering or in a medical school. What matters is that data science programs adopt engineering principles: rigorous foundations, specialized tracks, give attention to constructing reasonably than pure discovery, and skilled standards. This will occur in statistics departments, computer science departments, engineering schools, or standalone data science departments. The bottom line is the academic philosophy and standards, not the name of the department.

Existing Engineering Foundations

We should not the primary to view data science as engineering. Stueur’s essay (2020) expertly noted that while data science was becoming the engineering of the twenty-first century, it was being taught in two very distinct approaches. The primary is the, where the goal is to make reliable statements about that world. That is in contrast with the , where data is seen as examples, and the goal is to learn a general concept. Stueur notes (2020) there is no such thing as a common epistemological foundation by which data scientists are trained. We’re expanding upon those initial calls for common foundations and present thoughts on what this might appear to be for data science as an educational discipline and a career.

Hoerl and Snee (2015) have argued for a brand new discipline, called statistical engineering, for coping with large, unstructured, complex problems, combining multiple statistical tools, plus other disciplines. S is the appliance of statistical considering to large, unstructured, real-world problems. This call for a brand new discipline has led to the formation of the International Statistical Engineering Association (ISEA). It will appear that ISEA views statistical engineering because the science of integrating and applying methods rigorously with data science being the practice of using those methods.

Pan and colleagues (2021) have suggested engineering fields introduce data science concepts equivalent to machine learning and a give attention to statistics. They note that it is necessary to refine the university curriculum and train engineers to make use of data science and be data literate from the outset (Pan et al., 2021). We consider data science should adopt the reciprocal philosophy. Gerald Friedland has taken this to heart by introducing a novel textbook (Friedland, 2023) presenting machine learning from an engineering perspective. It’s value noting that engineering perspectives are appearing in related domains as well. Rebecca Willet (2019), for instance, has called for an engineering approach to artificial intelligence.

Although the info science as engineering idea is just not latest, there are still a variety of open questions. How should curricula change if we accept that data science is engineering? What competencies should we emphasize? How can we teach — not only accuracy? Should data scientists have codes of practice like engineers do? Our goal is to proceed the discussion of knowledge science as engineering while suggesting pedagogical, skilled, and ethical perspectives on these questions.

Implications for Education

Traditional engineering disciplines require deep foundational knowledge precisely because engineers need to acknowledge once they’re on the boundaries of established theory. A civil engineer needs to know materials science and structural mechanics well enough to know when a design problem requires latest research versus when it’s a simple application of known principles.

Similarly, a knowledge scientist working on, say, a brand new architecture for time series prediction should ideally recognize: “This convergence behavior is weird — this could be touching on something fundamental about optimization landscapes” versus “That is only a hyperparameter tuning issue.”

We would like to avoid education that generates practitioners who can use tools but not recognize once they’re observing something that violates theoretical expectations — which is precisely when foundational insights emerge. An absence of specialization creates each a signal problem (how do you assess practitioners?) and a training problem (one curriculum can’t serve all needs).

Listed below are just a few suggestions to help the continued discussions on the info science curriculum.

Core sequence in linear algebra and probability theory.
Physics for insight — some exposure to statistical mechanics and data theory, framed around their connections to learning systems could be extremely worthwhile.
“Foundations for practitioners” courses — Courses explicitly designed to provide practitioners enough theoretical grounding to acknowledge anomalies and foundational questions. Not a course in tool X; reasonably, “Here’s what occur in keeping with theory, here’s what it looks like once you’re outside the idea.”
Teach reliability, testing, and explainability as first-class concepts.
Case studies of foundational discoveries — Teaching through examples like “how dropout was discovered” or “why the Adam optimizer converges in a different way than theory predicted” to coach the skill of recognizing foundational questions.
Introduce capstone “design labs” modeled after engineering senior design.
A give attention to data ethics and fairness.

What changes within the classroom is a shift from a scientific framing — — to an engineering framing — Now students must consider pipelines, versioning, monitoring, and ethics — not only mean absolute error. Engineering students learn that systems fail, and that design is iterative. Data science students should too.

Ethics could be taught as a design constraint. Moderately than tacking on ethics as a discussion topic, it’s treated as a . If our systems must not produce disparate outcomes by gender or race then ethics becomes a technical design requirement, not an ethical afterthought.

In an engineering-style data science, tools should not optional extras. Selecting the right tools for reproducibility, monitoring and deployment, automation, and documentation change into the equivalent of safety codes and standards in traditional engineering.

Our assessment of scholars also shifts. As an alternative of grading only accuracy or mathematical derivations, we evaluate robustness, clarity of design, interpretability, and fairness metrics. Students needs to be rewarded for constructing systems that last.

The shifts in pedagogy would give practitioners the power to:

Read theoretical papers and understand what they’re claiming
Recognize when empirical results contradict theoretical expectations
Have theoretical and physical intuitions about algorithms
Know when to seek the advice of deeper theory
Communicate with researchers in adjoining fields
Learn from system failure

To be clear, we’re not saying “reorganize all colleges and universities.” Moderately, “recognize data science as an engineering practice and structure education accordingly”. Engineering is a mode of practice, not only an organizational category. The engineering framing is about skilled identity and academic standards, not departmental location.

Proposed Specializations and Modifications to Skilled Societies

If data science is engineering, we must shift from the scientific model (focused on research dissemination and academic credentialing) to the engineering model (focused on skilled standards, public responsibility, and practice competence). This includes specializations, enforceable ethics codes, technical standards with regulatory implications, and academic accreditation. What might data science specializations appear to be? Here’s one possible breakdown to maneuver the conversation forward.

Statistical/Experimental Data Scientist

Educational requirements: causal inference, experimental design, survey methodology
Applications: A/B testing, policy evaluation, clinical trials
Math core: Real evaluation, probability, statistics
Limited exposure to: Distributed systems, deep learning

AI/Machine Learning Data Scientist

Educational requirements: algorithms, distributed systems, optimization
Applications: Suggestion systems, search, large-scale prediction
Math core: Linear algebra, optimization, some statistical mechanics
Heavy exposure to: Software engineering, MLOps, scalability

Scientific/Research Data Scientist

Educational requirements: domain science + statistics
Applications: Genomics, climate, physics, social science
Math/Science core: physics, statistics, linear algebra, scientific computing
Deal with: Interpretability, uncertainty quantification, causal models

Business Intelligence Data Scientist

Educational requirements: business/economics, some statistics and Calculus
Heavy on: SQL, visualization, communication, domain knowledge
Applications: Dashboards, reports, exploratory evaluation

Data science programs and skilled societies with an engineering focus would have data standards analogous to engineering constructing codes. Not for the regulatory function of constructing codes. Moderately, the certification of tools and approaches for industry. This may consist of knowledge documentation standards (what constitutes adequate documentation), model validation protocols (when is a model ready for deployment?), reproducibility standards (minimum requirements for computational reproducibility), fairness and bias testing protocols, and security and privacy standards for data handling. These shouldn’t be academic papers — they needs to be living standards co-developed and adopted by industry.

Membership and focus would also shift inside data science skilled societies. There could be equal space for practitioners, not only academic research. Engineers learn from failures (e.g. bridge collapses). Data science needs failure case studies as well. Ethics, centered on consequences, would dominate teaching and publication. Public welfare (when should a knowledge scientist refuse to construct something?), downstream harms (responsibility for a way models are deployed), and enforceable standards (not only aspirational) would take center stage. Engineering ethics asks: “What could go fallacious and who could possibly be harmed?” Data science ethics should do the identical.

Teaching data science as engineering redefines success from “model accuracy” to “system reliability and responsibility”. As our data systems shape the world, we must train data scientists not only as analysts of knowledge but as engineers of knowledge system consequences.

Avoiding a False Dichotomy

The “science discovers, engineering applies” narrative is overly simplistic. Reality is far richer. History shows engineering and science intertwine with many foundational scientific insights emerged from engineering practice. The boundary is permeable and productive. Data science will generate latest scientific insights and data scientists who make scientific discoveries are doing exceptional engineering, not abandoning engineering for science. On this regard, the name is admittedly of secondary concern because an engineering framing values each sorts of contributions. While its pedagogy and professionalism recognize that almost all work is synthesis and application, we must always still create space for discovery. This can be a much healthier model than pretending all data scientists are doing fundamental science, or that those that construct systems are by some means lesser. Viewing data science as…

The engineering discipline that applies statistical, computational, and domain knowledge to design data-driven systems that operate effectively and ethically in practice

…clarifies why data scientists value pipelines and scalability, why reproducibility and maintainability matter, and why data science doesn’t must invent latest math to be an actual field. Once we see data science as engineering, we stop asking “Which model is best?” and begin asking “Which system design solves this problem responsibly and sustainably?” That shift produces practitioners who can think end-to-end, balancing theory, computation, and ethics — very like civil engineers balance physics, materials, and safety.

Acknowledgements

The writer would love to thank Dr. Bill Harder (Director of Faculty Development and Teaching Excellence) and Dr. Rodney Yoder (Associate Professor of Physics and Engineering Science) for helpful discussions and feedback on this text.

References

Blei, D. M. and Smyth, P. (2017). Science and data science. , 114(33), 8689–8692.

Cleveland, W. S., (2001). Data Science: an motion plan for expanding the technical areas of the sector of statistics. International statistical review, 69(1):21–26

Dogucu, M., Demirci, S., Bendekgey, H., Ricci, F. Z., and Medina, C. M. (2025). A Systematic Literature Review of Undergraduate Data Science Education Research. , 33(4), 459-471.

Donoho, D. (2017). 50 Years of Data Science. Journal of Computational and Graphical Statistics, 26(4), 745-766.

Friedland, G. (2024), Information-Driven Machine Learning, Springer Cham, https://doi.org/10.1007/978-3-031-39477-5

Hoerl, R. W. and Snee, R. D. (2015), Statistical Engineering: An Idea Whose Time Has Come?, arXiv preprint, https://arxiv.org/abs/1511.06013

Meng, X.-L. (2019). Data Science: An Artificial Ecosystem. , (1). https://doi.org/10.1162/99608f92.ba20f892

Pan, I., Mason, L., and Matar, M. (2021), Data-Centric Engineering: integrating simulation, machine learning and statistics. Challenges and Opportunities, arXiv preprint, https://arxiv.org/abs/2111.06223

Saltz, J. S. and Grady, N. W. (2017). The paradox of knowledge science team roles and the necessity for a knowledge science workforce framework. , Boston, MA, USA, 2017, pp. 2355-2361, doi: 10.1109/BigData.2017.8258190.

Steuer, D. (2020), Time for Data Science to Professionalise, , Volume 17, Issue 4, August 2020, Pages 44–45, https://doi.org/10.1111/1740-9713.01430

Wilkerson, M. H. (2025). Mapping the Conceptual Foundation(s) of ‘Data Science Education.’ , (3). https://doi.org/10.1162/99608f92.9ac68105

Willett, R. (2019). Engineering Perspectives on AI. , (1). https://doi.org/10.1162/99608f92.98280d4a

Wing, J.M., Janeia, V.P., Kloefkorn, T., & Erickson, L.C. (2018). Data Science Leadership Summit, Workshop Report, National Science Foundation. Retrieved from https://dl.acm.org/citation.cfm?id=3293458

Wing, J. M. (2020). Ten Research Challenge Areas in Data Science. , (3). https://doi.org/10.1162/99608f92.c6577b1f

Data Science as Engineering: Foundations, Education, and Skilled Identity

An Engineering Perspective

Existing Engineering Foundations

Implications for Education