Dr. Stavros Papadopoulos, Founder and CEO, TileDB – Interview Series

-

TileDB is the fashionable database that integrates all data modalities, code and compute in a single product.  TileDB was spun out of MIT and Intel Labs in May 2017.

Prior to founding TileDB, Inc. in February 2017, Dr. Stavros Papadopoulos was a Senior Research Scientist on the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for 3 years. He also spent about two years as a Visiting Assistant Professor on the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST). Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position on the Chinese University of Hong Kong with Prof. Yufei Tao.

You were previously the Senior Research Scientist on the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center (ISTC) for Big Data at MIT CSAIL for 3 years. Are you able to share with us some key highlights from this era in your life?

During my time at Intel Labs and MIT, I had the unique opportunity to collaborate with luminaries in two different scientific sectors: high-performance computing (at Intel) and databases (at MIT). The knowledge and expertise I acquired became key in shaping my vision to create a brand new style of database system, which I ultimately built as a research project throughout the ISTC and spun out into what became TileDB.

Are you able to explain the vision behind TileDB and the way it goals to revolutionize the fashionable database landscape?

Over the previous couple of years, there’s been an enormous uptake in machine learning and Generative AI applications that help organizations make higher decisions. Day by day, organizations are discovering recent patterns of their data,after which using this information to realize a competitive edge. These patterns emerge from an ever-growing spectrum of knowledge modalities that have to be housed and managed as a way to be harnessed. From traditional tabular data to more complex data sources reminiscent of social posts, email, images, video, and sensor data, the flexibility to derive meaning from data requires evaluation in aggregate. As data types increase, this task is becoming far more arduous, demanding a brand new style of database. This is strictly why TileDB was created.

Why is it crucial for organizations to prioritize their data infrastructure before developing advanced analytics and machine learning capabilities?

Amid the fervor to adopt AI is a critical and infrequently ignored truth – the success of any AI initiative is intrinsically tied to the standard and performance of the underlying data infrastructure.

The issue is that complex data that is just not naturally represented as tables is taken into account as “unstructured,” and is usually either stored as flat files in bespoke data formats, or managed by disparate, purpose-built databases. Data scientists find yourself spending huge amounts of time wrangling data as a way to consolidate it. It’s estimated that 80-90 percent of knowledge scientists’ time is spent cleansing their data and preparing it for merging. That slows time to training AI algorithms and achieving predictive capabilities. Moreover, which means only 10-20 percent of knowledge scientists’ time is spent creating insights.

What are the common pitfalls organizations face after they focus more on AI and ML applications on the expense of a strong database infrastructure?

Organizations are inclined to deal with shiny recent things. Large Language Models, vector databases and generative AI apps built on top of an information infrastructure are current examples, on the expense of addressing the underlying data infrastructure which is crucial to analytical successSimply put, in case your organization does this, you might be left spending an inordinate period of time cobbling together your data infrastructure and delay or altogether miss opportunities to glean insights.

Could you elaborate on what makes a database ‘adaptive’ and why this adaptability is important for contemporary data analytics?

An adaptive database is one which can shape-shift to accommodate all data – no matter its modality – and store it together in a unified manner. An adaptive database brings structure to data that’s otherwise considered  “unstructured.” It’s estimated that 80 percent or more of the world’s data is non-tabular, or unstructured, and most AI/ML models (including LLMs) are trained on such a data.

TileDB structures data in multi-dimensional arrays. How does this format improve performance and cost-efficiency in comparison with traditional databases?

The foundational strength of a multidimensional array database is that it may well morph to accommodate practically any data modality and application. A vector, as an illustration, is just a one dimensional array. By bringing structure to this “unstructured” data, you possibly can consolidate your data infrastructure, significantly reduce costs, eliminate silos, increase productivity, and enhance security. Going a step further, when compute infrastructure is coupled with the info management infrastructure, you possibly can extract fast value out of your data.

What are some notable use cases where TileDB has significantly improved data management and analytics performance?

The primary TileDB use case was storage, management and evaluation of vast genomic data, which could be very difficult and expensive to model and store in a standard, tabular database. We observed phenomenal performance gains (within the order of 100x faster in lots of cases over other databases and bespoke solutions). Nonetheless, our multidimensional array model is universal and might efficiently capture other data modalities, too. For instance, TileDB is superb at handling biomedical imaging, satellite imaging, single-cell transcriptomics and point cloud data like LiDAR and SONAR.

TileDB offers open-source tools for interoperability. How does an open source approach profit the scientific and data science communities?

We’re big proponents of open source at TileDB. The core library and data format specification are each open source. As well as, our life sciences offerings, built on top of the core array library, are also open source. This includes TileDB-SOMA, a package for efficient and scalable single-cell data management, which was in-built collaboration with the Chan Zuckerberg Foundation and powers the CELLxGENE Discover Census—the world’s largest fully curated single-cell dataset. This too is open source and is utilized by academic institutions and major pharmaceutical corporations across the globe.

What do you see as the longer term trends in data management?

As data becomes richer, AI applications grow to be smarter. Large Language Models  have gotten an increasing number of powerful, leveraging multiple data modalities, and the mixing of those LLMs with diverse data sets is opening up a brand new frontier in AI referred to as multimodal AI.

Practically speaking, multimodal AI implies that users are usually not limited to at least one input and one output type and might prompt a model with virtually any input to generate virtually content type. We see TileDB as the best database for supporting multimodal AI, built to support any recent and various kinds of data which will emerge.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x