Anais Dotis-Georgiou is a Developer Advocate for InfluxData with a passion for making data beautiful with the usage of Data Analytics, AI, and Machine Learning. She takes the information that she collects, does a mixture of research, exploration, and engineering to translate the information into something of function, value, and wonder. When she shouldn’t be behind a screen, you could find her outside drawing, stretching, boarding, or chasing after a soccer ball.
InfluxData is the corporate constructing InfluxDB, the open source time series database utilized by greater than 1,000,000 developers around the globe. Their mission is to assist developers construct intelligent, real-time systems with their time series data.
Are you able to share a bit about your journey from being a Research Assistant to becoming a Lead Developer Advocate at InfluxData? How has your background in data analytics and machine learning shaped your current role?
I earned my undergraduate degree in chemical engineering with a deal with biomedical engineering and eventually worked in labs performing vaccine development and prenatal autism detection. From there, I started programming liquid-handling robots and helping data scientists understand the parameters for anomaly detection, which made me more involved in programming.
I then became a sales development representative at Oracle and realized that I actually needed to deal with coding. I took a coding boot camp on the University of Texas in data analytics and was in a position to break into tech, specifically developer relations.
I got here from a technical background, in order that helped shape my current role. Though I didn’t have development experience, I could relate to and empathize with individuals who had an engineering background and mind but were also attempting to learn software. So, once I created content or technical tutorials, I used to be in a position to help recent users overcome technical challenges while placing the conversation in a context that was relevant and interesting to them.
Your work seems to mix creativity with technical expertise. How do you incorporate your passion for making data ‘beautiful’ into your day by day work at InfluxData?
These days, I’ve been more focused on data engineering than data analytics. While I don’t deal with data analytics as much as I used to, I still really enjoy math—I believe math is gorgeous, and can jump at a possibility to clarify the maths behind an algorithm.
InfluxDB has been a cornerstone within the time series data space. How do you see the open source community influencing the event and evolution of InfluxDB?
InfluxData may be very committed to the open data architecture and Apache ecosystem. Last 12 months we announced InfluxDB 3.0, the brand new core for InfluxDB written in Rust and built with Apache Flight, DataFusion, Arrow, and Parquet–what we call the FDAP stack. Because the engineers at InfluxData proceed to contribute to those upstream projects, the community continues to grow and the Apache Arrow set of projects gets easier to make use of with more features and functionality, and wider interoperability.
What are among the most enjoyable open-source projects or contributions you have seen recently within the context of time series data and AI?
It’s been cool to see the addition of LLMs being repurposed or applied to time series for zero-shot forecasting. Autolab has a group of open time series language models, and TimeGPT is one other great example.
Moreover, various open source stream processing libraries, including Bytewax and Mage.ai, that allow users to leverage and incorporate models from Hugging Face are pretty exciting.
How does InfluxData ensure its open source initiatives stay relevant and useful to the developer community, particularly with the rapid advancements in AI and machine learning?
InfluxData initiatives remain relevant and useful by specializing in contributing to open source projects that AI-specific firms also leverage. For instance, each time InfluxDB contributes to Apache Arrow, Parquet, or DataFusion, it advantages every other AI tech and company that leverages it, including Apache Spark, DataBricks, Rapids.ai, Snowflake, BigQuery, HuggingFace, and more.
Time series language models have gotten increasingly vital in predictive analytics. Are you able to elaborate on how these models are transforming time series forecasting and anomaly detection?
Time series LMs outperform linear and statistical models while also providing zero-shot forecasting. This implies you don’t must train the model in your data before using it. There’s also no must tune a statistical model, which requires deep expertise in time series statistics.
Nevertheless, unlike natural language processing, the time series field lacks publicly accessible large-scale datasets. Most existing pre-trained models for time series are trained on small sample sizes, which contain only a couple of thousand—or perhaps even tons of—of samples. Although these benchmark datasets have been instrumental within the time series community’s progress, their limited sample sizes and lack of generality pose challenges for pre-training deep learning models.
That said, that is what I imagine makes open source time series LMs hard to come back by. Google’s TimesFM and IBM’s Tiny Time Mixers have been trained on massive datasets with tons of of billions of knowledge points. With TimesFM, for instance, the pre-training process is finished using Google Cloud TPU v3–256, which consists of 256 TPU cores with a complete of two terabytes of memory. The pre-training process takes roughly ten days and ends in a model with 1.2 billion parameters. The pre-trained model is then fine-tuned on specific downstream tasks and datasets using a lower learning rate and fewer epochs.
Hopefully, this transformation implies that more people could make accurate predictions without deep domain knowledge. Nevertheless, it takes plenty of work to weigh the professionals and cons of leveraging computationally expensive models like time series LMs from each a financial and environmental cost perspective.
This Hugging Face Blog post details one other great example of time series forecasting.
What are the important thing benefits of using time series LMs over traditional methods, especially by way of handling complex patterns and zero-shot performance?
The critical advantage shouldn’t be having to coach and retrain a model in your time series data. This hopefully eliminates the net machine learning problem of monitoring your model’s drift and triggering retraining, ideally eliminating the complexity of your forecasting pipeline.
You furthermore mght don’t must struggle to estimate the cross-series correlations or relationships for multivariate statistical models. Additional variance added by estimates often harms the resulting forecasts and may cause the model to learn spurious correlations.
Could you provide some practical examples of how models like Google’s TimesFM, IBM’s TinyTimeMixer, and AutoLab’s MOMENT have been implemented in real-world scenarios?
That is difficult to reply; since these models are of their relative infancy, little is understood about how firms use them in real-world scenarios.
In your experience, what challenges do organizations typically face when integrating time series LMs into their existing data infrastructure, and the way can they overcome them?
Time series LMs are so recent that I don’t know the precise challenges organizations face. Nevertheless, I imagine they’ll confront the identical challenges faced when incorporating any GenAI model into your data pipeline. These challenges include:
- Data compatibility and integration issues: Time series LMs often require specific data formats, consistent timestamping, and regular intervals, but existing data infrastructure might include unstructured or inconsistent time series data spread across different systems, similar to legacy databases, cloud storage, or real-time streams. To handle this, teams should implement robust ETL (extract, transform, load) pipelines to preprocess, clean, and align time series data.
- Model scalability and performance: Time series LMs, especially deep learning models like transformers, will be resource-intensive, requiring significant compute and memory resources to process large volumes of time series data in real-time or near-real-time. This could require teams to deploy models on scalable platforms like Kubernetes or cloud-managed ML services, leverage GPU acceleration when needed, and utilize distributed processing frameworks like Dask or Ray to parallelize model inference.
- Interpretability and trustworthiness: Time series models, particularly complex LMs, will be seen as “black boxes,” making it hard to interpret predictions. This will be particularly problematic in regulated industries like finance or healthcare.
- Data privacy and security: Handling time series data often involves sensitive information, similar to IoT sensor data or financial transaction data, so ensuring data security and compliance is critical when integrating LMs. Organizations must ensure data pipelines and models comply with best security practices, including encryption and access control, and deploy models inside secure, isolated environments.
Looking forward, how do you envision the role of time series LMs evolving in the sphere of predictive analytics and AI? Are there any emerging trends or technologies that particularly excite you?
A possible next step within the evolution of time series LMs might be introducing tools that enable users to deploy, access, and use them more easily. Most of the time series LMs I’ve used require very specific environments and lack a breadth of tutorials and documentation. Ultimately, these projects are of their early stages, but it can be exciting to see how they evolve in the approaching months and years.