MODEL VALIDATION & OPTIMIZATION
those cross-validation diagrams in every data science tutorial? Those showing boxes in numerous colours moving around to clarify how we split data for training and testing? Like this one:
I’ve seen them too — one too repeatedly. These diagrams are common — they’ve turn out to be the go-to strategy to explain cross-validation. But here’s something interesting I noticed while them as each a designer and data scientist.
Once we take a look at a yellow box moving to different spots, our brain routinely sees it as one box moving around.
It’s just how our brains work — after we see something similar move to a brand new spot, we predict it’s the identical thing. (This is definitely why cartoons and animations work!)
But here’s the thing: In these diagrams, each box in a brand new position is purported to show a distinct chunk of knowledge. So while our brain naturally desires to track the boxes, we’ve to inform our brain, “No, no, that’s not one box moving — they’re different boxes!” It’s like we’re fighting against how our brain naturally works, just to know what the diagram means.
this as someone who works with each design and data, I began considering: perhaps there’s a greater way? What if we could show cross-validation in a way that really works with how our brain processes information?
Cross-validation is about ensuring machine learning models work well in the actual world. As an alternative of testing a model once, we test it multiple times using different parts of our data. This helps us understand how the model will perform with recent, unseen data.
Here’s what happens:
- We take our data
- Divide it into groups
- Use some groups for training, others for testing
- Repeat this process with different groupings
The goal is to get a reliable understanding of our model’s performance. That’s the core idea — easy and practical.
(Note: We’ll discuss different validation techniques and their applications in one other article. For now, let’s deal with understanding the essential concept and why current visualization methods need improvement.)
Open up any machine learning tutorial, and also you’ll probably see these kind of diagrams:
- Long boxes split into different sections
- Arrows showing parts moving around
- Different colours showing training and testing data
- Multiple versions of the identical diagram side by side
Listed below are the problems with such diagram:
Not Everyone Sees Colours the Same Way
Colours create practical problems when showing data splits. Some people can’t differentiate certain colours, while others may not see colours in any respect. The visualization fails when printed in black and white or viewed on different screens where colours vary. Using color as the first strategy to distinguish data parts means some people miss vital information resulting from their color perception.
Colours Make Things Harder to Remember
One other thing about colours is that it’d appear like they assist explain things, but they really create extra work for our brain. Once we use different colours for various parts of the info, we’ve to actively remember what each color represents. This becomes a memory task as an alternative of helping us understand the actual concept. The connection between colours and data splits isn’t natural or obvious — it’s something we’ve to learn and keep track of while trying to know cross-validation itself.
Our brain doesn’t naturally connect colours with data splits.
Too Much Information at Once
The present diagrams also suffer from information overload. They try and display your complete cross-validation process in a single visualization, which creates unnecessary complexity. Multiple arrows, extensive labeling, all competing for attention. Once we try to point out every aspect of the method at the identical time, we make it harder to deal with understanding each individual part. As an alternative of clarifying the concept, this approach adds an additional layer of complexity that we want to decode first.
Movement That Misleads
Movement in these diagrams creates a fundamental misunderstanding of how cross-validation actually works. Once we show arrows and flowing elements, we’re suggesting a sequential process that doesn’t exist in point of fact. Cross-validation splits don’t must occur in any particular order — the order of splits doesn’t affect the outcomes in any respect.
These diagrams also give the fallacious impression that data physically moves during cross-validation. In point of fact, we’re simply choosing different rows from our original dataset every time. The information stays exactly where it’s, and we just change which rows we use for testing in each split. When diagrams show data flowing between splits, they add unnecessary complexity to what ought to be a simple process.
What We Need As an alternative
We want diagrams that:
- Don’t just depend on colours to clarify things
- Show information in clear, separate chunks
- Make it obvious that different test groups are independent
- Don’t use unnecessary arrows and movement
Let’s fix this. As an alternative of attempting to make our brains work otherwise, why don’t we create something that feels natural to take a look at?
Let’s try something different. First, that is how data looks wish to most individuals — rows and columns of numbers with index.
Inspired by that structure, here’s a diagram that make more sense.
Here’s why this design makes more sense logically:
- True Data Structure: It matches how data actually works in cross-validation. In practice, we’re choosing different portions of our dataset — not moving data around. Each column shows exactly which splits we’re using for testing every time.
- Independent Splits: Each split explicitly shows it’s different data. Unlike moving boxes that may make you think that “it’s the identical test set moving around,” this shows that Split 2 is using completely different data from Split 1. This matches what’s actually happening in your code.
- Data Conservation: By keeping the column height the identical throughout all folds, we’re showing a vital rule of cross-validation: you usually use your entire dataset. Some portions for testing, the remaining for training. Each piece of knowledge gets used, nothing is ignored.
- Complete Coverage: Looking left to right, you’ll be able to easily check a vital cross-validation principle: every portion of your dataset shall be used as test data exactly once.
- Three-Fold Simplicity: We specifically use 3-fold cross-validation here because:
a. It clearly demonstrates the important thing concepts without overwhelming detail
b. The pattern is straightforward to follow: three distinct folds, three test sets. Easy enough to mentally track which portions are getting used for training vs testing in each fold
c. Perfect for educational purposes — adding more folds (like 5 or 10) would make the visualization more cluttered without adding conceptual value
(Note: While 5-fold or 10-fold cross-validation is likely to be more common in practice, 3-fold serves perfectly as an example the core concepts of the technique.)
Adding Indices for Clarity
While the concept above is correct, serious about actual row indices makes it even clearer:
Listed below are some reasons of improvements of this visual:
- As an alternative of just “different portions,” we will see that Fold 1 tests on rows 1–4, Fold 2 on rows 5–7, and Fold 3 on rows 8–10
- “Complete coverage” becomes more concrete: rows 1–10 each appear exactly once in test sets
- Training sets are explicit: when testing on rows 1–4, we’re training on rows 5–10
- Data independence is apparent: test sets use different row ranges (1–3, 4–6, 7–10)
This index-based view doesn’t change the concepts — it just makes them more concrete and easier to implement in code. Whether you concentrate on it as portions or specific row numbers, the important thing principles remain the identical: independent folds, complete coverage, and using all of your data.
Adding Some Colours
In the event you feel the black-and-white version is just too plain, this can also be one other acceptable options:
While using colours on this version may appear problematic given the problems with color blindness and memory load mentioned before, it will possibly still work as a helpful teaching tool alongside the simpler version.
The fundamental reason is that it doesn’t only use colours to point out the knowledge — the row numbers (1–10) and fold numbers inform you every thing you could know, with colours just being a pleasant extra touch.
Which means that even when someone can’t see the colours properly or prints it in black and white, they will still understand every thing through the numbers. And while having to recollect what each color means could make things harder to learn, on this case you don’t need to remember the colours — they’re just there as an additional help for individuals who find them useful, but you’ll be able to totally understand the diagram without them.
Similar to the previous version, the row numbers also help by showing exactly how the info is being split up, making it easier to know how cross-validation works in practice whether you concentrate to the colours or not.
The visualization stays fully functional and comprehensible even in case you ignore the colours completely.
Let’s take a look at why our recent designs is smart not only from a UX view, but in addition from a knowledge science perspective.
Matching Mental Models: Take into consideration the way you explain cross-validation to someone. You most likely say “we take these rows for testing, then these rows, then these rows.” Our visualization now matches exactly how we predict and talk concerning the process. We’re not only making it pretty, we’re making it match reality.
Data Structure Clarity: By showing data as columns with indices, we’re revealing the actual structure of our dataset. Each row has a number, each number appears in precisely one test set. This isn’t just good design, it’s accurate to how our data is organized in code.
Deal with What Matters: Our old way of showing cross-validation had us serious about moving parts. But that’s not what matters in cross-validation. What matters is:
- Which rows are we testing on?
- Are we using all our data?
- Is each row used for testing exactly once?
Our recent design answers these questions at a look.
Index-Based Understanding: As an alternative of abstract coloured boxes, we’re showing actual row indices. Whenever you write cross-validation code, you’re working with these indices. Now the visualization matches your code — Fold 1 uses rows 1–4, Fold 2 uses 5–7, and so forth.
Clear Data Flow: The layout shows data flowing from left to right: here’s your dataset, here’s the way it’s split, here’s what each split looks like. It matches the logical steps of cross-validation and it’s also easier to take a look at.
Here’s what we’ve learned concerning the whole redrawing of the cross-validation diagram:
Match Your Code, Not Conventions: We often persist with traditional ways of showing things simply because that’s how everyone does it. But cross-validation is admittedly about choosing different rows of knowledge for testing, so why not show exactly that? When your visualization matches your code, understanding follows naturally.
Data Structure Matters: By showing indices and actual data splits, we’re revealing how cross-validation really works while also make a clearer picture. Each row has its place, each split has its purpose, and you’ll be able to trace exactly what’s happening in each step.
Simplicity Has It Purpose: It seems that showing less can actually explain more. By specializing in the essential parts — which rows are getting used for testing, and when — we’re not only simplifying the visualization but we’re also highlighting what actually matters in cross-validation.
Looking ahead, this considering can apply to many data science concepts. Before making one other visualization, ask yourself:
- Does this show what’s actually happening within the code?
- Can someone trace the info flow?
- Are we showing structure, or simply following tradition?
Good visualization isn’t about following rules — it’s about showing truth. And sometimes, the clearest truth can also be the best.