How Indexing Works (Part 1): What Is an Index?
Why indexing?
- Improve query performance
- Make reading faster…
That’s about it!
But why, really?
Slow queries are one of the most (if not the most) common causes of poor application performance. A slow query can be the difference between a site being super snappy and being literally unusable. Sometimes the difference between these two extremes is just one good index away.
Indexing is a lot more than just throwing an index at every column in our WHERE clause and hoping that one of them sticks. If we don’t know what we are doing, a bad index can actually make performance worse.
Let’s find out how much there actually is to this topic.
What is an index?
Let’s take an example of a phone book. For simplicity’s sake, let’s say it has three pieces of information: First name, last name and the phone number. A phone book is almost always going to be ordered by last name. So if we’re given someone’s last name, we can find that person pretty quickly because the phone book is ordered for exactly that type of search query.
If we’re only given someone’s first name, then we have to look through every single entry and we will probably end up with a whole bunch of results. We could say that this phone book is having an index on last name. An index is an ordered representation of the indexed data.
Question: Why is order so important?
What we’re doing when we’re querying our data is we’re searching our data for a subset of that data. It turns out that searching an ordered input is a lot more efficient than searching an unordered input. So what is the index actually? That’s where B-Tree structure comes.
Read more: https://stackoverflow.com/questions/1108/how-does-database-indexing-work/1130#1130
B-Tree
From the example above, the root node is 23. We have the left subtree and the right subtree. All the values in the left subtree are less than 23 and all the values in the right subtree are greater than or equal to 23. This is true for every node in the tree. This is how the tree is ordered. The nodes at the very bottom are called the leaf nodes, they don’t have any further subtrees and they are all in the same level or the same depth. That’s why we call it B-Tree, B stands for Balance.
This is important because this way we can guarantee that it takes the same number of steps to find any node/value in the tree. If we have one set of the tree that is way deeper and unbalanced then searching for a value in that part of the tree will take longer, which is not what we want.
The leaf nodes will always be at the same level. They also form a doubly linked list (indicated by 2 arrows pointing back and forth). So each set of leaf nodes has a pointer to the next set of leaf nodes and also to the previous set.
Question: Why is that important?
When we’re searching for data in an index, we are going to spend a lot of our time down in those leaf nodes scanning through them. If we don’t have a doubly linked list then as soon as we hit the end of one of those leaf nodes, we would have to go back up the tree, find the next leaf node, start scanning, go up again.. That’s not very efficient. By having a doubly linked list, we can just scan through them sequentially without having to back up the tree every time.
Question: What data is actually stored on the index ?
Let’s say we have a simple table with employee_id and name. If we put an index on the name column, the index would end up looking like this
The employee_id is nowhere to be found on this index, it only contains the name. In other words, an index only contains the values of the column we actually put the index on.
But…
Question: If we need data that is not on the index, how do we get them? In this case, how could we also get the employee_id ?
The above image is actually incomplete. The fact is that the database stores one additional piece of information on the index for every value. It’s called Row ID
It’s a database internal identifier that identifies a particular row in a table (it’s not our primary key). That way we can go back to the table and find that corresponding row to which this value in the index belongs to.
Question: What are the consequences of choosing that B-tree data structure?
- Searching for a value if very fast because of binary search algorithm
- Logarithmic scalability. It doesn’t really matter how large our input gets, it can still keep performing.
Question: If index makes our queries faster then why don’t we put an index on every single column?
Using an index improves read time but it makes writing slower. With every insert, update and delete we perform, we also have to insert, update and delete the index which would mean the index will have to be potentially rebalanced (which will take time). If we have a whole bunch of indexes, our Writes are going to be very slow. So the rule is, we want to have as many indexes as necessary but as few as possible.