Finding an optimal way to index data

by wispi   Last Updated October 31, 2018 21:19 PM

I have a text file consisting of several related data points, similar to the following:

# Last, First, Age

Start

Appleseed, John, 20,
Doe, John, 45,
Smith, Jane, 38,
...

End

If I want to read this file in a program, I would open it and then begin reading it character by character. If I wanted to find a value in the file, for example, John Doe's age, I would have to read the file character by character until I found Doe, John,, assuming I have no prior knowledge of the file's contents. For a file consisting of millions of entries, this is very inefficient.

Increasing efficiency

Another approach is to add a sort of "index" to the file. For example:

# Last, First, Age

index { # By letter
    A, line 12,
    D, line 13,
    s, line 14,
    ...
}

Start

Appleseed, John, 20,
Doe, John, 45,
Smith, Jane, 38,
...

End

Now, the entire index can be loaded into memory (or other fast-storage), and if a piece of data needs to accessed, the index can be read, then the entry can be read without having to read the entire file until we find the entry we're looking for.

Indexing the index

But another problem begins to occur. Assuming we make each index entry as small as possible (by sacrificing human-readability), the index will still become very large and begin to have a significant cost to read from. It seems like adding an "index of the index" to the file could help. For example:

# Last, First, Age

indexA { # Points to indexB
    A-R, line 10,
    S-V, line 12,
    ...
}

indexB { # By letter
    A, line 18,
    D, line 19,
    s, line 20,
    ...
}

Start

Appleseed, John, 20,
Doe, John, 45,
Smith, Jane, 38,
...

End

As an aside, I realize using line numbers doesn't work very well, and that this is not a very efficient way of storing data in general, but if we sort out all the kinks and decide to use this method of storing data at an extremely large scale, it appears that indexing the contents of the file would, in principle, increase efficiency.

Given the principle of storing data with an index, is there an optimal number of indexes I should have? As another example, if I added several indexes to a small file, it would be less efficient to read all the indexes as oppose to just going straight into the data. Whereas with a very large file, multiple indexes appear to greatly improve efficiency since the program doesn't need to read every single data entry. Is my thinking of adding indexes completely wrong? Is this a problem that has been conceived before?

Tags : algorithms


Related Questions





Is automated machine learning a dream?

Updated July 06, 2015 17:08 PM