I am currently creating a web application using the twitter streaming API and would need some advice concerning my database scheme.
I basically have a list of hashtags that I track in real-time. All the tweets having that hashtag are stored in a table.
Thing is that the table is getting quickly huge, and it is far from efficient to crawl it afterwards (to get the list of tweets for a given user for example).
Just for information, I got something like 700.000 tweets crawled in less that 24 hours. This was for 5 hashtags, but I would like to scale up to a hundred at least.
I store them all and then populate another table of unique users out of it, which is transformed into a leaderboard.
So I was considering adding indexes, using the hashtag as a key. But I was also considering separating the tweets in several tables, one per hashtag in order to have a less juge amount of data to crawl.
The table would be created at runtime when a new hashtag is added.
Am I going the right way with this, or is that a bad practice?
Or is Mysql just a bas idea and I should go for another technology?
This is in general not the best practice. It depends upon the DB you are using, but generally a relational DB should be more efficient at handling the varying sets of rows if they are all in one table, than if you break the sets of rows into separate tables. If you let the DB manage the data, it will do a better job of optimizing IO than if you try to surround this with a bunch of tables in the file system.
Indeed, you should make the hashtag column indexed if you are going to commonly retrieve all the rows for a given hashtag.
There is a caveat: if you know that you will only be processing one hashtag value at a time, and you will never want to have data from two hashtag values in process at the same time, there is a possibility that creating tables like you suggest might perform well. However, I doubt you are in this situation, and it is better to let the DB do what it is supposed to do.