uk.ac.cam.juliet.twitter.database
Class Database

java.lang.Object
  extended by uk.ac.cam.juliet.twitter.database.Database
All Implemented Interfaces:
IDatabase

public class Database
extends java.lang.Object
implements IDatabase

A java interface to the MySQL database

Author:
Mansour Ahmed

Nested Class Summary
private static class Database.AnalysedTweets_table
           
private static class Database.BadWords_table
           
private static class Database.ChiSquaredFeatureSelection_table
           
private static class Database.Configurations_table
           
private static class Database.RawTweets_table
           
private static class Database.StopWords_table
           
private static class Database.TrainingData_table
           
private static class Database.Users_table
           
private static class Database.WordCount_table
           
 
Field Summary
private  java.sql.Connection connection
          the connection to the database that will be used by all the methods in the class to access the database
private  java.lang.String dbName
          the database name
private  java.lang.String dbURL
          the database url
private  java.lang.String password
          the password used to access the DB
private  int trainingDataCount
          this is a cache for getTrainingDataCount() to improve its performance
private  java.lang.String username
          the username used to access the DB
 
Constructor Summary
Database(java.lang.String dbName, java.lang.String username, java.lang.String password)
          default constructor
this constructor attempts to connect to the database using the parameters provided
 
Method Summary
 boolean addBadWord(java.lang.String badword)
          inserts the bad word in the badwords table with a default replacement of "[Censored]", if the badwords id already in the table the methods does nothing and returns false, otherwise returns true
 boolean addBadWord(java.lang.String badword, java.lang.String replacement)
          inserts badword in the badwords table alongside its clean replacement, if the badword is already in the table, the method does nothing and returns false, otherwise it returns true.
 boolean addRawTweet(Tweet tweet)
          attempts to insert the raw (every field; all data, before analysis) tweet into the raw_tweets table in the database, if the tweet is already in the database the method does nothing and returns false.
When a tweet is inserted, it is marked as under_analysis by default, which implies that the database will be considered inconsistent until the finishedAnalysis(Status) is called on the same tweet to un-mark it.
 void addStopWord(java.lang.String stopword)
          adds the argument to the list of stop words in the database
 boolean addUser(Tweet tweet)
          attempts to add the user to the Users table in the database if they do not already exist in it
 void clearAllAnalysis()
          deletes all rows from the analysis tables: analysed_tweets and word_count
 IDatabase copy()
          creates a new IDatabase object connected to the same database, username and password
 void createTables()
          attempts to create the required tables in the database if they are not already created.
 void decrementCount(java.lang.String word, int decrementValue)
          decrements the count of word in the word_count table by amount decrementValue, if the word is not found in the table, the function does nothing and return
 void deleteBadWord(java.lang.String badword)
          deletes this badword from the badwords table, if the badword was not found, the method returns with no side effects
 void finishedAnalysis(Tweet tweet)
          marks the parameter tweet in the database as completely analysed meaning that the analysis and statistics data is consistent with the raw_tweets table
 void flushDatabase()
          drops all the client specific tables from the database
Warning: this methods is irreversable, and causes all the client specific tables to be lost
 java.util.List<java.lang.String> getAllBadWords()
          reads all the badword regular expressions from the database
 java.util.List<java.lang.String> getAllStopwords()
          gets the list of all stopwords in the database
 double getChiSquared(java.lang.String word, Classification c)
          calculates the chi-square function which measures the independence of the class c of the parameter word , the lower the score the more independent is that word from this class
for good explaination of the chi-squared function check the 6th page of this document from Stanford university
 int getClassificationCount(Classification c)
          counts the number of tweets in the training data that belong to classification c
 int getCountPerClassification(java.lang.String feature, Classification c)
          counts the number of tweets in the training data that have the feature "feature" and have classification "c"
 int getNumberOfDays()
          gets the number of days to keep old tweets in the database
 java.util.List<Tweet> getRawTweets(java.util.Date olderThan)
          gets from the database a list of all tweets that are created before then the specified time
 java.lang.String getSearchString()
          gets the search string that is stored in the database
 int getTrainingDataCount()
          counts the training data in the database
 java.lang.String getTwitterUsernameAndPass()
          gets the twitter username and password used for authentication from the database
private  User getUser(java.lang.String username)
          gets from the user table in the database the user with the given username
 int getWordCount(java.lang.String word)
          gets the current count of the word in the word_count table in the database or 0 if the word is not present in the database.
 boolean hasCrashed()
          finds if any raw tweets are still marked as "under analysis"
 void incrementChiSquareWordCount(java.lang.String word, Classification c)
          inserts (word,0,0) in the table if word is not in the table already, and increments either n_p if c = positive or n_n of c = negative
 void incrementCount(java.lang.String word, int incrementValue)
          increments the count of word in the word_count table by amount incrementValue, if the word is not already in the table it is inserted with an initial count of incrementValue
 boolean insertAnalysis(java.math.BigInteger id, boolean isOffensive, double score)
          attempts inserts the tweet with identifer "id" into the analysed_tweets table with some analysis values, if the tweet id is already in the database, method does nothing and returns false
 boolean insertTrainingTweet(Tweet tweet, Classification c)
          attempts to insert the given tweet in the database alongside its classification, if the insertion fails, the method doesn't do anything
 boolean isEmpty()
          checks if the database has any tables in it
static void main(java.lang.String[] args)
          main method for quick testing
 void refilterTweets(java.lang.String badRegex)
          goes over every tweet in the raw_tweets table and if any tweet contains that regular expression, it is marked as nsfw in the analysed_tweets table.
for example: if the input is "belguim", it will match every tweet in raw_tweet to ".*\bbelgium\b.*" and if it matches, it gets marked as nsfw
 void removeAnalysis(Tweet tweet)
          deletes from the analysed_tweets table the analysis of the argument tweet
 void removeInactiveUsers()
          removes from the users table every user who doesn't have a refernece in the raw_tweets table
 void removeInsignificantWords()
          after cleanup, remove all word_counts where count <= 1
 void removeRawTweets(java.util.Date olderThan)
          deletes all raw_tweets that are older then the provided time
 void storeLastNumberOfDays(int numberOfDays)
          stores in the database the number of days for which to keep old tweets in the database
 void storeLastSearchString(java.lang.String searchString)
          stores the search string in a special table client table
 void underAnalysis(java.util.Date olderThan)
          marks all tweets older than the "olderThan" parameter as under analysis, hence if the server crashes before they are unmarked, a crash can be detected.
 boolean updateBadWord(java.lang.String badword, java.lang.String replacement)
          updates the clean replacement of the badword in the badwords table
Warning: the old replacmenet in the database will be lost
 void updateIsOffensive(int id, boolean isOffensive)
          updates the analysed_tweets table to indicate whether or not the tweet with identifier "id" contains offensive content.
 void updateScore(int id, double score)
          updates the analysed_tweets table to set a new sentiment score for the tweet with identifier "id"
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

connection

private java.sql.Connection connection
the connection to the database that will be used by all the methods in the class to access the database


dbURL

private java.lang.String dbURL
the database url


dbName

private java.lang.String dbName
the database name


username

private java.lang.String username
the username used to access the DB


password

private java.lang.String password
the password used to access the DB


trainingDataCount

private int trainingDataCount
this is a cache for getTrainingDataCount() to improve its performance

Constructor Detail

Database

public Database(java.lang.String dbName,
                java.lang.String username,
                java.lang.String password)
         throws java.sql.SQLException,
                java.lang.ClassNotFoundException
default constructor
this constructor attempts to connect to the database using the parameters provided

Parameters:
dbName - the name of the database
username - the MySQL username to use
password - the MySQL password
Throws:
java.sql.SQLException
java.lang.ClassNotFoundException
Method Detail

createTables

public void createTables()
Description copied from interface: IDatabase
attempts to create the required tables in the database if they are not already created.

Specified by:
createTables in interface IDatabase

addRawTweet

public boolean addRawTweet(Tweet tweet)
Description copied from interface: IDatabase
attempts to insert the raw (every field; all data, before analysis) tweet into the raw_tweets table in the database, if the tweet is already in the database the method does nothing and returns false.
When a tweet is inserted, it is marked as under_analysis by default, which implies that the database will be considered inconsistent until the finishedAnalysis(Status) is called on the same tweet to un-mark it.

Specified by:
addRawTweet in interface IDatabase
Parameters:
tweet - is the tweet object, as fetched from the Twitter API
Returns:
true if the insert was successful and false otherwise

insertAnalysis

public boolean insertAnalysis(java.math.BigInteger id,
                              boolean isOffensive,
                              double score)
Description copied from interface: IDatabase
attempts inserts the tweet with identifer "id" into the analysed_tweets table with some analysis values, if the tweet id is already in the database, method does nothing and returns false

Specified by:
insertAnalysis in interface IDatabase
Parameters:
id - the id of the tweet as returned by getRawTweetID and addRawTweet
isOffensive - a boolean value indicating whether the tweet contains offensive content
score - the sentiment analysis value between 0.0 and 1.0
Returns:
true if the insert was successful, and false otherwise

updateIsOffensive

public void updateIsOffensive(int id,
                              boolean isOffensive)
Description copied from interface: IDatabase
updates the analysed_tweets table to indicate whether or not the tweet with identifier "id" contains offensive content.

Specified by:
updateIsOffensive in interface IDatabase
Parameters:
id - the id of the tweet as returned by getRawTweetID and addRawTweet
isOffensive - a boolean value indicating whether the tweet contains offensive content

updateScore

public void updateScore(int id,
                        double score)
Description copied from interface: IDatabase
updates the analysed_tweets table to set a new sentiment score for the tweet with identifier "id"

Specified by:
updateScore in interface IDatabase
Parameters:
id - the id of the tweet as returned by getRawTweetID and addRawTweet
score - the sentiment analysis value between 0.0 and 1.0

addUser

public boolean addUser(Tweet tweet)
Description copied from interface: IDatabase
attempts to add the user to the Users table in the database if they do not already exist in it

Specified by:
addUser in interface IDatabase
Returns:
true if the insertion was successful, false if the user is present in the database

addBadWord

public boolean addBadWord(java.lang.String badword)
Description copied from interface: IDatabase
inserts the bad word in the badwords table with a default replacement of "[Censored]", if the badwords id already in the table the methods does nothing and returns false, otherwise returns true

Specified by:
addBadWord in interface IDatabase
Parameters:
badword - the badword to be inserted
Returns:
true if the insert was successful, false otherwise

addBadWord

public boolean addBadWord(java.lang.String badword,
                          java.lang.String replacement)
Description copied from interface: IDatabase
inserts badword in the badwords table alongside its clean replacement, if the badword is already in the table, the method does nothing and returns false, otherwise it returns true.

Specified by:
addBadWord in interface IDatabase
Parameters:
badword - the badword to be inserted
replacement - the clean replacement
Returns:
true if the insert was successful, false if the badword is already in the database

updateBadWord

public boolean updateBadWord(java.lang.String badword,
                             java.lang.String replacement)
Description copied from interface: IDatabase
updates the clean replacement of the badword in the badwords table
Warning: the old replacmenet in the database will be lost

Specified by:
updateBadWord in interface IDatabase
Parameters:
badword - the badword for which the replacement need to be updated
replacement - the new replacement word
Returns:
true if the update was successful and false otherwise

getAllBadWords

public java.util.List<java.lang.String> getAllBadWords()
Description copied from interface: IDatabase
reads all the badword regular expressions from the database

Specified by:
getAllBadWords in interface IDatabase
Returns:
a list of all the bad regular expressions

incrementCount

public void incrementCount(java.lang.String word,
                           int incrementValue)
Description copied from interface: IDatabase
increments the count of word in the word_count table by amount incrementValue, if the word is not already in the table it is inserted with an initial count of incrementValue

Specified by:
incrementCount in interface IDatabase
Parameters:
word - the word for which the count needs to be incremented
incrementValue - the number by which to increment the count of word

decrementCount

public void decrementCount(java.lang.String word,
                           int decrementValue)
Description copied from interface: IDatabase
decrements the count of word in the word_count table by amount decrementValue, if the word is not found in the table, the function does nothing and return

Specified by:
decrementCount in interface IDatabase
Parameters:
word - the word for which the count needs to be decremented
decrementValue - the number by which to decrement the count of word

getWordCount

public int getWordCount(java.lang.String word)
Description copied from interface: IDatabase
gets the current count of the word in the word_count table in the database or 0 if the word is not present in the database. This function is not case seneitive, e.g. "Hello" or "hello" return the same result

Specified by:
getWordCount in interface IDatabase
Parameters:
word - the word for which to get the count.
Returns:
the count of the word in the word_count table in the database

insertTrainingTweet

public boolean insertTrainingTweet(Tweet tweet,
                                   Classification c)
Description copied from interface: IDatabase
attempts to insert the given tweet in the database alongside its classification, if the insertion fails, the method doesn't do anything

Specified by:
insertTrainingTweet in interface IDatabase
Parameters:
tweet - the tweet to store in the database
c - the classification (Positive or Negative)
Returns:
true if the insertion was successful, false otherwise

getChiSquared

public double getChiSquared(java.lang.String word,
                            Classification c)
Description copied from interface: IDatabase
calculates the chi-square function which measures the independence of the class c of the parameter word , the lower the score the more independent is that word from this class
for good explaination of the chi-squared function check the 6th page of this document from Stanford university

Specified by:
getChiSquared in interface IDatabase
Parameters:
word - the word to measure independence against
c - the classification to measure indep
Returns:
chi-squared(word,c)

getClassificationCount

public int getClassificationCount(Classification c)
Description copied from interface: IDatabase
counts the number of tweets in the training data that belong to classification c

Specified by:
getClassificationCount in interface IDatabase
Parameters:
c - the classification to count for
Returns:
the number of tweets in the training data that belong to classification c

getCountPerClassification

public int getCountPerClassification(java.lang.String feature,
                                     Classification c)
Description copied from interface: IDatabase
counts the number of tweets in the training data that have the feature "feature" and have classification "c"

Specified by:
getCountPerClassification in interface IDatabase
Parameters:
feature - the feature to be counted
c - the classification to be counted
Returns:
the number of tweets in the training data that have the feature "feature" and have classification "c"

incrementChiSquareWordCount

public void incrementChiSquareWordCount(java.lang.String word,
                                        Classification c)
Description copied from interface: IDatabase
inserts (word,0,0) in the table if word is not in the table already, and increments either n_p if c = positive or n_n of c = negative

Specified by:
incrementChiSquareWordCount in interface IDatabase
Parameters:
word - the word to insert/update in the table
c - the column to increment (either positive or negative)

getTrainingDataCount

public int getTrainingDataCount()
Description copied from interface: IDatabase
counts the training data in the database

Specified by:
getTrainingDataCount in interface IDatabase
Returns:
the number of training tweets in the database

getAllStopwords

public java.util.List<java.lang.String> getAllStopwords()
Description copied from interface: IDatabase
gets the list of all stopwords in the database

Specified by:
getAllStopwords in interface IDatabase
Returns:
the list of all stopwrods

addStopWord

public void addStopWord(java.lang.String stopword)
Description copied from interface: IDatabase
adds the argument to the list of stop words in the database

Specified by:
addStopWord in interface IDatabase
Parameters:
stopword - the stopword to add

clearAllAnalysis

public void clearAllAnalysis()
Description copied from interface: IDatabase
deletes all rows from the analysis tables: analysed_tweets and word_count

Specified by:
clearAllAnalysis in interface IDatabase

flushDatabase

public void flushDatabase()
Description copied from interface: IDatabase
drops all the client specific tables from the database
Warning: this methods is irreversable, and causes all the client specific tables to be lost

Specified by:
flushDatabase in interface IDatabase

removeRawTweets

public void removeRawTweets(java.util.Date olderThan)
Description copied from interface: IDatabase
deletes all raw_tweets that are older then the provided time

Specified by:
removeRawTweets in interface IDatabase
Parameters:
olderThan - delete all tweets older than this time

isEmpty

public boolean isEmpty()
Description copied from interface: IDatabase
checks if the database has any tables in it

Specified by:
isEmpty in interface IDatabase
Returns:
true if the database has no tables and false otherwise

getSearchString

public java.lang.String getSearchString()
Description copied from interface: IDatabase
gets the search string that is stored in the database

Specified by:
getSearchString in interface IDatabase
Returns:
the search string that is stored in the database

getNumberOfDays

public int getNumberOfDays()
Description copied from interface: IDatabase
gets the number of days to keep old tweets in the database

Specified by:
getNumberOfDays in interface IDatabase
Returns:
the number of days to keep old tweets in the database

storeLastSearchString

public void storeLastSearchString(java.lang.String searchString)
Description copied from interface: IDatabase
stores the search string in a special table client table

Specified by:
storeLastSearchString in interface IDatabase
Parameters:
searchString - the search string to store

storeLastNumberOfDays

public void storeLastNumberOfDays(int numberOfDays)
Description copied from interface: IDatabase
stores in the database the number of days for which to keep old tweets in the database

Specified by:
storeLastNumberOfDays in interface IDatabase
Parameters:
numberOfDays - the number to store

finishedAnalysis

public void finishedAnalysis(Tweet tweet)
Description copied from interface: IDatabase
marks the parameter tweet in the database as completely analysed meaning that the analysis and statistics data is consistent with the raw_tweets table

Specified by:
finishedAnalysis in interface IDatabase
Parameters:
tweet - the tweet to mark as completely analysed

underAnalysis

public void underAnalysis(java.util.Date olderThan)
Description copied from interface: IDatabase
marks all tweets older than the "olderThan" parameter as under analysis, hence if the server crashes before they are unmarked, a crash can be detected.

Specified by:
underAnalysis in interface IDatabase
Parameters:
olderThan - mark all tweets older than this parameter

getRawTweets

public java.util.List<Tweet> getRawTweets(java.util.Date olderThan)
Description copied from interface: IDatabase
gets from the database a list of all tweets that are created before then the specified time

Specified by:
getRawTweets in interface IDatabase
Parameters:
olderThan - get all tweets created before this time parameter
Returns:
all tweets created before this time parameter

getUser

private User getUser(java.lang.String username)
              throws java.net.URISyntaxException
gets from the user table in the database the user with the given username

Parameters:
username - the username of the user to get from the database
Returns:
from the database the user with the given username
Throws:
java.net.URISyntaxException

removeAnalysis

public void removeAnalysis(Tweet tweet)
Description copied from interface: IDatabase
deletes from the analysed_tweets table the analysis of the argument tweet

Specified by:
removeAnalysis in interface IDatabase
Parameters:
tweet - delete the analysis of this tweet

hasCrashed

public boolean hasCrashed()
Description copied from interface: IDatabase
finds if any raw tweets are still marked as "under analysis"

Specified by:
hasCrashed in interface IDatabase
Returns:
true if there is one or more raw tweets marked as under analysis and false if no such raw tweets exist, if connection to the database fails it also returns true

removeInactiveUsers

public void removeInactiveUsers()
Description copied from interface: IDatabase
removes from the users table every user who doesn't have a refernece in the raw_tweets table

Specified by:
removeInactiveUsers in interface IDatabase

refilterTweets

public void refilterTweets(java.lang.String badRegex)
Description copied from interface: IDatabase
goes over every tweet in the raw_tweets table and if any tweet contains that regular expression, it is marked as nsfw in the analysed_tweets table.
for example: if the input is "belguim", it will match every tweet in raw_tweet to ".*\bbelgium\b.*" and if it matches, it gets marked as nsfw

Specified by:
refilterTweets in interface IDatabase
Parameters:
badRegex - the bad regular expression to check against

copy

public IDatabase copy()
Description copied from interface: IDatabase
creates a new IDatabase object connected to the same database, username and password

Specified by:
copy in interface IDatabase
Returns:
another IDatabase object connected to the same database, username and password

deleteBadWord

public void deleteBadWord(java.lang.String badword)
Description copied from interface: IDatabase
deletes this badword from the badwords table, if the badword was not found, the method returns with no side effects

Specified by:
deleteBadWord in interface IDatabase
Parameters:
badword - the badword to delete

getTwitterUsernameAndPass

public java.lang.String getTwitterUsernameAndPass()
Description copied from interface: IDatabase
gets the twitter username and password used for authentication from the database

Specified by:
getTwitterUsernameAndPass in interface IDatabase
Returns:
the twitter username and password in one string seperated by a colon like this "username:password"

removeInsignificantWords

public void removeInsignificantWords()
Description copied from interface: IDatabase
after cleanup, remove all word_counts where count <= 1

Specified by:
removeInsignificantWords in interface IDatabase

main

public static void main(java.lang.String[] args)
                 throws java.sql.SQLException,
                        java.lang.ClassNotFoundException
main method for quick testing

Parameters:
args -
Throws:
java.sql.SQLException
java.lang.ClassNotFoundException