uk.ac.cam.juliet.twitter.analysis
Class BayesUtil

java.lang.Object
  extended by uk.ac.cam.juliet.twitter.analysis.BayesUtil

public class BayesUtil
extends java.lang.Object

contains all the text processing functions that the BayesClassifier and the BayesLearner need.

Author:
Mansour Ahmed

Field Summary
static java.lang.String allCaps
          keyword to tag capital letters
 java.lang.String frownKeyword
          keyword to replace frowny emoticons
static java.lang.String laughKeyword
          keword to replace laughs
 java.lang.String smileKeyword
          keyword to replace smiling emoticons
static java.lang.String urlKeyword
          keyword to replace urls
 java.lang.String usernameKeyword
          keyword to replace usernames
 
Constructor Summary
BayesUtil()
           
 
Method Summary
static java.lang.String detectAllCaps(java.lang.String sentence)
          if the function detects any string of consecutive capital letters in the tweet (e.g.
static java.util.List<java.lang.String> extractWords(java.lang.String sentence)
          simply creates a list of every word inside the sentence with all punctuation after words removed
 java.lang.String insertKeywords(java.lang.String sentence)
          calls the methods: removeStopWords, replaceEmoticons, replaceLaughs, replaceRepeatedLetters, replaceURLs and replaceUsernames on the input string in one go
static boolean isLatin(java.lang.String sentence)
          checks if the text is all in Latin characters - to avoid question marks (?)
static void main(java.lang.String[] args)
          main method for quick testing
 java.lang.String processText(java.lang.String sentence)
          processes the input tweet and produces an output in a standard way such that is can be used by the BayesClassifier and the BayesLearner

For example:
the input string:
"@foo hahaha! I agree with you TOTALLY :-) http://cnn.com/"
will output the following
: "_USERNAME _LAUGH! agree totally :-) _URL _ALL_CAPS"
this function uses other functions in the class in a certain to perform atomic processing.
 java.lang.String removeStopWords(java.lang.String sentence)
          stopwords are words that are usually ignored by search engines because they show up very frequently in English.
 java.lang.String replaceEmoticons(java.lang.String sentence)
          replaces happy and sad emoticons with two keywords, _SMILE and _FROWN
static java.lang.String replaceLaughs(java.lang.String sentence)
          replaces laughs (like "hahaha" or "ahaahaha") with single keyword _LAUGH
static java.lang.String replaceRepeatedLetters(java.lang.String sentence)
          repeated letters cause a problem to the Bayes algorithm, this function takes a word like "huuuuuungrrrry" and reduces every string of repeated letters to 2 letters only, the result in this case will be "huungrry", this reduces the number of different ways the word "hungry" can be written
static java.lang.String replaceURLs(java.lang.String sentence)
          replaces every url it finds with a keyword URL
 java.lang.String replaceUsernames(java.lang.String sentence)
          replaces any mentioned username in a tweet (e.g.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

usernameKeyword

public java.lang.String usernameKeyword
keyword to replace usernames


laughKeyword

public static java.lang.String laughKeyword
keword to replace laughs


smileKeyword

public java.lang.String smileKeyword
keyword to replace smiling emoticons


frownKeyword

public java.lang.String frownKeyword
keyword to replace frowny emoticons


urlKeyword

public static java.lang.String urlKeyword
keyword to replace urls


allCaps

public static java.lang.String allCaps
keyword to tag capital letters

Constructor Detail

BayesUtil

public BayesUtil()
Method Detail

processText

public java.lang.String processText(java.lang.String sentence)
processes the input tweet and produces an output in a standard way such that is can be used by the BayesClassifier and the BayesLearner

For example:
the input string:
"@foo hahaha! I agree with you TOTALLY :-) http://cnn.com/"
will output the following
: "_USERNAME _LAUGH! agree totally :-) _URL _ALL_CAPS"
this function uses other functions in the class in a certain to perform atomic processing.

Parameters:
sentence - the sentence to be processed
Returns:
the processed text

removeStopWords

public java.lang.String removeStopWords(java.lang.String sentence)
stopwords are words that are usually ignored by search engines because they show up very frequently in English. examples: "a", "always" and "I"

this method checks against a standard list of stop words and removes any stop words it finds within the input string

Parameters:
sentence - the input string in question
Returns:
the same sentence with all stop words removed

replaceRepeatedLetters

public static java.lang.String replaceRepeatedLetters(java.lang.String sentence)
repeated letters cause a problem to the Bayes algorithm, this function takes a word like "huuuuuungrrrry" and reduces every string of repeated letters to 2 letters only, the result in this case will be "huungrry", this reduces the number of different ways the word "hungry" can be written

Parameters:
sentence - the input string in question
Returns:
the same input string but with all repeated letters reduced down to 2 only

replaceURLs

public static java.lang.String replaceURLs(java.lang.String sentence)
replaces every url it finds with a keyword URL

Parameters:
sentence - the input string
Returns:
the same input string but with every url replaced with keyword URL

replaceEmoticons

public java.lang.String replaceEmoticons(java.lang.String sentence)
replaces happy and sad emoticons with two keywords, _SMILE and _FROWN

Parameters:
sentence - the input sentence
Returns:
the same input sentence but with happy and sad emoticons replaced

replaceLaughs

public static java.lang.String replaceLaughs(java.lang.String sentence)
replaces laughs (like "hahaha" or "ahaahaha") with single keyword _LAUGH

Parameters:
sentence - the input sentence
Returns:
the same input sentence but with laughs replaced with keywords

replaceUsernames

public java.lang.String replaceUsernames(java.lang.String sentence)
replaces any mentioned username in a tweet (e.g. "@foo") with the keyword _USERNAME

Parameters:
sentence - the input sentence
Returns:
the same input sentence with all usernames replaced with a keyword

detectAllCaps

public static java.lang.String detectAllCaps(java.lang.String sentence)
if the function detects any string of consecutive capital letters in the tweet (e.g. "I AM SO EXCITED!!") it will insert a keyword "_ALL_CAPS" in the tweet, and turns the rest of the tweet to lowercase

Parameters:
sentence - the input sentence
Returns:
the same input sentence concatinated to "ALL_CAPS" if the input sentence contains a string of consecutive capital letters, and then everything else turned to lowercase

insertKeywords

public java.lang.String insertKeywords(java.lang.String sentence)
calls the methods: removeStopWords, replaceEmoticons, replaceLaughs, replaceRepeatedLetters, replaceURLs and replaceUsernames on the input string in one go

Parameters:
sentence - the input sentence
Returns:
calls all the replace and remove methods on the input string

extractWords

public static java.util.List<java.lang.String> extractWords(java.lang.String sentence)
simply creates a list of every word inside the sentence with all punctuation after words removed

Parameters:
sentence - the sentence in question
Returns:
a list of every word inside the sentence with all punctuation after words removed

isLatin

public static boolean isLatin(java.lang.String sentence)
checks if the text is all in Latin characters - to avoid question marks (?)

Parameters:
sentence - is the sentence to be checked
Returns:
boolean value of either true (isLatin) or false

main

public static void main(java.lang.String[] args)
main method for quick testing

Parameters:
args -