5/8/2023 0 Comments Python clean textSo, that’s pretty much all about how to clean your Twitter data. Remove the Stopwords for i in range(len(tweets_to_token)): tweets_to_token = if not word in sw] tweets_to_token Tokenize the tweets for i in range(len(tweets_to_token)): tweets_to_token = word_tokenize(tweets_to_token) Preparing stopwords tweets_to_token = tweets sw = stopwords.words(‘english’) #you can adjust the language as you desire sw.remove(‘not’) #we exclude not from the stopwords corpus since removing not from the text will change the context of the text There are several steps you need to do to remove the stopwords: So, we’re about to clean them now using the nltk Python library. In text-data, mostly it contains insignificant words that are not used for the analysis process because they could mess up the analysis score. Up till now, we already got much cleaner data, but there is one more thing that we need to do to make it even cleaner. for i in range (len(tweets)): tweets = “”, tweets) tweets.head() All those things will be cleaned using the regex Python library. Also, we will clean up hash characters (only the hash characters not the whole hashtags) and username. Since we’re only going to be using the text data, which is the tweets, so we need to clean up the links. Those media will be converted into links on the JSON data. Sometimes when tweeting, Twitter users attach media like pictures, videos, etc. for i in range (len(tweets)): x = tweets.replace(“\n”,” “) #cleaning newline “\n” from the tweets tweets = html.unescape(x) tweets.head() Apart from that, we also need to clean up newlines since they make the data messy. So, we’re cleaning them using html library. Most of the time, the tweets returned by Twitter JSON data contain HTML entities and they need to be decoded into characters. Once we extracted tweet data, we’ll notice things that need to be cleaned. new_data.to_csv(r’your_new_sample.csv’, index = False) new_sample = pd.read_csv(‘your_new_sample.csv’) new_sample.head() tweets = new_sample tweets.head() We’re here assuming that we’re only going to use the tweets data, so we’re going to extract the tweets data out of the file. Don’t forget to store the new dataframe to a new file without including the index on it, so that we could explore the data more freely later on. If your dataframe have indices included on it, once you drop those data duplicates, you need to store the new dataframe in a new file. new_data = data.drop_duplicates(‘Tweet Content’,keep=’first’) #delete the duplicates by dropping them and store the result value to a new variable new_data.head() analysis) these data duplicates could mess up the result by messing up the measurement. Most of the time, we don’t need the data duplicates, because in further use (i.e. The first things that we’re going to clean are data duplicates. Once we have imported the data, we’re now ready for the data cleaning process. pd.set_option(‘display.max_colwidth’, None) data = pd.read_csv(‘your_sample.csv’) data.head() We’re taking advantage of the pandas library here to import the data. in this case, I use CSV Twitter data, you may adjust the code if you use another extension type of file. Secondly, we need to import the Twitter data. Import pandas as pd import html import re from rpus import stopwords from nltk.tokenize import word_tokenize Re, to filter and delete unnecessary links, hash, username, punctuations or whatever you wish.Html, to decode HTML entities into regular characters.Pandas, to open data files and to apply certain operations to the data.In this article, I’m going to show you how to clean Twitter data using the python programming language.įirstly, you need to import the modules needed. Sometimes, the data contain unnecessary things that need to be cleaned, such as unnecessary characters, links, newlines, and other kinds of stuff. Twitter data contains a bunch of information parameters. If you haven’t known how to collect Twitter data using python, you can check my previous post, teehee. Besides, it’s pretty simple to collect data from it. The reason is that it’s open and free to collect unless you subscribe to the paid version one. Twitter is one of the most used data sources for data analysis.
0 Comments
Leave a Reply. |