Finding Temporal Patterns in Twitter Posts: Exploratory Data Evaluation with Python

Artificial Intelligence

Finding Temporal Patterns in Twitter Posts: Exploratory Data Evaluation with Python

admin

May 26, 2023

Finding Temporal Patterns in Twitter Posts: Exploratory Data Evaluation with Python

Clustering of Twitter data with Python, K-Means, and t-SNE

Tweet clusters t-SNE visualization, Image by writer

Within the article “What People Write about Climate” I analyzed Twitter posts using natural language processing, vectorization, and clustering. Using this system, it is feasible to search out distinct groups in unstructured text data, for instance, to extract messages about ice melting or about electric transport from hundreds of tweets about climate. Through the processing of this data, one other query arose: what if we could apply the identical algorithm to not the messages themselves but to the time when those messages were published? This may allow us to research when and how often different people make posts on social media. It will possibly be necessary not only from sociological or psychological perspectives but, as we are going to see later, also for detecting bots or users sending spam. Last but not least, almost everybody is using social platforms nowadays, and it’s just interesting to learn something latest about us. Obviously, the identical algorithm will be used not just for Twitter posts but for any media platform.

Methodology

I’ll use mostly the identical approach as described in the primary part about Twitter data evaluation. Our data processing pipeline will consist of several steps:

Collecting tweets including the particular hashtag and saving them in a CSV file. This was already done within the previous article, so I’ll skip the small print here.
Finding the final properties of the collected data.
Calculating embedding vectors for every user based on the time of their posts.
Clustering the information using the K-Means algorithm.
Analyzing the outcomes.

Let’s start.

1. Loading the information

I will probably be using the Tweepy library to gather Twitter posts. More details will be present in the primary part; here I’ll only publish the source code:

import tweepyapi_key = "YjKdgxk..."
api_key_secret = "Qa6ZnPs0vdp4X...."
auth = tweepy.OAuth2AppHandler(api_key, api_key_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
hashtag = "#climate"
language = "en"
def text_filter(s_data: str) -> str:
""" Remove extra characters from text """
return s_data.replace("&", "and").replace(";", " ").replace(",", " ") 
.replace('"', " ").replace("n", " ").replace("  ", " ")
def get_hashtags(tweet) -> str:
""" Parse retweeted data """
hash_tags = ""
if 'hashtags' in tweet.entities:
hash_tags = ','.join(map(lambda x: x["text"], tweet.entities['hashtags']))
return hash_tags
def get_csv_header() -> str:
""" CSV header """
return "id;created_at;user_name;user_location;user_followers_count;user_friends_count;retweets_count;favorites_count;retweet_orig_id;retweet_orig_user;hash_tags;full_text"
def tweet_to_csv(tweet):
""" Convert a tweet data to the CSV string """
if not hasattr(tweet, 'retweeted_status'):
full_text = text_filter(tweet.full_text)
hasgtags = get_hashtags(tweet)
retweet_orig_id = ""
retweet_orig_user = ""
favs, retweets = tweet.favorite_count, tweet.retweet_count
else:
retweet = tweet.retweeted_status
retweet_orig_id = retweet.id
retweet_orig_user = retweet.user.screen_name
full_text = text_filter(retweet.full_text)
hasgtags = get_hashtags(retweet)
favs, retweets = retweet.favorite_count, retweet.retweet_count
s_out = f"{tweet.id};{tweet.created_at};{tweet.user.screen_name};{addr_filter(tweet.user.location)};{tweet.user.followers_count};{tweet.user.friends_count};{retweets};{favs};{retweet_orig_id};{retweet_orig_user};{hasgtags};{full_text}"
return s_out
if __name__ == "__main__":
pages = tweepy.Cursor(api.search_tweets, q=hashtag, tweet_mode='prolonged',
result_type="recent",
count=100,
lang=language).pages(limit)
with open("tweets.csv", "a", encoding="utf-8") as f_log:
f_log.write(get_csv_header() + "n")
for ind, page in enumerate(pages):
for tweet in page:
# Get data per tweet
str_line = tweet_to_csv(tweet)
# Save to CSV
f_log.write(str_line + "n")

Using this code, we will get all Twitter posts with a particular hashtag, made throughout the last 7 days. A hashtag is definitely our search query, we will find posts about climate, politics, or some other topic. Optionally, a language code allows us to go looking posts in several languages. Readers are welcome to do extra research on their very own; for instance, it may be interesting to check the outcomes between English and Spanish tweets.

After the CSV file is saved, let’s load it into the dataframe, drop the unwanted columns, and see what kind of knowledge we now have:

import pandas as pddf = pd.read_csv("climate.csv", sep=';', dtype={'id': object, 'retweet_orig_id': object, 'full_text': str, 'hash_tags': str}, parse_dates=["created_at"], lineterminator='n')
df.drop(["retweet_orig_id", "user_friends_count", "retweets_count", "favorites_count", "user_location", "hash_tags", "retweet_orig_user", "user_followers_count"], inplace=True, axis=1)
df = df.drop_duplicates('id')
with pd.option_context('display.max_colwidth', 80):
display(df)

In the identical way, as in the primary part, I used to be getting Twitter posts with the hashtag “#climate”. The result looks like this:

We actually don’t need the text or user id, but it may be useful for “debugging”, to see how the unique tweet looks. For future processing, we are going to must know the day, time, and hour of every tweet. Let’s add columns to the dataframe:

def get_time(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.time()def get_date(dt: datetime.datetime):
""" Get date from datetime """
return dt.date()
def get_hour(dt: datetime.datetime):
""" Get time and minute from datetime """
return dt.hour
df["date"] = df['created_at'].map(get_date)
df["time"] = df['created_at'].map(get_time)
df["hour"] = df['created_at'].map(get_hour)

We are able to easily confirm the outcomes:

display(df[["user_name", "date", "time", "hour"]])

Now we now have all of the needed information, and we’re able to go.

2. General Insights

As we could see from the last screenshot, 199,278 messages were loaded; those are messages with a “#Climate” hashtag, which I collected inside several weeks. As a warm-up, let’s answer a straightforward query: what number of messages per day about climate were people posting on average?

First, let’s calculate the overall variety of days and the overall variety of users:

days_total = df['date'].unique().shape[0]
print(days_total)
# > 46users_total = df['user_name'].unique().shape[0]
print(users_total)
# > 79985

As we will see, the information was collected over 46 days, and in total, 79,985 Twitter users posted (or reposted) a minimum of one message with the hashtag “#Climate” during that point. Obviously, we will only count users who made a minimum of one post; alas, we cannot get the variety of readers this manner.

Let’s find the variety of messages per day for every user. First, let’s group the dataframe by user name:

gr_messages_per_user = df.groupby(['user_name'], as_index=False).size().sort_values(by=['size'], ascending=False)
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)

The “size” column gives us the variety of messages every user sent. I also added the “size_per_day” column, which is straightforward to calculate by dividing the overall variety of messages by the overall variety of days. The result looks like this:

We are able to see that essentially the most lively users are posting as much as 18 messages per day, and essentially the most inactive users posted just one message inside this 46-day period (1/46 = 0,0217). Let’s draw a histogram using NumPy and Bokeh:

import numpy as np
from bokeh.io import show, output_notebook, export_png
from bokeh.plotting import figure, output_file
from bokeh.models import ColumnDataSource, LabelSet, Whisker
from bokeh.transform import factor_cmap, factor_mark, cumsum
from bokeh.palettes import *
output_notebook()users = gr_messages_per_user['user_name']
amount = gr_messages_per_user['size_per_day']
hist_e, edges_e = np.histogram(amount, density=False, bins=100)
# Draw
p = figure(width=1600, height=500, title="Messages per day distribution")
p.quad(top=hist_e, bottom=0, left=edges_e[:-1], right=edges_e[1:], line_color="darkblue")
p.x_range.start = 0
# p.x_range.end = 150000
p.y_range.start = 0
p.xaxis[0].ticker.desired_num_ticks = 20
p.left[0].formatter.use_scientific = False
p.below[0].formatter.use_scientific = False
p.xaxis.axis_label = "Messages per day, avg"
p.yaxis.axis_label = "Amount of users"
show(p)

The output looks like this:

Messages per day distribution, Image by writer

Interestingly, we will see just one bar. Of all 79,985 users who posted messages with the “#Climate” hashtag, just about all of them (77,275 users) sent, on average, lower than a message per day. It looks surprising at first glance, but actually, how often can we post tweets in regards to the climate? Truthfully, I never did it in all my life. We want to zoom the graph so much to see other bars on the histogram:

Messages per day distribution with the next zoom, Image by writer

Only with this zoom level can we see that amongst all 79,985 Twitter users who posted something about “#Climate”, there are lower than 100 “activists”, posting messages day-after-day! Okay, possibly “climate” will not be something persons are making posts about day by day, but is it the identical with other topics? I created a helper function, returning the proportion of “lively” users who posted greater than N messages per day:

def get_active_users_percent(df_in: pd.DataFrame, messages_per_day_threshold: int):
""" Get percentage of lively users with a messages-per-day threshold """
days_total = df_in['date'].unique().shape[0]
users_total = df_in['user_name'].unique().shape[0]
gr_messages_per_user = df_in.groupby(['user_name'], as_index=False).size()
gr_messages_per_user["size_per_day"] = gr_messages_per_user['size'].div(days_total)
users_active = gr_messages_per_user[gr_messages_per_user['size_per_day'] >= messages_per_day_threshold].shape[0]
return 100*users_active/users_total

Then, using the identical Tweepy code, I downloaded data frames for six topics from different domains. We are able to draw results with Bokeh:

labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
counts = [get_active_users_percent(df_climate, messages_per_day_threshold=1), 
get_active_users_percent(df_politics, messages_per_day_threshold=1), 
get_active_users_percent(df_cats, messages_per_day_threshold=1), 
get_active_users_percent(df_humour, messages_per_day_threshold=1), 
get_active_users_percent(df_space, messages_per_day_threshold=1), 
get_active_users_percent(df_war, messages_per_day_threshold=1)]palette = Spectral6
source = ColumnDataSource(data=dict(labels=labels, counts=counts, color=palette))
p = figure(width=1200, height=400, x_range=labels, y_range=(0,9), 
title="Percentage of Twitter users posting 1 or more messages per day",
toolbar_location=None, tools="")
p.vbar(x='labels', top='counts', width=0.9, color='color', source=source)
p.xgrid.grid_line_color = None
p.y_range.start = 0
show(p)

The outcomes are interesting:

Percentage of lively users, who posted a minimum of 1 message per day with a particular hashtag

The preferred hashtag here is “#Cats”. On this group, about 6.6% of users make posts day by day. Are their cats just lovable, and they can not resist the temptation? Quite the opposite, “#Humour” is a well-liked topic with a lot of messages, however the number of people that post a couple of message per day is minimal. On more serious topics like “#War” or “#Politics”, about 1.5% of users make posts day by day. And surprisingly, way more persons are making day by day posts about “#Space” in comparison with “#Humour”.

To make clear these digits in additional detail, let’s find the distribution of the variety of messages per user; it will not be directly relevant to message time, however it continues to be interesting to search out the reply:

def get_cumulative_percents_distribution(df_in: pd.DataFrame, steps=200):
""" Get a distribution of total percent of messages sent by percent of users """    
# Group dataframe by user name and type by amount of messages
df_messages_per_user = df_in.groupby(['user_name'], as_index=False).size().sort_values(by=['size'], ascending=False)
users_total = df_messages_per_user.shape[0]
messages_total = df_messages_per_user["size"].sum()# Get cumulative messages/users ratio
messages = []
percentage = np.arange(0, 100, 0.05)
for perc in percentage:
msg_count = df_messages_per_user[:int(perc*users_total/100)]["size"].sum()
messages.append(100*msg_count/messages_total)
return percentage, messages

This method calculates the overall variety of messages posted by essentially the most lively users. The number itself can strongly vary for various topics, so I exploit percentages as each outputs. With this function, we will compare results for various hashtags:

# Calculate 
percentage, messages1 = get_cumulative_percent(df_climate)
_, messages2 = get_cumulative_percent(df_politics)
_, messages3 = get_cumulative_percent(df_cats)
_, messages4 = get_cumulative_percent(df_humour)
_, messages5 = get_cumulative_percent(df_space)
_, messages6 = get_cumulative_percent(df_war)labels = ['#Climate', '#Politics', '#Cats', '#Humour', '#Space', '#War']
messages = [messages1, messages2, messages3, messages4, messages5, messages6]
# Draw
palette = Spectral6
p = figure(width=1200, height=400, 
title="Twitter messages per user percentage ratio", 
x_axis_label='Percentage of users', 
y_axis_label='Percentage of messages')
for ind in range(6): 
p.line(percentage, messages[ind], line_width=2, color=palette[ind], legend_label=labels[ind])
p.x_range.end = 100
p.y_range.start = 0
p.y_range.end = 100
p.xaxis.ticker.desired_num_ticks = 10
p.legend.location = 'bottom_right'
p.toolbar_location = None
show(p)

Because each axes are “normalized” to 0..100%, it is straightforward to check results for various topics:

Distribution of messages made by most lively users, Image by writer

Again, the result looks interesting. We are able to see that the distribution is strongly skewed: 10% of essentially the most lively users are posting 50–60% of the messages (spoiler alert: as we are going to see soon, not all of them are humans;).

This graph was made by a function that is just about 20 lines of code. This “evaluation” is pretty easy, but many additional questions can arise. There’s a definite difference between different topics, and finding the reply to why it’s so is clearly not straightforward. Which topics have the most important variety of lively users? Are there cultural or regional differences, and is the curve the identical in several countries, just like the US, Russia, or Japan? I encourage readers to do some tests on their very own.

Now that we’ve got some basic insights, it’s time to do something more difficult. Let’s cluster all users and take a look at to search out some common patterns. To do that, first, we are going to must convert the user’s data into embedding vectors.

3. Making User Embeddings

An embedded vector is a listing of numbers that represents the information for every user. Within the previous article, I got embedding vectors from tweet words and sentences. Now, because I would like to search out patterns within the “temporal” domain, I’ll calculate embeddings based on the message time. But first, let’s discover what the information looks like.

As a reminder, we now have a dataframe with all tweets, collected for a particular hashtag. Each tweet has a user name, creation date, time, and hour:

Let’s create a helper function to point out all tweet times for a particular user:

def draw_user_timeline(df_in: pd.DataFrame, user_name: str):
""" Draw cumulative messages time for specific user """
df_u = df_in[df_in["user_name"] == user_name]
days_total = df_u['date'].unique().shape[0]# Group messages by time of the day
messages_per_day = df_u.groupby(['time'], as_index=False).size()
msg_time = messages_per_day['time']
msg_count = messages_per_day['size']
# Draw
p = figure(x_axis_type='datetime', width=1600, height=150, title=f"Cumulative tweets timeline during {days_total} days: {user_name}")
p.vbar(x=msg_time, top=msg_count, width=datetime.timedelta(seconds=30), line_color='black')
p.xaxis[0].ticker.desired_num_ticks = 30
p.xgrid.grid_line_color = None
p.toolbar_location = None
p.x_range.start = datetime.time(0,0,0)
p.x_range.end = datetime.time(23,59,0)
p.y_range.start = 0
p.y_range.end = 1
show(p)
draw_user_timeline(df, user_name="UserNameHere")
...

The result looks like this:

Messages timeline for several users, Image by writer

Here we will see messages made by some users inside several weeks, displayed on the 00–24h timeline. We may already see some patterns here, but because it turned out, there may be one problem. The Twitter API doesn’t return a time zone. There’s a “timezone” field within the message body, however it is all the time empty. Perhaps once we see tweets within the browser, we see them in our local time; on this case, the unique timezone is just not necessary. Or possibly it’s a limitation of the free account. Anyway, we cannot cluster the information properly if one user from the US starts sending messages at 2 AM UTC and one other user from India starts sending messages at 13 PM UTC; each timelines just won’t match together.

As a workaround, I attempted to “estimate” the timezone myself through the use of a straightforward empirical rule: most persons are sleeping at night, and highly likely, they usually are not posting tweets at the moment 😉 So, we will find the 9-hour interval, where the typical variety of messages is minimal, and assume that it is a “night” time for that user.

def get_night_offset(hours: List):
""" Estimate the night position by calculating the rolling average minimum """
night_len = 9
min_pos, min_avg = 0, 99999 
# Find the minimum position
data = np.array(hours + hours)
for p in range(24):
avg = np.average(data[p:p + night_len])
if avg <= min_avg:
min_avg = avg
min_pos = p# Move the position right if possible (in case of long sequence of comparable numbers)
for p in range(min_pos, len(data) - night_len):
avg = np.average(data[p:p + night_len])
if avg <= min_avg:
min_avg = avg
min_pos = p
else:
break
return min_pos % 24
def normalize(hours: List):
""" Move the hours array to the correct, keeping the 'night' time on the left """
offset = get_night_offset(hours)
data = hours + hours
return data[offset:offset+24]

Practically, it really works well in cases like this, where the “night” period will be easily detected:

In fact, some people get up at 7 AM and a few at 10 AM, and with no time zone, we cannot find it. Anyway, it’s higher than nothing, and as a “baseline”, this algorithm will be used.

Obviously, the algorithm doesn’t work in cases like that:

In this instance, we just don’t know if this user was posting messages within the morning, within the evening, or after lunch; there isn’t a details about that. But it surely continues to be interesting to see that some users are posting messages only at a particular time of the day. On this case, having a “virtual offset” continues to be helpful; it allows us to “align” all user timelines, as we are going to see soon in the outcomes.

Now let’s calculate the embedding vectors. There will be alternative ways of doing this. I made a decision to make use of vectors in the shape of [SumTotal, Sum00,.., Sum23], where SumTotal is the overall amount of messages made by a user, and Sum00..Sum23 are the overall variety of messages made by each hour of the day. We are able to use Panda’s groupby method with two parameters “user_name” and “hour”, which does just about all the needed calculations for us:

def get_vectorized_users(df_in: pd.DataFrame):
""" Get embedding vectors for all users 
Embedding format: [total hours, total messages per hour-00, 01, .. 23]
"""
gr_messages_per_user = df_in.groupby(['user_name', 'hour'], as_index=True).size()vectors = []
users = gr_messages_per_user.index.get_level_values('user_name').unique().values    
for ind, user in enumerate(users):
if ind % 10000 == 0:
print(f"Processing {ind} of {users.shape[0]}")
hours_all = [0]*24
for hr, value in gr_messages_per_user[user].items():
hours_all[hr] = value
hours_norm = normalize(hours_all)
vectors.append([sum(hours_norm)] + hours_norm)
return users, np.asarray(vectors)
all_users, vectorized_users = get_vectorized_users(df)

Here, the “get_vectorized_users” method is doing the calculation. After calculating each 00..24h vector, I exploit the “normalize” function to use the “timezone” offset, as was described before.

Practically, the embedding vector for a comparatively lively user may appear to be this:

[120 0 0 0 0 0 0 0 0 0 1 2 0 2 2 1 0 0 0 0 0 18 44 50 0]

Here 120 is the overall variety of messages, and the remainder is a 24-digit array with the variety of messages made inside every hour (as a reminder, in our case, the information was collected inside 46 days). For the inactive user, the embedding may appear to be this:

[4 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0]

Different embedding vectors will also be created, and a more complicated scheme can provide higher results. For instance, it might be interesting so as to add a complete variety of “lively” hours per day or to incorporate a day of the week into the vector to see how the user’s activity varies between working days and weekends, and so forth.

4. Clustering

As within the previous article, I will probably be using the K-Means algorithm to search out the clusters. First, let’s find the optimum K-value using the Elbow method:

import matplotlib.pyplot as plt  
%matplotlib inlinedef graw_elbow_graph(x: np.array, k1: int, k2: int, k3: int):
k_values, inertia_values = [], []
for k in range(k1, k2, k3):
print("Processing:", k)
km = KMeans(n_clusters=k).fit(x)
k_values.append(k)
inertia_values.append(km.inertia_)
plt.figure(figsize=(12,4))
plt.plot(k_values, inertia_values, 'o')
plt.title('Inertia for every K')
plt.xlabel('K')
plt.ylabel('Inertia')
graw_elbow_graph(vectorized_users, 2, 20, 1)

The result looks like this:

The Elbow graph for users embeddings, Image by writer

Let’s write the tactic to calculate the clusters and draw the timelines for some users:

def get_clusters_kmeans(x, k):
""" Get clusters using K-Means """
km = KMeans(n_clusters=k).fit(x)
s_score = silhouette_score(x, km.labels_)
print(f"K={k}: Silhouette coefficient {s_score:0.2f}, inertia:{km.inertia_}")sample_silhouette_values = silhouette_samples(x, km.labels_)
silhouette_values = []
for i in range(k):
cluster_values = sample_silhouette_values[km.labels_ == i]
silhouette_values.append((i, cluster_values.shape[0], cluster_values.mean(), cluster_values.min(), cluster_values.max()))
silhouette_values = sorted(silhouette_values, key=lambda tup: tup[2], reverse=True)
for s in silhouette_values:
print(f"Cluster {s[0]}: Size:{s[1]}, avg:{s[2]:.2f}, min:{s[3]:.2f}, max: {s[4]:.2f}")        
print()
# Create latest dataframe
data_len = x.shape[0]
cdf = pd.DataFrame({
"id": all_users,
"vector": [str(v) for v in vectorized_users],
"cluster": km.labels_,
})
# Show top clusters
for cl in silhouette_values[:10]:
df_c = cdf[cdf['cluster'] == cl[0]]
# Show cluster
print("Cluster:", cl[0], cl[2])
with pd.option_context('display.max_colwidth', None):
display(df_c[["id", "vector"]][:20])
# Show first users
for user in df_c["id"].values[:10]:
draw_user_timeline(df, user_name=user)
print()
return km.labels_
clusters = get_clusters_kmeans(vectorized_users, k=5)

This method is usually similar to within the previous part; the one difference is that I draw user timelines for every cluster as a substitute of a cloud of words.

5. Results

Finally, we’re able to see the outcomes. Obviously, not all groups were perfectly separated, but among the categories are interesting to say. As a reminder, I used to be analyzing all tweets of users who made posts with the “#Climate” hashtag inside 46 days. So, what clusters can we see in posts about climate?

“Inactive” users, who sent just one–2 messages inside a month. This group is the most important; as was discussed above, it represents greater than 95% of all users. And the K-Means algorithm was capable of detect this cluster as the most important one. Timelines for those users appear to be this:

“Interested” users. These users posted tweets every 2–5 days, so I can assume that they’ve a minimum of some type of interest on this topic.

“Lively” users. These users are posting greater than several messages per day:

We don’t know if those persons are just “activists” or in the event that they recurrently post tweets as an element of their job, but a minimum of we will see that their online activity is pretty high.

“Bots”. These users are highly unlikely to be humans in any respect. Not surprisingly, they’ve the best variety of posted messages. In fact, I don’t have any 100% proof that each one those accounts belong to bots, however it is unlikely that any human can post messages so recurrently without rest and sleep:

The second “user”, for instance, is posting tweets at the identical time of day with 1-second accuracy; its tweets will be used as an NTP server 🙂

By the way in which, another “users” usually are not really lively, but their datetime pattern looks suspicious. This “user” has not so many messages, and there may be a visual “day/night” pattern, so it was not clustered as a “bot”. But for me, it looks unrealistic that an odd user can publish messages strictly in the beginning of every hour:

Perhaps the auto-correlation function can provide good leads to detecting all users with suspiciously repetitive activity.

“Clones”. If we run a K-Means algorithm with higher values of K, we also can detect some “clones”. These clusters have similar time patterns and the best silhouette values. For instance, we will see several accounts with similar-looking nicknames that only differ within the last characters. Probably, the script is posting messages from several accounts in parallel:

As a final step, we will see clusters visualization, made by the t-SNE (t-distributed Stochastic Neighbor Embedding) algorithm, which looks pretty beautiful:

Here we will see a whole lot of smaller clusters that weren’t detected by the K-Means with K=5. On this case, it is sensible to try higher K values; possibly one other algorithm like DBSCAN (Density-based spatial clustering of applications with noise) can even provide good results.

Conclusion

Using data clustering, we were capable of find distinctive patterns in tens of hundreds of tweets about “#Climate”, made by different users. The evaluation itself was made only through the use of the time of tweet posts. This will be useful in sociology or cultural anthropology studies; for instance, we will compare the net activity of various users on different topics, determine how often they make social network posts, and so forth. Time evaluation is language-agnostic, so it’s also possible to check results from different geographical areas, for instance, online activity between English- and Japanese-speaking users. Time-based data will also be useful in psychology or medicine; for instance, it is feasible to determine what number of hours persons are spending on social networks or how often they make pauses. And as was demonstrated above, finding patterns in users “behavior” will be useful not just for research purposes but in addition for purely “practical” tasks like detecting bots, “clones”, or users posting spam.

Alas, not all evaluation was successful since the Twitter API doesn’t provide timezone data. For instance, it might be interesting to see if persons are posting more messages within the morning or within the evening, but without having a correct time, it’s not possible; all messages returned by the Twitter API are in UTC time. But anyway, it’s great that the Twitter API allows us to get large amounts of knowledge even with a free account. And clearly, the ideas described on this post will be used not just for Twitter but for other social networks as well.

For those who enjoyed this story, be happy to subscribe to Medium, and you’re going to get notifications when my latest articles will probably be published, in addition to full access to hundreds of stories from other authors.

Thanks for reading.