Home Artificial Intelligence Accessing Your Personal Data

Accessing Your Personal Data

2
Accessing Your Personal Data

The Extensive and Often Surprising Data that Corporations Have about You, Ready and Waiting for You to Analyze

Image created with the help of DALL-E 2

Data privacy laws are appearing in countries all around the world and are creating a singular opportunity so that you can learn the way others view you while also gaining insights into yourself. Most laws are just like the European Union’s General Data Protection Regulation, commonly know as “GDPR”. It includes provisions requiring organizations to let you know the kind of personal data they store about you, why they’re storing it, how they’re using it, and the length of time they store it.

However the laws also include an often missed requirement commonly often known as data portability. Data portability requires organizations to provide you a machine-readable copy of the information they’re currently storing about you upon request. Within the GDPR, this right is defined in Article 15, “Right of access by the information subject”. The info that organizations have often features a wealthy and varied set of features and is clean, making it ripe for several data evaluation, modelling, and visualization tasks.

In this text, I share my journey of requesting my data from just a few of the businesses with whom I routinely interact. I include suggestions for requesting your data in addition to ideas for using your data in data science and for private insights.

Think you have got a solid grasp in your taste in music? I assumed I had broad and varied musical tastes. In response to Apple, though, I’m more of a die-hard rocker.

Table by creator

Need to refine your geographic data mapping skills? These data sources provide a spectacular amount of geocoded data to work with.

Plot of a walk through Universal Studios — Image by Creator

Care to try your time series modelling skills? Multiple data sets include fine-grained time series observations.

Forecast of exercise time using Apple health data — Plot by creator

The very best news of all? That is your data. No license or permissions needed.

Fasten your seat belt — the variability of information you’ll receive is broad. The varieties of analyses and modelling you possibly can do are non-trivial. And the insights you gain about yourself and the way others view you might be intriguing.

To maintain the give attention to insights from the information and within the interest of brevity, I don’t include code in this text. Everybody like code, though, so here’s a link to a repo with several of the notebooks I used to research my data.

Getting the Data

Should you make a listing of organizations which have data about you, you’ll quickly realize the list is large. Social media firms, online retailers, mobile phone carriers, web service providers, home automation and security services, and streaming entertainment providers are only just a few categories of organizations storing data about you. Requesting your data from all of those groups will be quite time-consuming.

To make my evaluation manageable, I limited my data requests to Facebook, Google, Microsoft, Apple, Amazon and my cellular carrier, Verizon. Here’s a table summarizing my experience with the information request and response process:

Table by creator

And listed below are the links I used to request my data together with information on any data documentation provided by the vendors:

I take advantage of an Apple Watch to trace health and fitness data. That data is accessed individually from all other Apple data that you simply request from the overall Apple website. For this reason, I show two separate Apple entries within the above tables and discuss the Apple data in two topics below.

The quantity and kind of data you receive will rely on how extensively you engage with a specific company. For instance, I take advantage of social media infrequently. The reasonably modest amount of information I received from Facebook is subsequently not surprising. In contrast, I take advantage of Apple services and products so much. I got a broad range and enormous volume of information from Apple.

Have in mind that if you have got multiple identities with an organization, you’ll have to request the information for every identity. For instance, if Google knows you by one e-mail address on your Google Play account and a distinct e-mail address on your gmail account, you’ll have to do a knowledge request for every address to be able to get a full picture of the information Google stores about you.

Within the table above I show links that I used to request data from my goal firms. The links are current as of the publishing of this text but may change over time. Generally, you will discover instructions for requesting your data on the “Privacy”, “Privacy Rights” or similar sounding links on an organization’s home page. Those links incessantly appear on the very bottom of the house page.

Bottom of microsoft.com screen — image by creator

You normally need to read through documentation describing your privacy rights and seek for the “Accessing Your Data”, “Exporting Your Data”, “Data Portability” or similar topic to get a link to the actual page for requesting your data.

Finally, the method for requesting your data, the timeliness of the response and the standard of documentation you receive explaining the information varies greatly from one company to the subsequent. Be patient and persevere. You might be rewarded with a wealth of information and knowledge very quickly.

My Data Insights

Here’s a review of the information files that I received from each company together with just a few observations after analyzing the more interesting files. I also indicate some opportunities to do more in-depth data evaluation and modelling with the information from these firms.

Facebook

My download from Facebook included 51 .json files, excluding the many .json files containing individual message threads from my Facebook Messenger account. Facebook provides some high-level documentation for its files on the download website.

Data on my Facebook login activity, devices that I used to login, estimated geographic location of my logins, and similar administrative-type data about my account activities appear across several files. Nothing in these files is especially interesting, though I’ll say that the situation data seemed surprisingly accurate, given it was was often inferred from my IP address on the time of the recorded activity.

The truly interesting data began to seem in a file that tracked my off-Facebook app and web activity. I can see how the information in that file, coupled with the information that Facebook already has from my Facebook profile, paint a demographic picture that end in me being chosen as a goal by particular Facebook advertisers. The off-Facebook file starts to provide you a way for a way the profiling and promoting process works at Facebook.

Let’s take a take a look at the file. It is known as:

“/apps_and_websites_off_of_facebook/your_off-facebook_activity.json”

It incorporates 1,860 records of actions I took on 441 different non-Facebook web sites over the past two years. Here is an edited sample of the web sites and motion types it records:

Table by creator

Several technology and travel related sites rise to the highest of my off-Facebook activity list. Now let’s take a look at my demographic profile.

The file named:

“ads_information/other_categories_used_to_reach_you.json”

incorporates a listing of demographic categories that Facebook has assigned to me based, I assume, on my Facebook profile data, my Facebook friends, my activity on Facebook, and my off-Facebook app and web activity. Here is an edited sample of the demographic categories:

Table by creator

A lot of the categories above are based on my profile, my device usage pattern, and my friends. The “Frequent Travelers” and “Frequent International Travelers” categories come, I assume from my off-Facebook web activity. Thus far, this all checks out.

Finally, there’s a file named:

“ads_information/advertisers_using_your_activity_or_information.json”

The “advertisers_using_your_activity_or_information” within the file title leads me to imagine that Facebook makes my data available to its advertisers who in turn use it to focus on me with ads through Facebook. This file, then, lists those advertisers who displayed an ad to me, or who at the very least considered doing so based on my data.

The file contained 1,366 different advertisers. Here’s a small sample of those advertisers:

Table by creator

Travel sites, retailers, tech firms, fitness centers, automobile repair firms, healthcare insurers, media firms (who represent advertisers), and other firms appear within the list. It’s a wide range of organizations, but in lots of instances, I can see how they relate to me, my preferences and my habits.

Other files within the Facebook download include Facebook search history, search timestamps, and browser cookie data.

Google

Google’s export facility is cleverly named “Takeout”. The Takeout web page lists all the varied Google services for which you possibly can request your data (gmail, YouTube, search, Nest, etc.) It also shows the files available for every service, and the export format for every file (json, HTML, or csv). More often than not, Google doesn’t offer you a alternative of export format for individual files.

A portion of the Google Takeout request site at takeout.google.com — Screen image by creator

Google does an honest job of providing a high-level overview of the aim of every file. There may be, nevertheless, no documentation for individual fields.

I received 94 files in my extract. As with Facebook, there have been the traditional administrative files related to device information, account attributes, preferences, and login/access data history.

One interesting file is the one titled ‘…/Ads/MyActivity.json’. It incorporates a history of ads presented to me in consequence of searches.

Some entries within the Ads/MyActivity file have URLs containing a clickserve domain for instance:

Screen capture by creator

Per Google’s 360 ads website, these are ads from an ad campaign being done by considered one of Google’s advertisers, served to me in consequence of some click activity I did. The file doesn’t give any information on which motion I took that caused the ad to be served.

The ‘title’ column within the file distinguishes between sites “Visited” and topics “Searched”. The “Visited” records all have “From Google Ads” within the ‘details’ column (see example above), leading me to imagine that Google served an ad to me in response to me having visited a specific site.

The “Searched” records show sites I visited directly (macys.com, yelp.com, etc.) The ‘details’ column shows those sites while the ‘title’ column apparently shows what I looked for on those separate sites. For instance,

Screen capture by the creator
Screen capture by the creator

One other file I discovered interesting is known as ‘…/My Activity/Discover/MyActivity.json’. It’s a history of the subject suggestions that Google presented to me through its “Discover” feature on the Google app (formerly the Google Feed feature — more on Discover here.) Discover topics are chosen based in your web and app activity, assuming you give Google permission to make use of your activity to guide Discover topics.

Regardless that I don’t allow Discover to make use of my web and app activity, Discover still presented some topic suggestions relevant to me. Here is an edited sample of the topics presented most incessantly over several days:

We see here the recurring themes of technology and travel, together with a latest theme we may also see within the Apple files — music!

Google includes in its download several files tracking activity history across Google’s services and products. For instance, I received history for my visits to the developers.google.com and cloud.google.com sites for training and documentation resources. No compelling insights got here from this data, but it surely did remind me of topics I desired to revisit and study further.

Other historical data within the extract included searches and actions performed inside my gmail account; search requests for images; places searched, directions requested, and maps viewed through the Google Maps app; searches performed for videos on the net (outside of YouTube); searches done on and watch history for YouTube; and contacts I store with Google, presumably in gmail.

Unlike Facebook, Google doesn’t provide any information on a demographic profile that Google has built for me.

Note you can view your Google activity data across its products and apps by visiting myactivity.google.com:

Screen clip by the creator

While you can not export the information from this site, you possibly can browse the information, allowing you to get a way for the kind of data it’s possible you’ll wish to export through the Google Takeout site.

Microsoft

Microsoft allows you to export a few of your data through the Microsoft Privacy Dashboard. For individual Microsoft services not available on the Dashboard (for instance, MSDN, OneDrive, Microsoft 365, or Skype data) you should utilize links within the “Methods to access and control your personal data” section of Microsoft’s privacy statement page. The identical page directs you to an online form you possibly can submit if you happen to are searching for data that isn’t available by any of the above methods.

I selected to export all data available through the Privacy Dashboard. This included browsing history, search history, location activity, music, TV and films history, and apps and repair usage data. I also asked for an export of my Skype data. My export included 4 csv files, six json files, and 6 jpeg files.

No file documentation was included within the export and none was found on the Microsoft site. The sector names within the files are, nevertheless, fairly intuitive.

A couple of interesting observations from the Microsoft files:

The file ‘…MicrosoftSearchRequestsAndQuery.csv’ incorporates data from searches I performed during the last 18 months including search terms and, apparently, the positioning that I clicked on, if any, from the search results. It looks like the information was just for searches that I did through Bing or Windows Search.

Based on the information, it appears I clicked on a link within the search results only 40% of the time (347 out of 870 searches performed.) From this, I assume that the searches for which I didn’t click on a link were either poorly crafted, returning off-topic results, or I can have been in a position to get the reply I wanted just by reading the link previews within the search results. I don’t recall having to incessantly redo search terms, and I do know I often see the reply I want right in a link preview, since lots of my searches are for reminders on coding syntax. Either way, I used to be a bit surprised on the 40% click-through rate. I might have expected it to be much higher.

Not much interesting was is within the Skype data. It contained the history of in-app message threads between me and other Skype meeting participants. Also included were .jpeg files with images of participants from just a few of my calls.

Apple Fitness

I needed to access my Apple health and fitness data individually from the opposite data that I exported from Apple. The health and fitness data are accessed from the Health app on the iPhone. You just click in your icon within the upper right-hand corner of the Health app screen. It takes you to a profile screen and also you then the clicking on the Export All Health Data link at the underside of the screen:

Screen capture by creator

My health export included just below 500 .gpx files totaling 102 meg. They contain route information from my recorded workouts during the last several years. One other 48 files contained 5.3 meg of electrocardiogram data from self-tests that I performed on my Apple Watch.

The file named ‘…/Apple/apple_health_export/export.xml’ incorporates the actual interesting data. For me, it’s 770 meg with 1,956,838 records covering multiple different health and exercise measurements for roughly seven years. Among the activity types measured are as follows:

Table by creator

Note that the frequency at which Apple records data varies by activity type. For instance, Energetic Energy Burned is recorded hourly while Stair Ascent Speed is recorded only when going up stairs, resulting in the massive difference in commentary counts between these two activity types.

The info recorded for every commentary include the date/time on which the commentary was recorded, the beginning and end dates/times of the activity being measured, and the device that recorded the activity (iPhone or Apple Watch).

In his excellent Medium article “Analyse Your Health with Python and Apple Health”, Alejandro Rodríguez provides the code that I used to parse the xml within the export.xml file and create a Pandas data frame. (Thanks Alejandro!) After choosing a one 12 months subset of the information and grouping and aggregating it at day and activity type levels, I discovered some interesting things.

As I suspected. my average activity levels were different for days after I was travelling in comparison with days after I was in considered one of the cities I call home (Austin or Chicago). To see this, I had to make use of the latitude and longitude data from the .gpx exercise route files mentioned earlier. That allowed me to find out which of the routes were in a house city and which occurred while I used to be travelling. I then merged that location data with my activity summary data. This was then further summarized by activity type and placement (home city or travelling). Here is the pattern that merged:

Image by creator

While in Chicago, I’m in an apartment constructing with an elevator, so the massive decline in average flights climbed was not a surprise. What was surprising was the rise in activity levels for Chicago versus Austin. My exercise routine could be very similar in each locations, yet I do more work in Chicago. I feel I can attribute this to the indisputable fact that I walk to more locations in Chicago, reasonably than driving more often than not. Clearly, I want to up the quantity that I exercise in Austin.

Spotting trends just like the one above, which you can not see in the usual charts of the Apple Health app, are an important use for the health data.

The info can be great for modeling, given it is vitally complete and customarily clean. Here, for instance, is a time series forecast of my exercise minutes based on a one 12 months period using Facebook’s Prophet model:

Forecast of exercise minutes using default weekly seasonality, no annual seasonality — Image by creator

Here is identical forecast, but with annual seasonality enabled and weekly seasonality added manually based on my location (Austin, Chicago or travelling):

Forecast of exercise minutes using annual seasonality and manual weekly seasonality — Image by creator

The default weekly seasonality model above (first plot) does a worse job of fitting the training data than the model with custom seasonality terms added (second plot). Nonetheless the default seasonality model is much better (though still not great) at predicting future values of exercise minutes. Pointless to say, hyperparameter tuning would help improve these results.

Mean Absolute Percent Error of Different Models — chart by creator

That is only a sample of the kind of modeling you possibly can experiment with using your health data. Do you should try using very granular time-series data? Take a look at the workout routes files. They’ve observations for every second of your recorded workouts with latitude, longitude, elevation and velocity fields.

Apple — Non-Fitness/Health

You request a download of all of your non-fitness/health data from Apple’s major website. For me, that amounted to 84 files, mostly .csv and .json files together with just a few .xml files. I also received tons of of .vcf files, one for every of the contacts I actually have on my Apple devices, In total, I downloaded 68meg of information, excluding the .vcf files.

Apple stands out in that it provides comprehensive documentation for every of the information files. It includes explanations of every field, though some definitions are more helpful than others. The documentation helped me interpret just a few data files that looked intriguing.

As with most other exports, Apple’s files included the traditional administrative data, including things equivalent to my preferences for various apps, login information and device information. I didn’t find anything remarkable in those files.

There are several files related to Apple Music, considered one of the services to which I subscribe. Files with titles like:

  • “…/Media_Services/Apple Music — Play History Day by day Tracks.csv”;
  • “…/Media_Services/Apple Music — Recently Played Tracks.csv’’; and,
  • “…/Media_Services/Apple Music Play Activity.csv”

contain information equivalent to:

  • date and time a song was played;
  • play duration in milliseconds;
  • how each play was ended (for instance, it reached the top of the track, or I skipped past the song);
  • the variety of times the song has been played;
  • the variety of times the song was skipped;
  • the song title;
  • the album title, if any;
  • the song’s genre; and,
  • where the song was played from — my library, a playlist, or considered one of Apple’s radio channels.

My files contained between 13,900 and 20,700 records depending on the aim of the file. The info covered nearly seven years of song plays.

Apple captures a spread data on how song plays are ended, probably for purposes of recommending other songs to me. Song play termination reasons include:

Table by creator

For purposes of the analyses I show below, I focused on the ‘NATURAL_END_OF_TRACK’, ‘TRACK_SKIPPED_FORWARDS’, and ‘MANUALLY_SELECTED_PLAYBACK_OF_A_DIFF_ITEM’ end reasons.

Sometimes I’ll repeat a song that I like. One query I had was “Do I play favorite songs obsessively, over and once again?”. I answered that query using the Apple data:

Table by creator

The table above summarizes the variety of times I’ve played some favorite songs (‘Play Count’) and the number days over which I played the songs (‘Played on Variety of Days’). It looks like I generally play a song just once per day. Also, on condition that the play count is lower than the day count for some songs, I have to skip some favorites if I actually have heard them too again and again recently or if the song doesn’t fit my mood on the time. So, no obsessive playing here!

I also wondered if I favor certain varieties of songs on different days of the week, different times of the day, and even different months of the 12 months. My intuition says that I do. With the Apple data, it was easy to visualise the genres I played at different times. Here, for instance, are the genres I played most incessantly during every month of the 12 months:

Image by creator

I clearly favor rock songs, with alternative and pop music added for some occasional variety. July and August appear to be the months after I prefer the variability.

That said, I used to be surprised at just how much rock I appear to play. Admittedly I find it irresistible. But I also imagine I actually have pretty broad taste in music.

So, I questioned the accuracy of the genre assigned to the songs in Apple’s data. For one thing, 10,083 of the 22,313 song plays in my file had no genre assigned to them. Also, there appears to be quite a lot of overlap within the genres assigned. For instance, “R&B/Soul”, “Soul and R&B”, “Soul”, and “R&B / Soul” are all genres assigned to different songs in my data. The totals within the chart above will surely be different if I recast the genres of all songs to make use of a consistent genre naming scheme.

Quite than invest the time to update the genres, I made a decision on one other test to find out if the trends within the chart truly represent my playing patterns. Since Apple includes song play ending reasons in the information, I looked to see if I are likely to skip past rock songs more incessantly than other genres, indicating that I attempt to play other genres when too many rock songs are being played.

Plot by creator

Because it seems, I don’t skip past rock songs significantly greater than I skip past other genres that I hearken to incessantly. I’ll need to face it — I’m a die-hard rock fan.

One other interesting file is known as “…/Media_Services/Stores Activity/Other Activity/App Store Click Activity.csv”. While I don’t analyze it here, I like to recommend it to anyone who desires to get a way for the kind of data a retailer will probably want to track for activity on their website. For me, it included 4,900+ records with detailed history of my activity while within the app store and, apparently, in Apple music. Sorts of actions I took, dates/times, A/B test flag, search terms, and data presented to me (“impressed” is the term used) are among the many items included within the file.

One last potentially interesting file for evaluation is known as Media_ServicesStores ActivityOther ActivityApple Music Click Activity V3.csv. It includes the town and longitude/latitude of the IP address where, I assume, I used to be using Apple Music. For me, the file had 10,000 records.

Verizon

After an extended 80+ day wait, Verizon notified me I could download my data. It included 17 csv files for a complete of 1.4 meg of information. A lot of the files covered account administrative information (cell line descriptions, device information, billing history, order history, etc.), the history of notifications that Verizon sent to me, and my recent texting history (but without text contents). Though Call History and Data Usage files were provided, they were empty apart from a notation that the information was “Masked for security”.

Verizon provided two documentation files. One contained the names and general descriptions of 34 possible files that might be included in a download. The files included rely on the Verizon services you utilize. The second documentation file contained an outline of three,091 data fields that might appear within the files. While the information field descriptions are helpful, they lack some detail. For instance, quite a lot of fields are described as containing codes for various purposes, nevertheless the codes themselves and their meanings should not described.

One file that was extremely interesting is known as “…/Verizon/General Inferences.csv”. It incorporates a spectacular amount of demographic details about me and about other people in my household. Here is how Verizon’s documentation describes the file:

“The General Inferences file provides information general assumptions and inferences to deliver more relatable and relevant content across our platforms. This may increasingly include information like Attributes, Preferences, or Opinions.”

Based on the character of the demographic features, I assume most of it was acquired by Verizon from outside data aggregators and never gathered by Verizon directly from me. The number and scope of demographic features far exceed any information that I ever provided on to Verizon.

In truth, the Verizon documentation speaks about one other file called the “General” information file (not included in my download). The documentation says the “General” file includes data that got here from external information sources. My guess is the knowledge within the “General Inferences” file also comes from those external sources. Among the financial data within the “General Inferences” file could have come from the credit report that Verizon requires its customers to supply.

A complete of 332 demographic features were included in my General Inferences data. Here is an abridged list including a few of the more surprising features:

Abridged list of demographic features form the General Inferences file — Table by creator

All the General Inferences features are apparently utilized by Verizon to market to me and retain me as a customer. As you possibly can see within the above list, features about my spouse and our kids are also included. You possibly can see the entire list of 332 features here.

A couple of of the features that I discovered to be truly unusual include:

Table by creator

One has to wonder if those varieties of data elements are really needed by Verizon to assist it provide service to me and, in that case, how Verizon uses them.

Amazon

Amazon provided 214 files containing 4.93 meg of information. Several of the files covered:

  • Account preferences;
  • Order history;
  • Success and returns history;
  • Viewing and listening history (Amazon Prime Video and Amazon Music);
  • Kindle purchases and reading activity,
  • and search history including search terms.

If I used to be an Alexa customer or a Ring customer, I assume I might have received data for my activity on those services as well.

Six .txt files contained high-level descriptions of just a few of the downloaded data files. Several .pdf files contain documentation for fields within the downloaded files (the “Digital.PrimeVideo.Viewinghistory.Description.pdf” file, for instance).

Probably the most interesting files from Amazon pertain to the marketing audiences related to me by Amazon, it advertisers, or “third parties”. I presume the third parties are data vendors from whom Amazon purchases data.

The “…/Amazon/Promoting.1/Promoting.AmazonAudiences.csv” file incorporates the audiences that Amazon itself assigned me to. Here’s a sample of the 21 audiences:

Audiences assigned to me by Amazon — Table by creator

Amazon’s own audience assignments are largely accurate after I consider products that I purchased or looked for, either for myself or on behalf of others.

The “…/Amazon/Promoting.1/Promoting.AdvertiserAudiences.csv” file apparently incorporates a listing of Amazon advertisers who brought their very own audiences to Amazon and whose audience lists included me. The file incorporates 50 advertisers. Here’s a sample:

Amazon advertisers who’ve me of their audience lists — Table by creator

I do business with or own products from a few of the advertisers within the list (for instance, Delta, Intuit, Zipcar) so I understand how I ended up on their audience lists. I don’t have any reference to others on the list (for instance, AT&T, Red Bull, Royal Bank of Canada) so I’m undecided how I got of their audience lists.

In response to Amazon, the file

“…/Amazon/Promoting.1/Promoting.3PAudiences.csv”

incorporates a listing of

“Audiences by which you might be included by third parties”.

Its accuracy is poor. A complete of 33 audiences are listed, 28 of which focused on automobile ownership. The remaining 4 covered gender, education level, marital status and dependents. A sample of the automobile-related audiences:

Sample of automobile-related audience assignments by third party vendors — Table by creator

While the gender/education level/marital status -type assignments within the file are accurate, only just a few of the automobile-related assignments in it are correct. Most should not. And, I’m just not that thinking about automobiles to warrant 28 of 33 profile assignments. Mercifully, Amazon seems to disregard this data when it presents product or video recommendations to me.

Parting Thoughts

In this text, I hoped to point out you the big variety of information you possibly can get from firms with whom you do business. The info means that you can learn what those firms take into consideration you while also learning some surprising things about yourself!

We’ve seen that some firms appropriately discover my interests in technology and travelling, while one company incorrectly sees me as an avid automobile enthusiast. In an eye-opening and somewhat unnerving moment, I noticed one other company has extensive demographic details about my family.

I learned I want to extend my workout regime in considered one of the 2 places I call home, although I assumed my workouts were equivalent in each places. I discovered that some firms (facebook, Google) would not have a powerful view of my profile. Yet the demographic picture that Verizon has of me is shockingly accurate.

The info the varied firms offer you supply a wealthy source of raw material for experimentation. It’s data that’s vulnerable to deep evaluation, modelling and visualization activities. For instance, geographic coordinates and timestamps can be found for a lot of observations, allowing you to visualise or model your movements.

I hope you discover your individual set of interesting insights by downloading your personal data. Please let me know if you have got noteworthy experiences in working with firms apart from those I cover here.

It’s your data — Now go for it!

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here