Sunday, November 23, 2014

Facebook Friend Clustering

My wife is currently reading Dataclysm: Who We Are (When We Think No One's Looking) and pointed me to this interesting Relationship Test tool she learned about in the book. It analyzes your Facebook friends and then shows both their connections to you and to each other. The graph assigns weight to relationships, so "cliques" of friends will cluster together.  Here is the result I get:

Ken Weiner Friend Graph
Ken Weiner's Friend Graph

The white circle in the center is myself.  The cluster in the upper left identifies my friends from my
high school, Agoura High School.  The cluster in the lower right is mostly coworkers from my previous job at LowerMyBills.com. The 2 less-defined clusters to the right of my high school cluster is a mix of family and college friends.

In addition to visualizing friend clusters, this tool also tries to identify your spouse or romantic partner using an algorithm from a paper called Romantic Partnerships and the Dispersion of Social Ties:
A Network Analysis of Relationship Status on Facebook. It tries to answer this question:
Given all the connections among a person's friends, can you recognize his or her romantic partner from the network structure alone?
It measures something called dispersion - the extent to which two people's mutual friends are not themselves well-connected.

On my network, it worked extremely well, identifying my wife as the person with whom I have the highest dispersion score (called assimilation score in this tool).

Portion of my Facebook Connections Sorted by Assimilation Score
Portion of my Facebook Connections Sorted by Assimilation Score

This means that it is through my wife that I am connected to the most people that aren't themselves connected.  This is natural, of course, because I have met my wife's high school friends, college friends, and coworkers, most of which don't know each other.

There are a few people that rank high on the list that aren't actually "central" to my life.  If I ignore all the people with less that 3 mutual friends, the list becomes much more meaningful and predictive of who really is central to my life.

The ability to analyze my social network like this motivated me to visit Facebook and Unfriend acquaintances that I really never developed much of a relationship with.  Now, both my Facebook news feed and Dispersion scores will be that much more meaningful!

Sunday, November 16, 2014

HTTP Request Timings with cURL

When a client makes an HTTP request, the following things happen:

  1. DNS name resolution
  2. SSL/SSH/etc connect/handshake to the remote host if applicable
  3. TCP connection to the remote host
  4. Negotiations specific to the particular protocol(s) involved
  5. Redirects if applicable
  6. Content generation on remote host
  7. Content transfer

Have you ever wanted to know how long each phase of an HTTP request takes?

It turns out this is relatively straightforward to do with the command line tool cURL and its powerful "Write Out" option.

Write Out Option

cURL comes with an option to print out a lot of useful information related to a request.  The following is taken directly from the cURL man page:

-w, --write-out

Defines what to display on stdout after a completed and successful operation. The format is a string that may contain plain text mixed with any number of variables. The string can be specified as "string", to get read from a particular file you specify it "@filename" and to tell curl to read the format from stdin you write "@-".

The variables present in the output format will be substituted by the value or text that curl thinks  fit, as described below. All variables are specified as %{variable_name} and to output a normal % you just write them as %%. You can output a newline by using \n, a carriage return with \r and a tab space with \t.

NOTE: The %-symbol is a special symbol in the win32-environment, where all occurrences of % must be doubled when using this option.

The variables available are:

VariableDescriptionSince
content_typeThe Content-Type of the requested document, if there was any.
filename_effectiveThe ultimate filename that curl writes out to. This is only meaningful if curl is told to write to a file with the --remote-name or --output option. It's most useful in combination with the --remote-header-name option.7.25.1
ftp_entry_pathThe initial path curl ended up in when logging on to the remote FTP server.7.15.4
http_codeThe numerical response code that was found in the last retrieved HTTP(S) or FTP(s) transfer. In 7.18.2 the alias response_code was added to show the same info.
http_connectThe numerical code that was found in the last response (from a proxy) to a curl CONNECT request.7.12.4
local_ipThe IP address of the local end of the most recently done connection - can be either IPv4 or IPv6.7.29.0
local_portThe local port number of the most recently done connection.7.29.0
num_connectsNumber of new connects made in the recent transfer.7.12.3
num_redirectsNumber of redirects that were followed in the request.7.12.3
redirect_urlWhen an HTTP request was made without -L to follow redirects, this variable will show the actual URL a redirect would take you to.7.18.2
remote_ipThe remote IP address of the most recently done connection - can be either IPv4 or IPv6.7.29.0
remote_portThe remote port number of the most recently done connection.7.29.0
size_downloadThe total amount of bytes that were downloaded.
size_headerThe total amount of bytes of the downloaded headers.
size_requestThe total amount of bytes that were sent in the HTTP request.
size_uploadThe total amount of bytes that were uploaded.
speed_downloadThe average download speed that curl measured for the complete download. Bytes per second.
speed_uploadThe average upload speed that curl measured for the complete upload. Bytes per second.
ssl_verify_resultThe result of the SSL peer certificate verification that was requested. 0 means the verification was successful.7.19.0
time_appconnectThe time, in seconds, it took from the start until the SSL/SSH/etc connect/handshake to the remote host was completed.7.19.0
time_connectThe time, in seconds, it took from the start until the TCP connect to the remote host (or proxy) was completed.
time_namelookupThe time, in seconds, it took from the start until the name resolving was completed.
time_pretransferThe time, in seconds, it took from the start until the file transfer was just about to begin. This includes all pre-transfer commands and negotiations that are specific to the particular protocol(s) involved.
time_redirectThe time, in seconds, it took for all redirection steps include name lookup, connect, pretransfer and transfer before the final transaction was started. time_redirect shows the complete execution time for multiple redirections.7.12.3
time_starttransferThe time, in seconds, it took from the start until the first byte was just about to be transferred. This includes time_pretransfer and also the time the server needed to calculate the result.
time_totalThe total time, in seconds, that the full operation lasted. The time will be displayed with millisecond resolution.
url_effectiveThe URL that was fetched last. This is most meaningful if you've told curl to follow location: headers.

Here's a sample cURL command to show each one of these properties while making a request to Google's home page:

curl -w '\ncontent_type=%{content_type}\nfilename_effective=%{filename_effective}\nftp_entry_path=%{ftp_entry_path}\nhttp_code=%{http_code}\nhttp_connect=%{http_connect}\nlocal_ip=%{local_ip}\nlocal_port=%{local_port}\nnum_connects=%{num_connects}\nnum_redirects=%{num_redirects}\nredirect_url=%{redirect_url}\nremote_ip=%{remote_ip}\nremote_port=%{remote_port}\nsize_download=%{size_download}\nsize_header=%{size_header}\nsize_request=%{size_request}\nsize_upload=%{size_upload}\nspeed_download=%{speed_download}\nspeed_upload=%{speed_upload}\nssl_verify_result=%{ssl_verify_result}\ntime_appconnect=%{time_appconnect}\ntime_connect=%{time_connect}\ntime_namelookup=%{time_namelookup}\ntime_pretransfer=%{time_pretransfer}\ntime_redirect=%{time_redirect}\ntime_starttransfer=%{time_starttransfer}\ntime_total=%{time_total}\nurl_effective=%{url_effective}\n\n' -o /dev/null -s 'https://www.google.com/'

Where:
  • -w shows which properties to write out
  • -o /dev/null redirects the output of the request to /dev/null
  • -s tells cURL not to show a progress bar
  • http://www.cnn.com/ is the URL we are requesting

Timing

From the options above, the following are of interest when it comes to timing the phases of a request.

time_appconnectThe time, in seconds, it took from the start until the SSL/SSH/etc connect/handshake to the remote host was completed.7.19.0
time_connectThe time, in seconds, it took from the start until the TCP connect to the remote host (or proxy) was completed.
time_namelookupThe time, in seconds, it took from the start until the name resolving was completed.
time_pretransferThe time, in seconds, it took from the start until the file transfer was just about to begin. This includes all pre-transfer commands and negotiations that are specific to the particular protocol(s) involved.
time_redirectThe time, in seconds, it took for all redirection steps include name lookup, connect, pretransfer and transfer before the final transaction was started. time_redirect shows the complete execution time for multiple redirections.7.12.3
time_starttransferThe time, in seconds, it took from the start until the first byte was just about to be transferred. This includes time_pretransfer and also the time the server needed to calculate the result.
time_totalThe total time, in seconds, that the full operation lasted. The time will be displayed with millisecond resolution.


If we consider a simple request to a non-SSL page that involves no redirects, there would be 4 main phases:

HTTP Request Phases: DNS Lookup, TCP Connection, Content Generation, and Content Transfer
HTTP Request Phases



Let's run the following command to gather timings while loading CNN's home page:


curl -w '\ntime_namelookup=%{time_namelookup}\ntime_appconnect=%{time_appconnect}\ntime_connect=%{time_connect}\ntime_redirect=%{time_redirect}\ntime_pretransfer=%{time_pretransfer}\ntime_starttransfer=%{time_starttransfer}\ntime_total=%{time_total}\n\n' -o /dev/null -s 'http://www.cnn.com/'

Where:
  • -w shows which timing properties to write out
  • -o /dev/null redirects the output of the request to /dev/null
  • -s tells cURL not to show a progress bar
  • http://www.cnn.com/ is the URL we are requesting

This results in the following output:

time_namelookup=0.029
time_appconnect=0.000
time_connect=0.095
time_redirect=0.000
time_pretransfer=0.095
time_starttransfer=0.166
time_total=0.530

From this, we can compute the time taken in each of the 4 phases as follows:

DNS Lookup = DNS Lookup (29 ms) - Start (0 ms) = 29 milliseconds
TCP Connection = Pre Transfer (95 ms) - DNS Lookup (29 ms) =  66 millseconds
Content Generation = Start Transfer (166 ms) - Pre Transfer (95 ms) =  71 milliseconds
Content Transfer = Total (530 ms) - Start Transfer (166 ms) = 364 milliseconds

HTTP Request to http://www.cnn.com/
HTTP Request Timings for http://www.cnn.com/
So there you have it!  Next time a request seems slow, you can use cURL to help find out why. You can write cron-invoked scripts that invoke the cURL command, collect the timing information, and send the results to Splunk, Amazon Cloudwatch, or your favorite logging framework to get pretty graphs of data over time.