Home

Advertisement

Customize

Here’s the final part of my short series on mining data on around 50,000 Twitter accounts, as recorded by Twanalyst. Previously:

  • Part one looked at user profiles. Generally, the more you fill out your profile (description, avatar, background image etc), there seems to be a correlation with increased number of followers; and high-status description terms (’entrepreneur’, ‘author’, ’speaker’ etc) perform better than, er, low status ones (’student’, ‘nerd’ etc).
  • Part two discussed friends counts, and frequency of tweeting. There is an unsurprisingly close correlation between the number of friends you have and the number of followers; and you’re better off tweeting less than 30 times a day to avoid putting off followers. (Remembering always that correlation doesn’t mean causation, fact fans!)

Twanalyst also records data on the ‘type’ of tweets people write. It divides them into five categories:

  • Replies/mentions – anything beginning with a @ goes into this pot (mean 35% median 34%)
  • Retweets – ie simply retweeting others’ content (with RT as the flag) (mean 5% median 1%)
  • Links – tweets that contain web links pointing elsewhere (mean 16% median 9%)
  • Hashtags – tweets that use a hashtag to participate in some group activity (mean 3% median 0%)
  • Everything else – ie just normal tweets that aren’t any of the above (what people had for lunch, random witticisms, or whatever) (mean 41% median 37%)

Obviously in reality these categories aren’t so discrete, but let’s live with that and assume everything falls into one or another. Twanalyst records each as a percentage of total tweeting output (it analyses the most recent 200 tweets).

Expressed as a graph of these percentages against average follower counts for each percentage point (I’ve chopped off a few extreme values due to accounts with hundreds of thousands of followers):

Tweet content/followers

Tweet content/followers

The ‘lines of best fit’ are not hugely precise, but in broadly speaking it seems that there is a slight correlation between tweeting links and higher follower counts – people are interested in accounts which gather interesting stuff from elsewhere and tweet about it. The other values don’t really have any strong correlations.

One final analysis. Twanalyst also calculates a user’s Automated Readability Index – ie a rough measure of the simplicity or complexity of the language they use. A figure of between 6 and 12 represents ‘normal’ prose: below is simplistic and much above enters the realm of obscurantism. (It should be noted though that because tweets often contain links, odd hashtags and so on, the ARI figure is of necessity a bit vague.) Here’s ARI (chopped off at 50, and ignoring twitter accounts with more than 100,000 followers) measured against average follower counts for each data point:

readability

Not much to add here, except the obvious: very simple and very complex writing styles seem to put people off (apart from an odd blip at ARI=48), but a reasonably level of complexity may actually be popular. Or it may all be coincidence. Over and out!

Originally published at hatmandu.net. You can comment here or there.

 
 
22 December 2009 @ 11:23 am

For the last decade I’ve been following the fascinating work of Gerd Gigerenzer and colleagues (especially Dan Goldstein) – as briefly as I can state it, he has identified a number of very simple heuristics which outperform far more complex models for decision-making processes or making predictions about certain kinds of data (this stuff has partly inspired my Feweristics project). The most accessible explanation of all this is in his book Gut Feelings, where he explains things such as the recognition heuristic, and how it can be used to predict the winner of Wimbledon, or build a stock market portfolio that outperforms many experts, and so on.

Now two researchers, inspired by Goldstein and Gigerenzer’s ‘take-the-best heuristic’ have applied the less-information-beats-more methodology to the US elections since 1972. You can read their paper, Predicting elections from the most important issue facing the country (PDF – I found it via Decision Science News, the work of GG’s collaborator Dan Goldstein), though the bare bones as follows.

In the abstract, authors Andreas Graefe and J Scott Armstrong say that their simple model, called PollyMIP, “correctly predicted the winner of the  popular vote in 97% of all forecasts. For the last six elections, it yielded a higher number of correct  predictions of the election winner than the Iowa Electronic Markets”. Basically, they used a database of pre-election polls to identify what voters thought was the single most important issue each time (this varied over time before the election, in some cases more than others), then used the same database to pull out poll results for which of the two candidates (ie Democrat or Republican) they believed would deal with that issue best (they looked at all polls up to 100 days before the election). In passing, they corroborated other research that the incumbent party always starts with an advantage. (The authors note in their paper: “In the real world, people usually have to make decisions under the constraints of limited information and time, which is why models of rational choice often fail in explaining behaviour.”)

In full, their PollyMIP heuristic works thus (taken verbatim from their appendix):

Step 1 (identifying the most important problem)
Search rule: Look up last available poll on the most important problem facing the country; sort problems in the order of importance.
Stopping rule: Stop search if there is a single most important problem. If two or more problems are of similar importance, average their importance with the results from the most recent previously published poll until a problem is identified as the single most important.

Step 2 (obtaining voter support for candidates on most important problem)
Search rule: Look up polls that obtained voter support on the problem identified in step 1.
Stopping rule: Stop search if there are one or more polls available. Average voter support for each candidate and calculate the two-­party shares of the incumbent. Move to step 3.
If no polls are available and the most important problem (as identified in step 1) is different from the previous day, move to step 2.A. Otherwise move to step 2.B.

2.A (most important problem different to the day before)
Stopping rule: Take the incumbent’s two party share of voter support from the last available poll on the most important problem. Move to step 3.

2.B (most important problem similar to the day before)
Stopping rule: Take the PollyMIP score (see step 3) from the previous day. Move to step 3.

Step 3 (determining election winner)
Decision rule: Average the incumbent’s two-­‐party share of voter support for the last three days, which is referred to as the PollyMIP score. If the PollyMIP score is above 50%, predict the incumbent to win. If it is below 50%, predict the challenger to win. Otherwise, predict a tie.

Or, more briefly: “(1) Identify the  problem seen as most important by voters, (2) calculate the two-­party shares of voter support for the  candidates on this problem and average them for the last three days, and (3) predict the candidate with the higher voter support to win the popular vote.

Not bad for predicting election results 97% of the time. I’d love to see whether this would work for Britain’s elections, too. (They used the iPOLL databank – anyone know if there’s an equivalent for the UK?)

Originally published at hatmandu.net. You can comment here or there.

 
 
I can't remember when I last had any idea who was in the Top 40, or when I last listened to the radio, or when I last paid any attention to the 'new releases' section in HMV. Christmas Number One isn't something I've concerned myself with for a very long time. As far as I'm concerned, if you're paying attention to this stuff then you're just feeding the problem in which popular music consists pretty much entirely of shit.

We've got the Internet now, people. If you want to find good new music, just go out there and look for it. Start with Last.fm and see how you go from there. Who cares what genre it is, or what label it's on, or whether it's ever going to be in the charts? If you like it, just find it and buy it. Do your own thing, stop caring about old-fashioned concepts like the Top 40, and then you'll be supporting the artists who really deserve it. Let the bloated, sick, over-hyped, X Factor-polluted world of popular music disappear up its own arse and die once and for all.
 
 
18 December 2009 @ 11:47 am
First they came for the dumb, and I said nothing, because I was not dumb.
Then they came for the stoic, and I did not complain, because I was not stoic.
Then they came for the ignorant, and I did not even know about it.
Then for the blind, but I did not see that, because I was not blind.
Then they came for the shy, but I did not demonstrate about it---I am not shy.
Then they came for the sane, but I wasn't mad about that, because I was not sane.
And when they came for me, I wasn't quite myself. So that was all right.
 
 

I’ve been analysing data from 50000 Twitter accounts, recorded by my Twanalyst tool (tracks your Twitter stats over time, and analyses your tweeting style and personality). In Part 1, I looked at how people’s profiles might correlate with their number of followers, and a few trends emerged.

This time I’ve been looking at the relationship between follower counts and the following:

  • Number of friends
  • Time since joining Twitter
  • Number of tweets written
  • Average number of tweets written per day

In each graph below, the X-axis shows the above data, with follower counts on the Y axis. The Y figures are averages taken for each value of X.

Friends

Friends/followers

Friends/followers

The green line is the estimated line of best fit by OmniGraphSketcher (excellent Mac graphing program) – though it seems slightly generous. (I’ve cut friends off at 100000, as the few data points above that are so high that the rest of the data becomes unclear.) Roughly speaking, and unsurprisingly, there’s a one-to-one relationship between friends and followers. Want followers? Make friends.

Time

Days/followers

Days/followers

Obviously you need to have been on Twitter for a little time to get followers – but overall there isn’t really any strong correlation noticeable between how long you’ve been using it and how many followers you have. It must be what you do with Twitter that matters, rather than simply Being There.

Tweets

Tweets/followers

Tweets/followers

This doesn’t seem to show much, either. What might be helpful is to measure this against time…

Rate

Tweet rate/followers

Tweet rate/followers

When you measure the average number of tweets per day (since joining Twitter, and I’ve ignored a handful of rates over 300/day), a broad message comes across that you’re best of tweeting up to around 30 times a day – above that, and you risk putting people off. Again, this isn’t exactly surprising.

So there aren’t really any profound observations here, sorry: the data seems to corroborate common sense.

In the third and final part of this series, next week, I’ll see if there are any correlations between tweeting style (as recorded by Twanalyst – number of retweets, posting of links, how much you reply to other people etc) and follower counts. Thanks for listening!

PS: I’m indebted to the UNIX BASH Scripting blog for an awk script that helped crunch this data.

Originally published at hatmandu.net. You can comment here or there.

 
 
16 December 2009 @ 04:00 pm
Over the last few days, various friends have kindly presented me with opportunities to see Depeche Mode live. Since I regard Depeche Mode as almost certainly my favourite band of all time, you'd think I'd have jumped at these opportunities. However, I haven't.

This is mainly because the Depeche Mode that I loved came to an end after Songs of Faith and Devotion. Alan Wilder left the band at that point, and I haven't really loved any of their albums since (with the possible exception of Playing the Angel, which I do rate pretty highly).

When I first watched the film 101 many years ago, the sight of Dave Gahan doing his rock god act in front of three guys playing synthesizers and hitting metal objects with hammers was truly momentous for me. That's the Depeche Mode I fell in love with: the dark, electronic, industrial pop band, complete as they were; not the band with 'real instruments' and extra musicians onstage as they are nowadays. It's a source of painful regret for me that I never saw Depeche Mode live around the time of Music For the Masses / Violator / Songs of Faith and Devotion.

(Another problem is that, when Depeche Mode come to London, they only seem to play at venues (the Millenium Dome, the Royal Albert Hall) which are problematic for people who suffer from vertigo, as I do.)

If Alan Wilder rejoined the band, and if they decided to play at Wembley Arena using only synthesizers (and metal objects with hammers, plus maybe the occasional bit of guitar or piano from Martin Gore), and if they weren't going to play songs from any albums since Songs of Faith and Devotion, then I would pay pretty much any amount of money to go and see them. But, sadly, I don't think that's likely to happen.
Tags:
 
 
14 December 2009 @ 10:09 pm
Bah. Tchibo Online have a hat I want, and it's four quid, plus THIRTEEN POUNDS P&P.

Suffice it to say that I shall not be buying said hat (at least not online).
 
 
14 December 2009 @ 09:13 am
On Saturday [info]myriad_freckles hosted a wine tasting dinner party. Her brother Arthur (of the not-very-recently-updated-but-very-informative wine blog) had decided on a selection of wines for us to try.

First was Cava from Cordoníu. Cava is normally made with three grapes: Macabeo, Parellada and Xarello; but in this one, the Xarello was replaced by Cardonnay. It has small delicate bubbles, which shows it was made in bottles (like champagne) rather than in a vat. Flavours noted by the testers included Christmas pudding, dried fruit, apples, brioche and biscuit.

Then 2004 white Burgundy from Château de Chamirey on the Mercurey vineyard. This is an oaked Chardonnay, which gets its buttery taste from the conversion of malolactic acid into lactic acid during the fermentation process. To Rosy and me, this smelled of nail varnish remover at first. People thought it tasted of butter, pears, pear drops and lychee.

Then 2007 Chardonnay from Montes Alpha in Chile. We thought this smelled of pears and apples and tasted of lemons, vanilla, pineapple and passionfruit. New World wines are supposed to taste more of exotic fruits, and Old World of old-world fruits.

We moved on to red wine. The first was Merlot from Casa Lapostolle in Chile. Clare noted that this was very dark, meaning that the grapes had had a lot of sun. It might be called "Chilean gothic", we thought. The main flavours were vanilla, plum and chocolate.

This wine was oaked by American oak rather than French oak. American barrels are sawn, while French are split, so in the American barrels there is more pore contact with the wine, making the flavour more obvious. All red wines and some white wines (Chardonnay) are oaked. In cheaper bottles (under, say £6.50 ish) the oaked flavour will not come from being kept in a barrel, but instead from oak planks or (even worse) "tea bags" of oak chips being suspended in the vat. Barrels are expensive. (Here is even more about barrels.) Wine barrels can be reused, either by shaving the inside, or by selling them on for sherry or whisky or cider. We imagined a sort of barrel food-chain.

Next was 2005 Valpolicella Classico from Italy, which uses left-over grapes after the Amarone wine (made from partly-dried grapes) has been produced, which gives it a rich flavour. "Classico" means that it came from a vineyard higher up the valley where the soil is less damp. The flavour we got from this one was brown sugar, and we also noticed that it was very acidic and tannic, and makes you salivate. This is a characteristic of Italian wine, which tends to be designed to go well with food.

The third red wine we tried was a Syrah from Casa Marin in Chile. We thought it tasted full-bodied but understated and of milk, black pepper and mineral notes. Harry noted that there were purply bubbles when it was poured. Someone (we'd drunk a bit by now) said that it had a touch of evil. Arthur says this is an iconic style which is going to be big in the next five years.

And then was Muriel Rioja Gran Reserva 1996 fron Spain. We tasted blackcurrant and vanilla. Arthur noted that rioja wines often have a liquorice flavour. This wine was oaked in American oak barrels, giving it the very rich vanilla taste. We also learned about the Phylloxera louse, which devastated the roots of European vines from about 1840. As a result, the vast majority of vines in Europe are grafted onto rootstock from the New World, which is resistant to the pest.

Our last red wine was 2002 Burgundy Ladoix Premier Crun Les Corvées. People commented on its light red colour, and the taste of strawberries. It is made from the Pinot Noir grape, which has the characteristic tastes relating to decay, like mushrooms.

Then we had a pudding wine and unaccountably my notes stop before I have written down what we actually thought of it. I remember it was very tasty though. It was a Sauternes, which is a blend of three grape varieties: Muscadelle (not to be confused with Muscat or Muscadet), Sémillon and Sauvignon Blanc.
Tags: ,
 
 
 
 

Advertisement

Customize