## Blogs

### European Commission declares itself an Honest broker in future global negotiations on Internet Governance

For more than a decade there has been active resistance in some quarters to the continuing custody by the U.S. of the root domain registries of the Internet. Those directories (which control the routing of Internet traffic into and out of nations) are administered by ICANN, which in turn exi...

Categories: Blogs

### What I&#8217;m Reading on 02/11/2014

Impatience Has Its Reward: Books Are Rolled Out Faster – NYTimes.com “While the television industry has begun catering to impatient audiences by releasing entire series at once, the book business is upending its traditional timetable by encouraging a kind of … Continue reading
Categories: Blogs

### What I&#8217;m Reading on 02/10/2014

Bruce Springsteen and the E Street Band Announce 2014 U.S. Tour | Music News | Rolling Stone “Bruce Springsteen and the E Street Band have announced dates for their spring 2014 U.S. tour. It kicks off April 8th at Cincinnati, … Continue reading
Categories: Blogs

### Eyes & Ears

This week I have picked up two very nice mixes and one nice video of an installation by two artists, Romain Tardy and someone hiding behind the “Squeaky Lobster” pseudo. It’s called the Ark and it’s a beautiful combination of lights, water reflections and structures. Let’s start with these guys:

THE ARK
A site specific installation by Romain Tardy and Squeaky Lobster
Proyecta Oaxaca, Ethnobotanical garden of Oaxaca, Mexico

Café del Mar Summer mix 2013 by Toni Simonen

I’ve written here that I felt José Padilla and a couple others were the only ones who were able to capture the essence of Balearic music. It does not mean others do a bad job, and Toni Simonen is definitely up to something in this mix.

Manuel Göttsching – E2-E4 (part 1)

Now this is a true rarity. It’s been released in the nineties and listening to it will unveil the very soul and underlying characteristics of Balearic music. It’s all in there, in its most abstract, yet crudest form. Only AMCA ( a man called Adam) comes close to this level.

That’s all for now, thanks for listening folks!

Categories: Blogs

### What I&#8217;m Reading on 02/07/2014

Why Makers Fail At Retail | TechCrunch “When you get traction with software, you fire up new servers and scale your infrastructure. It is fast and cheap. And traction alone can get you funding. With hardware, traction is sales, or … Continue reading
Categories: Blogs

### The Age of Commonalities has Arrived

Ten years ago this month I wrote an iss...

Categories: Blogs

### The Words Democrats and Republicans Use

An Antic Disposition - Fri, 2014-02-07 15:06

It came to me after listening to the State of the Union Address:   Can we tell whether a speech was from a Democrat or a Republican President, purely based on metrics related to the words used?  It makes sense that we could.  After all, we can analyze emails and detect spam that way.  Automatic text classification is a well known problem.   On the other hand, presidential speeches go back quite a bit.  Is there a commonality of speeches of, a Democrat in 2014 with one from 1950?  Only one way to find out…

I decided to limit myself to State of the Union (SOTU) addresses, since they are readily available, and only those post WW II.  There has been a significant shift in American politics since WW II so it made sense, for continuity, to look at Truman and later.   If I had included all of Roosevelt’s twelve (!) SOTU speeches it might have distorted the results, giving undue weight to individual stylistic factors.   So I grabbed the 71 post WWII addresses and stuck them into a directory.  I included only the annual addresses, not any exceptional ones, like G.W. Bush’s special SOTU in September 2001.

I then used R’s text mining package, tm, to load the files into a corpus, tokenize, remove punctuation, stop words, etc.  I then created a document-term matrix and removed any terms that occurred in fewer than half of the speeches.  This left me with counts of 610 terms in 71 documents.

Then came the fun part.  I decided to use Pointwise Mutual Information (PMI),  an information-centric measure of association from information retrieval, to look at the association between terms in the speeches and party affiliation.  PMI shows the degree of association (or “co-location”) of two terms while also accounting for their prevalence of the terms individually.  Wikipedia gives the formula, which is pretty much what you would expect.   Calculate the log probability of the co-location and subtract out the log probability of the background rate of the term.  But instead of looking at the co-occurrence of two terms, I tried looking at the co-occurrence of terms with the party affiliation.    For example, the PMI of “taxes” with the class Democrat would be:  log p(“taxes”|Democrat) – log p(“taxes”).  You can see my full script for the gory details.

Here’s what I got, listing the 25 highest PMI terms for Democrats and Republicans:

So what does this all mean?  First note the difference in scale.  The top Republican terms had higher PMI than the top Democrat terms.  In some sense it is a political Rorschach test.  You’ll see what you want to see.  But in fairness to both parties I think this does accurately reflect their traditional priorities.

From the analytic standpoint the interesting thing I notice is how this compares to other approaches, like using classification trees.  For example, if I train the original data with a recursive partitioning classification tree, using rpart, I can classify the speeches with 86% accuracy by looking at the occurrences of only two terms:

Not a lot of insight there. It essentially latched on to background noise and two semantically useless words.   So I prefer the PMI-based results since they appear to have more semantic weight.

Next steps: I’d like to apply this approach back to speeches from 1860 through 1945.

Related posts:

Categories: Blogs

### A tale of two Aprils in the Adirondacks

This has been a long hard winter in the northeastern United States, and I don’t think we’re done yet. Earlier this week New York and Philadelphia got several inches of snow topped by ice. More upstate in New York where … Continue reading
Categories: Blogs

### What I&#8217;m Reading on 02/06/2014

The Standards Wars and the Sausage Factory “Today no one dies from standard wars, not that you’d know it from Internet comments. But years, millions of dollars, and endless arguments are spent arguing about standards. The reason for our fights … Continue reading
Categories: Blogs

### What I&#8217;m Reading on 02/05/2014

Computers May Someday Beat Chefs At Creating Flavors We Crave : The Salt : NPR “Computer scientists at IBM have already built a computer that can beat human contestants on the TV quiz show, “Jeopardy.” Now it appears they’re sharpening … Continue reading
Categories: Blogs

### What I&#8217;m Reading on 02/04/2014

Microsoft Names Engineering Executive as New Chief – NYTimes.com “Microsoft on Tuesday announced that Satya Nadella would be its next leader, betting on a longtime engineering executive to help the company keep better pace with changes in technology.” tags: microsoft … Continue reading
Categories: Blogs

### Adventures in Self-Publishing: Establishing a Web Presence (Part II)

In the last post, we talked about the different types of Web sites you can crea...

Categories: Blogs

### Why LibreOffice 4.2 matters more than you think

On Thursday the Document Foundation released its newest stable branch, LibreOffice 4,2. Don’t let be misled by its number; if we weren’t on a strict time released scheduled alongside a clear number scheme without any nickname for each release, I would have called this one the 5,0. Yes, you read that right, the mighty Five. Why? Mostly for two big reasons.

This is a major code overhaul

Do you remember one of my first posts about LibreOffice, at the end of 2010? I had hinted that one of our goals was to develop a brand new engine for Calc, which had stayed pretty much the same since 1998. Well, the 4,2 just got that: Ixion has been integrated as the Calc engine and that, among other things, such as real-time integration of data feeds, is about to change a lot of things, and not just in terms of performance boosts (over 30% of improvement depending on the cases). This might actually open the door for brand new types of users in professional and scientific venues for instance.

Alongside this rewrite, we also have a major work on the user interface layout and dialog rewrite. As Michael Meeks explains it, we had introduced this rewrite with the 4.0 but now quite many of our dialogs and widgets have been rewritten. Other user interface improvements such as a brand new iconset, document snapshots on Windows bring a fresh and refined user experience to LibreOffice.

We have heard this song here and there. You cannot be innovative and be successful as an enterprise solution. You cannot be the right choice for companies if you haven’t a major American corporation as your main sponsor/steward/overlord/friend. You cannot deliver a professional grade office suite if you work along a time-based release system. I think that these theories have already been proven wrong, unless you have a twisted definition of what the enterprise market  needs. But with the 4.2, we also have some nice and immediately actionnable features that will appeal specifically to the enterprise market:

• Integration of the CMIS stack allowing you connect to document repositories on SharePoint, Nuxeo, Tibco, Alfresco, Google Drive and many other CMS.
• Expert configuration options now all put in one place
• Better group policy controls for deployment and installed user base
• Improved Microsoft Office (.docx, etc.) and RTF document filters
• Improved look and feel on Windows
• Change tracking on ODF and even on OOXML documents

I’m not listing a good dozen of other improvements of importance, but here’s the complete list.

And now, I’m going to really explain why LibreOffice 4,2 matters more than what meets the eye. The amount of code clean-up, refactoring, write up, the inclusion of new features and the continued growth of contributors between the moment the Document Foundation released LibreOffice 4.0 and the 4.2 is truly amazing. The 4.0 was a major accomplishment, but this time we did even more, seemingly with less effort (although this comment does not diminishes everyone’s accomplishments for this release).

What’s going here? A giant in Free and Open Source projects is emerging and we are witnessing this unfolding right under our eyes: a growing development powerhouse, increased funding, an effective structure, overworked but growing contributors, an increased presence on worldwide events, improved processes on localization and quality assurance… I guess many observers as well as several insiders were thinking that once we had set up the Document Foundation as a structure and released the 4.0, things would take a course and a pace of their own. That hasn’t happened. On the contrary the word around the project was “Up!” and has not changed ever since. Another  possible reason is that once the founders -with some hindsight, I start to see it more clearly- got the structure, the governance, the main processes going, priorities started to change for the best: Discussions started to be more about resources, funding, sustainability, but the minds were freed from the worry of the next day and were able to focus on developing something great. I realize I’m painting a very nice picture, but I know that the road won’t be short and it will not be easy eiither, but judging by what this community has already overcome I am confident the Document Foundation is going to push the enveloppe on many levels in the years to come. I am truly proud of what we have accomplished so far and I would like to thank everyone who made this release possible. Happy FOSDEM!

Categories: Blogs

### Monthly disclaimer

The postings on this site are my own and don’t necessarily represent my employer’s positions, strategies or opinions. Blog entries before 2010 are in my Archived Blog. © Robert S. Sutor for Bob Sutor, 2014. All rights reserved. Permalink | No … Continue reading
Categories: Blogs

### What I&#8217;m Reading on 01/30/2014

Hal Varian and the “New” Predictive Techniques | Business Forecasting “Big Data: New Tricks for Econometrics is, for my money, one of the best discussions of techniques like classification and regression trees, random forests, and penalized  regression (such as lasso, … Continue reading
Categories: Blogs

### UK Cabinet Office Signals Move Towards Open Source Office Suites

It was ten years ago that the CIO of Massachusetts rattled the desktop world by announcing that the Executive Agencies of the Commonwealth would henceforth lice...

Categories: Blogs

### Eyes & Ears – February 2014

This month’s Eyes and Ears will not be about books, only about good music and videos.

• Hanoi: 17 years ago, I made a student trip to Vietnam that lasted 1 month and half.

Hanoi from Matt Devir on Vimeo.

•  La Grande Dune, by Lemongrass on its Papillon release, one of my favourites so far.

• Blank & Jones: Don’t let me pass you by, a refine deep house track that shows the variety of genres this duo can come up with. A real massage for the ears!

Enjoy!

Categories: Blogs

### What I&#8217;m Reading on 01/28/2014

The RedMonk Programming Language Rankings: January 2014 – tecosystems “The aspect of these rankings that most interests us is the trajectories they may record: which languages are trending up? Which are in decline? Given that and the adoption curve for … Continue reading
Categories: Blogs

### What I&#8217;m Reading on 01/27/2014

Android’s Favorite Keyboard Could Be Coming to iOS “SwiftKey, Android‘s most popular keyboard app, could soon be coming to iOS in the form of a note-taking app.” tags: android ios mobile Posted from Diigo. The rest of my favorite links … Continue reading
Categories: Blogs

### First Move Advantage in Chess

An Antic Disposition - Mon, 2014-01-27 14:57

### The Elo Rating System

Competitive chess players, at the amateur club level all the way through the top grandmasters, receive ratings based on their performance in games.   The ratings formula in use since 1960 is based on a model first proposed by the Hungarian-American physicist Arpad Elo.  It uses a logistic equation to estimate the probability of a player winning as a function of that player’s rating advantage over his opponent:

$E = \frac 1 {1 + 10^{-\Delta R/400}}$

So for example, if you play an opponent who out-rates you by 200 points then your chances of winning are only 24%.

After each tournament, game results are fed back to a national or international rating agency and the ratings adjusted.  If you scored better than expected against the level of opposition played your rating goes up.  If you did worse it goes down.  Winning against an opponent much weaker than you will lift your rating little.  Defeating a higher-rated opponent will raise your rating more.

That’s the basics of the Elo rating system, in its pure form.  In practice it is slightly modified, with ratings floors, bootstrapping new unrated  players, etc.  But that is its essence.

### Measuring the First Mover Advantage

It has long been known that the player that moves first, conventionally called “white”, has a slight advantage, due to their ability to develop their pieces faster and their greater ability to coax the opening phase of the game toward a system that they prefer.

So how can we show this advantage using a lot of data?

I started with a Chessbase database of  1,687,282 chess games, played from 2000-2013.   All games had a minimum rating of 2000 (a good club player).  I excluded all computer games.   I also excluded 0 or 1 move games, which usually indicate a default (a player not showing up for an assigned game) or a bye.  I exported the games to PGN format and extracted the metadata for each game to a CSV file via a python script.  Additional processing was then done in R.

Looking at the distribution of ratings differences (white Elo-black Elo) we get this.  Two oddities to note.  First note the excess of games with a ratings difference of exactly zero.  I’m not sure what caused that, but since only 0.3% of games had this property, I ignored it.   Also there is clearly a “fringe” of excess counts for ratings that are exactly multiples of 5.  This suggests some quantization effect in some of the ratings, but should not harm the following analysis.

The collection has results of:

• 1-0 (36.4%)
• 1/2-1/2 (35.5%)
• 0-1 (28.1%)

So the overall score, from white’s perspective was 54.2% (counting a win as 1 point and a draw as 0.5 points).

So white as a 4.2% first move advantage, yes?  Not so fast.   A look at the average ratings in the games shows:

• mean white Elo: 2312
• mean black Elo: 2309

So on average white was slightly higher rated than black in these games.  A t-test indicated that the difference in means was significant to the 95% confidence level.  So we’ll need to do some more work to tease out the actual advantage for white.

### Looking for a Performance Advantage

I took the data and binned it by ratings difference, from -400 to 400, and for each difference I calculated the expected score, per the Elo formula, and the average actual score in games played with that ratings difference.   The following chart shows the black circles for the actual scores and a red line for the predicted score.  Again, this is from white’s perspective.   Clearly the actual score is above the expected score for most of the range.   In fact white appears evenly matched even when playing against an opponent 35-points higher.

The trend is a bit clearer of we look at the “excess score”, the amount by which white’s results exceed the expected results.  In the following chart the average excess score is indicated by a dotted line at y=0.034.  So the average performance advantage for white, accounting for the strength of opposition, was around 3.4%.  But note how the advantage is strongest where white is playing a slightly stronger player.

Finally I looked at the actual game results, the distribution of wins, draws and losses, by ratings differences.  The Elo formula doesn’t speak to this.  It deals with expected scores.  But in the real world one cannot score 0.8 in a game.   There are only three options:  win, draw or lose.  In this chart you see the first mover advantage in another way.  The entire range of outcomes is essentially shifted over to the left by 35 points.

Related posts:

Categories: Blogs
XML.org Focus Areas: BPEL | DITA | ebXML | IDtrust | OpenDocument | SAML | UBL | UDDI
OASIS sites: OASIS | Cover Pages | XML.org | AMQP | CGM Open | eGov | Emergency | IDtrust | LegalXML | Open CSA | OSLC | WS-I