Saturday, 8 November 2008

3. Under your nose the whole time...

Metadata is usually stored in Tab Seperated Value (tsv) format.

Don't be worried by that - you've already been using it for years. You may never have known it, but MS Excel is in tsv format. 

All tsv means is that the fields of information are seperated by tabs. If you're in an excel spreadsheet, and you press 'tab', you move across by one cell - the software knows that 'tab' means 'the next field'.

There are other ways to seperate units of information: commas, semi-colons, colons, all sorts of things. The handy thing about tabs is that, unlike commas, semi-colons, and colons, they tend not to be used in natural language. You might have a book title like 'My life: the story of my life', but you'll never have 'My life (TAB) the story of my life'. That makes tsv very handy.

('Natural language' is any language that isn't a computer language. English, Spanish, Sign language... anything that people use to understand other people. Computers are terrible with natural language, which causes big headaches for the regex user. We'll cover that in more detail later.)

Look at this excel spreadsheet: 




...and this textpad document:




The textpad document looks unintelligible, but it isn't. It is exactly the same document as the excel spreadsheet. In fact, all I did was copy & paste between the two - there are no changes at all. This means two things: 

1. The tsv format is easy
2. If you ever get lost in textpad, you can pop your data back into excel to see what's going on.

(Sharp eyes will have spotted that this is a terrible, terrible piece of metadata. All I've done is pull some books off my bookshelf and pop them in a list. It's not standardised, the first name and surname are in the same column, the capitals are all over the shop, and my id column is based on the order I grabbed them. There's a reason for this - we're going to fix it.)


No comments: