
In the case of storing logs losing data is typically just fine. Should there really be a central metadata repository? The answer depends on what one needs from the system. which extents make up a file and where those extents are stored) is kept in a stand alone cluster that uses a quorum protocol so members of the metadata cluster can fail but the system will still function properly.įor me the interesting question about the storage layer is the metadata. The Scope paper also points out that the metadata about the files (e.g. The Scope paper also points out that extents are replicated so more than one machine will have a copy of a particular extent (no bonus for figuring out how many copies are typically kept around for each extent). So if someone wanted to read an entire 'file' they would have to touch a large number of machines in order to collect together all the extents that make up that file. Each extent is then copied onto a machine in a Cosmos cluster.

Files can be very large so file contents are broken into multi-megabyte blocks called extents. The storage layer is append only so it's ideal for data such as logs (hum… I hear those Cosmos guys are part of Search…). The Scope article in particular provides an architectural overview of the system and breaks Cosmos into three parts: Storage, Execution Management and Query Language. The story the previous data tells is that Cosmos is a platform for storing massive amounts of data across large numbers of machines and then distributing jobs onto the machines that store the data so that the data can be processed in a fully distributed and efficient fashion. You might also do an Internet search and find out about a paper on a language called Scope that was accepted by VLDB '08 (which I can give you a sneak peak of here). Or you might notice the reference in the job posting to something called Dryad (the job posting even includes a URL that amongst other things contains a link to a Google tech talk on Dryad). This might lead you to a job posting like this one. For example, you could go to Microsoft's job site and do a search for " Cosmos". While the Cosmos service is restricted for internal Microsoft use only and therefore isn't typically discussed outside the company, over time Microsoft has authorized quite a bit of data to be published about Cosmos. So read on if you are interested in the architecture Microsoft uses to store and query petabytes of data and what technical issues Microsoft's approach brings up.

As the lead Program Manager for Cosmos I can't say too much about it but what I can do is take a tour of the information that Microsoft has published about Cosmos. Cosmos is Microsoft's internal data storage/query system for analyzing enormous amounts (as in petabytes) of data.
