Monday 1 July 2013

A SQL version of the Pleiades dataset



For those of you who use the Pleiades data in your applications I’ve created a .sql file for their newest data dump which you can find here.  I used the Pleiades data dump from June 27, 2013.  What’s nice about the Pleiades people is that when they say a file is comma separated it really is.  I downloaded this file and extracted it into the promised .csv and then imported it directly into Excel where I formatted it into a set of .sql inserts.  The columns seem to have changed somewhat from the previous versions; there are fewer name columns.  As a result I reformatted my Pleiades data table in SquinchPix’s database so that it now looks like this:

CREATE TABLE `PlacRefer` (
  `bbox` varchar(100) default NULL COMMENT 'bounding box string',
  `description` varchar(512) default NULL COMMENT 'Free-form text description',
  `id` varchar(100) default NULL COMMENT 'Pleiades ID number',
  `max_date` varchar(255) default NULL COMMENT 'last date',
  `min_date` varchar(100) default NULL COMMENT 'earliest date',
  `reportLat` float default NULL COMMENT 'Latitude',
  `reportLon` float default NULL COMMENT 'Longitude',
  `Era` varchar(24) default NULL COMMENT 'Character indicate the era',
  `place_name` varchar(100) default NULL COMMENT 'Name string for the place'
);

I formatted the bounding box as a comma-separated varchar.  I do this because the bounding box requires special treatment; it might be missing altogether or it might be less than four points so, if you’re working with it, just get it into a string and split the string on commas.  Then you’ll have an array of items that you can treat as floats.  I finally got it through my thick skull that the description line can be parsed into keywords so I’ll be using that more in the future.  The ‘id’ field is the regular Pleiades ID.  Is it my imagination or did the Pleiades people suddenly get a large dump of data from the Near East?  The number of items in the file is now 34,000+ and this looks like a big increase.  The max_date and min_date fields give the terminus ante quem and terminus post quem, respectively, for any human settlement of the place in question.  The reportLat and reportLon fields haven’t changed.  The ‘era’ field gives zero or more characters that indicate the period of existence of any site: ‘R’ for ‘Roman’, ‘H’ for ‘Hellenistic’, etc.  I included them because it might be handy for your chronological interpretation.  The ‘place_name’ field is the only name field in the current setup.

If this table layout is satisfactory for you then you can get all the Sql to create and populate the table with all the newest Pleiades data from Google Drive here. Be careful; this new .sql deletes the PlacRefer table first.

I modified my Regnum Francorum Online parser to use this renewed table. The relevant code looks like this:

 $place_no = $l5[0];  // $l5[0] is a fragment of the input record which contains the     // Pleiades ID.
 unset($lat);  // we test for unset later
 unset($lon);

 $querygeo = "select a.reportLat, a.reportLon from PlacRefer a where a.id = $place_no;";
 $resultgeo = mysql_query($querygeo);
 $rowgeo    = mysql_fetch_array($resultgeo);

 $lat  = $rowgeo[0];
 $lon  = $rowgeo[1];

This is how you’ll probably use it most of the time – using the Pleiades ID to retrieve the lat/lon pair.  I was pleasantly surprised at how much the data has improved.   I redid all the Regnum Francorum Online records with the new data and it looks a lot better.  So congratulations to the Pleiades guys!  Although they should double check the exact location of Nördlingen.  Here's how the first 500 Regnum Francorum Online records look on a map.

First 500 Regnum Francorum Online records displayed on SquinchPix using new Pleiades data.
A big improvement over the previous version which you can see here.

If you want to do this yourself from the original Pleiades data dump then be sure to convert (no parentheses in the following sequences) all double quote characters to (") , left single quote to (‘) and right single quote to (’).  The data has elaborate description fields which have been formatted with lots of portions quoted in various ways by various workers.  Also many place names in the Near East and many French names contain embedded single quotes that must be changed to (‘) or (’) or the equivalent.  If you need a guide go here.

Get this right first because if you’re not absolutely sure that you’ve got all the pesky quotes taken care of then the sql import won’t run.

But you can avoid all that hassle by just downloading my .sql file from Google Drive and importing it to your DB.  Have fun!

Robert Consoli
Cross-posted from Squinches.