In this post we will be using curl to construct a geospatial index using Riak Search 2 (also known as Yokozuna) which is backed by Solr.
I’ll be using Riak Pre11 for this.
In etc/riak.conf
change search
to on
: (It was on line
411 for me)
search = on
Make sure your ulimit
is 4096 or greater:
ulimit -n 4096
Then start Riak. I’m using console
myself, but start
would also work.
./bin/riak console
Let’s create a schema so that Solr indexes our data correctly (file available here):
<?xml version="1.0" encoding="UTF-8"?><schema name="geotest" version="1.5"><uniqueKey>_yz_id<fields><field name="name" type="string" indexed="true" stored="true"/><field name="loc" type="location_rpt" indexed="true" stored="true"/><!-- Begin Yokozuna Fields --><field name="_yz_id" type="_yz_str" indexed="true" stored="true" required="true" /><field name="text" type="text_ws" indexed="true" stored="false" multiValued="true"/><field name="_version_" type="long" indexed="true" stored="true"/><!-- Entropy Data: Data related to anti-entropy --><field name="_yz_ed" type="_yz_str" indexed="true" stored="false"/><!-- Partition Number: Used as a filter query param --><field name="_yz_pn" type="_yz_str" indexed="true" stored="false"/><!-- First Partition Number: The first partition in this doc'spreflist, used for further filtering on overlapping partitions. --><field name="_yz_fpn" type="_yz_str" indexed="true" stored="false"/><!-- If there is a sibling, use vtag to differentiate them --><field name="_yz_vtag" type="_yz_str" indexed="true" stored="false"/><!-- Node: The name of the node that this doc was created on. --><field name="_yz_node" type="_yz_str" indexed="true" stored="false"/><field name="_yz_rt" type="_yz_str" indexed="true" stored="true"/><!-- Riak Bucket: The bucket of the Riak object this doc corresponds to. --><field name="_yz_rb" type="_yz_str" indexed="true" stored="true"/><!-- Riak Key: The key of the Riak object this doc corresponds to. --><field name="_yz_rk" type="_yz_str" indexed="true" stored="true"/><!-- Node: Stores a flag if this doc is the product of a failed object extration --><field name="_yz_err" type="_yz_str" indexed="true" stored="false"/></fields><types><fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"distErrPct="0.025"maxDistErr="0.000009"units="degrees"/><!-- since fields of this type are by default not stored or indexed,any data added to them will be ignored outright. --><fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" /><!-- YZ String: Used for non-analyzed fields --><fieldType name="_yz_str" class="solr.StrField" sortMissingLast="true"/><fieldType name="string" class="solr.StrField" sortMissingLast="true"/><fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/><fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/><fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/><fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/><fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/><!--Numeric field types that index each value at various levels of precisionto accelerate range queries when the number of values between the rangeendpoints is large. See the javadoc for NumericRangeQuery for internalimplementation details.Smaller precisionStep values (specified in bits) will lead to more tokensindexed per value, slightly larger index size, and faster range queries.A precisionStep of 0 disables indexing at different precision levels.--><fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0"/><fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" positionIncrementGap="0"/><fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0"/><fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/><fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/><!-- A Trie based date field for faster date range queries and date faceting. --><fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0"/><!--Binary data type. The data should be sent/retrieved in as Base64 encoded Strings --><fieldtype name="binary" class="solr.BinaryField"/><fieldType name="random" class="solr.RandomSortField" indexed="true"/><!-- A text field that only splits on whitespace for exact matching of words --><fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"><analyzer><tokenizer class="solr.WhitespaceTokenizerFactory"/></analyzer></fieldType><!-- A general text field that has reasonable, genericcross-language defaults: it tokenizes with StandardTokenizer,removes stop words from case-insensitive "stopwords.txt"(empty by default), and down cases. At query time only, italso applies synonyms. --><fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"><analyzer type="index"><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/><filter class="solr.LowerCaseFilterFactory"/></analyzer><analyzer type="query"><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/><filter class="solr.LowerCaseFilterFactory"/></analyzer></fieldType></types></schema>
The important parts here are the schema name being geotest
<schema name="geotest" version="1.5">
our two fields. name
and loc
<field name="name" type="string" indexed="true" stored="true"/><field name="loc" type="location_rpt" indexed="true" stored="true"/>```and our fieldType `location_rpt````xml<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"distErrPct="0.025"maxDistErr="0.000009"units="degrees"/>
Pretty much everything else is boilerplate so that the Riak Solr integration works.
We can now upload this schema (I saved the schema above as schema.xml in my current directory):
curl -i -XPUT http://localhost:8098/search/schema/geotest \-H 'content-type: application/xml' \--data-binary @schema.xml
…and we create an index named “my_geo_index” which uses the schema (name = “geotest”) we just uploaded.
curl -i -XPUT http://localhost:8098/search/index/my_geo_index \-H 'content-type: application/json' \-d '{"schema":"geotest"}'
They should both return 204 responses.
Next we’ll create a bucket type named “geo_type” using the
riak-admin
command. Our bucket type won’t have any special
properties, it just needs to exist.
./bin/riak-admin bucket-type create geo_type '{"props":{}}'
We also need to activate our new bucket type:
./bin/riak-admin bucket-type activate geo_type
We will now create a bucket named “stuff” under the
geo_type
bucket type. In addition, this command associates
the Solr index my_geo_index
with the bucket stuff
curl -XPUT 'http://localhost:8098/types/geo_type/buckets/stuff/props' \-H 'content-type: application/json' \-d '{"props":{"search_index":"my_geo_index"}}'
That’s it. Let’s index some data!
curl -i -H 'content-type: application/json' -X PUT 'http://localhost:8098/types/geo_type/buckets/stuff/keys/sf' -d '{"name":"San Francisco", "loc":"37.774929,-122.419416"}'curl -i -H 'content-type: application/json' -X PUT 'http://localhost:8098/types/geo_type/buckets/stuff/keys/sj' -d '{"name":"San Jose", "loc":"37.339386,-121.894955"}'curl -i -H 'content-type: application/json' -X PUT 'http://localhost:8098/types/geo_type/buckets/stuff/keys/mv' -d '{"name":"Mountain View", "loc":"37.386052,-122.083851"}'
Now for the fun part. Let’s find all of our data, scored and sorted by distance. The score will return a distance (in degrees). We are querying from a location in Palo Alto, California, so we should see fairly small distances to Mountain View, San Jose and San Francisco.
curl 'http://localhost:8098/search/my_geo_index?&fl=*,score&sort=score%20asc&q={!geofilt%20score=distance%20filter=false%20sfield=loc%20pt=37.441883,-122.143019%20d=10}&wt=json'
The query returns the results:
{"responseHeader":{"status":0,"QTime":24,"params":{"shards":"127.0.0.1:8093/solr/my_geo_index","sort":"score asc","fl":"*,score","q":"{!geofilt score=distance filter=false sfield=loc pt=37.441883,-122.143019 d=10}","127.0.0.1:8093":"_yz_pn:64 OR (_yz_pn:61 AND (_yz_fpn:61)) OR _yz_pn:60 OR _yz_pn:57 OR _yz_pn:54 OR _yz_pn:51 OR _yz_pn:48 OR _yz_pn:45 OR _yz_pn:42 OR _yz_pn:39 OR _yz_pn:36 OR _yz_pn:33 OR _yz_pn:30 OR _yz_pn:27 OR _yz_pn:24 OR _yz_pn:21 OR _yz_pn:18 OR _yz_pn:15 OR _yz_pn:12 OR _yz_pn:9 OR _yz_pn:6 OR _yz_pn:3","wt":"json"}},"response":{"numFound":4,"start":0,"maxScore":0.39857662,"docs":[{"loc":"37.386052,-122.083851","name":"Mountain View","_yz_id":"geo_type_stuff_mv_42","_yz_rk":"mv","_yz_rt":"geo_type","_yz_rb":"stuff","score":0.072977245},{"loc":"37.339386,-121.894955","name":"San Jose","_yz_id":"geo_type_stuff_sj_21","_yz_rk":"sj","_yz_rt":"geo_type","_yz_rb":"stuff","score":0.2221485},{"loc":"37.774929,-122.419416","name":"San Francisco","_yz_id":"geo_type_stuff_sf_60_MeNonMcIRG8mmdJpk5KfM","_yz_rk":"sf","_yz_rt":"geo_type","_yz_rb":"stuff","score":0.39857662},{"loc":"37.774929,-122.419416","name":"San Francisco","_yz_id":"geo_type_stuff_sf_60_5n4gT2r1Gt2qlHwfKCwypL","_yz_rk":"sf","_yz_rt":"geo_type","_yz_rb":"stuff","score":0.39857662}]}}
You’ll notice that there are two San Franciscos. This is because I inserted data twice into Riak without using a VClock the second time (While I was writing this post), resulting in siblings. This issue is easily resolvable by resolving the siblings as mentioned here.
Now, we can convert to miles by multiplying the score (which
is degrees) by 69.09341
. If we do this for San Jose it
would be .2221485 * 69.09341
, or about 15.34 mi
.
For Kilometers we use 111.1951
, which gives us about
24.7 km
.
Since our query location was from Palo Alto, California, we can see that San Jose is indeed, approximately that 15 miles away. Our search was successful!