I'd been thinking about using SolrCloud for a project at work recently, and I wanted to test it out locally. To do this I used Vagrant to setup a multi-machine private network with static IPs.
Getting the machines setup was much easier than I expected, and then getting SolrCloud working wasn't much more complicated. However, there does appear to be a lack of good examples of exactly how to get something like this going, so in this blog post I'm going to take you through the steps to do just that.
I used OSX to create this setup, but it should also work on major Linux distributions. I'm not a Windows user so you'll have to do your own research if that's what you need.
A SolrCloud setup has two types of component:
The 'Cloud' part of SolrCloud comes from the fact that any data you push into your set of Solr nodes can be split into shards and distributed across the nodes. Each shard is then replicated one or more times on different nodes to provide redundancy. Queries arrive at one of the nodes and the query is then forwarded to the node where a replica of the appropriate shard is located.
The amount of shards into which the data is split, and the amount of replicas for each shard is set at the point of creating a collection (more about this later).
You can find a basic introduction to how SolrCloud works on the Solr wiki. For the rest of this article I'm going to assume you are aware of the basics.
Our aim for this test is to have each element of the SolrCloud setup running on its own virtual machine. Our setup will have three Solr nodes with which we can store and query the data, and a single Zookeeper instance to manage the nodes. We could have multiple Zookeeper instances to provide further redundancy. This would be called a Zookeeper ensemble. However, for this initial test we're going to stick with just the one.
Having three Solr nodes means that we can split our data into two shards with two replicas of each, and if one of the Solr nodes goes down we'll still be able to access all of the data.
Our test network will be built using Ubuntu virtual machines. The first thing we're going to do is create a new directory for our test VMs on our host machine, and then generate a Vagrant file including the Ubuntu Trusty64 Vagrant box.
mkdir -p ~/solrcloud-test cd ~/solrcloud-test vagrant init ubuntu/trusty64
This will generate a file called
Vagrantfile which includes the instructions for Vagrant to build a basic Ubuntu VM. I'm going to use this file to create all four of the necessary VMs for our test. Vagrant includes the ability to create multi-machine setups out of the box. So, we need to open the Vagrantfile and replace the line
config.vm.box = "ubuntu/trusty64" with the instructions below:
config.vm.provider "virtualbox" do |v| v.memory = 1024 v.cpus = 2 end config.vm.define "zoo1" do |zoo1| zoo1.vm.box = "ubuntu/trusty64" zoo1.vm.network "private_network", type: "dhcp" end config.vm.define "solr1" do |solr1| solr1.vm.box = "ubuntu/trusty64" solr1.vm.network "private_network", type: "dhcp" end config.vm.define "solr2" do |solr2| solr2.vm.box = "ubuntu/trusty64" solr2.vm.network "private_network", type: "dhcp" end config.vm.define "solr3" do |solr3| solr3.vm.box = "ubuntu/trusty64" solr3.vm.network "private_network", type: "dhcp" end
The first block in the configuration above defines the amount of memory and CPUs that should be assigned for each of the VMs below. The default is 512mb, but this isn't enough to run Solr, so we need to bump up to 1024mb.
Each of the next four blocks defines a separate virtual machine and gives each box a name. The
[name].vm.box command is telling Vagrant which template to use for creating each box, and the
[name].vm.network command instructs Vagrant to create a private network using the DHCP protocol. This means that each of our boxes will be assigned an IP address that can only be accessed within our private network (the four vagrant boxes and our host machine).
Now let's get these Vagrant boxes running.
cd ~/solrcloud-test vagrant up
The process of building the four Vagrant boxes will begin. This could take a few minutes, particularly if you haven't used the Ubuntu Trusty64 box before, as Vagrant will download it.
Note: Some older versions of Vagrant have an issue when using the DCHP network type. They fail on
vagrant up with an error saying a network of that type already exists. Upgrade to the latest version of Vagrant and that error will disappear.
Now that the Vagrant boxes are built and running, we can SSH into them. Open three additional tabs or windows for your terminal, go to the solrcloud-test directory in each and use the command
vagrant ssh [box name], e.g.:
vagrant ssh zoo1 vagrant ssh solr1 etc...
Now that we are working within the virtual machines, the first thing we need to do is make a note of the IP addresses on each. There are a few ways to do this, but I use the
ifconfig -a command. You should see something similar to the output below:
eth0 Link encap:Ethernet HWaddr 08:00:27:55:57:5e inet addr:10.0.2.15 Bcast:10.0.2.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fe55:575e/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:754 errors:0 dropped:0 overruns:0 frame:0 TX packets:584 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:79919 (79.9 KB) TX bytes:69421 (69.4 KB) eth1 Link encap:Ethernet HWaddr 08:00:27:c4:24:ec inet addr:172.28.128.3 Bcast:172.28.128.255 Mask:255.255.255.0 inet6 addr: fe80::a00:27ff:fec4:24ec/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:175 errors:0 dropped:0 overruns:0 frame:0 TX packets:16 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:37752 (37.7 KB) TX bytes:2538 (2.5 KB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
The address you're looking for is the
inet addr: 172.28.128.3 bit in the
eth1 block. If we run this command on each box, we should find the IP address is the same for each apart from the final number. This is because the names are taken from the reserved IP addresses space. For example, the addresses generated for my example are:
You can test your private network by ssh'ing from one Vagrant box into another, with the username
vagrant and the password
And that's it! We have our machines up and running. You could use this setup to test any distributed network setup. You could test security settings on top of an application stack using iptables. These things are beyond the scope of this tutorial, but I'd encourage you to play around with this.
So, next we need to install the relevant software on each machine.
Both Solr and Zookeeper rely on Java 8 in one way or another. So the first thing we're going to do is install this on each of the boxes. Run the following commands in each of the tabs you have open.
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer
This installs both the JRE and JDK versions of Oracle's official Java package. If you would prefer to use OpenJDK, you can follow the instructions here.
As the website states, "ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services". For the purposes of SolrCloud, Zookeeper does the following:
Installing and configuring Zookeeper for our SolrCloud test is pretty easy. First, pull down the latest version with the
curl command and unpack it:
curl -O http://mirrors.ukfast.co.uk/sites/ftp.apache.org/zookeeper/zookeeper-3.4.8/zookeeper-3.4.8.tar.gz tar -zxf zookeeper-3.4.8.tar.gz
We then need to update the Zookeeper configuration with some basics. Zookeeper comes with a sample config file (
conf/zoo_sample.cfg), but we don't need all the comments and examples that that file provides, so we'll just create a new one using your editor of choice. I'm going to use nano.
Now copy the following three lines into that file and save it.
tickTime=2000 dataDir=/var/lib/zookeeper clientPort=2181
tickTimeis the amount of time in milliseconds that Zookeeper will wait before determining that one of your Solr servers is down.
dataDiris where Zookeeper will store the data about your SolrCloud cluster. If this directory doesn't exist then Zookeeper will creat it when it first starts up.
clientPortis the port on which your SolrCloud nodes will connect to Zookeeper.
Finally, you need to start Zookeeper with the start-up script provided with the installation:
sudo ~/zookeeper-3.4.8/bin/zkServer.sh start
If all has gone well, you should see the following output in your terminal:
ZooKeeper JMX enabled by default Using config: /home/vagrant/zookeeper-3.4.8/bin/../conf/zoo.cfg Starting zookeeper ... STARTED
We now need to install our three instances of Solr. Like Zookeeper, we need to download a distribution from the Apache Solr website, and unpack it.
curl -O http://mirrors.muzzy.org.uk/apache/lucene/solr/6.2.0/solr-6.2.0.tgz tar -xzf solr-6.2.0.tgz
To test everything is working, try starting Solr in basic standalone mode using the Solr start script provided in the distribution.
cd ~/solr-6.2.0 bin/solr start
Then visit your VM's IP in your host machines browser, appending
:8983/solr to the end. So for example,
http://172.28.128.4:8983/solr. If all is successful you should see the Solr admin.
However, we don't want these Solr instances to run in standalone mode, we want them to run in cloud mode. This is just as easy, you just need to know the IP for your Zookeeper machine, and the IP of each connecting Solr VM.
The first thing to do is stop the node we currently have running.
Then we restart in cloud mode with the following command:
bin/solr start -c -z 172.28.128.3:2181 -h 172.28.128.4:8983
Let's break down the elements of this command:
bin/solr start -c: This is the familiar start command, with the '-c' modifier which is a shortened version of
-z 172.28.128.3:2181: the
-z modifier instructs Solr to connect to a Zookeeper instance with the following IP and port number.
-h 172.28.128.4:8983: this defines the hostname and port to start Solr with. This should be set to the specific Solr machine's IP. The port can be anything that doesn't clash with something else, but I'd suggest sticking with the default Solr port of 8983.
After running this command, you should be able to go to your Solr admin for that node (e.g. http://172.28.128.4:8983/solr/), and you should see the 'Cloud' option in the left-hand menu. If you click this, currently you should only see a blank white area, with a key in the bottom right. For anything to display in this section we need to upload a 'Collection'.
A 'Collection' in SolrCloud is the equivalent of a Solr core in standalone mode. We can easily create a simple collection with the following command, run from the root folder of one of your Solr nodes:
bin/solr create -c testCollection -d data_driven_schema_configs -n testCollection_cfg -shards 2 -replicationFactor 2
I'm not going to go into great detail on how to create Collections in this blog post, but here's a quick breakdown of the command we've just run:
bin/solr create -c testCollection : The create command followed by the
-c modifier which defines the name of the new collection.
-d data_driven_schema_configs : The
-d modifier is required to set the config directory for the Collection. This config is uploaded to Zookeeper, which then shares it with the other Solr nodes. In this example I've used
data_driven_schema_configs, which is one of the example config sets. The default directory in which the Solr create command will look for the config is
/solr-6.2.0/server/solr/configsets/. If you want to create your own config, you can copy one of the example config sets into a new folder, then provide a relative path to that folder instead. For example, if running from the root directory of your Solr install
-shards 3 : This defines how many shards the Collection should be split into.
-replicationFactor 3 : This defines how many replicas of each Shard are created.
For more info on the usage of the 'create' command, see the Solr docs.
So now if you go to the 'Cloud' section of your Solr admin on any of your connected nodes, you should now see a graph with your collection name on the left, the split of your shards in the middle, and the locations of the replicas of these shards on the right.
There you have it, a working SolrCloud setup using Vagrant. We've got no data in our test collection, but adding in data isn't SolrCloud specific. You can use any method for pushing in data that you would use when using Solr in standalone mode.