This tutorial is about creating a new indexer for CouchDB. Our genuine idea to create the following indexer. We store an integer together with the Document ID. In the end we can query a range of documents based on their number.
That integer could be stores as is in a Document or be calculated, e.g. from counting words.
This tutorial won't describe the straight way to a perfect solution but a way with certain bumps and rewrites so that you'll get a better understanding how CouchDB works.
For this tutorial I expect that you know CouchDB's HTTP API, the Erlang syntax and that you've compiled CouchDB from source before. But let's start right at the beginning with checking out the source code.
git clone git://git.apache.org/couchdb.git
This will create a directory called couchdb with the source code in
it. Switch to the 0.11.x branch and create your own branch named
numidx.
git checkout -t origin/0.11.x
git checkout -b numidx
Now compile it:
./bootstrap
./configure
make dev
And finally starting the Couch:
./utils/run
If it starts without errors, the first thing we do is changing the log
level to debug, so we get more verbose input when anything goes
wrong. You can see several configuration files with the etc/couchdb/
directory. The best place to put in configuration changes for
development is the local_dev.ini as it won't be overwritten by any
script during bootstrapping or any later phase when you are building
CouchDB. Comment the line level = debug in.
When you CouchDB again you'll see a way more verbose output.
CouchDB's most visible and easiest to understand part is the HTTP
layer. Adding new handlers is straight forwards, therefore that's what
we are doing first. It's a good idea to follow CouchDB's convention
with naming files, so we add a new one named
couch_httpd_numidx.erl into the src/couchdb/ directory.
Note: Whenever you create new modules it's always a good idea to
look into similar ones. The code can easily be adapted in most
cases. For the HTTP part couch_httpd_misc_handlers.erl and
couch_httpd_view.erl might be worth a look.
Our module will respond to an HTTP requests with some JSON (the requested database and design document name).
I won't go into much detail when in comes to basic Erlang code. If you've never created an Erlang module before have a look at Learn You Some Erlang for Great Good!. Explanations for the code are inline.
% The module should have the same name as the file
-module(couch_httpd_numidx).
-export([handle_numidx_req/3]).
-include("couch_db.hrl").
-import(couch_httpd, [send_json/2, send_method_not_allowed/2]).
% path_parts is the url, split at the slashes
handle_numidx_req(#httpd{method='GET',
path_parts=[DbName, <<"_design">>, DName, <<"_numidx">>]}=Req,
_Db, _DDoc) ->
% CouchDB normally returns JSON, so do we
send_json(Req, {[
{<<"db">>, DbName},
{<<"designdoc">>, DName}
]});
% We allow only GET and HEAD requests
handle_numidx_req(Req, _, _) ->
send_method_not_allowed(Req, "GET,HEAD").
To make get it compiled we add it to corresponding Makefile. We won't
edit the src/couchdb/Makefile directly as our changes will be lost
whenever ./configure is run. Instead we edit the template file
src/couchdb/Makefile.am. Add couch_httpd_numidx.erl to
source_files and couch_httpd_numidx.beam to compiled_files. Now
run:
./configure
make dev
If you make any changes like fixing syntax errors then you don't need
to rune ./configure again, running make dev is sufficient. After
finishing the compilation you should check if CouchDB still starts up
as expected, before we add our new handler to the
configuration. Append to the end of etc/couchdb/local_dev.ini:
[httpd_design_handlers]
_numidx = {couch_httpd_numidx, handle_numidx_req}
This means that we add a new handler named _numidx after
_design/<designdocname>/. Thus the URL to access our new handler is
http://localhost:5984/<dbname>/_design/<designdocname>/_numidx. Let's
try it. We create a new database, Design Document and finally query
our new handler.
Note: If you get any crashes and like to have some debug output, use
the predefined macros for logging, e.g. ?LOG_DEBUG("my var ~p",
[Var]).
curl -X PUT http://localhost:5984/numdocs/
curl -X PUT -d {} http://localhost:5984/numdocs/_design/tutorial
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx
If the return value looks
like{"db":"numdocs","designdoc":"tutorial"}, congratulations, you
just created your first CouchDB HTTP handler. Relax.
The following diagram illustrates what we've achieved so far. We make a request to CouchDB and the HTTP handler responds.

It's time to dig a bit deeper into CouchDB's internals. Documents get stored in a central database which is used to compute the Views. The cool thing about it is, that our index just works like the View indexer, the functionality for adding, removing and updating Documents is already in place. The only thing you need to worry is getting those Documents from the database into your index.
Every time a Document is changed in the database the so-called
sequence number of the database is increased. The function
couch_db:enum_docs_since/5 loops through the database from a
specific sequence number to the most current one. This way we can keep
our index up to date with the database. It is very similar to the
_changes
feed.
We could add this functionality to the HTTP handler, but we don't want to recreate our index on every request. Therefore we need a daemon which can be called from the HTTP handler but runs independently in the background.
This numidx daemon will be an Erlang OTP
gen_server.
In case you haven't coded a gen_server before, you may want to read
this
tutorial
which is a nice introduction and contains anything you should
know. Add a new file named couch_numidx.erl in src/couchdb/ with
the following basic skeleton:
-module(couch_numidx).
-behaviour(gen_server).
-export([start_link/0, init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2, code_change/3]).
-include("couch_db.hrl").
start_link() ->
?LOG_DEBUG("Numidx daemon: starting link.", []),
gen_server:start_link({local, couch_numidx}, couch_numidx, [], []).
init([]) ->
{ok, {}}.
terminate(Reason, _Srv) ->
ok.
handle_call(_Req, _From, State) ->
{noreply, State}.
handle_cast(_Msg, State) ->
{noreply, State}.
handle_info(_Msg, Server) ->
{noreply, Server}.
code_change(_OldVsn, State, _Extra) ->
{ok, State}.
All this daemon does at the moment is putting an a debug message into
the log when it is started. All you need to do to run it is putting
this additional section into your configuration file
(etc/couchdb/local_dev.ini):
[daemons]
numidx_manager={couch_numidx, start_link, []}
Now compile and run it (following the same steps as for
couch_http_numidx.erl). You should see a debug message along the
lines of [debug] [<0.85.0>] Numidx daemon: starting link.. So your
daemon is running.
So the daemon should read out all Documents in the database and
process them somehow. We will look in every Document if it has a
top-level property called numidx which contains an integer. We will
return all Documents IDs together with their numidx.
As the name handle_call/3 already suggests, it handles calls within
the daemon. The first parameter will be used to call it, it's pattern
matched to the handle_call/3 functions. It is a tuple whenever you
have parameters you'd like to pass on. So we create a new
handle_call/3 which isn't the fallback.
handle_call({do_get_docs, Db}, _From, State) ->
We pass in the Database (it's the database data structure not only the
name), _From and State aren't of importance for now. We come to
the heart of the function, the previously mentioned
couch_db:enum_docs_since/5. It looks like that:
{ok, _Cnt, Acc} = couch_db:enum_docs_since(Db, 0,
fun(DocInfo, _, NumidxAcc) -> {ok, NumidxAcc} end,
[], [])
Let's have a look at the function parameters first. It takes:
The return value is a 3-tuple which consists of ok, the number of
Documents that were processed, and the accumulator from the inner
function.
Although couch_db:enum_docs_since/5 is quite complex the body of the
inner function is way easier to understand and shows some of the
beauty of CouchDB's internal API. Right at the first line we can see
how easy it is to open a document:
{ok, Doc} = couch_db:open_doc(Db, DocInfo),
Nice, isn't it? Now we extract the JSON out of the document:
{Body} = Doc#doc.body,
We want to check whether it has a numidx property or not:
case proplists:get_value(<<"numidx">>, Body) of
In case it hasn't one, we just return the accumulator (NumidxAcc)
without any changes:
undefined ->
{ok, NumidxAcc};
If it contains numidx we return the updated accumulator. Note that
the function expects the accumulator to be returned as a 2-tuple with
{ok, the_accumulator}. We extract the DocId out of DocInfo and
append it to the current accumulator value.
Numidx ->
{doc_info, DocId, _DocSeq, _RevInfo} = DocInfo,
{ok, NumidxAcc ++ [[DocId, Numidx]]}
So the returned accumulator will contain a list of Document IDs
together with their numidx. That's the whole function body. The only
thing that needs to be done in handle_call/3 is returning the
accumulator:
{reply, Acc, State};
The return value is a 3-tuple withreply (compared to noreply in
the fallback function), the actual return value and the State we
don't care about at the moment. This was all the code we need for
handle_call/3. Here is it again in one piece together with the
default fallback:
handle_call({do_get_docs, Db}, _From, State) ->
{ok, _Cnt, Acc} = couch_db:enum_docs_since(Db, 0,
fun(DocInfo, _, NumidxAcc) ->
{ok, Doc} = couch_db:open_doc(Db, DocInfo),
{Body} = Doc#doc.body,
case proplists:get_value(<<"numidx">>, Body) of
undefined ->
{ok, NumidxAcc};
Numidx ->
{doc_info, DocId, _DocSeq, _RevInfo} = DocInfo,
{ok, NumidxAcc ++ [[DocId, Numidx]]}
end
end,
[], []),
{reply, Acc, State};
handle_call(_Req, _From, State) ->
{noreply, State}.
Such a call handler of a gen_server can't be called directly from the outside, but only from a function within the module which can be exported in return. In our case it's very simple:
-export([get_docs/1]).
get_docs(Db) ->
gen_server:call(couch_numidx, {do_get_docs, Db}).
The return value of such a call/2 is the second element
from the return value of handle_call/3. In our case the list of the
Document IDs and their numidx.
The final step is calling the get_docs/1 from our HTTP handler and
adding it to our JSON response. Our new handle_numidx_req/3 looks
like that:
handle_numidx_req(#httpd{method='GET',
path_parts=[DbName, <<"_design">>, DName, <<"_numidx">>]}=Req,
Db, _DDoc) ->
% Get the docs from the database
Docs = couch_numidx:get_docs(Db),
% CouchDB normally returns JSON, so do we
send_json(Req, {[
{<<"db">>, DbName},
{<<"designdoc">>, DName},
{<<"docs">>, Docs}
]});
As you can see it's just a single function call to our daemon which handles all the logic. It's time to compiling and running it. You can also download the full source of the daemon and the HTTP handler. Here's the well known URL to test it:
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx
The response is a bit disappointing, docs is empty. But that's
correct, we don't have any document in the database that contains a
numidx property. So we add a few documents (you don't need to
restart CouchDB, yay!):
curl -X PUT -d '{"numidx": 3}' http://localhost:5984/numdocs/doc3 curl -X PUT -d '{"numidx": 7}' http://localhost:5984/numdocs/doc7 curl -X PUT -d '{"numidx": 8}' http://localhost:5984/numdocs/doc8
And try it again. You should get a response similar to that one back:
{"db":"numdocs","designdoc":"tutorial","docs":[["doc3",3],["doc7",7],["doc8",8]]}
Wow, we learned a lot of new things. Writing a gen_server, accessing the CouchDB database from within Erlang, talking to the daemon from the HTTP handler. Here's a diagram of all that. Time to relax.

So far we haven't coded anything for storing the Document IDs together with the numidx. It is a binary tree, where one child is smaller, the other one greater than its parent. The value is a list of Document IDs with the same numidx First step is building it up in memory, then creating a CouchDB-like append only version which is written straight to disk.
The in-memory structure looks like that:
-record(node, {
% key is the numidx in out case
key = nil,
% list of Document IDs
value = [],
% child node with a smaller key
lt = nil,
% child node with a bigger key
gt = nil
}).
We define four operations on that data structure: - add: Adding a new item - delete: Deleting an item - lookup: Returns all values with a certain key - range: Returns all values (as a list) that are within a certain key range
I won't go into detail how to code this data structure, but just
present the code. Call the module numidx and add it to the makefile as
explained for couch_httpd_numidx. Note: for details about this
unbalanced binary tree, have a look at Concurrent programming in
Erlang Part I.
-module(numidx).
-export([add/2, lookup/2, range/2, delete/2]).
-record(node, {
key = nil,
value = [],
lt = nil,
gt = nil
}).
add(nil, {NewDocId, NewNumidx}) ->
#node{key=NewNumidx, value=[NewDocId]};
add(#node{key=Numidx, value=Value}=Node, {NewDocId, Numidx}) ->
Node#node{value=[NewDocId|Value]};
add(#node{key=Numidx}=Node, {_NewDocId, NewNumidx}=NewItem)
when NewNumidx < Numidx ->
NewLt = add(Node#node.lt, NewItem),
Node#node{lt=NewLt};
add(#node{key=Numidx}=Node, {_NewDocId, NewNumidx}=NewItem)
when NewNumidx > Numidx ->
NewGt = add(Node#node.gt, NewItem),
Node#node{gt=NewGt}.
lookup(nil, _SearchKey) ->
[];
lookup(#node{key=Key, value=Value}, Key) ->
Value;
lookup(#node{key=Key, lt=Lt}, SearchKey) when SearchKey < Key->
lookup(Lt, SearchKey);
lookup(#node{key=Key, gt=Gt}, SearchKey) when SearchKey > Key->
lookup(Gt, SearchKey).
range(nil, _Range) ->
[];
range(Node, {From, To}) ->
lists:foldl(fun(Key, Result) ->
[lookup(Node, Key)|Result]
end, [], lists:seq(From, To)).
delete(nil, _) ->
nil;
% Node matches and contains more than one value
delete(#node{key=Numidx, value=Value}=Node, {DelDocId, Numidx})
when length(Value) > 1 ->
Node#node{value=lists:delete(DelDocId, Value)};
% Node matches and has no child nodes
delete(#node{key=Numidx, lt=nil, gt=nil}, {_DelDocId, Numidx}) ->
nil;
% Node matches and has one child (lt)
delete(#node{key=Numidx, lt=Lt, gt=nil}, {_DelDocId, Numidx}) ->
Lt;
% Node matches and has one child (gt)
delete(#node{key=Numidx, lt=nil, gt=Gt}, {_DelDocId, Numidx}) ->
Gt;
% Node matches and has both child nodes
delete(#node{key=Numidx, lt=Lt, gt=Gt}, {_DelDocId, Numidx}) ->
% Find greatest node an rebuild tree (thanks Concurrent programming in
% Erlang Part I pp. 61)
NewNode = deletesp(Lt),
NewNode#node{gt=Gt};
% Node doesn't match, move on
delete(#node{key=Numidx}=Node, {_DelDocId, DelNumidx}=DelItem)
when DelNumidx < Numidx ->
DelLt = delete(Node#node.lt, DelItem),
Node#node{lt=DelLt};
% Node doesn't match, move on
delete(#node{key=Numidx}=Node, {_DelDocId, DelNumidx}=DelItem)
when DelNumidx > Numidx ->
DelGt = delete(Node#node.gt, DelItem),
Node#node{gt=DelGt}.
% Abusing #node, always write the result into #node.lt
deletesp(#node{gt=nil}=Node) ->
Node;
deletesp(#node{gt=Gt}=Node) ->
NewNode = deletesp(Gt),
NewNode#node{lt=Node#node{gt=NewNode#node.lt}}.
You can also download the full source of the numidx data structure.
When we store all Documents of the database in our index, we can
easily perform range queries on it. So let's first change the numidx
daemon's code (couch_numidx.erl) to make use of the numidx data
structure. Then we'll extend the daemon and the HTTP handler to be
able to perform range queries.
You remember the inner function of couch_db:enum_docs_since/5 in the
do_get_docs call handle? There's a case statement which adds all
Documents which contain an numidx property to the accumulator. This
is the right spot to make use of our newly created numidx data
structure.
To recap, here's the old code:
case proplists:get_value(<<"numidx">>, Body) of
undefined ->
{ok, NumidxAcc};
Numidx ->
{doc_info, DocId, _DocSeq, _RevInfo} = DocInfo,
{ok, NumidxAcc ++ [[DocId, Numidx]]}
end
This case statement should return the numidx data structure at the
end, instead of a list of all elements. Therefore the accumulator
needs to be changed as well (initial value is nil instead
[]). Here's the new couch_db:enum_docs_since/5.
{ok, _Cnt, Acc} = couch_db:enum_docs_since(Db, 0,
fun(DocInfo, _, NumidxDs) ->
{ok, Doc} = couch_db:open_doc(Db, DocInfo),
{Body} = Doc#doc.body,
case proplists:get_value(<<"numidx">>, Body) of
undefined ->
{ok, NumidxDs};
Numidx ->
{doc_info, DocId, _DocSeq, _RevInfo} = DocInfo,
NumidxDs2 = numidx:add(NumidxDs, {DocId, Numidx}),
{ok, NumidxDs2}
end
end,
nil, []),
Compile, run and request it with:
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx
The response will be a json_encode-error as the numidx data
structure can't be mapped to JSON. But you can see that it contains
all elements that we expected. But we want to make range queries
anyway, so we don't bother.
Adding the range queries follows the same steps as for the
do_get_docs call handle. Therefore I keep it short again. The code
for the daemon:
-export([get_docs/1, range_search/2]).
range_search(Ni, Range) ->
gen_server:call(couch_numidx, {do_range_search, Ni, Range}).
handle_call({do_range_search, Ni, Range}, _From, State) ->
Result = numidx:range(Ni, Range),
{reply, Result, State};
Currently such a handle_call doesn't make much sense, we could even
perform this operation within the HTTP handler. It's a preparation for
later use, when a bit of persistence is introduced.
As our primary goal is doing a range search, we will build up the
index automatically whenever such a query is performed. The first
thing is changing the handle_numidx_req/3 in couch_http_numidx.erl
to support a range. The request URL will look like (%5B1,7%5D is
[1,7] URL encoded):
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx/%5B1,7%5D
As CouchDB heavily uses JSON, it makes sense to use the range parameter encoded as JSON as well (so we don't need to write the parsing ourselves). The function signature is changed and the function body does not only request numidx data structure, but also call out for the range search:
handle_numidx_req(#httpd{method='GET',
path_parts=[DbName, <<"_design">>, DName, <<"_numidx">>, JsonRange]
}=Req, Db, _DDoc) ->
% Parse range parameter
Range = list_to_tuple(?JSON_DECODE(JsonRange)),
% Build the numidx data structure from the database
Numidx = couch_numidx:get_docs(Db),
% Perform the range search
Result = couch_numidx:range_search(Numidx, Range),
% CouchDB normally returns JSON, so do we
send_json(Req, {[
{<<"db">>, DbName},
{<<"designdoc">>, DName},
{<<"result">>, Result}
]});
The response we get back is
{"db":"numdocs","designdoc":"tutorial","result":[["doc3",3],["doc7",7]]}
The index works, and we can query it. Here's the source for couch_httpd_numidx.erl and couch_numidx.erl. Relax.
We finally have an index, it can store, remove (though we don't use it at the moment) and return Document IDs based on their numidx. It is accessible through the HTTP handler. Here's again a diagram:

Recreating the numidx data structure on every request doesn't really make sense. Especially when we switch from in-memory to on-disk storage, we don't want to recreate a file on every request.
This sounds like a global state. Doesn't Erlang have only local variables and no local ones? How can a global state play nice with concurrency? The answer to those questions is easier than you might think.
At the moment the daemon's init/1 function returns an ok together
with an empty tuple ({ok, {}}). This empty tuple is the initial
state. Normally you return a record with some initial values. In our
case we will store the sequence number of the database and the numidx
data structure within the record.
-record(ni_daemon, {
seq = 0,
numidx = nil
}).
init([]) ->
{ok, #ni_daemon{}}.
The handle_call/3s get this state as a parameter and return a new
state (which might as well be the same as the old one). On the first
call of do_get_docs the seq is 0, so it will just work as it is at
the moment. Then we return the current sequence number with the new
state. This way we won't rebuild the whole index on any subsequent
call, but only if the sequence number increased since the last time.
The inner function of couch_db:enum_docs_since/5 needs to keep track
of the current sequence number. The easiest way of doing it is using
the state of the handle_call/3 as initial value for the accumulator,
and returning the sequence number and the numidx data structure as a
new state. Another benefit is, that couch_db:enum_docs_since/5
returns the initial value of the accumulator in case it doesn't loop
at all (if there were no new updates). There's a debug message to see
that the sequence number increases correctly if new Documents are
added. Here's the code:
handle_call({do_get_docs, Db}, _From, State) ->
{ok, _Cnt, StateNew} = couch_db:enum_docs_since(Db, State#ni_daemon.seq,
fun(DocInfo, _, StateCur) ->
{doc_info, DocId, DocSeq, _RevInfo} = DocInfo,
{ok, Doc} = couch_db:open_doc(Db, DocInfo),
{Body} = Doc#doc.body,
?LOG_DEBUG("doc-seq: ~p", [DocSeq]),
case proplists:get_value(<<"numidx">>, Body) of
undefined ->
{ok, StateCur#ni_daemon{seq=DocSeq}};
Numidx ->
NumidxDs2 = numidx:add(StateCur#ni_daemon.numidx, {DocId, Numidx}),
{ok, StateCur#ni_daemon{seq=DocSeq, numidx=NumidxDs2}}
end
end, State, []),
{reply, ok, StateNew};
Instead of returning the numidx data structure we just return ok as
the numidx is stored in the state. Therefore we have to edit
do_range_search handler to make use of the state instead of passing
the numidx data structure in:
-export([get_docs/1, range_search/1]).
range_search(Range) ->
gen_server:call(couch_numidx, {do_range_search, Range}).
handle_call({do_range_search, Range}, _From, State) ->
Result = numidx:rangeState#ni_daemon.Ni, Range),
{reply, Result, State};
The HTTP handler needs to be updated as well. Change the two calls to the daemon to:
% Build the numidx data structure from the database
ok = couch_numidx:get_docs(Db),
% Perform the range search
Result = couch_numidx:range_search(Range),
Compile and run it (sources are available for couch_httpd_numidx.erl and couch_numidx.erl). Thanks to the debug messages you can see that the index is build up only on the first request. Whenever you add a new document and make another range request, the index is updated automatically. Try it:
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx/%5B0,7%5D
curl -X PUT -d '{"numidx": 5}' http://localhost:5984/numdocs/doc5
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx/%5B0,7%5D
You've learned about gen_server's way of keeping a state to keep information that subsequent requests need alive. Not only data structures but also file descriptors to save resources.

CouchDB is famous for its append-only B-tree. Therefore our on-disk numidx will work in a similar fashion. CouchDB already comes with all the functionality you need.
numidx will operate on files, but we don't care about opening them, we expect to get an working file descriptor for every operation. The numidx daemon will supply that file descriptor, though I'll introduce the essential parts of opening a file, so that you can easily debug the numidx data structure when you are following the next steps. The code should be self explanatory:
OpenFile = fun(Filename) ->
case couch_file:open(Filename, [create, overwrite]) of
{ok, Fd} ->
{ok, Fd};
{error, Reason} ->
io:format("ERROR (~s): Couldn't open file (~s) for tree storage~n",
[Reason, Filename]),
{error, Reason}
end
end.
You get ok and a file descriptor back. You can paste this function
into your shell while debugging numidx.erl.
Hint: in case you get
** exception error: undefined function couch_file:open/2
when you call OpenFile("/tmp/foo.bin"), your Erlang shell can't find the CouchDB files. To fix it add the CouchDB src directory to the search path, e.g. with:
ERL_FLAGS="-pa '/your/path/to/couchdb/src/couchdb/tutorial/couchdb/src/couchdb'" erl
The CouchDB internal file API is easy and straight forward. The two most basic ones are for writing to and reading from file. As already mentioned we write to the file in append-only fashion, thus you don't need to specify where in the file to write, the call is simply:
{ok, Pos} = couch_file:append_term(Fd, Data).
The function parameters are the file descriptor we operate with and
the data (any Erlang term you like) that should be appended to the
file. The return value is the position where it was written in the file. This
position is needed whenever you want to read a term again with
couch_file:pread_term/2:
{ok, Data} = couch_file:pread_term(Fd, Pos).
Yes it's that easy.
The API of the numidx data structure won't change much. The difference
will be that we pass in the file descriptor and the position where it
was written in the file instead of the actual data structure
itself. The return value will be an ok and the position where it was
written. Additionally the lt resp. gt in the node record will
not contain the actual children, but only the position in the file
where those children are stored.
First we create a new entry point (add/2):
add({Fd, nil}, NewItem) ->
add(Fd, nil, NewItem);
add({Fd, Pos}, NewItem) ->
{ok, Node} = couch_file:pread_term(Fd, Pos),
add(Fd, Node, NewItem).
The first case is for bootstrapping an empty tree. In the second one the actual (root) node is read from file and passed on. We use the same cases for adding a new node, thus the function signatures are very similar. The only change is that we pass on additionally the file descriptor to be able to read/write the node from/to file.
In the most basic case, when the current node is empty, we write the new node to disk and return the position, instead of returning the node itself. The code changes from:
add(nil, {NewDocId, NewNumidx}) ->
#node{key=NewNumidx, value=[NewDocId]};
to:
add(Fd, nil, {NewDocId, NewNumidx}) ->
couch_file:append_term(Fd, #node{key=NewNumidx, value=[NewDocId]});
As you can see it is really straight forward. For the case where the tree needs to be traversed, you need to read the child node from file first, then passing it on and finally writing the current node again with the information about the new child position:
add(Fd, #node{key=Numidx}=Node, {_NewDocId, NewNumidx}=NewItem)
when NewNumidx < Numidx ->
Lt = data_at_pos(Fd, Node#node.lt),
{ok, NewLtPos} = add(Fd, Lt, NewItem),
couch_file:append_term(Fd, Node#node{lt=NewLtPos});
You notice the data_at_pos/2 helper function. A direct call to
couch_file:pread_term/2 can't be used here, as Node#node.lt might
be nil and thus confuse the call. The helper function solves that
problem:
data_at_pos(_Fd, nil) ->
nil;
data_at_pos(Fd, Pos) ->
{ok, Data} = couch_file:pread_term(Fd, Pos),
Data.
I will leave the rest of the add/3 function cases as an exercise to
the reader. Though the full source can be found at the end of this
chapter. In case you wonder how a shell session looks like, have a
look at one of mine.
The changes to the tree a propagated upwards, this means that all
nodes up to the root need to be rewritten. The position add/2
returns in the position of root within the file. When you want to add
a new node, you have to use that position, else you would update an
old version of the tree.
This illustration should make the approach a bit clearer:

The lookup functions are even easier to implement than add, give
it a go. range also needs only small changes. delete is a bit
harder, but a detailed description is out of scope for this
tutorial. If you can get delete work without looking at the final
source, you're on the right track, getting your own index work
should be breeze.
Of course the daemon (couch_numidx.erl) needs to be updated. It will
open the file an pass the file descriptor on to the numidx data
structure. The function for opening a file is a verbatim copy of the
previously introduced OpenFile/1:
open_file(Filename) ->
case couch_file:open(Filename, [create, overwrite]) of
{ok, Fd} ->
{ok, Fd};
{error, Reason} ->
io:format("ERROR (~s): Couldn't open file (~s) for tree storage~n",
[Reason, Filename]),
{error, Reason}
end.
For now we use a hard-coded file path and overwrite the file every time
the daemon is started. The file descriptor is stored in the state of
the daemon, hence we need a new entry in the ni_daemon record. As
the numidx data structure shouldn't be kept in memory, but always read
from disk, we don't need the numidx in the ni_daemon anymore, but
the position of the root node within the file. The new record and the
init/1 now looks like that:
-define(FILENAME, "/tmp/couchdb_numidx.bin").
-record(ni_daemon, {
seq = 0,
ni_pos = nil,
fd = nil
}).
init([]) ->
{ok, Fd} = open_file(?FILENAME),
{ok, #ni_daemon{fd=Fd}}.
The handle for the do_get_docs call contains one call to the
numidx module that needs to be changed:
NumidxDs2 = numidx:add(StateCur#ni_daemon.numidx, {DocId, Numidx}),
{ok, StateCur#ni_daemon{seq=DocSeq, numidx=NumidxDs2}}
numidx:add/2 takes the file handler and the position, which are both
present in the state:
{ok, NewPos} = numidx:add({Fd, Pos}, {DocId, Numidx}),
{ok, StateCur#ni_daemon{seq=DocSeq, ni_pos=NewPos}}
The other handle which contains a call to numidx is in
do_range_search which needs changes in a similar fashion. For the
full source with all changes, see
couch_numidx.erl and for the new on-disk
numidx binary tree see numidx.erl.
The in-memory numidx data structure was changed to an append-only on-disk data structure, the daemon was updated to make use of it. The HTTP handler doesn't need to change as the daemon's public API hasn't changed.

So far we've only added Document to the index, but never deleted
them. Although the numidx data structure supports the deletion from
the tree we don't make use of it yet. Before we get to the
implementation details here's a quick rundown how
couch_db:enum_docs_since/5 handles deletion.
A Document with a specific ID appears always only once in
couch_db:enum_docs_since/5 either as deleted or not. This means if
you create a new Document, it's added to the database which increases
the sequence number. When you delete this Document again, the sequence
number increases again and the Document (only the meta information,
not the data) will appear as deleted in
couch_db:enum_docs_since/5. When you create a Document with an ID
that was already present in the database but got deleted previously,
it will only show up with the new sequence ID.
The inner function body of couch_db:enum_docs_since/5 is the place
where we handle deleted Documents. The information whether a Document
was deleted or not is present in the revision information, which is in
return part of the Document information. The first element of the
revision information contains all we need:
{doc_info, DocId, DocSeq, [RevInfo|_]} = DocInfo,
The basic code to determine deleted Document is a simple case
statement. For the false case we use the code we already have, in
the other case we only update the current sequence number.
{doc_info, DocId, DocSeq, [RevInfo|_]} = DocInfo,
?LOG_DEBUG("doc-seq: ~p", [DocSeq]),
case RevInfo#rev_info.deleted of
false ->
{ok, Doc} = couch_db:open_doc(Db, DocInfo),
{Body} = Doc#doc.body,
case proplists:get_value(<<"numidx">>, Body) of
undefined ->
{ok, StateCur#ni_daemon{seq=DocSeq}};
Numidx ->
{ok, NewPos} = numidx:add({Fd, Pos}, {DocId, Numidx}),
{ok, StateCur#ni_daemon{seq=DocSeq, ni_pos=NewPos}}
end;
true ->
?LOG_DEBUG("deleted: ~p", [DocSeq]),
{ok, StateCur#ni_daemon{seq=DocSeq}}
end
Run it, perform a range search, delete a document (replace the rev
with your specific one) and search again:
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx/%5B0,7%5D
curl -X DELETE http://localhost:5984/numdocs/doc5?rev=1-56f223e26669eb4ff2cdacc8c2852405
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx/%5B0,7%5D
In your debug log you should see something along the lines of:
[debug] [<0.86.0>] doc-seq: 6
[debug] [<0.86.0>] deleted: 6
[info] [<0.213.0>] 127.0.0.1 - - 'GET' /numdocs/_design/tutorial/_numidx/%5B0,7%5D 200
Though the range search still returns the Document as we haven't
deleted it from the numidx data structure yet. You might think it's a
simple call to numidx:delete/3 as it's for adding a new item. But it
is not.
The last parameter of numidx:delete/3 is a tuple with Document's ID
and its numidx. You can't open deleted Documents with
couch_db:open_doc/2 to get the Documents contents as they are not
there any longer. Only their ID is still present.
There are two options, either traversing the whole numidx data
structure (which isn't really a good idea) to find the item that
should be deleted or saving the corresponding numidx value in
another data structure. The latter is CouchDB's preferred as it's
faster the more items you have.
Luckily we don't need to build a new one. In our case we could even use another numidx data structure, but the general solution (for all kind of indexes) is using CouchDB's native B-tree implementation which is way better. This data structure that references from Document ID back to a value is called Back-Index. The Back-index needs to be kept in sync with the numidx data structure, so every insertion or deletion is performed on both.
CouchDB's B-tree API is as simple as the one for file access. In
init/1 the B-tree is initialized put into the state of the daemon:
-record(ni_daemon, {
seq = 0,
ni_pos = nil,
fd = nil,
btree = nil
}).
init([]) ->
{ok, Fd} = open_file(?FILENAME),
{ok, Btree} = couch_btree:open(nil, Fd),
{ok, #ni_daemon{fd=Fd, btree=Btree}}.
As nothing within the file is ever overwritten but only appended, we
can even use the same file for both, the numidx and the B-tree. The
B-tree record that is returned from couch_btree:open/2 takes care of
the position of the root node within the file itself.
Right after adding the new item to the numidx, we add it to the
Back-Index. The Btree variable is from the CurState record:
{ok, NewPos} = numidx:add({Fd, Pos}, {DocId, Numidx}),
{ok, NewBtree} = couch_btree:add(Btree, [{DocId, Numidx}]),
{ok, StateCur#ni_daemon{seq=DocSeq, ni_pos=NewPos, btree=NewBtree}}
Note that couch_btree:add/2 takes a list of tuples to be
inserted. The first tuple element is the key, the second one the
value. Multiple items can be inserted at the same time, which is
e.g. used for
CouchDB's bulk API.
For the case when a Document got deleted a lookup is needed:
case couch_btree:lookup(Btree, [DocId]) of
couch_btree:lookup/2 takes again, as all other B-Tree operations a
list of Documents to lookup. Hence the return value is a list as well,
either with not_found or a tuple with ok and the value that
corresponds to the key.
[{ok, Numidx}] ->
{ok, NewPos} = numidx:delete({Fd, Pos}, {DocId, Numidx}),
{ok, NewBtree} = couch_btree:add_remove(Btree, [], [DocId]),
{ok, StateCur#ni_daemon{seq=DocSeq, ni_pos=NewPos, btree=NewBtree}};
[not_found] ->
% If it's not in the Back-Index it's not in numidxx
{ok, StateCur#ni_daemon{seq=DocSeq}}
In the first case the key was found, so we delete it from the
Back-Index and the numidx. There's no explicit couch_btree:remove/2
function, so we use the couch_btree:add_remove/3 instead, which
takes a list of tuples to be inserted as a second parameter (of course
an empty list in our case). Thanks to the lookup we have the numidx
value for the Document so that we can remove it easily from the data
structure.
In the second case the Document ID wasn't found in the
Back-Index. This happens if a document doesn't contain a numidx
property and is therefore not store in the numidx data structure.
If compile and run it, your range request shouldn't show the deleted Document
any more. As always, here's the full source for
couch_numidx.erl.
You've learned how CouchDB handles the deletion of Documents and how to use the B-tree implementation. Additionally the Back-Index was introduced to allow reverse lookups on your index.

So far the numidx data structure is built on every startup and stored persistently for any subsequent request. The next logical step is to build the index when it doesn't exist yet and updating it whenever it changes (resp. updating it when the database was changed although the daemon wasn't running for whatever reason).
Again CouchDB has some nice built-in feature we can use. Reading and writing headers to the file. Those headers are always appended to the end of the file. When you want to get the latest header, the file is automatically searched from the back to get it.
Let's start with writing the header. Whenever the numidx update is
completed, i.e. it's in sync with the Documents in the database, the
header will be written. This happens right before the new state in the
do_update_docs handler is returned:
couch_file:write_header(StateNew#ni_daemon.fd, StateNew),
{reply, ok, StateNew};
On startups first check if the file can be opened for reading and read out the header. The Btree record needs be updated, as it contains the file descriptor which changed since the last time you've opened the file. If the file can't be opened, create a new one as we did it already:
open_file(Filename) ->
case couch_file:open(Filename) of
{ok, Fd} ->
{ok, State} = couch_file:read_header(Fd),
{ok, Btree} = couch_btree:open(nil, Fd),
{ok, State#ni_daemon{fd=Fd, btree=Btree}};
{error, enoent} ->
case couch_file:open(Filename, [create, overwrite]) of
{ok, Fd} ->
{ok, Btree} = couch_btree:open(nil, Fd),
State = #ni_daemon{fd=Fd, btree=Btree},
couch_file:write_header(Fd, State),
{ok, State};
{error, Reason} ->
io:format("ERROR (~s): Couldn't open file (~s) for tree storage~n",
[Reason, Filename]),
{error, Reason}
end;
{error, Reason} ->
io:format("ERROR (~s): Couldn't open file (~s) for tree storage~n",
[Reason, Filename]),
{error, Reason}
end.
init([]) ->
open_file(?FILENAME).
open_file/1 is extended to do a bit more than just opening
a file. It will read in the header of an existing one, or create a new
file with a header. In the end a new state will be returned. Thus
init/1 will only contain a call to open_file/1.
Make sure you remove your old index from disk
(tmp/couchdb_numidx.bin), else CouchDB will crash on startup as it
doesn't contain a valid header yet.
When you start it the first time and query it, you should get the well known debug messages along the lines of:
[debug] [<0.86.0>] doc-seq: 1
[debug] [<0.86.0>] doc-seq: 2
[debug] [<0.86.0>] doc-seq: 3
[debug] [<0.86.0>] doc-seq: 4
[debug] [<0.86.0>] doc-seq: 6
[debug] [<0.86.0>] deleted: 6
If you kill CouchDB and start it again, you shouldn't get that output on a query, but the right response back. The index wasn't rebuilt, but read from disk. Congratulations! This was an easy but pretty nice feature to implement.
Hint: When you keep coding from now on, you might want to comment out the part where it tries to read the file from disk or remove it on every startup. This prevents you from errors caused by an old (potentially broken) state read from disk.
Download the source of
couch_numidx.erl. No new modules where
touched, therefore there's no new diagram.
Now that the numidx is persistently stored on disk it's time to get rid of some hard-coded stuff. So far the numidx didn't even do what was advertised right at the beginning: storing an integer that might come from counting words of a arbitrary string.
The number we use for the numidx shouldn't come from a property named
numidx but as return value from a function within a design
documents, just as it works for Views.
Update your Design Document to contain such a function (replace the revision with your specific one):
curl -X PUT -d "{\"_rev\": \"1-967a00dff5e02add41819138abb3284d\", \"numidx\": \"function(doc) {return doc.text ? doc.text.split(' ').length : false;}\"}" http://localhost:5984/numdocs/_design/tutorial
The JavaScript function is quite simple:
function(doc) {
return doc.text ? doc.text.split(' ').length : false;
}
It takes a Document as a parameter (just as a View map function does)
and returns the number of words of a property called text, but only
if it exists. Else it returns false. This false means in our case
that this Document won't be added to the numidx.
The function defined in the Design Document will be applied to every Document. So the steps are: getting the document body as JSON (instead of reading the body of the Document):
{ok, Doc} = couch_db:open_doc(Db, DocInfo),
JsonDoc = couch_query_servers:json_doc(Doc),
Then pass it into a function defined in the Design Document:
[<<"numidx">>, Numidx] = couch_query_servers:ddoc_prompt(
DDoc, [<<"numidx">>], [JsonDoc]),
couch_query_servers:ddoc_prompt/3 takes three parameters:
DDoc
variable seems to come from nowhere and it indeed does at the
moment. It's available in the HTTP handler. You need to change the
code of the HTTP handler and the daemon to pass it on till here.
When you've mastered this tutorial so far, it should be a piece of
cake.shows functions also supply the information about the
request as second parameter.The return values are <<"numidx">> to identify the request (more on
where this value comes from later) and the return value of the
function. The old case statement when retrieving the <<"numidx">>
property of a Document is replaces with this new one:
case Numidx of
false ->
{ok, StateCur#ni_daemon{seq=DocSeq}};
_ ->
{ok, NewPos} = numidx:add({Fd, Pos}, {DocId, Numidx}),
{ok, NewBtree} = couch_btree:add(Btree, [{DocId, Numidx}]),
{ok, StateCur#ni_daemon{seq=DocSeq, ni_pos=NewPos,
btree=NewBtree}}
end;
It's almost identical to the old one, only Numidx matches different
patterns.
This is all that needs to be done in the daemon's code. It would be cool if it would work now, but there's still one small step that will hopefully get easier/nicer in a post 1.0 release of CouchDB. Though you can compile and run it, to see a huge crash report when you try a range request. It's always a good exercise trying to understand what's going on just from the log.
It's time to leave the Erlang world for a short time and move on to
JavaScript. In the share/server/ directory is JavaScript code that
is used for the default View/Query Server.
Start with creating a new file called numidx.js:
var Numidx = {
numidx : function(fun, ddoc, args) {
try {
respond(["numidx", fun.apply(ddoc, args)]);
} catch (error) {
respond(error);
}
}
};
Hint: If you need some output from your JavaScript on the server,
you can use log(["Some debug output"]);.
Don't care about the naming of the function yet. The interesting bits are the parameters: 1. the function of you specified in the Design Document 2. the Design Document itself (to set the scope when the function is called) 3. the arguments you've specified in Erlang. In our case it's only the Document we want to process.
The respond() is interesting as well. This is what gets returned to
Erlang. You also remember that <<"numidx">> the
couch_query_servers:ddoc_prompt/3 returns? This is where it comes
from. The second value it returns is value the JavaScript function
returns (for us either false or the numidx).
You could define additional functions that could be used within your numidx function, but we don't need that for our numidx.
The final step on the JavaScript side is the ugliest and will
hopefully be fixed soon. You need to patch parts of CouchDB
itself. Open share/server/loop.js and go to the DDoc declaration.
There's a variable called ddoc_dispatch, just add
"numidx" : Numidx.numidx
to the list of properties. Now you need to add the numidx.js to the
JS_FILE_COMPONENTS in the share/Makefile.am. ./configure and
compile it.
Now put some new Documents into the database with a text property to see if it really works.
curl -X PUT -d '{"text": "This is a test"}' http://localhost:5984/numdocs/doc4
curl -X PUT -d '{"text": "And another one with some more text"}' http://localhost:5984/numdocs/doc_somemore
curl -X PUT -d '{"text": "Cool, it seems to work"}' http://localhost:5984/numdocs/doc_seems
A request for the range 0 to 6:
curl -X GET http://localhost:5984/numdocs/_design/tutorial/_numidx/%5B0,6%5D
should return only doc4 and doc_seems:
{"db":"numdocs","designdoc":"tutorial","result":["doc4","doc_seems"]}
You've learned how to call out to the Query Server and how to access
functions in Design Documents. to apply them to Documents stored in
the database. You can of course download
couch\_httpd\_numidx.erl,
couch_numidx.erl,
loop.js and numidx.js.

The next step seems to be clear. Switch from a hard-coded file path to the same system as the default View Server does. One file per Design Document which is recreated when the Design Document changes.
This sounds easy, but it is not. This is the last part of the "Creating a new indexer" tutorial and it doesn't contain a step by step description what needs to be done. The reason is that the code for it is highly coupled with the View server. I really hope to get it refactored for a post 1.0 CouchDB release, as new indexers will always need similar functionality.
I will only describe how the current View Server works so that you have a chance to get you indexer at least work in similar fashion. I've used a lot a copy and paste for my spatial indexer, I wouldn't call that good coding practice :)
The modules of interest are couch_view_group, couch_view_updater
and of course the View daemon itself couch_view.
You can think of the couch_view_group as a file manager daemon. It
opens/creates files, and provides the correct file descriptors for the
View Daemon. Thee Design Documents get hashed to the right filename
and are put at the right place on the file system. In case of a Design
Document changes this daemon takes care of the new files.
A group is basically a Design Document with all its Views. It's a single file that contains all View indexes and the Back-Index. Whenever a function within a View changes, the whole index needs to be recreated.
In the numidx daemon the data structure is created on the first request, and updated when the database changes. CouchDB's Views have an extra daemon for it. It has several advantages, one is that you can request a View that returns the not yet updated View, but kicks off the update. If another request comes in it won't start the updater again, but either wait until it's finished, or again return the not yet updated View.
The couch_view daemon is the central place for the HTTP handler to
interact with views.
The last part of the tutorial was more of high-level overview of CouchDB's Views. If you read the whole tutorial you should now be able to build you own indexer. There are several that I'd like to see in the future. Of course the spatial index I'm working on, but also indexers based on Graphs would be possible. Or other data structures support multi-dimensional indexing.