20:00:45 <hdl> #startmeeting solr search 7 December 20:00:45 <munin> Meeting started Tue Dec 7 20:00:45 2010 UTC. The chair is hdl. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:45 <munin> Useful Commands: #action #agreed #help #info #idea #link #topic. 20:00:56 <hdl> Hi all 20:01:07 <slef> hi hdl! 20:01:48 <hdl> First we could make a call of the persons and then begin on the agenda I published from the email. 20:02:02 <hdl> http://wiki.koha-community.org/wiki/7_December_2010 20:02:26 * hdl = Henri-Damien LAURENT, BibLibre 20:02:27 <thd> Thomas Dukleth, Agogme, New York City 20:02:39 <druthb> == D Ruth Bavousett, Washington DC 20:02:40 * slef = MJ Ray, worker-owner, software.coop 20:02:43 * clrh Claire Hernandez, Biblibre 20:02:47 <Colin> Colin Campbell PTFS-Europe 20:02:57 <wizzyrea> Liz Rea, NEKLS, Lurking 20:03:17 <cait> Katrin Fischer, BSZ 20:03:43 <owen> Owen Leonard, Nelsonville Public Library, lurking 20:03:44 <cfouts> Clay Fouts, PTFS/LibLime 20:03:54 <darling> Reed Wade, Catalyst 20:04:00 * jcamins_a = Jared Camins-Esakov, C & P Bibliography Services, lurking 20:04:55 <hdl> ok. If anyone arrives late please chime in. 20:05:18 <hdl> #topic Why we investigated solr 20:05:37 <hdl> Reasons why we investiguated was that 20:05:52 <hdl> everyone agreed that C4::Search needed deep revamping. 20:06:20 <hdl> And we have had many problems with the actual implementation of zebra in Koha 20:06:21 <chris_n> Chris Nighswonger, FBC 20:06:43 <hdl> problems with the hard coded indexes 20:07:17 <hdl> Problem with the untranslatable strings from search 20:07:47 <hdl> Problem with the unability for users to order the indexes on advanced search page. 20:07:59 <hdl> Problem also with the search engine itself : 20:08:32 <hdl> It proved quite a nightmare to know if indexing was ok. 20:08:54 <hdl> It also proved that some announced features were not quite meeting the demand. 20:09:32 <hdl> (facets, but also icu is quite disapointing.... since all the features in RPN are not embedded. 20:09:40 <hdl> Completeness for instance. 20:09:49 <hdl> And left truncation). 20:10:01 <hdl> All this has already been said... 20:10:22 <hdl> But I want to tell the context that prompted us into invetigating. 20:10:44 <clrh> #link http://www.biblibre.com/en/blog/entry/solr-developments-for-koha 20:10:44 <hdl> any problem with what i am stating... ? 20:10:48 <thd> The untranslatable strings problem is mistaken if the query is intercepted for translation or if a translation is provided for CCL, CQL, Solr/Lucene, etc. query configuration.. 20:11:23 <hdl> well when you go on the results page. 20:11:34 <hdl> it is truly ccl search that is printed. 20:11:50 <hdl> ti,wrdl=huckleberry finn 20:11:58 <hdl> Is not user friendly. 20:12:46 <hdl> And More, some usage we are now doing of zebra is announced to be obsolete... 20:12:58 <thd> Certainly, but the user unfriendliness could be factored out and translated to the unfriendly form if the query is intercepted. 20:13:31 <hdl> we are using grs1 when DOM indexing is favoured... but would have required much time to build... 20:13:53 <slef> Well, I think my problems with solr and my doubts with some of the accusations against zebra I've stated on the list, so I won't repeat them here unless you want them repeated. 20:14:17 <hdl> And most of the xslt embedded are cut out for USMARC where UNIMARC is not supporter. 20:14:49 <slef> I agree that some of the problems are with our usage of it, though, so C4::Search probably must change anyway. 20:15:03 <hdl> ok slef... I will talk about what solr would bring along. and what we did... 20:15:25 <hdl> #topic solr : what it brings along and what we did 20:15:29 <thd> Some of the problems are problems of adding UNIMARC support which is a problem that does not go away by using Solr/Lucene. 20:15:42 <darling> slef, hdl -- depends on if the topic is "swap/deprecate zebra for solr" or "add support for solr as an alt search engine" 20:15:52 <Nate> Nate Curulla, ByWater Solutions... Sorry for the lateness 20:16:05 <hdl> thd part of it yes. 20:16:35 <hdl> Solr brings along a widely used search engine. 20:16:50 <hdl> With the ability to do full text indexing of documents. 20:16:58 <hdl> And with many built in features. 20:17:03 <hdl> utf8 support 20:17:08 <hdl> facets. 20:17:20 <hdl> And all the things we explain in the blog. 20:17:21 <rhcl> rhcl = Greg Lawson, Rolling Hills Consolidated Library, lurker 20:17:58 <hdl> darling: I acknowledge that solr should be an option. 20:18:19 <hdl> But change in a search engine in an ILS is quite strategic. 20:18:19 <ibot> hdl: that doesn't look right 20:18:53 * druthb giggles at ibot. 20:18:53 <thd> Solr/Lucene uses Java ICU which based on the same core ICU code as the C ICU used in Zebra. 20:19:07 <hdl> But we have limited time. 20:19:13 <hdl> And limited ressource. 20:19:20 <cait> hdl: so you are not implementing it as an option? 20:19:36 <hdl> cait at the moment, no. 20:20:01 <hdl> Because we signed and are due to deliver a product on a specific time. 20:20:13 <hdl> So we made a choice. 20:20:34 <hdl> And we try to do it so that zebra can then be reintroduced. 20:20:40 <cait> I think there are valid concerns about solr - so that it should at least be an option at first 20:20:58 <thd> ICU = International Components for Unicode a well supported project providing a Unicode programming library supported by IBM etc. 20:21:26 <hdl> thd icu support in zebra is rather poor compared to the icu syntax and possibility. 20:21:42 <cait> I am concerned about replacing it so fast - as you said it's an important feature of an ILS 20:22:13 <hdl> cait: we are trying to gather all the use cases so that features are not lost. 20:22:38 <hdl> We are building on top of Data::SearchEngine and Data::Pagination 20:22:51 <thd> As hdl identifies, the method of calling the ICU in Zebra is an awkward add on from the point when Zebra had no Unicode support. 20:22:51 <clrh> #link http://search.cpan.org/~gphat/Data-SearchEngine-0.16/lib/Data/SearchEngine.pm 20:23:41 <hdl> Data::SearchEngine can be adapted in order to build RPN and CCL queries and work nicely with zebra. 20:23:47 <hdl> I bet it is doable. 20:23:59 <hdl> But again, we have limited ressources. 20:24:03 <hdl> And limitted time. 20:24:36 <hdl> We work as crazy in order to make the whole change and have some very promising results. 20:25:03 <clrh> #link http://catalogue.solr.biblibre.com/ 20:25:14 <clrh> #link http://solr.biblibre.com/ 20:25:17 <hdl> the two interfaces we built are there for you to try, test. 20:25:30 <hdl> On intranet there is a demo/demo account. 20:25:58 <thd> There is an English expression, if you break it you have bought it which must have an equivalent in many languages. However, most of us recognise that the work should be done and is of importance to everyone. 20:26:23 <drulm> Hello. git version question: I am drawing from git://git.koha-community.org/koha.git 20:26:32 <hdl> You can then see the how the indexes can be edited and queried. 20:26:35 <wizzyrea> drulm: we're in a meeting 20:26:37 <slef> drulm: /msg me please, a meeting is on. 20:26:54 <hdl> and then you can see the page for indexes : 20:27:25 <clrh> #link http://solr.biblibre.com/cgi-bin/koha/solr/indexes.pl 20:27:28 <hdl> you can add some indexes, and link them to the index user define. 20:27:43 <hdl> And there are some plugins that we can add. 20:27:59 <hdl> We already designed plugins as to search for rejected forms. 20:28:17 <hdl> And usage of authorities in biblio records. 20:28:29 <hdl> At the moment, 20:28:50 <hdl> we gather use cases sa as not to loose any query that we could do in zebra. 20:29:20 <thd> hdl: I am concerned that CCL, Pazpar2, and Zebra support should not be an either that or Solr/Lucene option. We need CCL and Pazpar2 for metasearch and we currently need Zebra for a Z39.50/SRU server. 20:29:23 <clrh> #Ĺ€ink https://spreadsheets.google.com/pub?key=0AuZF5Y_c4pIxdEVzTjUtUGFoQnFpSkpfbTU5Ykc3b2c&hl=en&output=html 20:29:59 <hdl> thd : in the commits of the wip/solr branch of our git, 20:30:23 <hdl> you can see the first commits for a Z3950 search engine on top of solr... 20:30:30 <hdl> using SimpleServer 20:30:46 <hdl> At the moment, the queries decoded are only simple queries. 20:31:06 <clrh> #link http://git.biblibre.com/?p=koha;a=blob;f=misc/z3950.pl;h=b373b7308f7133ab3f1a6698e2089a3cd7263940;hb=refs/heads/wip/solr 20:31:06 <hdl> But we will work by the end of the year to have something. 20:31:28 <hdl> more powerful and answering the needs. 20:31:34 <thd> hdl: It is the limitation of query support which concerns me for the Z39.50/SRU server as we have to write that ourselves. 20:31:56 <thd> ... for SimpleServer, 20:32:25 <hdl> Well, since JZ3950 was proven to be quite ... disappointing. 20:32:49 <hdl> We will build a grammar.... 20:32:55 <thd> hdl: Do you mean JZKit? 20:33:04 <hdl> thd: yes. 20:33:37 <hdl> It seems that memory consumption goes exploding. 20:34:09 <hdl> thd: have you been in contact with the company working on that ? 20:34:21 <hdl> Is there some fix for that ? 20:34:36 <thd> I have investigated JZKit deeply as well as options from Index Data. 20:35:10 <thd> I do not have a full set of responses but have been writing a detailed report for the RFC. 20:35:31 <thd> The only hope of fixing issues is a support contract. 20:36:21 <thd> Even some cryptic options in SimpleServer would need a support contract to understand well. 20:36:48 <hdl> Well we envision a support contract, but more on solr issues than on zebra or z3950. 20:36:49 <thd> Simple2Zoom is another option in principle for a Z39.50/SRU server. 20:36:49 <slef> On the exploding memory consumption theme: my big concern with solr is *its* memory usage (reportedly at least a gigabyte, more than all of koha 3.0 including a base OS, which would mean more expensive servers would be needed for koha libraries). How is solr memory usage on biblibre servers? 20:37:35 <hdl> well, it has been quite slow. even indexing 300000 biblios. 20:38:05 <hdl> We never stressed it severely though. 20:38:14 <hdl> But it is a thing we will do. 20:38:44 <thd> I personally favour SimpleServer at the moment but Ian Ibbotson from Knowledge Integration and Sebastian Hammer from Index Data indirectly alerted me to an important problem for having a Z39,50/SRU server. 20:38:44 <darling> for what it's worth, at Catalyst we use it from time to time -- there are setups than run very smoothly and make no trouble; there are some where it sometimes goes insane 20:39:14 <darling> we have a practise of putting each solr intance on an isolated vm for that reason 20:40:21 <thd> hdl: How do you intend to return full MARC records for a Z39.50/SRU server when using Solr/Lucene? 20:40:26 <Brooke> what would be a reasonable benchmark? 20:40:35 <Colin> Did you evaluate any other options apart from Solr? 20:40:49 <hdl> indexing marcxml data or even iso2709 20:40:52 <Brooke> and how do we compare apples to apples with the current iterations absent performance guidelines? 20:41:09 <hdl> Colin: other options could be Nutch or so... 20:41:12 <darling> I'm working on a project right now that's indexing about 40k documents of about 5k of structured text each and it's so far not been trouble -- and it's very fast and sweet -- but tricky to configure and gives unclear error messages when it fails -- I like it but don't yet trust it 20:41:23 <hdl> But solr is quite a standard nowadays. 20:41:44 <thd> hdl: Solr/Lucene corrupts ISO2709 records. Lucene has a binary storage type which Solr/Lucene does not. 20:41:57 <slef> Who has declared it a standard? 20:42:29 <cait> hdl: I think you can not argument everybody else is using it 20:42:31 <thd> hdl: The experience of BlackLight is very informative on the issue of storing full bibliographic records in Solr/Lucene. 20:42:31 <darling> slef, it's widely used engouh that I would call it a standardish solution 20:42:34 <hdl> slef: at least, it is wide spread. and is doing quite a good job "out of the box." 20:42:44 <cait> I really have a bad feeling about a hasty replacement 20:42:58 <hdl> and i know at least 4 solutions using solr. 20:43:17 <hdl> cait it may not be for 3.4 20:43:20 <cait> we should start with an option, if most people start using it we can drop zebra perhaps sometime in the future 20:43:34 <slef> darling: those are the same sort of arguments which lead to calling Windows "standard" which I don't ;-) 20:43:40 <thd> hdl: Which 4 solutions do you mean? 20:43:44 <darling> slef, fair enough 20:43:49 <hdl> But I think that we had to share the point we achieved. 20:44:06 <Brooke> cait++ 20:44:21 <druthb> cait++ 20:44:22 <hdl> Vufind, BlackLight, Drupal and XC (but it is durpal based) 20:44:41 <thd> slef: MS Windows standard :) 20:44:46 <hdl> We had no time doing that. 20:45:16 <cait> all those are discovery interfaces - no ILs 20:45:20 <slef> drupal uses solr? I thought it was some extension module 20:45:23 <hdl> And it is really time consuming without any tests... or use case... to build regression tests. 20:45:40 <clrh> it is an extension module more and more used slef 20:45:53 <thd> hdl: in the case of the OPACS such as VuFind and BlackLight remember that they are merely OPACs with the real library system and its own OPAC underneath. 20:46:00 <darling> cait++, and I would expect that the integration would be in such a way that would facilitate slipping in things other than solr later (or sooner, like for smaller setups) 20:46:32 <hdl> darling: Data::SearchEngine::Zebra... COULD be written. 20:46:37 <slef> clrh: which still doesn't make it drupal. 20:46:45 <darling> (our use of solr here is mostly for drupal sites we run) 20:46:52 <hdl> But we donot have any ressource on that. 20:47:05 <hdl> C4::Search Had to be revamped... 20:47:28 <hdl> And we began the work. And we come to you in order to show you what we achieved. 20:47:43 <hdl> And say we are at this point of the road... 20:48:09 <hdl> But we won't be able to do the whole lot alone... 20:48:15 <hdl> And it would just be insane. 20:48:15 <thd> hdl: I will add a more important case to your count which at least demonstrates scalability in a relatively simple configuration if money is spent on hardware. Wikipedia uses Lucene indexing via some extensions. If it scales for Wikipedia it can really scale with the hardware caveat. 20:49:43 <thd> No one should expect BibLibre to do this alone any more than we expected LibLime to add Zebra support alone. 20:49:56 <Brooke> thd++ 20:50:06 <hdl> Installer should be ok with solr now. We have an installing option for solr core support. 20:50:12 <cait> thd: noZebra was kept as an option 20:50:20 <hdl> (but yes, only solr) 20:50:20 <Brooke> what can't be expected is collaboration for a BibLibre deadline when other individuals have their own timeframes and projects. 20:50:23 <thd> LibLime funded support contracts for Zebra on their own but that was not necessarily a reasonable option. 20:50:34 <hdl> cait : but quite rapidly deprecated. 20:51:06 <hdl> Brooke: I donot ask for more than what ppl are willing to do. 20:51:27 <hdl> At least just consider what we do... 20:51:30 <Brooke> I am aware of that. It is important to note that distinction, too. 20:51:42 <hdl> And let us work together... rather than in // 20:51:56 <Brooke> How do we reconcile profitability with community? Agility with stability? 20:52:14 <Brooke> Today it is solr/zebra, tomorrow it will be something else. 20:52:15 <hdl> use cases and regression tests. 20:52:28 <hdl> look at the ggl page. 20:52:40 <hdl> you have your use cases : add yours 20:52:48 <hdl> you think of a use case. 20:52:52 <hdl> add itt 20:53:03 <thd> cait: I agree with the assertion which paul_p has made, that long term support for multiple record indexing models has proven too much for the size of the Koha project support community in the past. 20:53:13 <slef> I'll poll our libraries, but I suspect the concerns of needing java and lots more memory will outweigh the benefits, including some of the claims I questioned without any reply. 20:53:16 <hdl> there are plenty ways to do. 20:53:51 <hdl> slef: I will try and assing all the questions you asked. 20:53:54 <slef> I don't think there's any point polling our members for whether we could fund it from our community fund because it involves java, which is no fun. 20:54:06 <slef> What's the ggl page? 20:54:24 <clrh> #link https://spreadsheets.google.com/a/biblibre.com/ccc?key=0AuZF5Y_c4pIxdEVzTjUtUGFoQnFpSkpfbTU5Ykc3b2c&hl=en#gid=1 20:54:32 <thd> cait: I agree with you that we should retain options until we are satisfied that the Solr/Lucene solution is ready to replace Zebra for local indexing. 20:54:39 <clrh> oups maybe bad link 20:54:42 <darling> I have to run off to a mtg right now. My parting thoughts are: solr is tasty but not a simple option, I suspect we will be glad once it's in and , pocket sized installations shouldn't have to use it 20:54:48 <slef> clrh: "Sign in to your account at BibLibre"? 20:54:55 <clrh> #link https://spreadsheets.google.com/pub?key=0AuZF5Y_c4pIxdEVzTjUtUGFoQnFpSkpfbTU5Ykc3b2c&hl=en&output=html 20:54:58 <clrh> better? 20:55:39 <hdl> in the 5 last minutes. 20:55:46 <darling> slef, integrating w/it doesn't mean needing to deal w/java (though app server config matters would be a thing) 20:55:49 <hdl> #topic what could be done 20:55:56 <slef> clrh: how do I add "run on a 512Mb server" to it? ;-) 20:56:00 <hdl> Data::SearchEgine::Zebra 20:56:18 <hdl> could be written with a Data::SearchEngine::Query 20:56:31 <hdl> We could also build dynamically fomrs 20:56:55 <hdl> We could achieve relevancy. 20:57:24 <hdl> And do some on the fly weighting. 20:57:29 <slef> yeah, this is a basic problem I'm having... what's the incentive for someone to write||fund Data::SearchEngine::Zebra? Zebra currently works for most people most of the time. 20:57:52 <hdl> most people... not in France. 20:58:31 <hdl> slef: have you ever tried to fix a C4::Search bug ? 20:58:40 <hdl> Have you ever dived into that code ? 20:58:57 <hdl> It really needed revamping. 20:59:08 <thd> slef One hopes that the work involved in refactoring Zebra support to not be the only option would be very much less than the work in adding Solr/Lucene support. 20:59:09 <hdl> We began the work. 20:59:23 <hdl> And tried to do something quite SearchEngine Independant. 20:59:32 <slef> hdl: I've no memory and searching suggests not. 20:59:52 <hdl> of what ? 21:00:00 <thd> hdl++ SearchEngine independence 21:00:03 <slef> <hdl> slef: have you ever tried to fix a C4::Search bug ? 21:00:09 <cait> hdl: sorry, I think it could be improved, but saying it does not work at all seems not right to me 21:00:16 <slef> thd: sure, but it's still not zero. 21:00:39 <hdl> cait: I donot say it doesnot work at all. 21:00:45 <hdl> It works. 21:00:46 <thd> slef: Yes, which is why BibLibre is hoping that you will help. 21:01:03 <hdl> But not always. and not on all the features it claims to provide. 21:01:16 <hdl> For instance availability search is not working. 21:02:14 <hdl> We are trying to make it so that what we claim to do. We actually do. 21:03:10 <hdl> It is quite frustrating we could not take out with a real plan... or actions out of this meeting. 21:03:39 <hdl> At least, we wanted to come to you and share all the stuff we did. 21:03:44 <slef> thd: Yes, it does look like BibLibre asking others to work for free. :-( For reasons explained above, I don't see how to make this paid. 21:03:58 <thd> hdl: Some people have doubted that problems with Zebra when using the ICU are Zebra problems. I could test at least one of those problems readily with my own non-Koha code which has been well tested for character encoding if you provide a Z39.50 server with claimed problems. 21:04:34 <thd> slef: The most significant problem I see reported for Zebra is one which hdl does not repeat clearly. 21:04:42 <slef> It is frustrating that there is no conclusion to this. 21:04:48 <slef> Now I must go elsewhere. Sorry. Bye all. 21:04:48 <thd> slef: There are reports of Zebra failing. 21:05:11 <thd> slef: This is free software. It is never finished. :) 21:05:48 <hdl> No conclusion... Because we have to work and either make that happen or make another thing happe. 21:05:49 <hdl> n 21:06:30 <thd> hdl: I am sorry that I did not have my report ready well in advance of the meeting. 21:06:43 <hdl> any question you have, any comments on the way we did things. you can tell us. 21:06:48 <clrh> we really wants to work with you this is why we are here... 21:06:49 <thd> hdl: I have been very detailed which takes much time. 21:07:03 <clrh> I'll try to be more present on the channel next weeks 21:07:09 <hdl> on list, or by chan. 21:07:38 <hdl> Thanks for your interest. I hope that you can test and send patches or at least see what we achieved. 21:07:46 <thd> hdl: What is the function of the use of YAML in your proof of concept or work in progress especially in bulkmarcimport.pl? 21:08:27 <hdl> thd: let me end the meeting 21:08:33 <hdl> and will answer you. 21:08:37 <thd> OK 21:08:37 <hdl> #endmeeting