10:00:17 #startmeeting solr meeting 10:00:17 Meeting started Wed Dec 15 10:00:17 2010 UTC. The chair is hdl. Information about MeetBot at http://wiki.debian.org/MeetBot. 10:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic. 10:00:19 Hi all 10:00:24 Hi.. 10:00:54 let's proceed to a round call 10:01:11 * hdl Henri-Damien LAURENT, BibLibre 10:01:21 Thomas Dukleth, Agogme, New York City 10:01:21 * clrh Claire Hernandez, Biblibre 10:01:24 irma birchall from CALYX in Sydney 10:01:32 Reed Wade, Catalyst, NZ 10:02:01 Jonathan Druart, Biblibre 10:02:12 Magnus Enger, Libriotech, Norway 10:03:08 any other persons ? 10:03:12 hi miguelxer 10:03:18 hello 10:03:18 bonjour, miguelxer 10:03:28 it is presentation time 10:03:51 buenos dias a todos!!, ja 10:04:36 #topic Why taking on solr 10:05:07 I think that this topic has been long advocated... 10:05:35 If you have any questions or doubt on what we said previously and posted on list, 10:05:39 then ask. 10:06:46 #link http://librarypolice.com/koha-meetings/2010/koha.2010-12-07-20.00.log.html 10:07:20 #link http://wiki.koha-community.org/wiki/Switch_to_Solr_RFC 10:07:32 Since there are no questions then we will skip to next topic : what we did and are up to. 10:07:42 no questions one 10:07:45 #link http://www.biblibre.com/en/blog/entry/solr-developments-for-koha 10:07:49 two 10:07:57 three 10:08:08 #topic what is done 10:08:17 Zeno Tajoli, CILEA 10:08:21 Does BibLibre intend to do work towards refactoring existing Koha record indexing and retrieval for use alongside Solr/Lucene and not as an either/or option? 10:08:36 i guess my main concern is support for Koha acting as Z39.50 and SRU server - what is the status on that and Solr? 10:09:02 guetting back then. 10:09:26 our work now is not an either or option... 10:09:34 Because of time and ressources. 10:09:54 But bricks that we used are flexible. 10:10:06 And I think that we could wrap ZOOM in them 10:10:30 That would be excellent not only for Koha... But also for Data::Search::Engine. 10:11:02 magnus: #link http://git.biblibre.com/?p=koha;a=blob;f=misc/z3950.pl;hb=refs/heads/wip/solr misc/z3950.pl shows BibLibre work on a Simple2ZOOM gateway. 10:11:04 And we would be grateful if the community would help us inahcieving that. 10:11:33 #link https://github.com/eiro/rg-z3950-rpn/ wip about rpn grammar and z3950 server development 10:11:59 #help on building Data::SearchEngine::ZOOM and Data::SearchEngine::Query::PQF via ZOOM wrappers 10:12:04 we have to match both together and translate in solr requests 10:12:53 magnus: we are doing some work on that... And will use 17 Mo of real use cases z3950 RPN queries to validate. 10:13:15 (well not all of those are interestiing... But we plan to take out the most relevant ones) 10:13:35 What we achieved 10:13:57 - advanced search is now working pretty well. 10:14:14 - Configuration of indexes via database is now ok. 10:14:27 - display items information is now OK. 10:14:42 as you can see and test on solr.biblibre.com 10:14:48 wow that is impressive 10:14:49 or catalogue.solr.biblibre.com 10:15:04 hdl: 17 mo? 10:15:16 mega bytes. 10:15:20 I have a suggestion about Zebra in Koha 10:15:35 listening. 10:16:05 One of the problem that you report is about Facets 10:16:32 why not insert Pazpar2 as mandatory and use its facets ? 10:16:46 Ok is a dead road ? 10:17:25 pazpar2 could have been chosen... We chose solr because the community is much more active. 10:17:26 tajoli: Pazpar2 requires CCL support which BibLibre have removed in their implementation. 10:17:37 tajoli: if you want accurate facets with Pazpar2, you need to send to it the whole resultset 10:17:47 it's not a solution 10:18:04 And CCL is not a standard... It needs configuration 10:18:21 and all the setups are a little bit different. 10:18:33 that's true. we have proved it. 10:19:03 tajoli: #link http://bugzilla.indexdata.dk/show_bug.cgi?id=2048 which is a bug for no facets with the ICU in Zebra which Index Data will only fix with a support contract. 10:19:03 04Bug 2048: blocker, PATCH-Sent, ---, gmcharlt, CLOSED FIXED, Kohazebraqueue daemon MAJOR issue 10:19:05 fredericd: miguelxer and josepedro can you preset your selves for the records ? 10:20:03 What we have problems and are willing to do ... 10:20:16 Z3950 support on top of solr. 10:21:06 clrh: mentioned that we worked on the Z3950 grammar to get the whole of it (at least what is presented on the Indexdata website which was the only ressource we got that from) 10:21:33 #action Z3950 support on top of solr 10:21:40 we started with pazpar but we thought that is not the best solution 10:22:06 josepedro was that with Koha 3? 10:22:06 #action improving the indexing time. 10:22:30 yes 10:22:35 indexing speed is still quite slow compared to zebra... 10:22:43 But we are working on two ideas. 10:22:59 #link http://www.indexdata.com/yaz/doc/tools.html 10:23:38 - DIH Marc 10:23:41 #link http://lucene.472066.n3.nabble.com/Getting-started-with-DIH-td504691.html 10:23:48 (posted from erik hatcher...) 10:23:50 and SRU? 10:24:21 since Biblibre started with solr, we started to review their code comparing with Vufind 10:24:38 magnus: I guess and hope that SimpleServer will also cope with SRU 10:24:48 ok 10:25:09 - forking and sending // batches to index to solr. 10:25:31 #link http://www.nntp.perl.org/group/perl.perl4lib/2010/12/msg2836.html about another idea for improving => multi-threaded call 10:25:59 magnus: SimpleServer will map CQL to PQF for SRU. 10:26:15 we are reviewing the sru/srw libs, too 10:26:32 josepedro: what have you found ? 10:26:32 libs for dspace 10:26:48 josepedro: what have you figured out from your audits ? 10:27:23 and code reviews ? 10:28:06 we think that the implementation is not going to be very difficult with time enough. 10:28:54 Do you have a plan or could devote people to work with us ? 10:29:08 anyway, our main aim is facets solution 10:30:02 yes, we would like to collaborate with you 10:30:12 josepedro: it looks that we have a nice solution... With true facets. needs some more work on thorough configuration of indexes. 10:30:39 and adapting the solr configuration for MARC21 ( 10:30:51 we based our work on UNIMARC) 10:31:29 have you seen something about printing solr records directly?? 10:31:48 without looking the database?? 10:32:04 Plugin system we implemented for solr (that could be also used for zebra) is quite handy 10:32:42 josepedro: I donot understand. 10:33:15 We are using records. getting them from koha and processing information to send to solr. 10:33:30 hdl: Where did you find a Solr/Lucene configuration for MARC 21? 10:33:40 Well, as I have undestood the mains problems are Zebra + ICU and facets. 10:34:03 For me setup Zebra is not a problem 10:34:11 when you get the solr records, you search them in the database instead of printing them directly 10:34:33 and realtime indexing work (with a daily reboot) 10:34:43 tajoli: Zebra has a few problems but we should be able to have both Zebra and Solr/Lucene together. 10:35:35 But I undestand that Zebra + ICU is mandatory with more that one charset (like in France). 10:36:04 whatever quite big library you are. 10:36:08 tajoli: That is exactly how BibLibre came to their problem. 10:36:25 even small have some special books 10:36:37 in Hebrew, arabic... and Georgian.. 10:36:38 tajoli: As hdl states big libraries need full Unicode support. 10:36:59 so what is the real question here? no one seems opposed to solr as such, but there are some good reasons for keeping zebra around too. as long as solr is introduced as an option along side zebra everyone is happy, right? 10:37:00 This is how we ... and the whole community came into that problem. 10:37:04 And Zebra + ICU doesn't work 10:37:07 tajoli: Zebra is fixable but with support money. 10:37:23 tajoli: koha 3.0 was claimed to support full utf8 search 10:37:43 magnus: yes. Sure. 10:37:56 our main problem is that we have limited ressources. 10:38:12 Would BibLibre not at least consider abstracting the search calls so that others attempting to reintroduce Zebra, Pazpar2, etc. support on top of BibLibre's Solr/Lucene work would not entail rewriting BibLibre Solr/Lucene work for better abstraction? 10:38:12 We are willing to share ideas and development. 10:38:49 Well, what I want to say is that CILEA can TRY to help biblibre to develop an abstract search call interface 10:38:56 there's a separate project here and I think that's 'make koha support various search back ends' 10:39:05 sorry Biblibre 10:39:54 Well reed it is not much separate. I think it could be built on top of what we did. 10:39:54 But with pointing that Zebra continues to have the problems: 10:39:59 hdl: we all have limited resources 10:40:15 -- diffcult indexes setup 10:40:20 -- no ICU 10:40:31 -- bad facets 10:40:35 but 10:40:42 Less RAM to use 10:40:57 For use this is the kay point 10:41:15 hdl, good to hear -- but still is a thing that sounds like it needs additional attention 10:41:17 I have to see how much ram is needed with solr, didn't test anymore 10:41:17 We can't ask to improve RAM requests 10:41:20 big CPU consumption 10:41:45 big CPU consumption with solr ? 10:41:48 bug CPU consumption is for zebra. 10:42:06 are you sure ? 10:42:14 absolutely. 10:42:15 tajoli: Zebra certainly has ICU but it does not work for scan queries for facets nor truncation other than right truncation. 10:42:24 I don't see it. 10:42:42 If you read some logs about performance improvements, mason pointed 10:42:55 that you needed to set zebra on a different machine. 10:42:55 Ok, thank you. 10:43:23 reed: I always said that we would like to build that. 10:43:32 reed: but we cannot do that alone. 10:43:48 Refactoring the C4::Search was a priority for us. 10:43:50 solr ram reqs will vary depending on updates and number of indexes and searches and catalog size and it'll be a few years before we have some stable config advice 10:43:59 We chose the best bricks for that. 10:44:01 (I might be exagerating the case a little) 10:44:09 With Solr as option I don't suggest to use Zebra with ICU 10:44:58 tajoli: said he would help us in making solr an option... any other persons , 10:44:59 ? 10:45:15 hdl: Do you understand the problem that anyone starting from your work on C4:Search to add other non-Solr/Lucene options would need to rewrite all search calls to keep your Solr/Lucene work? 10:45:30 i do not like the sound of "others attempting to reintroduce Zebra, Pazpar2, etc. support on top of BibLibre's Solr/Lucene work" - sure biblibre has put a lot into this, but why should "others" have to re-implement something that is working (although not perfectly) today? 10:46:06 I should have s/on top/along side/ 10:46:20 magnus: C4::Search refactoring was not set by BibLibre. 10:46:33 hdl? 10:46:33 well, hdl is in France. France is in a galaxy far, far away. 10:46:33 and is a point in 3.4 10:46:47 forget hdl 10:46:48 hdl: I forgot hdl 10:46:52 ahhh 10:47:31 magnus: we worked on that... And for solr integration... because it fixed many problems at once... 10:47:42 And would allow better end user experience. 10:47:52 hdl: However, the particular implementation of refactoring C4::Search is BibLibre's work. 10:48:10 I confirm, I will TRY to help to BibLibre to have Solr and Zebra as index tool in Koha. 10:48:15 We are willing to add advanced search customization from administration. 10:48:38 hdl: Substituting one record indexing system for another is not refactoring as such. 10:48:47 But not in the same install, as an option to select in install 10:49:25 clealy with Zebra no those options: 10:49:30 tajoli, yeah, that sounds sensible 10:49:35 -- No ICU 10:49:51 -- No vanced search customization from administration 10:50:00 -- No improvment on facets 10:50:13 thd: show me any other code that works as much what we did. And I will be happy 10:50:43 No indexes from administration 10:51:05 No checks on data 10:51:06 #action work in pairs with CILEA for zebra as an option implementation 10:51:17 hdl: I am trying to understand your last post. 10:51:22 Etc. 10:51:31 Zebra as is today 10:51:53 and proprietary LMS already offer (or say they do) facet searching and truncation etc. 10:52:28 and many are using solr internally 10:53:07 * slef = MJ Ray, worker-owner of software.coop 10:53:13 #help new ideas for plugins to add so that the indexing could be better. 10:53:26 hi MJ 10:53:40 late ... 10:53:59 but here :) 10:54:08 dentist, unavoidable 10:54:23 hdl: I think I have understood your post about comparable code but stating that other work is inadequate should not be a basis for not attempting to develop a better model than other work. 10:54:35 #welovethenhs but it does mean I'm reluctant to waste public money by moving appointments 10:54:41 josepedro: miguelxer we would appreciate your feedback 10:55:03 on the review. 10:55:21 hdl: I do not question that much of the best work in Koha is work from BibLibre and Paul Poulain's business previously. 10:55:57 Can I add an action from xercode as of code review on what we did ? 10:56:19 thd: it is not a question of comparison. 10:56:23 hdl: no one is denying that biblibre is doing good work here - it just seems odd to me that one of the biggest companies should introduce new "things" that break old "things" that a lot of people still want... 10:56:30 There is no other code to compare. 10:57:04 magnus: we donot want to break... 10:57:13 good :-) 10:57:15 But to build on safer ground. 10:57:24 we do not understand 10:57:43 hdl: safer ground? 10:57:55 We would like to have your feedback from the code review you did. 10:58:27 hdl:I have been typing furiously on that since last night in addition to other days. 10:58:41 thd: more abstract bricks so that it is more flexible. 10:59:05 Have any thing that worked before working with the new system. 10:59:52 And then... use that abstraction to reintroduce options 11:00:14 hdl: Yet, not everything that worked before would work with the new system otherwise we would merely be busy praising your effort without these qualifications. 11:00:40 We are gathering use cases of searches that worked in koha previously. 11:00:41 hdl: Do not mistake that I do praise BibLibre's work. 11:00:46 #link https://spreadsheets.google.com/pub?key=0AuZF5Y_c4pIxdEVzTjUtUGFoQnFpSkpfbTU5Ykc3b2c&hl=en&output=html 11:01:11 #help please try and contribute yours 11:01:14 but not use cases like "run in less than a gig" which are also a vital concern 11:01:38 are you having servers with less than one gig ? 11:01:43 we consider it a great job but we think that there are several points that need to review deeply. 11:01:49 yes, lots of our libraries have sub-gig servers 11:01:51 hdl: One thing which would have worked on Koha using Pazpar2 which needs CCL is metasearch. 11:02:01 for example, the last i commented you 11:02:13 ok josepedro can you send us a mail with your points? 11:02:13 hdl: What support do you envision for metasearch? 11:02:48 solr has internal support for metasearch. 11:03:05 hdl: To Z39.50 servers? 11:03:19 hdl: The co-op may be unusual in that we support almost as many self-hosted libraries as shared-hosted ones, but I would expect a lot of independent libraries to be worried by the increased resource demands of solr too. 11:04:07 hdl: Solr/Lucene is not the API for library databases while Z39.50/SRU is. 11:04:12 * magnus agrees with slef 11:04:15 yes, no problem. But we have already sent you something about this. 11:04:29 slef: can you then be accurate in your demands ? 11:04:41 Those libraries surely donot have 300000 records. 11:05:02 And it would be quite nonesense to pretend that koha3.0 works in that context. 11:05:21 hdl: reportedly (see link I added to RFC), solr defaults to a 1Gb RAM usage. 11:05:37 hdl: Do you find database size limitations for Zebra? 11:06:18 #action josepedro send a mail with code reviews. 11:06:29 again, it is still work in progress. 11:06:35 We can help. 11:06:44 We are willing to recieve help. 11:06:47 slef, hdl - solr is just not going to be viable in small installations, it can't become a requirement for using koha 11:06:49 Even comments. 11:06:51 hdl: I don't know how many records, but I suspect most koha libraries are smaller than that. Solr may be needed for the top 10% of libraries, but we cannot let 10% of libraries increase expense for the 90% unnecessarily, can we? 11:07:13 snap 11:07:42 hdl: I think you missed a 0 in 3 million if you intended 3 million. 11:07:55 slef: if there is an abstraction layer you will have no problems. 11:08:21 thd: no. 300,000 with less that 8Gb is not viable option. 11:08:43 with koha3.2 11:08:53 reed++ 11:09:03 redd++ 11:09:09 reed++ # sorry 11:09:23 slef: reed we are willing to help but we cannot do that alone. 11:09:25 hdl: wow, I have only tested very small record sets. 11:09:32 tajoli: propsed to help. 11:09:33 (I don't know OTTOMH, that is) 11:09:43 yes, I confirm 11:09:46 and that is fine. 11:09:53 hdl, right -- was going to say that I don't think you expect it should be a requirement for koha 11:10:03 We will try and help him. 11:10:07 hdl: Do you have a comparison of RAM requirements in your Solr/Lucene test? 11:10:19 nop thd I ll try to provide it 11:10:43 #action provide a comparison of RAM requirements between zebra and solr 11:10:58 thd: to be honest... it would require to do multiple tests. 11:11:18 hdl: how multiple? 11:11:21 I am sure that Croswalking records in zebra is also ram demanding. 11:11:26 From our point of view, Koha has 3 big problems: 1- Facets. 2- Abstraction. 3- ModPerl. At present Zebra does not meet our expectations about facets, so we believe SOLR is the best solution and we would like to collaborate with BibLibre to develop this solution. 11:11:51 josepedro: Mod Perl Plack will be another meeting. 11:12:00 re: ram requirements --- my prediction is that solr schema tuning is going to be a very long process and so and profiling done now is likely to go out of date fast 11:12:15 josepedro: Please keep contributing... 11:12:50 any other questions ? 11:13:01 No 11:13:06 I propose to look at the MARC21 implications with sorl - adapting the solr configuration for MARC21 11:13:22 reed: Improvements in the sophistication of indexing may actually greatly increase RAM requirements for all options. 11:13:30 agree 11:13:53 #action irmaB build an adaptation of MARC21 for solr 11:14:23 #action BibLibre make a solr instance for MARC21 and publicise that to Irma 11:14:32 yes. 11:14:39 hdl: Did you state that you adapted an existing MARC 21 Solr/Lucene schema to UNIMARC? 11:15:18 No. 11:15:41 Vufind has a MARC21 setup for Solrs 11:15:56 tajoli: with solrmarc. 11:16:14 because you don't use solrmarc ? 11:16:17 yes... we investigated that. And think that we could build bridges. 11:16:28 No. 11:16:35 hdl: I think that was merely a confusion between your description and a comment next to yours. 11:16:53 It was proved not to be that efficient in indexing. 11:17:26 hmmm... seems to me the main focus now should be on getting solr and zebra to both be options along side each other - otherwise it sounds like the solr solution will have a hard time becoming part of koha/replacing zebra... 11:17:32 hdl: How do you return a complete record from Solr/Lucene? 11:17:46 And would have required too much time.... and would not have enable ppl wirh the flexibility we wanted to provide them. 11:18:25 thd, you can look in Search::IndexRecord 11:18:32 Clearly is not a task for 3.4 (april 2010) 11:18:36 magnus: we cannot. This is why we asker for help. 11:18:43 tajoli: agreed.. 11:18:45 But for 3.6 11:18:47 a record is constructed before indexing 11:18:56 s/before/during 11:19:00 But anyway, it is work on progress. 11:19:25 and if all the RFCs cannot be integrated into 3.4 RM is fine with that. 11:19:54 clrh: I had looked at addbiblio.pl 11:20:47 hdl: it is difficult to persuade 90% of libraries that they should fund something to enable support for the biggest 10%, and I mentioned last meeting that I don't think our members will pay from the co-op's community donation fund. You see our difficulty here? 11:21:06 slef: our work was only very little funded. 11:21:10 And we did that. 11:21:18 hdl: which is cool, but it seems there are lots of fun things to do with solr that may not be so important compared to getting solr into koha in the first place (which seems to imply making zebra and solr work as options next to each other) 11:21:52 hdl: Do you not think you should have tried to obtain more funding for a development with such a large scope? 11:22:17 well, for me the main bonus of Solrs is to replace Zebra +ICU 11:22:20 thd: noone would have ever funded refactoring. 11:22:31 you donot want to redo things. 11:22:38 libraries want features. 11:22:39 hdl: that is your decision. Maybe this is easier for a private company. I just explain the difficulty of our membership organisation in the hope you will comprehend it. 11:22:49 Is clear that Zebra+ICU doesn't work. 11:23:04 I explain ours. 11:23:23 hdl: that's not true. The co-op funded a lot of SQL-injection/placeholder cleanup way back when. 11:23:25 tajoli: It works mostly but with important exceptions. 11:23:33 So every library with a complex charset (like Arab+latin) can't use koha in good way now 11:23:44 I think someone now is funding template refactoring (sorry I forget who). 11:23:50 tajoli: Yes that is correct. 11:24:01 Koha in fact is use on a simlgle charset enviroment 11:24:34 The english speaking countriies and contry like Italy with latin charset only 11:24:49 ok. 11:24:49 And in this situation Zebra is good 11:25:03 ok. 11:25:14 I propose to stop the meeting now. 11:25:18 But if you need two charset, problems arise 11:25:39 And if you have other questions, or concerns or feed back on test instances, let us know. 11:25:58 on list please 11:26:30 bye 11:26:54 catalyst are funding template refactoring 11:28:19 #endmeeting