Variable char encoding of bytes

23 07 2010

I may not be looking in the right place, but often I want to take a byte[] and convert it into a char[] where the char representation comes from a set of chars that I decide on. This is not for false encryption or obfuscation, I just want a safe compact representation of keys. Hex is ok, but bulky and does no give much flexibility. I couldn’t find anything off the shelf that would work with random encoding sequences and random byte[] lengths so here is what I quickly put together. Published for any one who also couldn’t find anything off the shelf. It benchmarks at about 110K encodings per second (9ns) on a Mac Book Pro Java16 and is reasonable memory efficient. It could be made more efficient, but that would require more cpu.

    /**
   * Generate an encoded array of chars using as few chars as possible
   *
   * @param hash
   *          the hash to encode
   * @param encode
   *          a char array of encodings any length you lik but probably but the shorter it
   *          is the longer the result. Dont be dumb and use an encoding size of < 2.
   * @return
   */
  public static String encode(byte[] hash, char[] encode) {
    StringBuilder sb = new StringBuilder(hash.length);
    int x = (int)(hash[0]+128);
    int xt = 0;
    int i = 0;
    while(i < hash.length) {
      if ( x < encode.length) {
        i++;
        if ( i < hash.length ) {
          if ( x == 0 ) {
            x = (int)(hash[i]+128);
          } else {
            x = (x+1)*(int)(hash[i]+128);
          }
        } else {
          sb.append(encode[x]);
          break;
        }
      }
      xt = x%encode.length;
      x = x/encode.length;
      sb.append(encode[xt]);
    }
    return sb.toString();
  }

And a unit test that checks the operation, and checks for collisions (AFAICT the algorithm is complete, but I could have missed something).

  @Test
  public void testEncoding() {
    SecureRandom sr = new SecureRandom();
    String encoding = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
    for (int j = 10; j < encoding.length(); j++) {
      for (int i = 1; i < 100; i++) {
        byte[] b = new byte[i];
        sr.nextBytes(b);
        System.err.println(StringUtils.encode(b, encoding.substring(0, j).toCharArray()));
      }
    }
  }

  /**
   *  This is a very long running test, do not enable unless you want to wait a long time.
   */
  @Test
  public void testEncodingCollision() {
    SecureRandom sr = new SecureRandom();
    String encoding = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
    long l = 200000;
    long m = 1000;
    Set<String> check = new HashSet<String>((int)l);
    for (long j = 0; j < m; j++) {
      long s = System.currentTimeMillis();
      check.clear();
      for (long i = 0; i < l; i++) {
        byte[] b = new byte[20];
        sr.nextBytes(b);

        String e = StringUtils.encode(b, encoding.toCharArray());

        if (check.contains(e)) {
          Assert.fail();
        }
        check.add(e);
      }
      long e = System.currentTimeMillis();
      long t = (1000*(e-s+1))/l;
      long o = (1000*l)/(e-s+1);
      long tleft = (t*l*(m-j))/1000000;
      System.err.println("No Collisions after "+l*j+" operations of "+l*m+" "+((100*l*j)/(l*m))+" % at "+t+" ns/op "+o+" ops/s eta "+tleft);
    }
  }

If you want to use this, feel free and are worried about license consider the code Apache 2 Licensed (c) Ian Boston. If you do use this and find a problem or if you know of something out there in a standard lib that does the same, please comment.





Sometimes new is not better

21 06 2010

Here is a classic example of new not being better. My utilities provider has a “new and improved interface”, so much improved that they felt it necessary to tell their customers how much better it was.

If you click on the “Continue” button you find out just how much better it is

Hey, fantastic, I can request a call back! But if I press the browser back button, I find the “new and improved site” is even better.

I feel so reassured by Katie’s smile, I really feel I should ask her for help.

I wonder which numbskull decided to inflict this new and improved user experience on customers without thinking about getting the implementation to work. Made me laugh, hope it does the same for you.





And its working, JAMES3 IMAP in OSGi on Jackrabbit

5 06 2010

A Self contained James OSGi bundle containing a DNS Server and an IMAP server, binding down to a Sling repository. The screenshot is of OSX iMail connection doctor checking the imap connection and running some simple tests.





Bundling larger OSGi components.

4 06 2010

At the moment, I am bundling the James IMAP server with the JCR backend into OSGi. I could add all the jars to the OSGi Container, but I already have about 150 classloaders and dont really want to add more, so rightly or wrongly I am creating a larger bundle with most of the dependencies isolated inside the bundle. Wrongly the OSGi purists scream. James servers typically use Pheonix, being awkward, I want to bring the James IMAP server up standlone only creating what is absolutely necessary, so I am not using Pheonix to configure, I am instancing a custom NIOServer extending the default ImapServer based on the same. James servers, if you let BND analyse the classes, require lots of things, most of which I dont think I will be using. Things like Spring, Torque, Ant Tools, and getting well over 1000 packages just right in an OSGi Manifest is a bit of a pain. Once its right, you will be thankful you took the time, if not ask the OSGi purists who are now crawling up the walls and screaming. So here is the approach I am taking, which appears to be relatively painless (note, relatively):

  1. Add packages to the Ignore Package maneifest header, until the bundle starts.
  2. Once the bundle starts, add Embedded Jars to the bundle to eliminate class not found errors, except, where you know that the package should be available from the container.
  3. If you see a NoClassDefError, then the class is probably available from the OSGi container (exported from another bundle) but it cant be loaded because you dont have a policy that says it could be loaded, so add an import. NoClassDefError means that classloader cold not get the bytecode stream for the class. This error depends on the OSGi framework in use, ymmv. But look at the last exception in the stack, not the first.

There are better ways of doing this I am certain, but if you really do want to take a small part of a larger project and rebundle it, then this appears to work. Doing the same as a trial and error process really is like trying to shoot rabbits in the dark. (not that I have tried that one).

And I should add, James 3 is really nice.





Jackrabbit PM in columns

14 05 2010

A while back I asked about Cassandra as a Persistence Manger for Jackrabbit 2. The problem that exists with any Column DB and to some extends Cassandra (although it has a concept of Quorum) is that the persisted value is eventually consistent over the whole cluster. Jackrabbit PM’s, at least in JR2 and earlier need a level of ACID in what they do. The PM acts as a central Transaction Monitor streaming updates to storage and maintaining an in memory state. If you look a JR running against an RDBMS this is obvious. The ratio of select to modify is completely reversed with 90% update and 10% read inspite of the JR application experiencing 90% read and 10% update. This presents no problem on a single node, provided the PM give some guarantees to ACID like behaviour. The problem comes when JR is run in a cluster. When the state of items is changed on one node, that change must be known on all nodes, so that their internal caches can be adjusted. Thats what the ClusterNode implementation in JR2 does. In a cluster it goes further. The sequence which the changes are applied must be consistent, so that a add, remove, add happens in that order, on all nodes. Finally, and crucially for a persistence that is eventually consistent, all nodes must be able to see  the committed version of an item when they receive the event concerning the item.

Thats why I asked about Cassandra. If you took the existing RDBMS ClusterNode implementation, which uses a DB table to ensure the sequence of events is the same on all nodes, then events would get replayed correctly, but when they accessed their local connection to the column DB, its almost certain that their local state would not be sufficiently consistent to provide the correct version of the data. I have ignored the slight problem that the journal managed by a central DB is still a central point of failure, a synchronization point for the entire cluster (with a reasonable amount of latency), and botteneck for throughput.

So I did a bit of playing, its not complete, but I think its possible to have a ClusterNode implementation tightly coupled with a PM, where the ClusterNode uses JGroups to manage the journal sequence, with one elected master emitting sequence numbers and all slaves listening in waiting to become the next master should the current one fail. The approach looks like it will happily produce sequence at several orders of magnitude greater than the central DB approach. The second part is to use the modified ClusterNode to communicate a version of the Item in its event. Should the receiving node not get that version of the item, it can come back to the originating node and ask it for a copy, avoiding the problems of eventual consistency and the penalty of a fully quorate column db. Now, I havent tested any of this other than the cluster node impl, so I dont know if trafic back to the consistent node for the correct version of the item will dominate or not, but it all looks feasable.

Having said all that, the discussions on JR3 which looks like making it into trunk any moment now may make all of this obsolete, as there was talk of using a append only node hierarchy store, simular to that used by Git, CouchDB and others. This might still need to overcome the consistency issues when distributed, but it would be core rather than bolt on. Need to go an look ….





Whats going on in Nakamura/Sakai 3

23 04 2010

There are lots of things happening in Nakamura, the back end for the Sakai 3, and many of them are of little interested to anyone thinking of how Sakai 3 might impact their lives in teaching, learning and collaboration in higher ed. However, standing back for a moment, 2 features are going to make a massive impact, perhaps not initially, but certainly longer term. Full Business Rules Engine and Activity based Workflow Engine. I dont expect these words would excite many readers and I would not expect the features to be instantly visible to the teacher, student or researcher. However, with these capabilities in place it becomes possible for new features and functionality to be implemented with less effort and programming. We already have a highly flexible component architecture, in OSGi, allowing deployers to add functionality without modifying the core code base. We also have a extremely flexible unstructured storage model courtesy of Apache Sling and Apache Jackrabbit, which mean, unlike many comparative products there is no schema rebuild required to add functionality, but we moved to that structure 18 months ago and thats not what excites me. We already have a flexible widget based UI model that allows institutions to own, develop and deploy “Apps” for Sakai. We have OpenSocial integration allowing Students, Teachers and Researchers to build academic networks. Thats all in addition to all the features you rightly demand from Moodle 2, Blackboard 9 and others. But thats not what excites me.

What excites me is being able to give Business Analysts and Educational Designers at institutions visual tools that will enable them to directly develop, simulate and deploy workflows and rules that will give their institutions competitive advantage, without having to resort back to programmers. And its the simulare based on feedback and evidence that are most compelling, having the tools to run what if evidence based senario modelling on the introduction of new ways of operating, before deploying to production is core to so many highly effective organizations, so why not higher ed ?





Configuring Logging in Nakamura/Sling, at runtime

12 03 2010

One of the most frustrating things about trying to find out what is broken in a running application is that the information in the logs is not informative enough, and if it is, there is too much of it to be useful. This is often because the logging configuration has to be set prior to the server starting. So if you have 2 options. Configure the logging at debug level just in case there is a problem, or restart the server when there is a problem. Nether are really practical in production. Debug an everything kills the server, and your definitely in the wrong profession if you can predict where to put debug on (ie the future), you should be a banker.

Fortunately Sling/Felix have a nifty feature, exposed in Sakai Nakamura that allows a sysadmin to reconfigure logging at runtime. If you go to the Configuration console plugin http://localhost:8080/system/console/configMgr and find the Configuration factory called “org.apache.sling.commons.log.LogManager.factory.config” you can create a new configuation. In this configuration you can specify a level, target file, and best of all, a package sub path. For instance I just needed to look at the way ACLs were being resolved in Nakamura so I created a configuration logging to logs/acl.log at log level Debug and used a package name org.apache.jackrabbit.core.security.authorization.acl. Now all the ACL activity happening in a class DynamicACLProvider appears in the log file logs/acl.log. When I am done, I can delete the configuration. Neat. No server restart, and highly targeted logging with no need to go round to each node in a cluster and configure each one differently. I could probably make it all stream out to syslogd on demand and collect the results from a cluster centrally. So if you are sinking under mountains of log files with useless debug statements that you cant get rid of till next time downtime is scheduled, there is another way.





ACL extensions just got easier

9 03 2010

Extending the types of ACLs in Jackrabbit 1.x was hard. After, 1.5 where there was a reasonable ACL implementation, much of the code that managed this area was buried deep within inner classes inside the DefaultAccessManager and related classes of Jackrabbit 1.5/1.6. In Jackrabbit 2 as part of the improvement to the UserManager (I guess) its become much easier to make extention. I had a patch that modified the core classes of ACLEditor, ACLProvider, ACLTemplate in JR1.5 allowing the injection of a class that would control the way in which Access Control Entries were collected for a node. This allowed things like dynamic principal membership to be implemented (eg membership of a group that is determined by a rule set, and not a membership record). The upside, was this was possible in 1.5, the downside was that it was hard and required re-implementation of some of the core classes, in the same package space to get round private, protected and even code blocks based on class names. So the patch was huge and hard to maintain.

In JR2, inside the acl.ACLProvider there are protected methods designed to be overridden. You still have to do this inside the same package as the native JR2 classes, in order to instance private internal classes, but at least extending the default ACL resolution in Jackrabbit is a case of extending a few classes. All of this put inside an OSGi environment makes it possible to re bundle an embedded Jackrabbit server with custom extensions on the core… and no need for a massive patch set…. another reason to upgrade to JR2.





Jackrabbit2 User Manager Scaling.

7 03 2010

In Jackrabbit 1.x the User manger (at least after 1.5) stored users in a security workspace within the repository. This was great, and works well upto about 1000 users. However it uses a structure where users are nested in the user that created the user. If if “admin” creates all your users, then there will be 1000 child nodes to the admin user node. Sadly, additions become slower the more child nodes there are. Pretty soon a number of things happen. The number of child nodes causes updates to be come so slow it hard to add users (>5K users). This can be addressed by a sharding the child node path, avoiding large numbers of child nodes. Secondly (and this is harder to solve), the query that is used to find a user, or to check that the user doesn’t exist somewhere becomes progressively more expensive. So that when you get to about 25K users the create user operation has slowed by an order of magnitude.  That may not sound too bad, since its not often that you want to create a user, however, retreval of a user that becomes slower as well since you cant calculate the location of the user node from the userID, and since this needs to be done on almost every request, it slows everything.

Fortunately it looks like all of this has been fixed in Jackrabbit 2. Now the UserManager does not store users in a nested form, it has an autoscaling sharding mechanism and the search query generates results that are far more direct. Some of this is not enabled by default, but here is the config that makes it work.

In repository.xml

<!DOCTYPE Repository 
       PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 2.0//EN"
"http://jackrabbit.apache.org/dtd/repository-2.0.dtd">
<Repository>
     <FileSystem>
        <param name="path" value="${rep.home}/repository"/>
     </FileSystem>
     <Security appName="Jackrabbit">
        <SecurityManager workspaceName="security">
           <UserManager>
             <param name="defaultDepth" value="4" />
             <param name="autoExpandTree" value="true" />
             <param name="autoExpandSize" value="500" />
           </UserManager>
        </SecurityManager>
        <AccessManager>
        </AccessManager>
...

In tests in Sling I see no slowdown in user creation upto 10K node. With HTTP requests ranging from 12-25ms. To add those 10K nodes from a single request thread takes 6m4s, a rate of about 27/s. The same test performed with Sling on Jackrabbit 1.6 was averaging 6/s over 10K nodes. Concurrent add operations tend to speed this up further as the http cost is factored out. My laptop runs out of steam at about 70/s.

User Node paths that are generated are of the form

.../t/te/testuser201003070946420-300
with auto scaling in operation they become
.../t/te/tes/test/testu/testus/testuse/testuser/testuser2/testuser20/testuser201/testuser2010/testuser20100/testuser201003071035370-4998
However this is an extreem case where all the user ID’s are almost identical.
Conclusion.
If you have a app that has a lot of users, use Jackrabbit 2 not Jackrabbit 1.6, as it scales better.




New Programming Language

5 02 2010

My new programming language that always compiles, never has bugs, has perfect style and is generally delivered on time, (all IMHO) is English. Developers must have a screw loose. Generally they refuse to write anything down, often they say the documentation is in the code, any yet, most of their leasure time is taken up refactoring rewriting and perfecting that algorithm that started out as a simple sort and is now drinking credit card limits on a cluster of Amazon nodes. Meanwhile, those crafty non developer types, lean back and claim victory with a page of prose that no compiler can even start to understand, and yet, they are the ones living it up with deadlines met. Now, we do do our best to ensure that doesnt happen, but there is a lesson to be learnt here.

Although 90% of those that write code for a living don’t do documentation and certainly don’t do it in advance, its far quicker and easier to experiment with ideas in plain English. (oops forgot that should be _i18n_native_tongue_ ), I wont go on, suffice it to say I’ve been trying out spec before implementation and found that on many occasions nuances that I can see would have appeared in the implementation have appeared in the spec, but the cost of correction has been significantly lower and my personal dialect of english has passed my patched english complier most of the time. Other’s substandard copies have problems. The bigest thing I have noticed with this approach is that I haven’t wasted hours debugging what I would eventually throw away.

It’s interesting that the software industry does so little of this. My first vocation in life was a design engineer; mechanical, high vacuum. There was no point in taking a stock bar out of the stores putting it into a lathe and hoping to machine off the right amount to make a stub shaft. In fact with high vacuum and often high voltage that was a positively dangerous pastime. Perhaps 200 years ago at the start of the industrial revolution or today if creating a work of art, but for engineering the physical material is always the last final part of the process. As software engineering, still a young profession evolves perhaps we will see more design and works of art and less metal bashing.