May 26, 2013

Review of "Clojure Data Analysis Cookbook" book

(Disclaimer: I've got the "Clojure Data Analysis Cookbook" book from Packt Publishing for review - as one of Incanter's maintainers, I was interested how this package is described there).

The book itself is classical cookbook - there are different recipes, combined by common topics into chapters, but almost independent from each other.  You can select any of them, and experiment with it.  But book doesn't described theoretical foundations of corresponding examples - it only gives solution & explain how it works, but you can find more information by using links to additional information.  Neither this book describes Clojure itself - you need to grab some other book for this, such as "Programming Clojure, 2ed" or "Clojure Programming".

Book covers wide range of problems, starting with data import (from different sources & formats), cleanup. After that, author shows how performance of processing could be improved by different tricks, starting with basic type hints, and continue by using Clojure's primitives for concurrent programming (agents, STM, pmap, reducers, etc.). Separate chapter is dedicated to analysis of data using Cascalog - Clojure-based framework for data transformation & analysis.

Several chapters covers Incanter - from operations with datasets, to statistical analysis and charting - this information is enough to start to use Incanter for data analysis.  Book describes latest version of Incanter - 1.4.1, although some information (such as import of data from SQL databases) will become outdated in Incanter 1.5.0 (I hope that it will be released soon).

Besides Incanter, book describes how to use Clojure together with other programs and libraries for statistics and data mining, such as Weka, Mathematica & R.  And the last chapter of the book describes how you can create charts in web applications using Clojure web frameworks, Clojurescript & existing Javascript libraries for charting.

I can recommend this book if you're interested in data analysis, and have some experience/interest in Clojure.

I can't say that this is perfect book - there are some problems with code formatting, and in electronic version, URLs aren't clickable (that would be very useful when you want to read additional information.  But overall impression from this book is very good.

I want to thank author, Eric Rochester, that he wrote such useful book.

December 31, 2012

2012th in overview...

It was relatively quiet year. It started with 2 weeks vacation in Dominican Republic - it was interesting travel, I and my wife enjoyed it a lot, and want to make it again, may be next year.
Not so much happened on work, most of tasks were not so challenging as in 2011th, although I tried to do some experimenting staff, trying to apply machine learning & related knowledge to my area of expertise. But no of these experiments went to production (I hope, that they will, but who knows when this will happen). I also participated in Microsoft's plugfest - conference about file format & protocols - it was very useful for me, and as usual, most of useful information was obtained during conversations with developers.
I continued to self-study - read a lot of books on different topics: machine learning, natural language processing, big data, software development practices, etc. (you can follow me on Goodreads if you're interested). But pile of books with label 'to-read' is still too big.
I also took several courses from Coursera: Natural Language Processing was very interesting, and right now, I'm doing Heterogenous Parallel Programming - to refresh GPGPU programming skills.
This year I did 2 presentations about Clojure - one for functional programming user group in Bielefeld, and second was done as part of small Russian-speaking conference ITSea-2012, that was combined with vacation in Montenegro.
Tried to participate in different open source projects, mostly in CEDET & Incanter, and some other. And I hope, that I'll have enough time to continue this work, especially for CEDET.
Cycling is still my favorite sport, although I started cycling season relatively late this year, due operation on my right hand, so I managed to ride only 2400 km (last trip was 2 days ago :-)

I want to wish Happy New Year to all my readers, and see you next year!

October 30, 2012

New version of article about Emacs/CEDET

I just uploaded to site the new version of my article A Gentle introduction to CEDET. I also left old version of this article, but as separate page.
New version describes CEDET with new activation scheme, so it's now applicable to versions from bzr, and to versions bundled with GNU Emacs (after they will release next GNU Emacs version, where CEDET was updated).
Besides this, I added small description of how to customize CEDET to work with Java, and small section about setting name completion through the auto-complete package.

Instead of my config file, that contains too much not necessary stuff, now it's better to use separate config.

P.S. btw, fresh snapshots of CEDET can automatically detect Maven projects, and get classpath information directly from them. So now, names completion works also for 3rd party libraries. For example, this is name completion during work with source code from Apache Tika:

July 24, 2012

Getting started with examples from "Mahout in Action"

I decided to write this post because I saw several similar questions on how to start to work with examples from "Mahout in Action" book (I was technical proofreader for it, and familiar with examples ;-).

Preparations

Complete source code of examples from book is available at separate repository at Github, together with short instruction on how to use them.  Please, note that book was written & tested for Mahout 0.5 - stable release, that existed at time of publishing, and master branch in repository contains code for this version.  There are also separate branches for code that was modified to work with Mahout versions 0.6 and 0.7 - they are named accordingly.To obtain code, you can use either Git, or use Github's "download source" functionality.  Here are links for all existing versions: 0.5 (master), 0.6, 0.7 - download and unpack archives to some location.
To work with examples, you need to have Apache Maven installed (it's better to install it from repository on Mac OS X or Linux systems). Maven is used to compile source code and to create packages. Maven project could be also imported into your favorite Java IDE - Eclipse, Netbeans, or Idea (I will explain how to use Eclipse, but for other IDEs the process is similar). To use Maven with Eclipse, you need to have m2eclipse plugin installed - it will provide import and build functionality.
To run examples from chapter 16, you'll also need to have Apache Zookeeper installed - see instructions in README file in repository - they're pretty detailed.
You also need to download Mahout distribution to run some examples (usually they involve execution of mahout script). Download file mahout-distribution-<version>.tar.gz and unpack it. You can also download file mahout-distribution-<version>-src.tar.gz, although this isn't necessary (it contains Mahout's source code).
I just want to mention, that Mahout is works best on Unix-based systems -- all examples were tested on Mac OS X & Linux. This also applied to Hadoop, so if you're using Windows, it could be better to install Linux in virtual machine and use it for all work.

Build examples

To be able to run examples, you need to build packages (jar files). From directory where source code for examples is located (you should have file pom.xml in this directory) execute following command:
   mvn package
It will compile source code and create packages. Compiled packages are stored in the target directory. There are several files created:
  • mia-<version>.jar contains only examples, to run them you need to specify all dependencies;
  • mia-<version>-jar-with-dependencies.jar contains examples plus all dependencies - this jar could be run without specifying additional classpath elements;
  • mia-<version>-job.jar contains examples plus all dependencies, excluding Hadoop -- it should be used for Hadoop jobs.
Use corresponding packages when book refers to them.

Import of example's source code into Eclipse

Import of code into Eclipse is very easy - go to menu File, select Import... item, and then unfold Maven and select Existing Maven Projects from list and press Next.  Eclipse will ask you where source code is located - point to directory where you unpacked examples - Eclipse will analyze pom.xml and will display string like: /pom.xml com.manning:mia:0.5:jar, you can press Finish after that.
After import, project will be opened in Eclipse, and you can look into source code, modify examples if you need, and execute them (see below).
If you need, you can also import source code of Mahout itself into Eclipse, the procedure is similar, but this may work for all releases - in some cases, it will give you error that some plugins aren't covered by m2eclipse - you can select Ignore item in Quick fix menu (when you click right mouse button).

How to run examples

You can run examples either from command line, or directly from Eclipse.

Run from Eclipse

To run example from Eclipse, select needed class from browser on left, click right mouse button on it, select Run as..., and from sub-menu, select Java Application.
Take into account, that some classes need to have additional parameters specified - you can customize this by selecting Run configurations item from Run as... sub-menu.
For example, code from chapter 2, expects that file intro.csv is located in current directory (top of the project), while it's located together with source code, so execution without explicit configuration will lead to error. To fix this problem you need to specify that working directory for these examples is in non-default place - go to Run configurations, and select Arguments tab in dialog window. Then change Working directory parameter from Default to Other, press Workspace... button, and select src/main/java/mia/recommender/ch02 directory from tree view. After that you can press Run button, and your example will be executed without error.

Run from command-line

You can run examples from command line either by using java directly, or by using Maven's exec plugin.
To run examples with java, you need to specify package with all dependencies in classpath, and specify class name to execute, like this:
   java -cp target/mia-0.5-jar-with-dependencies.jar mia.recommender.ch02.IREvaluatorIntro
But to run like this, you need to have package recompiled if you did some changes. From this perspective, Maven's exec plugin is more handy - it automatically recompile changed code, and executes it without packaging everything once again.  To execute you class with need to issue following command (for this example, you need to copy intro.csv file to top-level directory, or it will fail):
   mvn exec:java -Dexec.mainClass="mia.recommender.ch02.IREvaluatorIntro"
If your class accepts command-line parameters, then you can specify them using exec.args parameter of plugin:
   mvn exec:java -Dexec.mainClass="mia.recommender.ch02.IREvaluatorIntro" -Dexec.args="src"

Conclusion

So, I hope, that this article helped you to start with Mahout in Action examples. Most of examples should work as described here, but some requires more work, but you can find instructions for them in the README file in source code repository.
If you're still having questions, then I try to answer them ;-)

June 18, 2012

Experience with new mouse

I had some problems with my right hand at the start of this year, so I decided to remove a part of load from it. I started to use left-handed mouse and used it for several months (it's not always possible to work completely without mouse, especially in Windows).
Week ago I got a new mouse as present - Evoluent VerticalMouse 4 (for left hand), and used it for a week at home, while used old mouse at work. And I can say, that new mouse is much more comfortable for me - hand is comfortably placed on it, there is separate middle button (instead of less comfortable wheel click, although it also works). There are also additional buttons, that could be programmed to do something useful. Mouse works fine on all tested OSes (Mac, Linux).
It was so comfortable, so I ordered the same mouse for work and already replaced my old mouse with it.

June 17, 2012

ECB & fresh Emacs/CEDET...

I already twitted about this, and also wrote to ECB & CEDET mailing lists, but I also want to reach Planet Emacs readers :-)
I made small changes in the ECB code that allow to use it together with fresh Emacs & CEDET versions. Modified code is available in my github. I tried this version together with CEDET from trunk, and also with CEDET from Emacs 24.1, and it worked for me.
If you're using ECB, please try this modified version, and leave feedback (either here, or by sending e-mail to me or to ECB mailing list). If you'll find bugs, feel free to file a bug using github's issue tracker.

March 23, 2012

Jenkins + CMake/CTest

Some time ago I setup Jenkins CI to compile & test our code. Installation itself was straightforward, matched to documentation on official site (besides official documentation, there is pretty good book "Jenkins: The Definitive Guide"). The only problem was that it didn't work with OpenJDK, so I was need to install Sun JDK.
For C++ components we're using CMake. Out of box, Jenkins has no support for CMake, but there is plugin in repository, that you can install separately. This plugin allows to specify different options for CMake, and will perform compilation and installation of code. (Although, at the end, I setup configuration that explicitly run CMake with needed options, and perform compilation via GNU Make with some additional options). The main problem with CMake (CTest really) was that tests results are in format, that isn't acceptable by default test analyzer that is designed to work with JUnit logs. Plugin repository has xUnit plugin, that should work with other test frameworks, but it also doesn't support CTest log format. Solution was found on StackOverflow - we need to write custom XSL that will convert CTest results into JUnit format. I found stylesheet at StackOverflow and slightly modified it. So, build step for testing now looks following way:
cd Debug
TRES=0
ctest -T test --no-compress-output || true
if [ -f Testing/TAG ] ; then
   xsltproc /var/lib/jenkins/userContent/xunit/CMake/2.8/ctest2junix.xsl Testing/`head -n 1 < Testing/TAG`/Test.xml > CTestResults.xml
fi

I was need to rewrite 3rd line this way, because Jenkins aborted job if ctest returned non-0 result if one of tests failed, so other commands weren't executed, and we had no testing results in dashboard.
So now, all our components are compiled & tested immediately after check-in, and we're able to catch error early.

Update: last line will cause build be marked as broken if any test is failing. You can remove it, and in this case, Jenkins will mark build as unstable if any test is failing - it will take this information from tests results

December 30, 2011

Year's results...

I think, that today is appropriate time to look back onto what happened in 2011th, and to think, what to do in next year. Many things happened in 2011th...
  • I got promotion to principal engineer, so some new activities were added to my usual duties, I started to work more closely with other teams around the world.
  • Traveled a bit - was in England on business trip,  together with wife were 3rd time on Canary Islands, spent a week traveling in England, Holland, plus did some trips in Germany (Cologne, Rhein, etc.).
  • Rode 2000km on my bicycle, and want to make more next year...
  • Participated in release of several books in different roles: as technical proofreader for Mannings "Mahout in Action" & "Tika in Action", and as translator for Russian translation of famous "Types and Programming Languages", plus I helped slightly in translation/proofing of more functional programming-related books that will be released next year (in Russia).
  • Open-source activity wasn't so great as before, because of lack of free time, usually did small patches for different projects - muse, haskell-mode, mahout, CEDET, incanter, and several more.
  • Tried to continue to write articles, so at begin of year, I wrote article about TDD & Unit testing in C++, and at the end of year, wrote (together with Dmitry Bushenko) small tutorial (in Russian) on how to extend Emacs for refactoring of code. We plan to extend this article with more examples, including more tools from CEDET/Semantic, and than translate it to English.  Also gave a talk about Emacs for Scala.by user group (via Google+ Hangout)...
  • Read a lot, mostly technical books...
  • And one of the most interesting experiences was participation in Stanford's University experiment - online classes on Artificial Intelligence (AI) and Machine Learning (ML). It was very interesting to study there, and they teach me many new interesting things + plus I had a chance to read some books that were bought some time ago, but simply stayed on my bookself, for example, Artificial Intelligence: A Modern Approach. I got only 89% overall score on AI class (I need to check my homeworks more carefully ;-), but anyway it's very interesting experience. ML class was more "easy" for me, but it also was interesting and allowed me to study many new things.  I hadn't received my certificate for ML class yet, but I hope that it will be good - I did all homeworks & programming tasks... (If you hand't participated in that courses, I recommend to look onto actual list of cources at ML Class site, maybe you'll find something interesting for you - they will offer much more courses in January-March!).


What's next? I plan to investigate topics that are interesting for me - natural language processing (including study at Stanford's NLP class), machine learning, etc. Plan to participate more actively in open source projects, especially in Clojure-related. Continue to read new books (I already prepared pile of books for reading on vacation). Will try to make my personal record in cycling (I plan to make 3000km next year :-). And do many other things...

And now, I want to wish Happy New Year to everybody!

November 2, 2011

Planet Clojure's milestone

Today, Planet Clojure reached another milestone, and now combines feeds from more than 300 blogs (301, precisely).

P.S. I have a question - does anybody interested in Planet Clojure in other languages than English? For example, I maintain Russian Planet Clojure separately, but maybe it will be better to combine all planets in one place, as subdomains of Planet Clojure?

November 1, 2011

Miscellaneous...

I hadn't blogged since May, and many things happened since that time.

In June, together with my wife we traveled to England and spent a week, visiting different cities. Although we visited not so much, so we're planning to make another trip to Scotland, and maybe other parts of UK.

Cycling was also big time consumer (this year I made about 2,000km on bicycle), but I like this activity, and planning to upgrade my bicycle next year to something speedy - don't know yet what to select - road bike or time trial bike...

Book-related

As usual, I read a lot, investigating new (for me) technologies & best practices.
Together with Manning's team, I participated (as technical proofreader) in publishing of two books: Mahout in Action, and Tika in Action. First book was very interesting for me, as I planned to study this technology anyway, and work with Manning allowed me to dig more deeply into all examples, code, etc. This book is different from other machine learning books because it shows how to solve ML tasks with Mahout in practice.
Week ago another book-related project also was finished - Russian translation of well-known book on type theory - Types and Programming Languages by Benjamin C. Pierce. This project lasted for 3 years, but at the end we produced high-quality translation of this book (and it's free in ebook form!). I hope that I'll receive paper book shortly.
And more projects are in progress ;-)

Programming & other IT-related projects

I also continued to experiment with different technologies to get new experience (not related to work) - Mahout, Lucene, Hadoop, Erlang, Haskell, and many other.
Some work also was made with Clojure, but not so much, I hope that I'll have more free time at winter.
Also I participate in AI & ML classes from Stanford, and like them a lot, although not so much free time left. First is more hard, as it requires to do more study on your own, seeking for new information, etc. (I forgot many things since I finished university :-) While second is more self-contained - you're usually don't need to search much in external sources to make your homework. Some time was needed to read Octave manual, but this was not so complicated. Although now I'm also studying how to program with R, to help my wife with thesis.


More things happened & happening, than I wrote about, and I'll try to blog more often.