Why it Shouldn't Matter Whether the Research Works Act Passes or Not

If you’re reading this post, you likely are already aware that Congress recently introduced a bill (H.R. 3699) that would roll back provisions that require certain kinds of taxpayer-funded research to be accessible to the public for free. I won’t go into the details, since it’s already been covered very nicely in many places.

While I am deeply disturbed that this bill has appeared (and I would love to see it go down in flames), at some level, part of me is actually glad that it is here, and I would further submit that it doesn’t (or at least shouldn’t) matter whether this bill passes or not. Perhaps it would even be for the best if it does pass. Why? Because at the end of the day, we scientists are in the driver’s seat, and we decide how we disseminate our work. Unless companies like Elsevier lobby congress to require that we all submit our papers to their journals, we ourselves are complicit in the problem, and we hold the power to solve it, if we so choose.

David Crotty wrote a nice piece recently calling attention to the fact that much of the rhetoric on both sides of the issues has become overly emotional and inflammatory. He correctly points out that just because a subway was built with public money doesn’t mean that you have a right to ride it for free, nor does it mean that a commercial operator of that subway line doesn’t have a right to make a profit. This is all fair. However, if the fares on that subway prevented the vast majority of the public from riding it (and the subway operator were making enormous profits), then the public can and should be up in arms about the misuse of public funds. If the majority of taxpayers cannot access important medical research results because they are behind outrageous commercial publishing pay-walls, then the public has a right to be upset.

But who should the public be upset with? Clearly it is easy to be upset with fat-cat publishers like Elsevier. One could hardly ask for more James-Bond-esque villain from the dry world of academic publishing. When they’re not publishing fake journals or running international weapons shows, they ask scientists to write the papers (free of charge), then ask other scientists to peer review them (on a volunteer basis), and then charge the author thousands in page charges (which come out of grant funds) and extort their institutions to the tune of tens of thousands of dollars for the subscriptions to the journals (again, largely from public funds). Someone from the general public who does not have access to a University mega-library must further pay something like $25+ per article just to look at the paper. Meanwhile, they provide comparatively poor editorial services, host ancient-looking rarely-updated websites, and generally make the entire scientific process go slower. In exchange for all of the “value” they add, they demand authors transfer their rights to the paper, and they maintain a mind-blowing 36% operating profit margin

Now at this point, you might complain that I’m conflating the bad behavior of an individual company (or collection of companies) with an issue of principle — should the government be dictating the business model of a private company. We’ll get back to the “inflammatory” facts about Elsevier in a moment, but I fully agree that the government shouldn’t be dictating the business model of publishing houses. That’s fine, but the government already is in the business of telling me how I can and can’t spend grant money. Government funding agencies get quite a bit of say over how scientists spend taxpayer money (e.g. only coach class tickets, no money spent on food, etc. etc.). These are good restrictions, and scientists take them very seriously, because such provisions ensure that the government gets what they want out of the research money they spend. So, just as the government has the right to tell me not to spend taxpayer money on donuts for lab meetings, they should have a right to tell me not to spend government money submitting to journals that will squirrel away the research results and make them inaccessible to the public. The public absolutely has a right to demand this, and it has nothing to do with dictating anyone’s business model.

Furthermore, even leaving aside open-access mandates and the Research Works Act, the public has a right be pissed at scientists who choose to publish their work in places that are not accessible to the public, even if this is strictly “legal”. Scientists should exert social pressure on other scientists who do so. There are increasingly many open access alternatives whereby we can disseminate our work to the widest possible audience, and I believe that we have a duty to use them, putting that duty even above of the realities of “publish-or-perish”, to the extent that we are able. At the very least, we need to proactively foster an environment where, come grant review, or hiring or tenure-promotion time, we place value on publishing in open access venues. “Closed access” ought to be a dirty word.

Open access mandates have given scientists a “pass” on having to exercise our principles. We could keep submitting to the worst of publishers, knowing that our work would still get out to the public eventually. However, if the Research Works Act passes, it would force us to do the right thing. Supporters of the Research Works Act wax mistily about the importance of the free market, but it is exactly the power of the free market that I hope wins the day. The academic publishing world is indeed governed by market forces, but as in other “third party payer” situations (e.g. healthcare), the market is distorted by the fact that the payers and the decision makers are not the same. Scientists decide where to publish their work, using taxpayer money to do so. Academic scientists need to keep publishing if they want to keep their jobs and keep doing what they love, and thus scientists are put in the awkward spot of not always having the luxury of being able to put principle above survival. However, we have a responsibility to put the interests of the taxpaying public first. Patients should be able to read the latest research about their disease, entrepreneurs should have access to the latest information about technology coming out of universities to accelerate innovation, and the general public should have a direct window into how science really works.

Moreover, we scientists should stop setting ourselves up to be abused by publishing companies that charge too much and provide laughably terrible service in return, all while demanding sweeping rights to our work because of the “value” that they add. Scientists can vote with their research budgets and with their feet (papers). Publishers like the Public Library of Science prove that we can get both open access and better service. Traditional publishers like Nature Publishing Group and AAAS are also coming out in support of open access. This isn’t about for-profit vs. not-for-profit. It’s about open access.

Elsevier employees who I’ve talked to get uncomfortable and call “foul” when I bring up the company’s checkered past, but they can’t have it both ways. If they want to invoke the free market, then they have to accept that they are a company with an associated brand, and we are their customers. It absolutely matters that they have been behaving like an international band of thugs and villains. Especially in the modern market, “brand” includes corporate accountability. I’m the customer and I can choose to take my business elsewhere.

Scientists: stop submitting papers to organizations that exist to squeeze the life out of science. Stop reviewing for these journals. We can exert market forces to solve this problem.

The academy has long been the silent, sleeping giant in this industry. My only hope is that something as brazenly clumsy and greedy as the Research Works Act is enough to wake that giant.

Disclaimer:

In the spirit of full disclosure, I need to mention that one of my postdocs and I do have one Elsevier submission still in process from months before this whole Research Works Act thing blew up (yes, it takes that long to get a paper published there). We were invited to submit something (it’s one of those “best of conference” issues), and it seemed harmless at the time. I wasn’t excited about the fact that Elsevier was involved when we submitted initially (my first paper was in an Elsevier journal and I remember the process unfondly). Now that their involvement in pushing the RWA has come to light, I’m really unhappy about the situation.

The editorial process has been extremely poor throughout, which makes me doubly angry when I hear Elsevier and its minions go on and on about the value that they add. We have considered withdrawing the submission in protest, but since peer review has already taken place, we feel bad about wasting the effort of our colleagues who did the review. I’m hoping that we can get through the last throes of the publication process quickly, so that we can get this behind us. After that, I plan to sign the online petition at thecostofknowledge.com, wash my hands of this, and call it quits with Elsevier once and for all. I also plan to post a disclaimer alongside the version of the paper that I am allowed to post on my website under Elsevier’s awful policy encouraging others to steer clear.

Taming Multiple Pythons with Pythonbrew

Keeping track of multiple Python installs can be pain if you want to actually switch between them. MacPorts had something called python_select that attempted to address this problem, but it is deprecated, and I wouldn’t recommend going anywhere near Macports if you can help it. You could create an armada of shell aliases for different versions, or write your own shell functions to manage the switching, and I’ve done this before, and it mostly works (depending on how well you do it). However, a solution that I particularly like is to use pythonbrew.

Pythonbrew was designed to work like rvm for Ruby; that is, it installs and manages multiple, completely independent Python installs, including the interpreter and everything. Pythonbrew also supplies uncannily robust tools for switching between the Pythons it installs. For instance:

pythonbrew use 2.7.2

switches you to using its Python 2.7.2 for this particular shell session, while:

pythonbrew switch 2.7.2

switches for this and future sessions.

Of course, out of the box, this only works with Python installs that pythonbrew itself installed. This is great if you use pythonbrew (which I do recommend), but it doesn’t solve the problem if you have other pythons you’d like to switch between (although pythonbrew off will command pythonbrew to take itself out of the loop, reverting back to whatever you had before when you typed python).

However, pythonbrew isn’t too picky about which Pythons it switches between, as long as they live in:

~/.pythonbrew/pythons/Python-[version-number]

Thus, we can draw on pythonbrew’s Python-switching excellence by putting our Pythons-of-interest in there.

A straightforward and clean way to do this is with a virtual environment (see the virtualenv homepage for details). Basically, a virtual python environment is a symlinked “pretend” python install which carries its own site-packages, etc., but which links through to an underlying python interpreter / libraries for all of the guts.

To use the system python by name from pythonbrew, we could create a virtual environment inside ~/.pythonbrew/python, using virtualenv (assuming you’ve already done pip install virtualenv):

/usr/local/bin/virtualenv ~/.pythonbrew/pythons/Python-system

This allows us to switch to the system python (actually a virtualenv riding on top of it) by typing:

pythonbrew switch system

Likewise, we can play the same trick for a brew-based python install:

/usr/local/share/python/virtualenv ~/.pythonbrew/python/Python-2.7.2_brew

and even for EPD (see caveat below):

$EPD_BIN/virtualenv ~/.pythonbrew/python/Python-EPD

The caveat on the EPD part: Enthought somehow f-ed up EPD w.r.t. virtualenv. To install a working virtualenv on EPD, do:

sudo $EPD_BIN/pip install --upgrade  -e "git+https://github.com/satra/virtualenv.git@fix/EPDpatch#egg=virtualenv"

(At some point in the future, either the EPD people or the virtualenv people will properly sort out their business and this special treatment won’t be necessary.)

I find my Pythonic life much more orderly now that I can just pythonbrew switch EPD and be confident that python, ipython, pip, etc. all point to the right place and do what I expect them to.

UPDATE: I was recently asked how the above compares to mkvirtualenv, and why I don’t use that instead. If you’re reading this post and haven’t heard of mkvirtualenv, you should definitely check that out too. Basically, it’s a tool for managing virtualenv’s and it has some very nice features, though it doesn’t (to my knowledge) take care of the problem of installing Pythons or switching between “base” installs (which is particularly a problem, I find, on OS X).

Also, since I wrote this, pythonbrew has added a venv command, which makes the creation of virtualenv’s under the umbrella of pythonbrew easier. If you’re interested in trying out pythonbrew, you should check out that functionality as well.

Python for Science on Mac (Without the Tears)

Why Python?

If you’re a scientist thinking about installing Python on your Mac, there’s a good chance that you’re doing so to escape the clutches of Matlab. There are many good reasons to do so. Matlab is a monstrosity. Its syntax is ugly and archaic. Its library support for anything not-directly-related to science and engineering is awkward at best. Its installer has a long history of not actually working, and it is non-free software with irritating licensing software thrown in make your life unpleasant. And if you do parallel/cloud computing, good luck running 100 or 1000 copies on EC2 to get your work done faster. That’ll be 100 (or 1000) times $$$ in license fees, thank you very much.

Python, by contrast, is free, has elegant modern syntax, supports actually-fully-integrated-and-not-crappily-bolted-on object oriented programming, and has an excellent and enthusiastic community supporting it. There are (usually several) packages available for doing just about anything you’d like to do with a computer, including packages that cover the vast majority of functionality of Matlab. There are even packages for interfacing with Matlab if you can’t get away fully just yet.

Getting full “batteries-really-included” Python on your Mac

OK, so we’re ready to stab a dagger into Matlab’s cold, black heart, and embrace 21st century computing. Only, there’s one problem: like many open source things, Python is a “bazaar” rather than a “cathedral”; that is, because people can get in there and solve their own problems, there is often more than one solution to any one problem. This poses a problem, particularly if you’re using a Mac, since you’ll get lots of conflicting (and sometimes bad) advice on how you’re “supposed” to use Python on a Mac.

(As an aside, Matlab isn’t really what I’d call a “cathedral” either. Maybe more of a fast food chain? Everything is homogenized… but the food isn’t very good. People keep eating it, though, because they’re used to it, and it’s “easy” in some sense).

So there’s more than one way to do things, and if you do a Google search, you’ll get a lot of conflicting advice. Some will tell you that you should never use the system-installed python, some will tell you that you should. Some will tell you to use macports or fink. Others will tell you use a homebrew-installed python, or download a copy from python.org, or to use the pythonbrew project. If you follow bits and pieces of this advice without a broader understanding of what’s going on, you can get yourself in trouble.

Some may disagree with what I have to say, but here are some preliminary rules of thumb that I recommend:

  • If you’re new to Python and you’re on a Mac or Windows, strongly consider using the Enthought Python Distribution. More on this later.

  • Aside from the above, don’t ever install anything Python-related from a Mac-style installer package. Python/Unix-style tools have package managers (e.g. pip for Python and brew for other stuff on Mac) which you should use. They’re wonderful things and you should use them. More on these later.

  • Once you’ve chosen a package manager and Python distribution, stick with it. Mixing and matching will only confuse matters. Don’t try to use both brew and macports. Don’t, for that matter, try to use macports at all. It is a giant mess. Fink is even worse. Homebrew is much better (but don’t take my word for it)

  • I (personally) recommend that you immediately disregard out-of-hand any advice that tells you to a) use macports or fink b) install the “standard” python.org python or c) use easy_install for anything other than installing pip.

The instructions here assume you’re starting from a fresh state. If this is true, and if you want to just get going and don’t care about learning a bunch of Unix stuff, you can just install the Enthought Python Distribution and get back to work. If you’ve accumulated “damage” from past failed attempts, then you’ll need to understand a bit more about what’s going on so that you can undo what’s been done. But don’t worry, once you see how everything fits together and you have the right tools at your disposal, everything really is quite sensible.

The No-Fuss Option for Scientists and Engineers: Enthought Python Distribution (EPD)

The primary advantages of using EPD are:

  • scipy/numpy/matplotlib/ipython are all installed and ready to go. These can be tricky to install yourself, so you save time and hassle by going with them

  • a host of other useful stuff is installed and ready to go.

  • the Enthought Tool Suite is installed. These are some interesting libraries, which I’ve found to be damn-near impossible to install by hand. That said, I’ve never felt good about using them for code I’m planning on sharing (which is almost all of it), precisely because it is damn near impossible to install anywhere else.

  • if you want it, you can purchase support from Enthought.

Downsides are:

  • The highly useful virtualenv package doesn’t work correctly with EPD, without special incantations. How Enthought screwed this up, and why they didn’t bake in the solution is beyond me.

  • While free for academics, EPD costs money for others.

  • You’re dependent on Enthought’s schedules for releases, etc.

Once you’ve got that installed, you should be able to type ipython at the command prompt and you can start enjoying python. Good tutorials for Matlab switchers can be found here.

Installing additional packages, not included by default

I’m not going to go deeply into package management here, since this is better covered elsewhere, but suffice it to say, installing packages is as easy as typing:

pip install [package name]

e.g.:

pip install pyparsing

If you get a warning about not having permission to do that, prepend sudo to the above command. If you get a message about there being no command called pip, then type easy_install pip to install pip and then avoid ever touching easy_install again.

If you need a package that is only available from source, and the instructions recommend installing with python setup.py install, don’t do it — at least not directly. Instead go to the directly where you would have typed that command and run:

pip install .

instead. This ensures that pip knows about all of the packages you’ve installed. That way, you have the option to pip uninstall later if you want to.

Understanding and Managing Multiple Pythons

The fundamental difficulty in using Python on a Mac is that there can easily end up being several orthogonal python installs on your machine. If you installed EPD (above), you’ve already got another one, over and above the ones that come pre-installed with your Mac. Actually, it should be noted that the ones that come with your Mac are generally perfectly reasonable, if not always up to date. You can happily use these (particularly on Lion, which has version 2.7.1), however, there is some virtue as well in starting fresh.

So there can be multiple Python’s installed on your system, and you need to keep them all straight. If you installed EPD as described above, and you didn’t press the “Customize” button during the installation process, then EPD will have inserted itself “in front of” your system python. You can verify this by typing:

which python

and it should return something like /Library/Frameworks/EPD64.framework/Versions/Current/bin/python. The which command is your friend when it comes to getting your bearings when multiple versions of something are installed. Basically, like it says on the tin, it tells you which actual file will be executed when you type something at the prompt. If EPD is not installed, or is installed improperly, then which python might return something like /usr/bin/python, which is the OS X default installed python.

It sometimes helps too to call:

ls -l `which python`

Which shows you if the python being called at the command line is actually a symlink to somewhere else.

It’s also useful to keep track of which version of commands like pip, ipython and virtualenv you’re using. I’ve seen people get themselves into quite a pickle with multiple overlapping installs of python on a Mac where python corresponded to one install, but pip or ipython went to another. If things are wonky, make sure that all of these commands “match” one-another. The basic rule in python is that pip (and easy_install, which you shouldn’t use) install new packages into the python that they “belong to”. So if you have multiple pythons installed, there will be multiple pip’s too. Here’s a brief survey of where all of the various possible Python’s keep their stuff:

  • System (pre-installed):

    • python: /usr/bin/python
    • pip, ipython, virtualenv, etc.: /usr/local/bin/
    • user-installed packages live in: /Library/Python/2.7/site-packages
  • EPD 7.1 (64bit):

    • python: /Library/Frameworks/EPD64.framework/Versions/Current/bin/python
    • pip, ipython, virtualenv, etc: /Library/Frameworks/EPD64…/bin (fill in the blank)
    • user-installed packages: /Library/Frameworks/EPD64…/lib/python2.7/site-packages
  • EPD 7.1 (32bit):

    • python: /Library/Frameworks/Python.framework/Versions/7.1/bin/python
    • pip, ipython, virtualenv, etc: /Library/Frameworks/Python.framework…/bin (fill in the blank)
    • user-installed packages: /Library/Frameworks/Python.framework/…/lib/python2.7/site-packages
  • Python 2.7 (from homebrew, e.g. brew install python):

    • python: /usr/local/bin/python
    • pip, ipython, virtualenv, etc.: /usr/local/share/python/
    • user-installed-packages: /usr/local/lib/python2.7/site-packages
  • Python installed by pythonbrew:

    • python: ~/.pythonbrew/pythons/Python-2.7.2/bin/python
    • pip, ipython, virtualenv etc.: ~/.pythonbrew/pythons/Python-2.7.2/bin/
    • user-installed-packages: ~/.pythonbrew/pythons/Python-2.7.2/lib/python2.7/site-packages

If you don’t know about Python installed by brew or pythonbrew yet, don’t worry, we’ll touch on those briefly next.

Installing other Pythons on your Mac

If for whatever reason you’re not satisfied with EPD, you can either use the system Python, or install other versions of Python on your machine. Only do this if you have a firm grasp where stuff is going and how to switch between them. If you want to stop EPD from being used when you type python edit out the lines in ~/.profile (or ~/.bash_profile) that the EPD installer added in. These basically just put it’s directories in front of all others in the PATH environment variable. If you have more than one python installed, when you type python (or ipython, or pip), then it will search the directories specified in PATH until it finds one that has python in it. If you’re new to Unix, you can look at the current value of PATH by typing echo $PATH. Consult the table above to see where various pythons install their wares.

As long as you understand the above, you can install as many Pythons on your system as you like. Personally, there are two means of installing additional pythons that I particularly like: pythonbrew and brew

Pythonbrew is a tool for installing Python distributions in your home directory, where they won’t cause trouble. Brew is a general-purpose package manager for Mac OS X, that in my opinion blows Macports and Fink out of the water. You can google these for more details, but at the end of the day, for pythonbrew you can type:

pythonbrew install --force --no-test 2.7.2

And a fresh copy of Python 2.7.2 will appear under .pythonbrew/python/Python-2.7.2 in your home directory. Likewise, if you use brew:

brew install python --universal

A clean install of Python 2.7.2 will be built and installed at /usr/local/python (see above for where it puts/expects other stuff).

Installing Useful Science Stuff in a New (non-EPD) Python

Unfortunately, as of the time of writing this post, the most useful scientific python packages are also the trickiest to install: numpy, scipy, ipython, matplotlib. Assuming you already have brew and pip installed (and you’re sure you’re using the pip that you mean to use), the following commands should get your new Python in fighting shape (a nice post about these issues can also be found here):

# Install pip and distribute, if you don't already have them
curl -O http://python-distribute.org/distribute_setup.py
python distribute_setup.py
easy_install -U pip

# numpy -- no fanciness currently required, just pip install...
pip install numpy

# if this doesn't work, try using:
# pip install "git+git://github.com/numpy/numpy#egg=numpy-dev"
# this tells pip to install from the latest git repository
# often, fixes that haven't found their way into the PyPI repository
# (where pip checks) are nonetheless present in the repo

# scipy

# scipy requires gfortran, so use brew to install it
brew install --upgrade gfortran

# install a version of scipy from the repo ("pip install scipy" fails)
pip install "git+git://github.com/scipy/scipy#egg=scipy-dev"

# ipython

pip install ipython

# readline is weird for some reason, so it is sometimes necessary to call:
easy_install readline
# yes, that said "easy_install"; pip doesn't work here for unknown reasons

# matplotlib
brew install --upgrade pkg-config
pip install "git+git://github.com/matplotlib/matplotlib#egg=matplotlib-dev"

# Python Imaging Library (PIL)
brew install jpeg libtiff pkg-config
pip install --upgrade PIL

It should be noted that the state of the above commands is in constant flux. Scipy and matplotlib, in particular, are continually ping-ponging back and forth between pip install scipy being enough, and it not being enough. OS X Lion just came out, so maybe that explains the current state. Once you know the basic drill, you can try the straight-ahead pip install first, and then revert to the git repository syntax if you need to.

What to try if the above commands don’t work

  • Rebrewing dependencies (e.g. brew install --upgrade gfortran) can also sometimes make things work again.

  • If you get errors about missing symbols, you might get some traction by calling export ARCHFLAGS="-arch x86_64" before the above commands (if you’re on a 32-bit machine, replace “x86_64” with “i386”). Mac binaries can contain 64bit and/or 32bit contents, and you can’t build 32/64bit binaries against libraries that only contain one or the other. On OS X 10.7 in particular, you really can’t go wrong with 64bit all the way. There was a time when it was recommended by some to run python in 32bit to support certain 32bit-only packages (particularly wxWidgets; you’ll still see this advice on the EPD website). The time for these packages has passed, however, as Apple has decisively dropped support for the 32bit infrastructure (Carbon) that they relied on.

Conclusion

Getting a fully-loaded scientific Python distribution up and running on your Mac is getting easier and easier, though there still are some pitfalls. EPD provides an excellent choice for beginners, and you can go any number of other routes as well if you take the time to learn a little bit about how things are organized.

Have any other tips, comments or suggestions? Please share them in the comments sections! I’d like for this post to be as complete and correct as possible, and your help is much appreciated.

Acknowledgments

Special thanks to Nicolas Pinto and Zak Stone for providing helpful comments and tips for this post.