Software engineering notes

Archive for the ‘notes’ Category

getting started with Ubuntu server security

In preparation for playing around with a VPS, I’d like to get familiar with Ubuntu 10.10 64-bit server. I grabbed the iso from their download page and installed it on vmware. Please pause with me and feel gratitude for Ubuntu. Thank you, Ubuntu, for being awesome. I was going to pick a more commercially popular OS, but I value my life, and Ubuntu was made with humans in mind.

The first thing I want to look at is security. Ubuntu’s forum has a sticky for general, intro-level security.

Ubuntu Wiki configure SSH seems like as good a place as any to get started.

This wiki page leads with “Once you have installed an OpenSSH server…”, so I set off to install openssh-server: sudo apt-get install openssh-server

But that gave me an error about openssh-server not being available for my system. After some digging, I got the impression that I might just need to update my system:
sudo apt-get update

Yup, that was it. Whew! I’m grateful it wasn’t a multi-hour quest for some random config setting.

Allegedly, after installing openssh, I should be able to ssh in right away. I ran ifconfig to get my vm’s ip address, and then tried it: ssh erik@

ssh: connect to host port 22: Permission denied.

Well, at least it’s talking to me. I think we’re ready to move on with the wiki.

I was able to make a backup of the default ssd_config file and set permissions on it without issue. On to customizing my sshd_config file: sudo vi /etc/ssh/sshd_config

  • Change PasswordAuthentication to “no”
  • I didn’t see a default setting for AllowTcpForwarding an X11Forwarding, so I added entries to turn each of these off
  • I added an AllowUsers entry for my username
  • Changed LoginGraceTime from 120 to 20
  • Changes the LogLevel from “INFO” to “VERBOSE”
  • Uncommented the Banner entry, and changed the file name from “” to “issue” for simplicity. I’ll defer setting the contents of this file.
  • I also changed PermitRootLogin to “no”

As a sanity check, I ran ps -A | grep sshd to confirm sshd is running. As a second sanity check, I tried logging in via the local machine: ssh -v localhost. Amazingly, this also worked.

Ok. Moment of truth. Restarting sshd: sudo /etc/init.d/ssh restart.

Doh! I forgot to add my ssh key before disabling password login. Quick edit to restore PasswordAuthentication. Trying again … Connection refused on port 22. Oh, yeah. I changed it to 2222. Trying again … success! – from the local machine. Still can’t ssh in from a remote host. Time to check the ssh log: tail -f /var/log/auth.log

My ssh requests aren’t showing up in the logs. Time to look into the iptables settings. I’m guessing there’s a rule in there to ignore ssh, or no rule to allow ssh. I’ll continue this in another post.

Written by Erik

October 18, 2010 at 10:35 pm

Posted in notes

Tagged with , , , ,

Linked Data: notes from Tim Berners-Lee’s 2009 TED talk video

I started watching Tim Berners-Lee’s TED talk last night. He defines the term linked data to refer to pieces of data placed on the Web and linked to one anaother.  He said there are three rules for putting something on the Web:

  1. http addresses are now being used to reference any unique entity on the Web, people, places, products, events, etc., not just documents
  2. If we request an object identified by an http address, we should get back useful information
  3. The object should include relationship pointers, formatted as http addresses, to other objects, e.g., “this person was born in Berlin, and Berlin is in Germany”.  A person can link to a city, which can link to a region …

Linked data is browsable.  The more data is connected together, the more powerful it is.

Berners-Lee mentioned DBpedia. describes itself as “a community effort to extract structured information from Wikipedia and to make this information available on the Web”.

Diversity on the Web is important.  We can put all kinds of data on the Web, government, university, enterprise, scientific, personal, weather, events, news, etc.  Transparency in government is important, but this data is also beneficial because it describes how life is lived.  It’s actually useful.

But owners of data are tempted to hang on to it.  Hans Rosling calls this”database hugging”.  So, make a beautiful Website, but first make the unadulterated data available.  “Raw data now!”.  A lot of the data concerning that state of the human race is sitting on computers unaccessible by the Web.  Now that scientists are putting genomic data and protein data on the Web as linked data, they can ask questions like “What proteins are involved in signal transduction and are related to pyramidal neurons?” (personal note: this seems a lot like a db query).

Linked social data is only possible if we break down the walls of social networks.

Open Street Map is all about everyone doing their bit.  Linked data is all you doing your bit, everyone else doing theirs, and it all connecting.

“Linked data.  I want you to make it.  I want you to demand it.  I think it’s an idea worth spreading.”

Thanks, Tim!

Written by Erik

October 15, 2010 at 10:26 pm

Getting started with Google App Engine Java SDK

A few days ago, I tried to use the App Engine Eclipse plugin, but ran into some issues, as described in an earlier post. These were probably due to my lack of experience with Java, Eclipse, and/or the AppEngine dev model, but I was blocked all the same. This time, I’ll start at a lower level, with the App Engine Java SDK.

My first stop was the App Engine Java overview page, which suggests “… you haven’t already, see the Java Getting Started Guide …”, so I hopped over.

The steps outlined in Installing the Java SDK worked well, and I was able to launch the dev server.

Next, I created my first project, the Guestbook app. The steps here were helpful too, and I was able to compile the app successfully (via the Using Apache Ant documentation), but I ran into trouble when I tried to run it:

$ ant runserver
Unable to find a $JAVA_HOME at "/usr", continuing with system-provided Java...
Buildfile: build.xml



[java] 2010-10-14 00:47:18.489 java[24218:903] [Java CocoaComponent compatibility mode]: Enabled
[java] 2010-10-14 00:47:18.492 java[24218:903] [Java CocoaComponent compatibility mode]: Setting timeout for SWT to 0.100000
[java] Oct 14, 2010 7:47:20 AM info
[java] INFO: Logging to JettyLogger(null) via
[java] Oct 14, 2010 7:47:20 AM readAppEngineWebXml
[java] SEVERE: Received exception processing /Users/foo/Sites/appengine/Guestbook/war/WEB-INF/appengine-web.xml
[java] Could not locate /Users/foo/Sites/appengine/Guestbook/war/WEB-INF/appengine-web.xml

Total time: 3 seconds

The missing file is located in /Users/foo/Sites/appengine/Guestbook/war/WEB-INF/classes/WEB-INF/appengine-web.xml, which seems to be intentional given the statement “All other files found in src/, such as the META-INF/ directory, are copied verbatim to war/WEB-INF/classes/”.

If I add the following to build.xml, so appengine-web.xml and web.xml are coped into the src/WEB-INF dir, then it works:

    <copy todir="war/WEB-INF">
      <fileset dir="src/WEB-INF">

The next step would be to Using the Users Service, but it’s getting alte, and I’z getting seelpy, so I’ll save that for another day.

To conclude w/ something uplifting, here’s a pic of a sleeping hedgehog.

Sleepy Hedgehog

Sleepy Hedgehog, credit: Andreas-photography

Written by Erik

October 14, 2010 at 12:22 am

Posted in notes

Tagged with , , , ,

the Internet as a series of tubes

We’ve long sought to create a singular artificial intelligence.  I wonder if another aspect of intelligence arises simply from the existence of connections.  Our brain is not composed of intelligence, but rather a mass of connections, neural pathways, which somehow creates an opportunity for intelligence to occur.  The Internet currently seems to be beginning to manifest intelligent behavior in that I can interact with it and gain something from that interaction (“You just checked into restaurant X.  Your friend Alan is here too”).  I wonder if it is our input then that brings this intelligence to life, i.e., we are the intelligence in the giant networked computer “brain”.  I can query the Web to find out where my friends are only because they have stated their position.  A measure of the Net’s intelligence could be based on the data it contains, the questions we imagine to ask, and our ability to ask them.

I feel inspired as an Internet software developer to make the process of interaction, contribution, and connection as easy as possible.  How can I make it easier to contribute?  Simplified markup is one option.  Easy authentication is another.  Improved data collection, such as automated geo positioning via mobile devices, and mining enable us to contribute implicitly.  How can I reduce the barrier to entry?

Yahoo!’s recent brand campaign stated that the Internet is all about “you”.  One way to interpret this is that Yahoo! facilitates contributions to, and recognition of, one’s self online.  I spend so much of my time online it seems like a second home.  So much of my persona involves how I see myself  reflected in the Internet.  Yahoo!, and many other services, tries to make it easy to be online and manifest a personality there.  This is one way to describe of the process of growing the Web.  This propagation of the Web could be summarized as: building physical and logical connections between people, and allowing them to input and retrieve data.  I’m curious to see if the Web does seem to become more intelligent relative to the success of these activities.

A few current touch points involving the simplification of Internet interaction, i.e., interaction with the Net itself, come to mind: establishing network infrastructure (how easy can we make it to set up an internet access point?  this shouldn’t be a bottleneck); creating and maintaining online identities (oauth, openid); storing the collected data in easily retrievable formats (semantic web, search, open gov, freebase, wikipedia); processing big data using mapreduce; server-side processing with web hooks and app engines; providing easy access to the processed data via asynchronous communication, key/value interfaces, convenient off-network “connect” access to social data; using apps on existing networks, and mobile devices, for easy delivery and consumption, esp. iphone and android.

Written by Erik

January 13, 2010 at 12:53 am

Posted in notes

Tagged with , ,

notes from Cloud Expo 2009: Christophe Bisciglia on “Working with Big Data and Hadoop”

– machines are reliable
– machines are unique or identifiable
– a data set should fit on one machine

– it’s not a database
— it doesn’t serve data in real-time
— it augments existing DBs
— it does enable deeper analysis that would normally slow a relational DB
– leverages commodity hardware for big data & analytics
– cloudera does for hadoop what redhat does for linux

– fox
— hat ppl are watching on set-top obxes
– autodesk
– D.E.Shaw
— analyze financial data
– mailtrust
— use hadoop to process mail logs and generate indexes that suport staff can use to make adhoc queries

– scientific and experimental data
– storage
— multiple machines are req’d to store the amount of data we’re interested in
— replication protects data from failure
— data is also 3 times as available

– allows for processing data locally
– allows for jobs to fail and be restarted

hadoop’s fault tolerance
– handled at software level

using hadoop
– map-reduce
— natively written in java
— map-reduce can be written in any language
– hive
— provides sql interface
– pig
— high level lang for ad-hoc analysis
— imperative lang
— great for researchers and techinical prod. managers

high performance DB and analytics.  when is it time for hadoop
– in general
— generation rate exceeds load capacity
— performance/cost considerations
— workloads that impede performance
– db
— 1000s of transactions per second
— many concurrent queries
— read/write
— many tables
— structured data
— high-end machines
— annual fees
– hadoop
— append only update pattern
— arbitrary keys
— unstructured or structured data
— commodity hardware
— free, open source

– traditional: web server –> db –> oracle –> biz analytics
– hadoop: web server –> db –> hadop –> oracle –> biz analytics

– data storage costs drops every year
– hadoop removes bottlenecks; use the right tool for the job
– makes biz intel apps smarter

– cloudera’s distro for hadoop
– cloudera desktop

Written by Erik

November 3, 2009 at 6:26 pm

Posted in notes

Tagged with , ,

notes from Cloud Expo 2009: Surendra Reddy’s presentation on “Walking Through Cloud Serving at Yahoo!”

open cloud access protocol (opencap)
– definition, deployment, and life cycle mgmt of cloiud resources
– aloc, provisioning, and metering of clourd resources
– metasdat,/registry for cloud resources
– virtual infrastructure
– why ietf?  they are brilliant folks.  we’re not attaching any vendor-specific details
– open source implentation planned
– structure
— resource model
— all infastructure
— nodes, networks, etc
— resource properties
— modeled as json objects
— standard catalog of attributes, extensible
— resource operations
— operation, control, etc.
— management ntification services

– smtp is simple.  protocols must be simple
– traffic server has bindings built in (or vice versa?)
– open cloud consortium
— a national testbed to bring clouds together

Written by Erik

November 3, 2009 at 3:11 pm

Posted in notes

Tagged with , ,

Notes from Cloud Expo 2009: Raghu Ramakrishnan’s talk on the Yahoo! cloud: “key challenges in cloud computing … and the yahoo! approach”

raghu ramakrishnan
– a triumphant preso
– “key chalengeds in cloud comoputing .. and the y! approach”

this is a watershed time.  we’ve spent lots of time building packabged software now wer’re moving to the cloud

key challenges
– elastic scaling
– availabiolity
— if the cloud goes down, everyone is hosed.  consistency or performance myst be traded for availoability.
– handliong failures
— if things go wrong, what can the developer count on when things come up?
– operational efficiency
— cloud managers are db admins for 1000s of clients
– the right abstractions

yahoo’s cloud
– the cloud is an ecosystem.  it’s bigger than a single componenet.  all the pueces must work together seamlessly.

data management in the cloud
– how to make sense of the many options
– what are you trying todo?
– oltp vs olap
– oltp
— random access to a few records
— read-heavy vs write-heavy
– olap
— scan access to a large number of records
— by rows vs columns vs unstructired
– storage
— common features
— managed service. rest apis
— replication
— global footprint
— sherpa
— mopbstor

y! storage problem
– small records, 100kb or less
– structured records, lots of fields
– extreme data scale

typical applications
– user logins and profiles
— single=-record transactions suffice
– events
— alerts, social network activity
— ad clicks
app-specific data
– postings to messsage boards
– uploaded photos and tags

vsld data serving stores
– scale based on partitioning data accross machines
– range selections
— requests span machines
– availability
– replication
– durability
— is it required?
– how is data stored on a single machine?

the cap theorem
– consistency vs availability vs partition tolerance
– consistency => serializability

approaches to cap
– use a single version of a db w/ defered reconciliation
– defer transaction commit
– eventual consistency eg dynamo
– restrict transatctions eg sharded mysql
– object timelines, eg sherpa
– ref: julianbrowne.cim/artice/viewer/brewers-cap-theorem

single slide hadoop primer
– hadoop is wrte optimized, not ideal for serving

out there in the world
– oltp
— oracle, mysql,
— write optimized: cassandra
— main-mem; memchached

ways of using hadoop
– data workloads -> olap -> pig for row ops, zebra for column ops, map reduce for others

hadoop based apps
– we own the terasort benchmark

– parallel db
– geo replication
– structured, flexible schemas
– hashed and ordered tables
– components
— req -> routers -> (record looked up, if necessary) -> lookup cached -> individual machine
– raghu is awesome (“And then!”, sprinting through dense slides)
– write-ahead
– asynch replication
— why? we’re doing geo replication due to the physics involved
— supposing an eearthquake hits and ca falls in th ocean, two users can continue to update their profiles
– consistency model
— acid requiores synch updates
— eventual consistency works
— is there any middle ground?
— sherpa follows a timeline of changes achieved through a standard per-record primary copy protocol

– cloud allows us to apperate at scale
– tablet splitting and balancing
– automatic transfer of mastership

comparing systems
– main point: all of this needs to be thought through and handled automatically

– sherpa, oracle, mysql work well for oltp

banchmark tiers
– cluster performance
– replication
– scale out
– availability
– we’d like to do this a group effort, in keeping w/ our philosophy

the integrated cloud
– big idea: declrative lang for specifying structure of service
– key insight: multi-env
– central mechanism: the integrated cloud
– surrendra will talk about htis

foundation componenets
– how to describe app
– desc for resources, entrypoijts, bindings, etc

yst hadled 16.4 million uniques for mj death news

acm socc
– acm symposium on cloud computing

Written by Erik

November 3, 2009 at 10:25 am

Posted in notes

Tagged with , ,

Notes from Cloud Expo 2009: Shelton Shugar, “Accelerating Innovation with Cloud Computing”

Shelton Shugar just delivered an excelllent keynote address “Accelerating Innovation with Cloud Computing” at the 4th “Cloud Conference and Expo”: in Santa Clara.  The subtitle of the expo is “”.  This is also the 7th annual virtualization summit.

Yahoo is not here to sell you anything; we’re not into consulting or selling software.  At Yahooo! cloud computing is not about saving money.  Our motivation arises from the fact that cloud computing drives innovation.  Cloud computing is the “engine of innovation”.  Yahoo! has hundreds of products and platforms all over the world.  Many of these products were the result of acquisition, so they came onboard w/ their own infrastructure, down tot he metal.  Cloud computing at y! is about streamlining the services these products and platforms require.  We store hundreds of petabytes of data all ove the world, and petabytes of internet traffic daily.  We think about scale foremost and features second.

cloud strategy
we are building a private cloud, deployed in data centers all over the world.  focusing in two areas: data processing and serving.  data processing refers to data minigna nd analysis.  serving refers to app environments for our products, edge capabilities for fast delivery, and a channel for data to flow into storage.  This is a multi-year effort.  “Open source plays a central role”.  We both consume and produce open source.

inside the y! cloud
5 buckets: edge services, cloud serving where we host apps w/in y!, online storage for serving content to consumers, a batch rocessing data warehouse, data collection services to clean, de-dup, and filter incoming data.

Serving is based on the Yahoo! Traffic Server.  Over half of all y! traffic flows through YST.

The app serving layer is based on a tiered architecture.  Apps can be cloned.  Traffic can be split natively, which allows for bucket testing.  THis frees developer from having to worry about versions of the platform, location of machines, etc.  Capacity can be moved via point and click.

Uses Restful apis.  Deployed worldwide.  Global replication is supported natively.  Multiple consistency models are provided.  Mobstore (mass object store) is used to store large objects (1mb-2gb) such as images and video.  Objects are immutable.  Structured content is provided via a product called Sherpa, a key-value store.  Content can be replicated easily.  Sherpa is intended to support enough of the capabilities properties used to build.mainatin relational dbs for.

Batch processing is oriented around Hadoop.  This has been running for a few years.  It now runs on 10s of thousands of machines.  80PB worth.  We use it to optimize our sadvertising, process weblogs.  1000s of yahoos are trained to run jobs on it.  hdfs allows thousands of computers to be treated as a single machine.  Pig is a higher-level procedural lang that generates map-reduce code.  It’s almost as efficient a well-written map-reduce code.  the internal joke is that most people don’t write well-written map-reduce code.  We’re building columnar storage.

An example: the y! homepage
When a user visits the homepage, the user is usung y! cloud services.  Content is optimized using a feedback loop to provide relevant stories in the news offered.  Hadoop is used to optimize ad matching.  Hadoop is used to build the search index.  edge services are used to cahce and load-balance the page content, normalize the news feeds.

Another example of useafge: y! mail.  Hadoop is used to identify and filter spam.  before hadoop, mail engineers had to spend lots of time maintaining storage and machines to process a huge amount of data.  hadooop abstracts scale for processing enormous data, handles failures, and manages multiple users.  this allows the scientists to focus on their jobs.  mail uses cloud storage’s replication services to help detect abuse.

Y! soprts usage of cloud services.  Edges services provides a proxy service to route requests for dynamic content.  this allows y! sports to provide the most up-to-date content.  People want scores as fast as possible.  the cunsumers are happy due to faster access to content.

y! finance.  y! is #1 for finance.  finance uses hadoop to spped advertisinf optimaization by importing resource utilization.

yql is an sql-like language.  it allows developers to qu ery, filter, join etc data.  yql uses sherpa instead of mamnaging its oawn storage.

open source @ y!
hadoop.  we contribute our code for hadoop to open source.  external developers benefit and contribute back.  pig is open source.  zookeeper is a system used to coordinate mutliple systems.  open cissur is a consortium was designed to facilitate to cloud computing.  it has 9 members.  y! contribution is m45, w/ 1000 cores. we work w/ some of the leading universities in the world.   We’ve built an enormous community around hadoop.  we can hire people straight out of university.  open source attracts the best and the brightest.

About 500 people were in attendance.

The highlight of the talk was his announcement of the newly open-sourced Yahoo! Traffic Server, now an Apache Incubator project.  A “recent post”: on OStatic gives more information about the project.  trafic server can process up to 34k trasnsactions/sec on commodity hardware.  it’s modular.  it’s how we implement our cahcing, proxy, load balancing, etc.  we push 400tb daily through it.  we use it in online storage to help direct traffic.  we’re hoping to create a vibrnt community around traffic server like we did w/ hadoop.

Back in june, we announced the y! distribution of hadoop.  we select the code we need and test it well.  it’s a solid collection of code that’s been proven to work.  shelton annouced that we’re now updating our releas.

we’re fully committed to cloud computing.  “moving to the cloud requires change”.  if you’re like us, w/ lots of legacy systems, you need to make a big organization commitment.  it’s more like  amarriage than a transaction.  it takes invesment to create these services and migrate to them.  it takes time.  ours is a multi-year effort.  cloud computing is worth it for us.  it’s changing our cutlure.  we’re able to deploy so much faster than before.

Written by Erik

November 3, 2009 at 8:51 am

Posted in notes

Tagged with

Notes from Christian Heilmann’s Developer Evangelism handbook

Notes from Christian Heilmann’s Developer Evangelism handbook

  • The Developer Evangelist Handbook
  • remove the brand
    • “As a developer evangelist you have to keep your independence.”
    • “Your independence and your integrity is your main weapon. If you lost it you are not effective any longer. People should get excited about what you do because they trust your judgment – not because you work for a certain company.”
    • ways to work w/ the competition: “Remain an independent voice”, “Become a specialist in a certain underlying technology”, “Keep your finger on the pulse” –
    • “You can’t be a professional evangelist and bad-mouth the competition at the same time. We all are professionals and work on projects to make the web a better place.” –
    • “Acknowledge when the competition is better”
    • “Know about the competition”
  • Work with your own company
    • “Your job as a developer evangelist is to listen to developers, understand their problems and communicate with management to try to sort the issues out.” –
    • “There is no “off the record”” –
    • “if people ask you what is going on don’t say “no comment” as that implies you know something but are not allowed to say it. Simply state that you are not in a position to know yet but that you are investigating.”

Written by Erik

November 2, 2009 at 4:01 pm

Posted in notes

Tagged with , ,

notes from YUIConf 2009: “Building YUI 3 Custom Modules”, by Caridy Patino

what is a module in yui 3?
– modules are not plugins, but there is a plugin module
– module names are passed into a sandbox w/ the ‘use’ method
– prefer YUI().use instead of var Y = new YUI(); Y.use …
– you can have multiple use() calls to defer loading
– community modules vs basic yui core team modules

custom modules
– registration
— by seed YUI().use
— seed will import
— by inclusion
— manually add script include and then YUI.use
— YUI(config)
— most performant
— this takes advantage of onload handling
— reduces number of http requests req’d in ‘by inclusion’
— organization
—- use YUI_config global var to manage registration
—- you can have multiple config options

building custom modules
– YUI.add(‘foo’, fn(Y){mod code}, version, requirement list);
– naming convention: utilities are all lowercase, classes are camelcase w/ uppercase leading char
– plugins extend host modules
– stack: utilities –> classes –> plugins –> mashups

how to use and build plugins
– plugins allow us to extend an existing class at runtime
– the def of a plugin looks much like that for a module class
– instead of extending y.base, we extend y.plugin.base

mashups and legacy code
– using multiple modules, including external dependencies, enhancing dom, defining event listeners
– use case: using a pre-existing yui2-based object in yui3
– check out zakas’ talkon scalable app arch
– cool: organize app as module repo
– conclusions
— define apps at a granular level
— modular apps are easier to test
— share code thru yui3 gallery
— use yui custom modules to integrate pre-existing code

– differences btwn yui2 and yui3 lazy loading?
— yui3 will load everything as a single item, if module requirements are defined using config option
— yui3 will load items in the order they are specified
– reusing modules across multiple sandbox
— yes, if defined as such in config


Written by Erik

October 29, 2009 at 10:21 am

Posted in notes

Tagged with , , ,