Networking

TODO current landscape TODO merge in notes from 6.824 TODO beep TODO csfq core stateless fair queueing TODO mrd multi radio diversity TODO srm scalable reliable multicast: rand # from (C1+C2 r) d TODO igmp tree-building; slotting/damping; routers can prune its subtree out of the group TODO crossbar

  • LPM
  • sched: all are maximal matching
    • wavefront
    • PIM: rand
    • iSLIP: round-robin

TODO wireless mac

  • hidden terminal problem: A -> B <- C
  • exposed terminal problem: A <- B C -> D
  • approaches
    • 802.11 CSMA: rand in [0, cw], ack
    • reserv: MACA: RTS/CTS; MACAW: adds more features (DS, etc)
    • TDMA: when high traffic

TODO i3, doa

TODO etx

  • expected # transmissions is \frac{1}{d_f d_r}, where d_f, d_r are probs
  • better than hop count

routing

  • DSDV: seq no
  • DSR: best; “plz tell me route to x”; store in mem; src-route
  • OAODV: ad-hoc route discovery; uses DV

people

  • Bob Metcalfe: inventor of Ethernet
  • Van Jacobson: primary contributor to TCP/IP

basic performance facts

  • latency
    • typical backbone takes 60 ms from SF to NYC
    • EDGE: 150 ms best, 500 ms real-world avg
    • ref: an engineer’s guide to bandwidth on ydn

general terms

  • anycast: send to 1 of many; useful for HA/load-bal/geoloc; used in BGP, DNS

subnet notations

  • /0 = mask 0.0.0.0 (any address)
  • /24 = mask 255.255.255.0
  • /32 = mask 255.255.255.255 (specific host)
  • 0.0.0.0/0 or 123.123.123.123/0: any address
  • 192.168.0.1/24: typical LAN
  • 4.3.2.1/32: specific host

MACs

  • time division multiple access (TDMA)
  • frequency division multiple access (FDMA)
  • carrier sense multiple access (CSMA): verify channel clear before xmit
    • with collision detection (CSMA/CD): terminate xmission as soon as collision is detected, reducing prob of collision on retry
    • with collision avoidance (CSMA/CA): eg RTS/CTS

dns

  • 13 root NSs; maintained by ICANN
  • either recursive or non-recursive queries
  • protocol: stateless
    • udp port 53, or tcp if response > 512B or if doing zone xfer
    • sections: header, question, answer, authority, add’l
    • question: the only section in queries
      • name (no wildcards), type (incl. specials), class
    • answers: matching RRs; if any CNAMEs, then target RRs should also be included
    • auth: NS RRs pointing to more authoritative NSs; clients should cache this
    • add’l: RRs that may be useful
  • zone xfers: commonly done by secondary NS updating its DB
  • glue records: specify the IP, not just the hostname, of another NS, to avoid circular deps (if under same domain as the one being resolved)
  • caching: TTL 0 to 68 yrs; also have negative caching (non-existence)
  • some apps (like browsers) maintain own DNS cache, to reduce resolves
  • zones: subtrees in domain name tree; don’t necessarily extend to leaves
    • may delegate authority to another NS for a zone
    • an NS may be authority for multiple zones
  • other uses
    • many hostnames -> 1 IP: virtual hosting
    • 1 hostame -> many IPs: fault tolerance, load distribution, geolocation
    • MX records: where MTAs should send email for this domain
      • mailers try sending to the MX-indicated servers in the order that they’re listed
      • backup MXs do store-and-fwd (tmp intermediate hold)
    • internet-scale cached storage, eg email blacklists, software updates (latest version checks), sender policy framework/domainkeys (use TXT records)
  • reverse lookups: maintain a reverse lookup hierarchy under in-addr.com using pointer records
    • eg to find name of 1.2.3.4, lookup 4.3.2.1.in-addr.com (note topological addr reversal)
    • populated along with forward entries
    • impractical since introduction of CIDR; see RFC 2317 (uses CNAMEs)
  • resource record (RR): the basic data elt
    • contains: NAME (FQDN), TYPE, TTL, CLASS, type-specific data (RDLENGTH, RDATA)
      • NAME: may have wildcard
      • CLASS: IN for internet, CH for chaos, HS for hesiod
      • TYPE
        • A (answer): translate name to IP
        • NS: list FQDNs of (more) authoritative NSs for zones
        • MX: specify the mail server
        • CNAME: canonical name; specifies a FQDN; no other RRs allowed if this is present; hence root domains (blah.com) can’t be CNAMEs since they must also have at least NS, SOA (problem for amazon ELB)
        • PTR (pointer): like CNAMEs, but semantically means just a pointer; used for reverse lookups
        • SOA (start of authority): start of DNS zone; the first record in a NS for that domain
    • RRs of same type define a resource record set; this is what’s returned
    • RRS order undefined, but servers implement round-robin for load bal
  • DNSSEC: works on complete RRSs in canonical order
  • domain names have no char restrictions, but hostnames must be LDH (letters, digits, hyphen; ASCII subset)
  • internationalized domain names in applications (IDNA): unicode in LDH
    • based on punycode
  • dynamic DNS: use UPDATE DNS op to add/remove RRs dynamically from a zone DB on an authoritative NS
  • 240+ country code TLDs (ccTLDs)
  • registration
    • domain name registrars: accredited by ICANN; they feed registries
    • TLDs maintained by different orgs, operating a registry
    • registries, registrars charge annual fees
    • WHOIS DB held by many domain registries; provides contact info
  • WHOIS: tcp port 43; query/response
    • no std for determining authoritative WHOIS server for a domain
    • commonly used for TLDs
      • whois-servers.net provides DNS CNAMEs for TLD WHOIS servers of the form <TLD>.whois-servers.net
      • some TLDs publish a server referral (SRV record) for the WHOIS protocol in their zone, which identifies their WHOIS server, in the format _nicname._tcp.TLD
  • COM and NET: thin registry used; domain registry holds basic WHOIS
  • verisign’s sitefinder directed unregistered domains to verisign site, breaking many non-web apps like email (which relies on non-existence to mark msgs undeliverable)
  • general advice

SMTP

  • cmds
    • HELO: greeting
    • EHLO: greeting + list server capabilities/extensions
    • STARTTLS: upgrade to TLS
    • MAIL FROM: set return addr
    • RCPT: set dest addr
    • DATA: actual hdrs + msg
  • notes
    • BCCs are just RCPT TOs that aren’t in the data
    • routing protocol deals w transient failures
  • sender policy framework (SPF): use DNS to list IPs allowed to send from a domain (a “senders’ MX”)
    • thus, gmail (which can send from other domains) requires they be included, if this is set
    • implemented by spamassassin, courier, exchange, patches for postfix/sendmail/exim/qmail
    • may set Received-SPF header
  • domainkeys identified mail (DKIM): pub key
    • context
      • used by gmail, many others
      • DomainKeys is different; pushed by yahoo
      • comparable to sender id, which is a MS technology
    • each SMTP relay signs whole mail and specifies a selector name
      • PKs stored in DNS TXT record in SELECTOR._domainkey.example.com
      • may set DomainKey-Status header
    • DKIM-Signature
      • h= headers included in hash; ordered, so can’t reorder headers, despite RFC2822
      • c= canonicalization (simple is strict, relaxed tolerates whitespace changes, header rewrapping, etc)
      • ``
    • author domain signing practices (ADSP)
      • optional extension to DKIM
      • author address is From/Sender
      • author domain signature: DKIM sig by domain that is author’s domain
      • practices: published in adsp.domainkey.workping.com TXT
        • unknown: same as not defining any record
        • all: all mail is signed with author domain sig
        • discardable: all, and drop any mail that isn’t
      • registered with Authentication-Results as dkim-adsp
        • none, pass, unknown, fail, discard, nxdomain, temperror, permerror
  • internet message format RFC5322
    • comments are parenthesized strings
  • security
    • SASL: the usual authentication mechanism
    • SMTP port 25: no TLS enforcement
    • SMTPS port 465: unofficial & obsolete
    • Submission port 587: require STARTTLS
  • Authentication-Results RFC5451
    • subsumes Received-SPF and DomainKey-Status
    • always prepended
    • Authentication-Results: AUTHSERV-ID; METHOD=RESULT PTYPE.PROPERTY=PVALUE …; …
    • DKIM/DomainKeys
      • results: none, pass, fail, policy, neutral, temperror, permerror
      • DKIM properties: header.d, header.i
      • DomainKeys properties: header.d, header.from, header.sender
    • SPF/Sender-ID
      • results: none, neutral, pass, policy, hardfail, softfail, temperror, permerror
      • SPF properties: smtp.mailfrom, smtp.helo
    • SMTP AUTH
      • results: none, pass, fail, temperror, permerror
      • properties: smtp.auth
    • examples:
      • no auth done: Authentication-Results: example.org; none
      • could’ve been combined:

        Authentication-Results: example.com;
                  auth=pass (cram-md5) smtp.auth=sender@example.com;
                  spf=pass smtp.mailfrom=example.com
        Authentication-Results: example.com;
                  sender-id=pass header.from=example.com
        Received: ...
      • diff MTAs

        Authentication-Results: example.com;
                  sender-id=hardfail header.from=example.com;
                  dkim=pass (good signature) header.i=sender@example.com
        Received: from mail-router.example.com
                      (mail-router.example.com [192.0.2.1])
                  by auth-checker.example.com (8.11.6/8.11.6)
                      with ESMTP id i7PK0sH7021929;
                  Fri, Feb 15 2002 17:19:22 -0800
        Authentication-Results: example.com;
                  auth=pass (cram-md5) smtp.auth=sender@example.com;
                  spf=hardfail smtp.mailfrom=example.com
        Received: from dialup-1-2-3-4.example.net
                      (dialup-1-2-3-4.example.net [192.0.2.200])
                  by mail-router.example.com (8.11.6/8.11.6)
                      with ESMTP id g1G0r1kA003489;
                  Fri, Feb 15 2002 17:19:07 -0800

SNMP

  • agents running on managed devices comm w network mgmt system (NMS) on UDP
  • packet types: get requests, set requests, get next requests, get bulk requests, responses, traps (asynchronous notifications), inform requests (trap acks)

zeroconf

  • techniques for creating IP networks automatically
  • displaces DHCP and DNS services
  • link-local addr assignment: standard in IP; addr autoconfig; 169.254.0.0/16
  • uses multicast DNS (mDNS) for name resolution, service discovery (printers, shares)
    • sent to bcast IP 224.0.0.251 on UDP port 5353
    • apple, DNS-SD: TODO
    • making IETF std
  • look up SERVICEID.SERVICENAME.domain.com
  • impls
    • bonjour: apple’s
    • avahi: posix

CIDR TODO

ethernet

  • segment: single physical layer
  • bridge: connects segments into multi-hop network (LAN), forming a broadcast domain
    • maintain forwarding table mapping MAC addrs to ports
    • for unknown MAC addrs, flood all outgoing ports; cache incoming packet addrs
  • each host has 48-bit MAC address (MAC-48; also used in other PHYs)
    • first 3 bytes: organizationally unique identifier (OUI)
    • latter 3 bytes: NIC specific
  • multiple bridges are arranged into spanning tree to avoid loops since ethernet pkts have no TTL
    • admins choose a single root bridge
  • ARP (bcasting) is used to resolve IPs to MACs

tcp

more tcp

  • nagle’s: wait for acks, in order to batch sends
  • oob: send urgent data out of stream; unused; unreliable; poorly supported
  • delayed ACKs for batching; by default (on linux), wait up to 40ms

IP

  • explicit congestion notification (ECN): RFC 3168; set bit in header instead of dropping packet
    • only meaningful with active queue mgmt (AQM), of which early impl’s were RED

misc

  • flow table
    • uses: ACL in firewalls, NAT, QoS, …

nat TODO

  • types
    • full cone: once internal addr (iaddr:iport) is mapped to external addr (eaddr:eport), pkts from iaddr:iport go through eaddr:eport, and pkts to eaddr:eport go to iaddr:iport
    • (address) restricted cone: only accept pkts previously sent to host:anyport
    • port-restricted cone: accept only pkts
    • symmetric
  • nat-t
  • simple traversal of udp through nat (STUN)
    • UDP
  • traversal using relay nat/traversal using relays around nat (TURN)
    • use intermediate server; written for SIP; introduces lag
    • TCP/UDP
    • proxy:
  • ICE: direct connection with another peer
    • written for SIP; fall back on TURN

tun/tap

  • virtual network devices backed by software; mainly for VPNs
  • tun is IP (L3); tap is ethernet (L2); usu want tun, more efficient
  • but tap will let remote hosts coexist on same subnet; no routing table changes

vpn

  • L2TP: the main use of ipsec (L2TP:ipsec::HTTP:TLS)
  • ipsec expects key distribution to have been solve (public keys to have been exchanged already); this is reasonable for vpn’s
  • vpn creates a new interface backed by the vpn software daemon (which does encryption); the routing table is updated to route through this interface

ipsec

  • authenticated header (AH) vs. encapsulating security payload (ESP)
    • authentication vs. authentication and/or encryption
    • these are both protos above IP alongside TCP/UDP/ICMP/IGMP
    • can’t nat vs. can nat: AH includes src/dst IP address in HMAC vs. ESP doesn’t authenticate src/dst IP
    • encapsulation: precede vs. surround
  • transport vs. tunneling
    • encapsulation: packet body vs. whole IP packet (IP proto = IP(-in-IP))
    • endpoints: hosts vs. gateways
  • internet key exchange (IKE): establishes shared session secret; uses PKI or pre-shared key; builds upon the ISAKMP/Oakley protocol; basically d-h
  • main mode vs. aggressive mode
    • 6 packets back and forth to init vs. 3 but with slightly less security
  • security parameters index (SPI) is an identifying number in each packet; think of like a randomly chosen port numbered: indexes into state about session details, called a security assocation (SA), stored in a SADB
  • nat-t: encapsulates IPsec IP packets; runs on UDP port 4500
  • read somewhere (perhaps in NIST doc), unverified: ipsec doesn’t use pki/certs, but this was planned
  • refs
    • http://unixwiz.net/techtips/iguide-ipsec.html
    • http://csrc.nist.gov/publications/nistpubs/800-77/sp800-77.pdf
  • implementations
    • freeswan: orig ipsec for linux; dead
    • openswan: freeswan fork; targets more than linux
    • strongswan: freeswan fork; focus on good support for certs, smartcards
    • netkey: native linux ipsec; in linux 2.6+; see ipsec-tools aka kame-tools
    • windows server
    • refs: http://www.jacco2.dds.nl/networking/freeswan-l2tp.html

PPP

  • login protocol for dial-up access
  • some encryption: CHAP and PAP password encryption algos

RADIUS

  • runs over UDP; used by VPN servers, AP, NAS, etc.

Wi-Fi

  • 802.11-legacy
  • 802.11a: 54Mbps, 5GHz band, OFDM (orthogonal frequency division multiplexing), 35m
  • 802.11b: 11Mbps, 2.4GHz band, DSSS, 38m
    • CSMA/CA (carrier sense multiple access/collision avoidance)
  • 802.11g: 54Mbps, same 2.4GHz band as/back-compat w 802.11b, OFDM, 38m
    • back-compat w 802.11b but presence of b lowers bandwidth to 11Mbps
    • only 3 channels that don’t overlap
  • 10 beacons per second
  • implementations
    • ath5k is the newest atheros driver

Cellular

  • CDMA vs GSM: two main competing techs

  • general packet radio service (GPRS): packet data service for GSM; TODO
  • generic terms for n-th generation networks
  • CDMA: a 2G network; launched in 1995
    • pioneered/still very controlled by qualcomm
    • leader in USA until recent surge by competitor GSM
    • used by Verizon, Sprint (eg Virgin Mobile), MetroPCS
    • users register their phones; no SIM cards
  • CDMA2000: succeeds CDMA
    • comprises 1xRTT (2.5G), 1xEVDO (3G), 1xEVDV (3G)
  • Enhanced Data-rates for GSM Evolution (EDGE): 2.5G replacement for GSM
    • theoretical top speed of 200Kbps
  • Evolution Data-Only (EVDO): subset of CDMA2000
    • peaks at 2.4Mbps; averages at 300-600Kbps
    • used by Sprint, Verizon
  • Global System for Mobile Communications (GSM): 2G (9.6Kbps) network
    • popular because it’s international standard (eg europe, asia)
    • used by ATT, T-Mobile
    • has security vulns even when using KASUMI
    • users use portable SIM cards; can switch phones wo hassle
  • High-Speed (Downlink) Packet Access (HSDPA/HSPA): enhancement for 3G UMTS networks
    • used by ATT
  • Universal Mobile Telephone Service (UMTS): 3G network that GSM carriers use
    • developed by GSM community; one of ITU’s family of 3G systems
    • peaks at 2Mbps; averages at 300-400Kbps
    • W-CDMA
    • used by ATT
  • WiMax: high-perf Wi-Fi 802.16e open standard; 12Mbps; OFDMA; lost to LTE
    • once potential cornerstone of 4G networks
    • once backed by sprint/clearwire/intel
  • long term evolution (LTE): 4G tech winner; 100/50Mbps; proprietary standard
  • Advanced Wireless Service (AWS): 3G
    • used by T-Mobile
  • cell tower area: ~10 sq mi

wireless

  • WiFi
    • 802.11a: 54 Mbps, 20MHz-wide channels
    • 802.11b: 11 Mbps, 20MHz-wide channels
    • 802.11g: 54 Mbps (20-30 actual), 20MHz-wide channels
    • 802.11n: 150 Mbps (75-100 actual) per radio stream (MIMO)
      • 2.4GHz or 5GHz, but 2.4GHz is crowded & can’t take adv of future improvements
      • 20MHz- or 40MHz-wide channels (2x bandwidth)
    • 802.11ac: 1 Gbps @5GHz; due 2012
      • up to 160MHz-wide channels
      • multi user MIMO (MU-MIMO): xmit simultaneous streams to diff users on same channels
  • WiGig: 60GHz + WiFi; 7 Gbps
    • beyond 10 m, switch to 600-Mbps WiFi, which has 100-m range

DNS

  • start of authority (SOA) record: info about a zone
    • mname: FQDN of primary NS
    • rname: responsible person’s email with @ replaced by .
    • serial: an incrementing version number (uint32_t) for this record; usually an encoding of timestamp
    • refresh: seconds until secondary NS tries to refresh; 1 hr
    • retry: seconds beween secondary NS retries if refreshes failed; 15 min
    • expire: seconds until secondary NS stops answering queries for this zone; long time
    • minimum: seconds that records of this zone are valid; 1 day
  • NS record
  • A record: main record type
  • MX record: mail exchange record
  • CNAME record: alias
  • ref: http://rscott.org/dns/soa.html

TCP

  • segment: a packet (as in max seg size, MSS)
  • push: send immediately
  • Nagle’s algorithm: delay packets sends for batching
  • rfc793: TCP
    • options: nop, mss
  • rfc1323: TCP Extensions for High Performance
    • introduces 2 tcp options
    • wscale: scaled windows (larger than 2^16)
    • timestamps: for estimating RTT
  • rfc2018: TCP Selective Acknowledgment Options
    • sacks: ack particular bytes; rexmit what’s needed instead of entire tail
  • rfc2581: TCP Congestion Control
    • slow start, congestion avoidance, fast retransmit, fast recovery
  • rfc2861: TCP Congestion Window Validation
    • problem: after long periods of silence, window may not reflect current network congestion
    • decay window cwnd after transition from a sufficiently long application-limited period
  • http://www.tcpipguide.com/
  • http://www.iana.org/assignments/tcp-parameters/

Congestion Control in Linux TCP

  • TODO

network topologies

  • clos network
    • insight: number of crosspoints required can be much fewer than were the entire switching system implemented with one large crossbar switch
    • reasonable approximate indication of total cost of switching system
    • 3 stages: ingress, middle, egress; each stage has some crossbar switches
      • clos network
      • nonblocking: requires m \ge 2n - 1
      • may be extended to any odd number of stages by replacing each middle crossbar with 3-stage network
    • very-large-scale integration (VLSI): integrated circuits by combining thousands of transistor-based circuits into a single chip
  • fat tree (charles leiserson): upper links are fatter
    • make nonblocking my making upper links fat enough to aggregate lower links
    • this is basic def; “A Scalable, Commodity Data Center Network Architecture” uses hypertree version of this

Clean Slate

Why can’t I innovate in my wiring closet? (Nick McKeown’s talk, 4/17)

  • worthless talk
  • current
    • software approach: XORP (and Zebra; or Linux)
      • flexible, easy
      • limited perf
      • point solution, limited port count
    • hardware approach: NetFPGA (http://www.netfpga.org/)
      • small “router”, OSS hardware/software
      • line-rate perf
      • point solution, limited port count
    • first two are aimed more at the backbone
      • network researchers tend to focus more on the core
    • OpenWRT
      • embedded Linux for wireless APs and routers
      • aimed more at our more local networks
        • most of us interact at the edges
  • idea: change the substrate
    • IP is the current narrow waist
    • virtualization layer would allow us to introduce alternatives to IP
    • one way to characterize the GENI approach
    • “innovation from the top down”
    • see: WUSTL SPP, VINI, …
  • proposed approach: OpenFlow, “innovation from the bottom up”
    • complementary, not a replacement for that approach
    • flow layer
  • deployment
    • work with switch and AP vendors to add OpenFlow to products
      • no immediate motivation for them, besides better relations with researchers
    • deploy on university campuses
    • stand back and watch innovations
  • desirable features
    • isolation: regular production traffic untouched by experimental stuff
    • virtualized and programmable: different flows processed in diff ways
      • can’t add more hardware to the closet, “stuff will catch on fire!”
    • equipment we can trust in our wiring closet
    • open development env for all researchers (e.g. Linux, Verilog, etc.)
    • flexible defs of a flow
  • the router talks to a controller PC (when it sees a new flow?)
  • flow table entry, type 0
    • rule: match on header fields
    • action: fwd to port, encap and fwd to controller, drop, send to normal processing pipeline
    • stats: # packets, bytes
  • type 1 (future)
    • more flexible header/matching
    • add’l actions: rewrite headers, map to queue/class, encryption
    • support multiple controllers: load-balancing and robustness
  • status
    • Stanford CS building will have this

GENI

  • http://geni.net/

Data Center Network Architecture

Floodless in SEATTLE (princeton)

  • status quo
    • ethernet
      • simple to manage/reconfig: MAC IDs are of flat namespace
      • not scalable: relies on flooding/bcasting to learn network info
        • fwding table grows linearly with network size
        • flooding for basic operation
          • flood all ports for unknown MAC addrs
          • ARP (bcasting) is used to resolve IPs to MACs
        • multiple bridges arranged into spanning trees; 1 path btwn any 2 nodes
          • packets “routed” through tree don’t necessarily follow shortest path
          • also can’t choose alt paths for scalability/reliability
    • IP
      • shortest-path routing; smaller routing tables based on prefixes/subnets
      • config overhead due to hierarchical addressing
        • must specify subnets on router interfaces
        • DHCP servers must be consistent with router subnets
      • limited mobile host support (migratable VMs)
    • VLANs
      • group hosts into single broadcast domain
      • defined logically, not based on locations
      • hosts can retain IP addrs while moving btwn bridges
      • extending across multiple bridges required trunking (provisioning) at each bridge; deciding which bridges should be in a given VLAN must take into acct traffic & mobility patterns for efficient operation, hence done manually
      • limited control-plane scalability: must maintain fwd tables on every host in every VLAN
      • insufficient data-plane efficiency: single spanning tree in each VLAN
  • SEATTLE
    • switch topology is replicated to each switch using link-state protocol
      • enables shortest path routing
      • sensible: switch topoligy changes much less frequently than indiv hosts
    • 1-hop network-level DHT: IP -> MAC -> physical loc
      • modified ARP to query DHT
      • routers cache DHT lookups; argues that reactive caching results in smaller tables than proactive distribution
    • define “groups” (on DHT) instead of doing network-wide bcasts
  • http://everythingisdata.wordpress.com/2009/09/17/floodless-in-seattle/

A Scalable, Commodity Data Center Network Architecture (amin vahdat, ucsd, sigcomm08)

  • use an instance of a clos network they call a “fat tree” [not sure why they call it a fat tree, since that already has another meaning]
  • straightfwd application of clos network

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric (amin vahdat, ucsd, sigcomm09)

  • problems: DCs have: no scalability, mgmt difficulty, inflexible comm, limited support for VM migration
  • assumptions: modern DCs are “fat trees” or multi-rooted hierarchies
    • core, agg, and edge switches
    • location discovery protocol (LDP): used by switches to determine position in hierarchy and communicate this to the other switchesg
  • propose pseudo MAC (PMAC) addrs [add layer of indirection]
    • PMACs are hierarchical and specify locations; reduce table sizes
    • translated back to MACs by switches; routing is done by MACs
    • VMs keep stable MACs (and IPs)
  • centralized fabric mgr to resolve ARP queries, and to simplify multicast/fault tolerance
    • switches fwd ARP reqs to FM
    • usually can resolve bc PMAC-IP mappings are eagerly sent to FM as PMACs are assigned and as VMs are migrated (new PMAC)
    • rarely, can’t resolve, so bcast down fat tree and cache result
  • fault tolerance simplified
    • switches send keepalive (LDP) msg to neighbors every 10 ms
    • if no keepalive for 50 ms, assume switch failed, contact FM
    • FM updates state, informs affected switches
    • switches recompute tables based on new topology
  • implemented using OpenFlow
  • vs VL2: added LDP for easier mgmt
  • [Ethan also uses centralized routing implemented in OpenFlow]
  • http://everythingisdata.wordpress.com/2009/09/19/portland-a-scalable-fault-tolerant-layer-2-data-center-network-fabric/

VL2 (msr, sigcomm09)

  • focused on high bisectional bandwidth and good fairness among flows
  • VL2 (virtual layer 2) is an overlay that virtualizes IP addrs
    • app-level IP addrs (AAs) allow for flexibility and migration; single flat layer 2 addr space
    • loc-specific IP addrs (LAs) are actual locations; standard IP (layer 3) addrs
    • central directory translates AAs to LAs
    • modify end-host OSs to be clients of directory; no need to change routers though
    • OS module also intercepts ARP reqs for LA addrs and converts ARP bcasts into unicast lookups into a centralized directory service (like PortLand/Ethane)
  • use clos network topology
    • each top-of-rack switch connects to two agg switches
    • each agg switch connects to every core switch (called “intermediate” switch)
  • use valiant load balancing (VLB) and equal-cost multi-path (ECMP) routing
    • valiant load balancing: work from ’80
      • dst-indep (eg random) traffic spreading across multiple intermediate nodes
      • VL2 spreads flows rather than packets; avoids OOO delivery
    • VLB+ECMP spreads traffic uniformly across network paths
      • VLB uses random intermediate waypoint (switch) to route traffic btwn 2 endpoints (using IP-over-IP)
      • ECMP selects amongst multiple equal-cost paths; provides load balancing among the many VLB paths
    • [but does this apply to intra-rack flows too? seems silly.]
  • all together
    • first AA -> LA then encapsulate with an intermediate switch
  • found high variance among flows in MS properties, so static BW alloc is not promising
  • implemented entirely in existing protocols: packet encaps, ECMP, OSPF, etc
  • ran on cluster of 3 racks, 80 servers, 10 switches

PortLand vs VL2

BCube/DCell (MSRA)

  • more complex networks than fat trees
  • like hypercube networks; orig in thinking machine’s CM-5 supercomputer

Monsoon TODO

your data center is a router: the case for reconfigurable optical circuit switched paths (rice, cmu, intel pa, hotnets09)

  • use reconfigurable optically switched network for bulk xfers
  • keep electrically switched network for small flows

flyways to de-congest data center networks (MSR, hotnets09)

  • propose hybrid wired/wireless network; wifi is “flyway”
  • use wifi to handle transient overages in wire

Routing

Secure Internet Routing (Sharon Goldberg, Princeton, MIT special seminar, 2008-02-26)

  • work on both the Control Plane (routing protocol layer) and the Data Plane (normal traffic)
  • SBGP: announce paths signed with public keys
  • on the Control Plane, showed that Secure BGP does not dissuade rational players under the “attraction” incentive model
    • key idea (game theory-ish): a player’s utility function depends not just on their willingness to send packets to a place, but also how much they want others to send packets through them (how much they want to receive packets; this is the “attraction”)
    • SBGP does dissuade rational players from lying under the previous “no attraction” incentive model
    • how do players know how to manipulate people they’re receiving from? they can learn about others’ valuation functions out-of-band
  • next-hop routing: you can lie in this by generating cycles, taking advantage of players’ refusal to route using cycles
    • but SBGP prevents cycles
    • combine SBGP and next-hop routing: then you can’t lie!
    • next-hop routing not practical itself, but useful when combined with filtering, but you can no longer prove anything (simple NHR is necessary)
  • on the Data Plane, you can actually detect a lying router
    • pair share secure key into a hash function h
    • maintain a sketch: an array of counters A
    • for each packet x, A[h(x)]++
    • compare sketches every once in a while; if drop or change, then mismatch
  • ongoing work: need to have keys for each pair of endpoint nodes in a network
  • to localize, you need to have key pairs with every node in the system

DCell: Data Center Networking (reading group reading 2008-07-21)

  • properties
    • scalability
    • fault-tolerance: better than trees
    • high network capacity
  • related work
    • Fat Tree (by CEL, for supercomputers): “scalability of Fat Tree is not comparable to that of DCell”
      • Fat Tree: total bandwidth at every level of the tree is the same

Compact Routing

Linux

Speeding up Networking (Van Jacobson, Linux.conf.au 2006)

  • TODO
  • http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2006/01/27#vj_channels
  • http://evanjones.ca/researchpapers/network-channels.html

NSDI09

iplane nano (ucsd nsdi09)

  • motivation
    • eg p2p CDNs are geo dist; clients must pair with best replica
    • eg bt, skype, etc
    • internet perf neither const nor queriable; currently each app measures independently
  • desired sol: predict perf and have shared infrastructure for all apps
  • prior work
    • network coords: limited to latency, but lightweight
    • iplane: rich metrics, but 2GB atlas (large mem footprint)
    • iplane nano: 7mb atlas
  • overview
    • traceroutes from vantage points
    • plane: store all paths; leads to combinatorial explosion
    • key idea: replace atlas of paths with atlas of links
      • O(n^2) -> O(n)
    • clients download atlas and query locally against prediction engine
  • routing policy: link atlas throws away too much info
    • iplane: 81% correct predicted AS path using 2GB atlas
    • strawman link-based: 30%, 5MB; too big a hit
    • iplane nano: 70%, 6.6MB
  • so far just predicting routes; also predict properties (latency, loss rate, bw)
    • predict route, then compute prop based on props of links
    • ongoing challenge: measuring link props
  • evaluated how much it improves 3 apps (refer to paper for latter 2)
    • p2p cdn: choose replica with best perf
    • voip: choose detour node to bridge hosts behind nats
    • detour routing for reliability: route around failures
  • p2p cdn: 199 plab nodes, 10 random akamai nodes per client, 1MB downloaded from ‘best’ replica
    • best to worst: iplane nano, measured latency, vivaldi, OASIS, random

congestion control

Analysis and Simulation of a Fair Queueing Algorithm

  • traditionally, endpoints manage congestion control (eg TCP), while use simple FCFS/tail drop policy
    • problems
      • malicious endpoints can do whatever they want
      • heterogeneous CC techniques may interact in unpredictable/undesirable ways
      • high latency for interactive low-BW conns sharing same network as high-BW conns
  • FQ: bit-by-bit round robin, but 1 packet at a time
    • each user has separate queue
    • route packets in order defined by per-packet finishing times
    • if packet arrives and full, drop last packet from largest queue
  • give slight pref to packets arriving into empty queues
    • even without this, FQ has much better promptness
  • game theory: encourages good behavior since bad is punished (dropped)
  • adds much more state to routers

random early detection gateways for congestion avoidance

  • 2 schools: argue that router must play important role in CC
    • random early detection (RED): take action as soon as approaching congestion is detected
    • fair queuing (FQ): take action once queue buffers are full
  • detection: calculate avg queue size using low-pass filter with EWMA
    • min size and max size separate no/moderate/severe congestion states
  • on congestion, mark packet to inform senders to reduce rate
    • marking: either dropping packet or setting bit in header
    • no congestion: no packets need to be marked
    • severe congestion: all packets marked
    • moderate: mark with probability that’s scaling with queue size
  • properties
    • responds more promptly than “tail drop” policy, avoiding catastrophic congestion collapse
    • fairness: high-BW senders more likely to have packets dropped than low-BW senders
    • no bias against bursty traffic, unlike tail drop or random drop
  • [this work led to ECN]

Congestion Control for High Bandwidth-Delay Product Networks

  • XCP: CC from scratch
  • TCP: AIMD for both efficiency and fairness
  • XCP: separate; MIMD for efficiency, AIMD for fairness
  • why explicit congestion signals
    • clients can’t distinguish losses due to errors vs congestion
    • detection requires RTO timeout, duplicate acks, or similar
    • packet loss is binary, but notification can carry more info
  • comments
    • ECN for IP has been around for a long time
    • DECNet did something similar long before XCP
    • hard to deploy

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication (CMU)

  • TCP Incast problem: catastrophic drop in TCP goodput when:
    • client issues req in parallel to multiple servers and waits for all resps
    • servers respond at once, overflowing switch buffers on bottleneck link, causing pkt loss
    • default lower bound on TCP RTO is typically 200ms, much more than typical RTT on DC nets (<1ms)
      • servers that suffered pkt loss must wait for a relatively long time before retransmitting dropped packets
    • client is waiting for all to respond, so can’t make progress/further utilize network
  • causes
    • commodity switches have small buffers
    • RTO timers on all servers will retry at about same time; RTO grows exponentially
  • proposes lowering the TCP retransmission timeout (RTO) to <= 1ms [and not delaying ACKs?]
  • simulated next-gen network (10Gbps, 20gs RTT); showed problem still persists at scale (>1024 servers)
    • adding randomization to RTO avoids phenomenon, but might be unnecessary outside of sim
  • fine-grained RTO safe to use in WANs? yes, little to no impact
  • how does fine-grained RTO interact with unmodified TCP stacks that use delayed ACKs?
    • doesn’t actually wait up to 40ms; no delays when seeing a retransmission
    • interacts poorly, but not as bad as incast
    • latency more important than ACK traffic in DCs
  • http://everythingisdata.wordpress.com/2009/09/25/fine-grained-tcp-retransmissions/

Understanding TCP Incast Throughput Collapse in Datacenter Networks (yanpei chen, berkeley, 2009)

  • differences from CMU paper
    • while varying number of servers, CMU kept size of data received by client fixed, vs. size of data sent by each server fixed
    • delayed acks hurt both workloads bc overdrives congestion window (widens it too much) and bc of congestion; no explanation why overdrive
    • 200us RTO lower bound led to poor perf for both workloads bc network RTT was 2ms, whereas CMU’s was 100us
      • should really be using jacobson RTO estimation: RTO = min(200us, mean RTT
        • 4 * RTT linear deviation)
  • http://everythingisdata.wordpress.com/2009/09/25/understanding-tcp-incast-throughput-collapse/

Wireless

exor TODO

  • greedy algo: node closest to dest fwds packets
  • src routing; src knows where all nodes are & sets up batch map ordering

MACAW: a media access protocol for wireless lan’s

  • media access protocol: link-layer protocol for allocating access to a shared medium (wireless channel) among multiple transmitters
  • MACAW: media access protocol for mobile pads that communicate wirelessly with base stations
  • each base station can hear transmission from pads within a certain range of its location, called a cell
  • challenges
    • mobility, but location not communicated explicitly; transmitter simply moves
    • connectivity is not transitive: A-B comm, and B-C comm, but A-C don’t
      • must prevent B-C from disrupting A, even though A doesn’t know C
    • noise: asymmetric comm (A->B, but not B->A), interfering xmitters (A out of B’s range, but A’s xmissions might still disrupt B’s xmissions)
  • MACAW ignores asymmetry and interference; assumes simplified model
    • xmission successfully received iff there’s only one active xmitter in range of receiver
    • no 2 base stations are within range of one another
  • goals: high media utilization and fairness
  • MACA: earlier protocol to improve on
    • before xmitting, send “ready to send” (RTS) containing len of upcoming data
    • if receiver hears RTS and is not deferring, reply with “clear to send” (CTS) containing len
    • any station that hears a CTS defers xmissions for long enough to allow len reception
    • any station that hears a RTS defers xmissions for long enough to allow CTS reception (not len reception, since collisions are at receiver)
    • backoff: if no CTS, wait [1..BO] seconds, re-xmit RTS; BO = exponentially growing backoff counter, reduced to 1 on successful RTS-CTS
  • MACAW
    • less aggressive backoff: orig has large oscillations
      • increase BO by 1.5 after timeout, decrease by 1 after success
      • communicate BO btwn clients, to prevent client with lower BO starving other clients
      • maintain BO per-destination, rather than single counter
    • receivers should ACK bc min TCP RTO is long (.5s at the time)
    • allow xmitters to avoid contention
      • send DS after successful RTS-CTS, before data
        • so if a pad hears RTS but not CTS, don’t try to xmit during data xfer period
      • say A,B in diff cells, want to receive rapidly, but close to ea other [TODO confusing]
        • B will be unlikely to send CTS bc it would need to hear RTS exactly right after A finishes receiving a data msg
        • fix: if receiver hears RTS while deferring xmissions, after deferral period reply with “ready for RTS” (RRTS)
        • sender then resends RTS
        • solution is imperfect: if A is receiving data constantly, B won’t hear RTS, so won’t know to send RRTS
  • best-case perf of MACAW < MACA bc of add’l ACK/DS

misc

NetMedic: Detailed Diagnose in Enterprise Networks (sigcomm 2009)

  • http://everythingisdata.wordpress.com/2009/09/17/detailed-diagnose-in-enterprise-networks/
  • pinpointing network problems
  • model network as set of components (machines, processes, network paths, etc.)
    • each component has set of vars defining current state
    • and a set of unidirectional dependencies on other components
    • auto construct dependency graph
  • record state over time; infer correlations between states of 2 components
    • eg when C1_x abnormal, C2_y abnormal also
  • label edges with correlation strength, estimating likelihood that one component influences the state of another component
    • infer which component is causing the undesirable behavior/failure

Network Protocols and Infrastructures to Hide Latency for Cloud-Based Applications (MSR talk 2009)

  • xbox live primetime (XBL) service needs reliable and low-latency comm btwn xbox and DC
    • all comm is based on TCP
    • 1-way msg latency (ms)
      • RTT 200 avg 144 90% 203 99% 703
      • RTT 300 avg 208 90% 305 99% 860 99.9% 1360 99.99% 2321
  • forward error correction (FEC) widely used as E2E tech
    • real-time comm: degradable; video, voip
    • bulk data xfer: throughput; reliable multicast, digital fountains, network coding
    • short xfer: not degradable, latency; balakrishnan ’07 ’08; pangolin addresses this
  • propose pangolin protocol: FEC with rexmits
    • 1 msg -> k packets -> n packets (FEC); ACK received pkts; RTO
    • key for rest of talk: how to choose the rate?
  • TCP vs pangolin: 300ms RTT, 2% bidir loss, n=5 k=4
    • TCP, pangolin same up to 90%; TCP latency drops considerably
    • pangolin: 90% 313 99% 313 99.9% 625
    • TCP: 90% 313 99% 1734 99.9% 3656

sensor networks

VanGo

  • linear chain of filters (nesc modules)
  • configuration says where to partition
  • built on SNACK (Sensor Network Application Construction Kit)

Lewis Girod, Sevan, Ryan

  • using wavescope to do audio processing
  • specifically, separate signal from noise, probably using clustering of spectra (using euclidian distance of their frequencies)